Pixel’s Neighbors Are Noteworthy: Localized Vision–Language Attention for Remote Sensing Semantic Segmentation

Zeng, Cheng; Tao, Sheng; Tan, Xiaowei; Xiao, Zhifeng; Hu, Lei

doi:10.3390/rs18111708

Open AccessArticle

Pixel’s Neighbors Are Noteworthy: Localized Vision–Language Attention for Remote Sensing Semantic Segmentation

by

Cheng Zeng

^1,2,

Sheng Tao

¹,

Xiaowei Tan

^1,*,

Zhifeng Xiao

³

and

Lei Hu

⁴

¹

School of Artificial Intelligence, Hubei University, Wuhan 430062, China

²

Key Laboratory of Intelligent Sensing System and Security, Hubei University, Ministry of Education, Wuhan 430062, China

³

State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, 129 Luoyu Road, Wuhan 430079, China

⁴

School of Computer Science, Hubei University, Wuhan 430062, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(11), 1708; https://doi.org/10.3390/rs18111708

Submission received: 29 March 2026 / Revised: 19 May 2026 / Accepted: 22 May 2026 / Published: 26 May 2026

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose LoVLANet, a localized vision–language segmentation framework for RGB remote sensing imagery that explicitly combines class-label semantics with neighborhood-aware attention.
By reformulating attention with key–key similarity and a Gaussian-weighted spatial prior, LoVLANet enhances spatial coherence and boundary delineation, achieving 69.77% mIoU on LoveDA and 92.49% mIoU on GID under the adopted experimental settings.

What are the implications of the main findings?

The results indicate that pixel-level neighborhood relationships should be incorporated directly into visual–language attention formation for high-resolution remote sensing segmentation, rather than being addressed only through feature fusion or post-processing.
This work provides a practical design for combining language guidance with locality-aware dense prediction, supporting more interpretable remote sensing scene understanding.

Abstract

In recent years, vision–language models (VLMs) have been introduced into remote sensing semantic segmentation to provide richer semantic representations through visual–textual alignment. However, most existing VLM-based segmentation methods focus on global semantic alignment while neglecting pixel-level local neighborhood features, which are crucial for reliably understanding remote sensing imagery with high spatial resolution, complex structures, and strong spatial continuity. To address this issue, we propose LoVLANet (Localized Vision–Language Attention Network), a novel vision–language segmentation framework that integrates language-driven global semantics with local spatial context. LoVLANet consists of a text encoder, a visual encoder, and a segmentation decoder. Specifically, the text encoder is inherited from RemoteCLIP to preserve domain-adapted vision–language alignment. The visual encoder is built upon a Vision Transformer (ViT). To enhance local dependency modeling, we propose a Neighborhood Key–Key Encoder. It leverages a Gaussian-weighted neighborhood matrix for spatial correlation and uses key–key similarity to emphasize intrinsic semantic similarity over query-driven features, thus, preserving spatial consistency. Finally, the segmentation decoder fuses multi-scale visual features and aligns the image–text representations to generate accurate pixel-level segmentation results. Experiments on RGB remote sensing benchmarks, including LoveDA and GID, show that LoVLANet achieves competitive segmentation performance under the adopted experimental settings, with improved mIoU and clearer boundary delineation in qualitative visualizations. These results suggest the effectiveness of explicitly modeling local neighborhood relationships in VLM-based segmentation for supervised remote sensing scene understanding.

Keywords:

remote sensing; semantic segmentation; vision–language model; Vision Transformer; neighborhood attention; RemoteCLIP; Gaussian weighting; local context modeling

1. Introduction

Remote sensing semantic segmentation is a fundamental task in Earth observation and geospatial information analysis, playing a critical role in applications such as land-cover mapping, urban expansion monitoring, agricultural assessment, and disaster response. Unlike natural images, remote sensing imagery is often characterized by ultra-high spatial resolution, complex spatial structures, pronounced intra-class variability, and high inter-class similarity. Under such conditions, a single pixel is frequently insufficient to represent stable semantic information independently. Instead, accurate semantic discrimination heavily relies on spatially continuous structures formed by surrounding neighborhoods. For instance, linear objects such as roads and rivers depend on long-range connectivity and croplands and grasslands typically exhibit large-scale homogeneous regions, while buildings and bare land are distinguished through local geometric patterns and contextual relationships. From the perspectives of physical imaging mechanisms and land-cover distributions, pixel-level semantic inference in remote sensing is inherently a neighborhood-aware problem. Consequently, segmentation models without explicit neighborhood modeling may suffer from fragmented boundaries, locally unstable predictions, and degraded recognition of small or low-contrast objects [1,2].

In recent years, the rapid development of vision–language models (VLMs) has introduced a new paradigm for semantic segmentation. Representative models such as CLIP [3] align visual and textual embeddings into a shared semantic space via contrastive learning, thereby enabling effective semantic correspondence across diverse scenes. Motivated by this, existing studies have explored language-guided pixel-level prediction, allowing segmentation results to benefit from semantic descriptions in addition to conventional category supervision. In remote sensing, recent efforts have further adapted vision–language pretraining models to satellite and aerial imagery, such as RemoteCLIP [4], which incorporates remote sensing-specific image–text data during pretraining and improves domain-adapted semantic representation for remote sensing imagery.

However, many recent vision–language segmentation methods mainly emphasize global semantic alignment in their architectural design, implicitly assuming that Transformer attention can adequately capture spatial relationships. In high-resolution remote sensing imagery, purely global token interaction may be insufficient to capture fine-grained local spatial dependencies among neighboring pixels, which are crucial for spatial continuity and boundary delineation [2]. As a result, although such models exhibit strong semantic discrimination ability, they can still struggle to maintain spatial consistency and boundary accuracy under complex remote sensing scenes.

Meanwhile, extensive research has acknowledged the importance of local modeling in semantic segmentation and introduced spatial constraints through convolutional operations and context aggregation (e.g., FCN [5], PSPNet [6], DeepLabv3+ [7], and OCRNet [8]), window-based Transformers (e.g., Swin [9]), or local refinement during decoding [1,2]. Nevertheless, in many existing designs, local information is incorporated in an indirect manner, either during feature extraction or post-processing. Convolutional operations rely on fixed receptive fields (often enhanced by dilated convolution for larger context [7]), window-based attention restricts computational scope, and post-processing techniques impose smoothing constraints after prediction. Crucially, such strategies do not explicitly allow neighborhood relationships to participate in the formation of attention weights and, thus, may not fundamentally reshape the semantic decision-making process in attention computation. This limitation becomes more pronounced in vision–language segmentation frameworks, where language-guided global semantics may amplify long-range semantic correlations without adequate spatial constraints, thereby exacerbating prediction fragmentation and local instability.

Based on these observations, we argue that neighborhood modeling in remote sensing semantic segmentation should not be confined to feature fusion or post-processing stages but should be explicitly incorporated into the semantic decision-making process. To this end, we propose LoVLANet, a localized vision–language segmentation framework that integrates RemoteCLIP-based textual semantics, ViT-based visual representations, and Gaussian-weighted key–key neighborhood attention. By allowing local spatial priors to participate in attention formation, LoVLANet aims to improve spatial consistency, boundary delineation, and low-contrast land-cover recognition while preserving the semantic representation capability of vision–language models.

Experiments on the LoveDA and GID datasets, conducted under a supervised training paradigm, show that LoVLANet obtains competitive segmentation results and favorable local-structure preservation under the adopted experimental settings. Ablation studies further verify the critical role of neighborhood-aware attention in enhancing local structural modeling, providing empirical evidence demonstrating that explicit modeling of pixel neighborhoods is crucial for remote sensing semantic segmentation.

The main contributions of this work are summarized as follows:

We propose LoVLANet, a localized vision–language segmentation framework tailored for remote sensing imagery, and we develop a heterogeneous encoder architecture that leverages the RemoteCLIP text encoder [4] to guide visual feature extraction. Unlike conventional ViT backbones that rely solely on global token interactions, we reformulate the standard self-attention mechanism by introducing the Neighborhood Key–Key Encoder (NKKE), which explicitly emphasizes local spatial dependencies for high-resolution imagery.
We introduce a Gaussian-weighted local-aware attention module within the NKKE. By explicitly encoding pixel-to-neighborhood relationships and utilizing key–key similarity, this module improves spatial consistency and boundary precision while helping to suppress spectral confusion.
Comprehensive experiments on the LoveDA and GID datasets systematically compare LoVLANet with state-of-the-art methods such as UNetFormer [1], DC-Swin [10], D2LS [11], SFA-Net [12], SegMAN [13], and SegNeXt [2]. Results show that LoVLANet achieves competitive segmentation performance under the same fixed experimental protocol.

2. Materials and Methods

2.1. Related Work

2.1.1. Single-Modal Remote Sensing Semantic Segmentation

Semantic segmentation has long been a core task in the remote sensing community, enabling pixel-level understanding for applications such as land-cover classification, urban mapping, and environmental monitoring. Early advances in semantic segmentation were primarily driven by convolutional neural networks (CNNs). The seminal work on fully convolutional networks (FCNs) [5] first introduced an end-to-end fully convolutional architecture, extending image classification networks to dense pixel-wise prediction and laying the foundation for deep learning-based semantic segmentation. However, FCNs rely on simple upsampling operations during decoding, which limits their ability to recover fine spatial details and often leads to blurred boundaries and loss of local structural information.

To address these limitations, U-Net [14] proposed a classical encoder–decoder architecture with skip connections that transfer high-resolution features from shallow layers to the decoding stage. By progressively downsampling to extract multi-scale features and gradually restoring spatial resolution during decoding, U-Net effectively preserves local semantic information while incorporating contextual cues. CNN-based encoder–decoder architectures have since achieved remarkable performance in dense prediction tasks, including medical image analysis and remote sensing semantic segmentation.

Despite their strengths in local feature modeling, CNN-based methods are inherently constrained by fixed receptive fields, which limits their ability to capture global semantic context and long-range dependencies [6,7].

When applied to high-resolution imagery with large-scale variations and complex spatial structures, CNNs often struggle to simultaneously model fine-grained local details and global contextual relationships, thereby restricting further performance improvements.

To overcome these limitations, Transformer architectures were introduced into computer vision. Originally proposed for natural language processing [15], Transformers leverage self-attention mechanisms to efficiently model long-range dependencies within sequences. Vision Transformer (ViT) [16] successfully extended this paradigm to visual tasks by representing images as sequences of patches and modeling global contextual relationships through self-attention. Compared with traditional CNN-based approaches, ViT demonstrates strong capability in global semantic modeling, and has inspired a wide range of Transformer-based semantic segmentation methods [2,17].

Nevertheless, the self-attention mechanism in standard Transformers is inherently content-driven and largely position-agnostic, lacking explicit inductive bias toward local spatial continuity [16,18,19]. This limitation becomes particularly pronounced in high-resolution remote sensing imagery, where insufficient local modeling often results in unstable predictions within local regions and imprecise boundary delineation. Consequently, although Transformers substantially enhance global context modeling, their ability to capture fine-grained local spatial structures in dense prediction tasks remains limited.

In addition, we include DC-Swin, which corresponds to the Swin-style encoder implementation [9] provided in the official UNetFormer/GeoSeg codebase [10], as a backbone-level baseline in our experiments.

2.1.2. Vision–Language Models for Semantic Segmentation

Building upon the success of single-modal visual models, VLMs have introduced a new paradigm for semantic segmentation. Representative models such as CLIP [3] align visual and textual embeddings into a shared semantic space via contrastive learning on large-scale image–text pairs, thereby enhancing cross-modal understanding and semantic consistency between visual features and textual concepts. This paradigm provides semantic segmentation models with richer textual priors and reduces their reliance on purely visual category supervision.

Inspired by CLIP [3], several studies have extended vision–language alignment to semantic segmentation. Methods such as CLIP-Adapter [20], MaskCLIP [21], Lang-Seg [22] and CLIPSeg [23] leverage textual prompts to guide pixel-level prediction, allowing segmentation outputs to respond to semantic descriptions rather than fixed label sets.

These approaches demonstrate promising flexibility under diverse remote sensing scenarios. In the remote sensing domain, recent efforts have further explored adapting vision–language pretraining models to satellite and aerial imagery. RemoteCLIP [4], in particular, incorporates remote sensing-specific image–text data during pretraining, improving semantic alignment for remote sensing imagery. These studies highlight the importance of language priors for remote sensing semantic understanding, offering an effective means to mitigate class distribution shifts and reduce annotation costs.

However, most existing vision–language segmentation methods still adopt standard Transformer-based designs, where visual features are aligned with textual embeddings mainly at global or coarse-grained token levels, while fine-grained pixel-level spatial structures are only implicitly considered. Although locality can be partially introduced by convolutional backbones or refinement modules, explicit neighborhood-aware modeling in the attention formation for cross-modal alignment remains limited. As a consequence, these models may suffer from fragmented predictions, blurred boundaries, and degraded recognition of small or low-contrast objects in high-resolution remote sensing imagery.

Hajimiri et al. [24] also explored neighborhood information for open-vocabulary semantic segmentation by improving CLIP-based patch localization in a training-free setting. Different from this general open-vocabulary segmentation scenario, LoVLANet focuses on supervised RGB optical remote sensing semantic segmentation, where neighborhood modeling is introduced to address spatial continuity, boundary fragmentation, and low-contrast land-cover confusion in high-resolution imagery. In this context, the proposed key–key semantic affinity and Gaussian-weighted spatial prior serve as task-oriented spatial constraints for trainable pixel-level semantic discrimination and local structure preservation.

Overall, the existing single-modal segmentation models and VLM-based methods have improved global context modeling and semantic alignment, but explicit neighborhood-aware modeling in attention formation remains insufficient for dense remote sensing prediction. Therefore, the detailed localized vision–language framework and its formulation are presented in the following Methodology Section.

2.1.3. Attention Mechanisms and Local Context Modeling

Attention mechanisms constitute the core of Transformer-based models [15,16], enabling flexible modeling of dependencies among tokens. While standard self-attention excels at capturing global interactions, it often overlooks local spatial details, particularly in dense prediction scenarios. To address this issue, prior studies have explored local attention mechanisms, window-based Transformers such as Swin Transformer [9], and hybrid convolution–Transformer architectures that introduce convolutional inductive biases or multi-scale conv-attention designs [18,19,25].

In remote sensing semantic segmentation, this locality issue becomes more critical due to ultra-high spatial resolution and complex land-cover structures. Accurate recognition of elongated objects (e.g., roads and rivers), fine-grained boundaries, and small or low-contrast categories often requires explicit modeling of neighborhood continuity. Therefore, recent remote sensing segmentation approaches have paid increasing attention to local structure modeling and boundary preservation, such as UNetFormer [1] and SegNeXt [2], which incorporate locality-enhanced designs to mitigate fragmented predictions.

Nevertheless, localized spatial modeling within vision–language segmentation frameworks, especially for remote sensing imagery, remains under-explored. Most existing VLM-based segmentation methods still primarily rely on content-driven global attention for cross-modal alignment, where neighborhood relationships are not explicitly involved in attention weight computation. In contrast, our work introduces a Gaussian-weighted local-aware attention module that explicitly models interactions between each pixel and its surrounding neighborhood. By embedding this neighborhood-aware mechanism into a multimodal framework, the proposed method effectively bridges global semantic understanding and local spatial consistency.

2.2. Methodology

2.2.1. Overall Framework

We propose LoVLANet, a novel semantic segmentation framework tailored for remote sensing imagery. As illustrated in Figure 1, LoVLANet adopts a dual-stream encoder–decoder architecture to explicitly integrate language-driven global semantics with local spatial context. The overall architecture of the proposed method is shown in Figure 1.

First, in the visual stream, an input three-channel RGB remote sensing image

X \in R^{H \times W \times 3}

is fed into the Neighborhood Key–Key Encoder. Through a series of specialized encoder blocks, this module captures fine-grained local spatial dependencies and progressively extracts multi-scale hierarchical visual features, denoted as

{F_{1}, F_{2}, F_{3}, F_{4}}

. Concurrently, in the textual stream, a set of discrete category prompts

Y = {y_{1}, y_{2}, \dots, y_{N}}

(e.g., “Building”, “Road”) is processed by a Tokenizer and fed into a pretrained RemoteCLIP [4] text encoder. The extracted text representations are then processed by an

L_{2}

normalization layer to generate domain-adapted dense text embeddings

\hat{T} \in {{\hat{t}}_{1}, {\hat{t}}_{2}, \dots, {\hat{t}}_{N}}

. Finally, in the decoder, the hierarchical visual features

{F_{1}, F_{2}, F_{3}, F_{4}}

are aggregated via a spatial feature fusion module, which utilizes progressive upsampling and skip connections to restore spatial resolution. The fused representations are

L_{2}

-normalized to form dense visual embeddings

\hat{V} \in {{\hat{v}}_{1}, {\hat{v}}_{2}, \dots, {\hat{v}}_{H W}}

. In the Feature Alignment module, the visual features

\hat{V}

and textual sequence

\hat{T}

undergo matrix multiplication to compute cross-modal similarities. The resulting similarity map is subsequently processed by the Segmentation Head to generate the final pixel-level prediction output.

2.2.2. Remote Sensing-Oriented Text Encoder

We adopted the text encoder in RemoteCLIP [4]. Compared with CLIP [3], RemoteCLIP provides more domain-adapted semantic representations for remote sensing scenarios. Given a set of discrete category names or descriptive textual prompts

Y = {y_{1}, y_{2}, \dots, y_{N}}

, the inputs are first processed by a Tokenizer. The text encoder then maps these prompts into a shared vision–language semantic space, generating discriminative embeddings that serve as semantic anchors. To align with the visual features for dense pixel-level segmentation, these representations are subjected to an

L_{2}

-normalization layer, yielding the continuous text embeddings

\hat{T} \in {{\hat{t}}_{1}, {\hat{t}}_{2}, \dots, {\hat{t}}_{N}}

.

2.2.3. Neighborhood Key–Key Encoder

For the visual encoder, we employ ViT as the backbone. Due to the intrinsic complexity of remote sensing scenes—including strong spatial correlations among land-cover distributions, significant pixel-level dependencies within the same category, and distance-dependent semantic decay across regions—backbone feature representations alone remain insufficient. Specifically, purely visual features struggle to precisely model local spatial dependencies and semantic correlations among neighboring pixels, while the semantic anchors provided by the text encoder require more refined spatial alignment with visual representations. To address these challenges, we replace standard global self-attention with our proposed Neighborhood Key–Key Encoder. By introducing targeted modifications to the attention mechanism, this module strengthens spatial–semantic relationship modeling within visual features while enabling efficient alignment with textual semantic anchors. In LoVLANet, this design serves as a task-oriented spatial–semantic modeling mechanism for RGB optical remote sensing segmentation, aiming to improve local structure preservation, boundary consistency, and low-contrast land-cover recognition.

The Neighborhood Key–Key Encoder is illustrated in the lower-left part of Figure 1. To provide an overall understanding of the Neighborhood Key–Key Encoder, we first summarize the role of each component before presenting the detailed formulation. Given the visual token sequence extracted by the ViT backbone, this encoder aims to generate locality-enhanced visual representations for subsequent spatial feature fusion and vision–language alignment. Specifically, key–key similarity models intrinsic semantic affinity among visual tokens, allowing tokens with similar land-cover characteristics to reinforce each other. The Gaussian-weighted spatial prior introduces a distance-aware neighborhood constraint, assigning larger influence to nearby tokens and reducing noisy interactions from distant regions. The MLP simplification further reduces unnecessary global transformation in the final encoder stage and encourages locality-driven spatial–semantic reasoning. Through these components, the Neighborhood Key–Key Encoder jointly considers semantic relevance and spatial proximity, thereby improving local structure preservation, boundary consistency, and low-contrast land-cover recognition.

Neighborhood Key–Key Attention

To effectively capture local spatial correlations in remote sensing imagery, we redesign the standard self-attention mechanism and introduce three key modifications tailored for semantic segmentation. As illustrated in the dashed box of Figure 1, the Neighborhood Key–Key Attention module eliminates conventional query-driven interactions and constructs a localized, key-centric attention map. First, we reformulate the key–value attention mechanism to emphasize pixel-wise semantic representation, enabling the model to better understand the relationships between individual pixels and their semantic roles. This design facilitates more precise semantic discrimination at the pixel level, which is essential for dense prediction tasks. Second, we introduce a Gaussian-weighted neighborhood adjacency matrix to explicitly encode spatial distance relationships between pixels. Unlike conventional self-attention, where tokens at different spatial locations are treated with equal positional importance, the proposed weighting strategy assigns higher influence to spatially closer pixels. This design embeds a locality-aware inductive bias into the attention computation, allowing the model to more accurately capture spatial continuity and neighborhood dependencies inherent in remote sensing imagery. Third, we remove image-level classification components commonly used in standard Transformer architectures. This modification eliminates unnecessary global classification constraints and ensures that the attention mechanism is fully dedicated to pixel-level semantic reasoning, making it more suitable for semantic segmentation tasks.

The combination of these three modifications enables the proposed neighborhood-aware attention module to explicitly integrate semantic relevance and spatial proximity into the attention computation. As a result, local spatial structures are more effectively modeled, boundary delineation is improved, and semantic alignment between visual features and textual anchors becomes more consistent.

Key–Key Attention

In standard self-attention, the similarity between tokens is computed via the query–key interaction. Given a sequence of token features

X \in R^{N \times d}

, the linear projections produce query, key, and value:

Q = X W_{Q}

(1)

K = X W_{K}

(2)

V = X W_{V}

(3)

[q, k, v] = Z W^{q k v}

(4)

where Z denotes the input embeddings and

W^{q k v}

are learnable projection matrices. The similarity between patch i and patch j is computed as

{sim}_{i j} = \frac{q_{i}^{⊤} k_{j}}{\sqrt{d}}

(5)

and the attention weights are given by

A_{i j} = softmax ({sim}_{i j}) v_{j},

(6)

with the final output computed as

S A {(Z)}_{i} = A_{i j} W^{o} .

(7)

Although this formulation has become the standard in modern Vision Transformers, interpreting its internal mechanism from an intuitive perspective remains difficult. From a more explainable viewpoint, the query vector of patch i represents what this patch seeks, the key vector of patch j encodes the information contained in that patch, and the value vector determines what information will eventually be contributed. However, the conventional

Q K^{⊤}

formulation does not explicitly emphasize the intrinsic similarity among patches, which is particularly important in remote sensing imagery where spatial continuity and local homogeneity are common.

To address this limitation, we shift the similarity modeling from a query-driven process to a key-centric design. Instead of computing similarity via

Q K^{⊤}

, we directly measure the affinity between key vectors using

{sim}_{i j} = \frac{k_{i} k_{j}^{⊤}}{\sqrt{d}},

(8)

where d is the feature dimension. This formulation captures the inherent semantic correlation between patches: patches with similar key representations (e.g., sharing similar textures or land-cover types) exhibit higher similarity and, therefore, exert stronger influence on the attention distribution. This property is particularly desirable for remote sensing tasks, where objects usually appear in spatially coherent clusters.

Based on this similarity metric, we introduce a key–key attention mechanism, defined as

{Attention}_{KK} (K, V) = Softmax (\frac{K K^{⊤}}{\sqrt{d}}) V .

(9)

By constructing the attention affinity matrix purely from key vectors, the proposed design enhances the interaction between semantically related patches and strengthens local contextual awareness. Consequently, the model better captures region-level coherence and produces more locality-aware feature representations, which is crucial for dense prediction tasks such as remote sensing semantic segmentation.

Gaussian-Weighted Neighborhood Matrix

Recognizing the importance of local neighborhoods for each patch, we further reinforce spatial locality by introducing a lightweight spatial attention mechanism. This mechanism assigns position-dependent weights to neighboring patches, enabling surrounding regions to contribute unequally based on their spatial proximity. In particular, we adopt a Gaussian kernel to suppress long-distance interactions, thereby explicitly favoring local spatial consistency. The Gaussian distribution is defined as

N (x ∣ μ, Σ) = \frac{1}{{(2 π)}^{d / 2} {| Σ |}^{1 / 2}} exp (- \frac{1}{2} {(x - μ)}^{⊤} Σ^{- 1} (x - μ)) .

(10)

Removing the normalization factor yields

K (x, μ, Σ) = exp (- \frac{1}{2} {(x - μ)}^{⊤} Σ^{- 1} (x - μ)) .

(11)

Assuming an isotropic covariance

Σ = σ^{2} I

, the kernel reaches its maximum when

x = μ

and decreases monotonically as the Euclidean distance between x and

μ

increases. We further define a spatial weighting function

ϕ ((i, j))

that treats pixel coordinates as inputs to the kernel. After normalization and embedding into an

H \times W

matrix, the kernel can be written as

ϕ (x; μ, σ) = exp (- \frac{{∥ x - μ ∥}^{2}}{2 σ^{2}}) .

(12)

By substituting pixel coordinates, let

x = p_{i}

and

μ = p_{j}

. The Euclidean distance then becomes

∥ p_{i} - p_{j} ∥^{2}

, resulting in the following pairwise Gaussian weighting matrix:

G (i, j) = exp (- \frac{∥ p_{i} - p_{j} ∥^{2}}{2 σ^{2}}) .

(13)

Here,

p_{i}

and

p_{j}

denote the 2D spatial coordinates of pixels i and j, and

σ

controls the kernel width. As the spatial distance increases, the exponent becomes more negative, causing the kernel value to decay rapidly. This behavior reflects the fact that Gaussian distributions concentrate probability mass around the center.

To further justify this, let

r = ∥ p_{i} - p_{j} ∥ \geq 0

, and rewrite the kernel as a radial function

G (r) = exp (- \frac{r^{2}}{2 σ^{2}}) .

(14)

Taking the derivative with respect to r yields

\frac{d g}{d r} = exp (- \frac{r^{2}}{2 σ^{2}}) \cdot (- \frac{r}{σ^{2}}) = - \frac{r}{σ^{2}} g (r) .

(15)

For any

r > 0

, we have

\frac{d g}{d r} < 0

, indicating that

g (r)

is strictly monotonically decreasing. When

r = 0

,

\frac{d g}{d r} = 0

, corresponding to the maximum value of the function. This establishes the mathematical property that the Gaussian kernel assigns smaller weights to pixels farther from the center.

Finally, we define the neighborhood-aware attention mechanism as

{Attention}_{NA} (K, V) = Softmax (\frac{K K^{⊤}}{\sqrt{d}} + G) V,

(16)

where G is the Gaussian-weighted spatial prior matrix.

This design directly integrates the Gaussian spatial prior with the semantic similarity matrix derived from

K K^{⊤}

, thereby enhancing both spatial continuity and local context modeling. Compared with conventional additive fusion strategies, our design exhibits the following advantages:

Better balancing between semantic relevance and spatial closeness. Pure semantic affinity is easily dominated by distant but semantically similar regions, while additive fusion introduces Gaussian priors directly at the similarity matrix level. This ensures that attention is simultaneously guided by pixel-level spatial proximity and semantic relationships.
More interpretable spatial weighting. By explicitly modeling distance decay with a Gaussian kernel, the fusion process becomes spatially interpretable. Nearby regions naturally obtain larger weights, while distant pixels are adaptively suppressed.
More constrained attention distribution. High-resolution remote sensing images contain complex textures. The Gaussian prior constrains the attention distribution by reducing excessive concentration on distant patches, resulting in more locality-aware feature aggregation for dense prediction.

Removing the Feed-Forward Network

Precise spatial localization is critical for semantic segmentation, yet it is often under-emphasized in Vision Transformers (ViTs). Unlike convolutional neural networks (CNNs), which naturally encode local inductive bias through spatially constrained kernels, most components in ViTs operate in a global manner and do not explicitly account for local patch relationships.

In standard ViT architectures, the feed-forward network (FFN), following self-attention, is primarily designed for image-level recognition tasks, where global feature transformation is beneficial. However, such global transformations are not specifically tailored for dense prediction and may introduce redundant global mixing that is less relevant for pixel-wise semantic reasoning. Therefore, we remove the FFN in the final encoder block of the visual backbone, retaining only the attention-based transformation.

Since the self-attention mechanism in LoVLANet has been explicitly redesigned to incorporate local spatial constraints (Section Key–Key Attention), the contribution of the attention module becomes central to spatial–semantic modeling. In this setting, retaining additional FFN layers or strong skip connections may bias the network toward earlier encoder outputs, thereby weakening the effect of the proposed neighborhood-aware attention. By removing the segmentation-irrelevant FFN component, the model is encouraged to focus on locality-driven similarity modeling rather than redundant global transformations.

This simplification leads to a more compact architecture that emphasizes local semantic consistency, reduces interference from unnecessary global operations, and is better suited for high-resolution remote sensing semantic segmentation.

In summary, the neighborhood-aware attention module enhances traditional self-attention by (1) replacing

Q K^{T}

with

K K^{T}

, (2) incorporating Gaussian-based spatial priors, and (3) removing the MLP to better preserve local semantic consistency.

2.2.4. Decoder

The decoder is designed to bridge the resolution gap among hierarchical visual features and align the fused spatial representation with cross-modal textual prototypes. As illustrated in Figure 1, it mainly consists of two components: the spatial feature fusion (SFF) module and the cross-modal feature alignment (CMFA) module. First, the SFF module progressively fuses multi-level visual features from the encoder to generate a unified high-resolution representation. Then, the CMFA module aligns the fused visual features with textual semantic prototypes in the shared vision–language space, producing the final pixel-level segmentation result.

SFF module: The decoder takes the hierarchical features

{F_{1}, F_{2}, F_{3}, F_{4}}

from the visual encoder as input. To effectively aggregate contextual information across different scales, we adopt a top-down fusion strategy. Specifically, deeper features (e.g.,

F_{4}

) are first upsampled and then combined with the adjacent shallower features (e.g.,

F_{3}

) through element-wise addition. This process is repeated progressively via successive upsampling operations and residual connections, gradually restoring the spatial resolution and yielding a unified high-resolution feature map

P_{1}

.

CMFA module: To perform prediction in the shared vision–language space, the fused feature map

P_{1}

is first subjected to

L_{2}

normalization and then flattened into a sequence of dense visual embeddings

\hat{V} = {{\hat{v}}_{1}, {\hat{v}}_{2}, \dots, {\hat{v}}_{H W}}

. Meanwhile, the normalized text embeddings from the textual branch are denoted as

\hat{T} = {{\hat{t}}_{1}, {\hat{t}}_{2}, \dots, {\hat{t}}_{N}}

. In the CMFA module, we compute the dot-product similarity between visual embeddings and textual prototypes, and obtain the dense logit map S as

S = τ \cdot (\hat{V} \otimes {\hat{T}}^{⊤}),

(17)

where

τ

is a learnable temperature scaling parameter. Finally, the logit map

S \in R^{H W \times N}

is reshaped into the spatial layout and upsampled via bilinear interpolation to match the original image size, generating the final pixel-level semantic segmentation result.

2.2.5. Training Objective

The overall training objective of LoVLANet combines a segmentation loss with an optional vision–language alignment loss. For semantic segmentation, we employ the standard pixel-wise cross-entropy loss:

L_{seg} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} I (y_{i} = c) log p_{i} (c),

(18)

where N denotes the number of valid pixels involved in loss computation for a single image (excluding invalid or ignored background pixels); C is the number of predefined semantic categories in the training set;

y_{i}

represents the ground-truth semantic label of the i-th pixel;

I (\cdot)

is the indicator function that equals 1 when the condition holds and 0 otherwise; and

p_{i} (c)

denotes the predicted probability of pixel i belonging to class c.

The predicted probability

p_{i} (c)

is obtained by computing the cosine similarity between the visual embedding of pixel i and the corresponding class text embedding, followed by temperature scaling and Softmax normalization. This formulation naturally aligns pixel-level predictions with language-guided semantic representations.

To further enhance semantic consistency between visual features and textual embeddings, we introduce a vision–language contrastive alignment loss. Specifically, given the visual embedding

f_{v}

and the text embedding

f_{t}

, the alignment loss is defined as

L_{align} = - log \frac{exp (sim (f_{v}, f_{t}) / τ)}{\sum_{t^{'}} exp (sim (f_{v}, f_{t^{'}}) / τ)},

(19)

where

sim (\cdot, \cdot)

denotes cosine similarity,

τ

is a temperature parameter, and

t^{'}

indicates negative text embeddings.

The final training objective is formulated as a weighted combination of the two loss terms:

L = L_{seg} + λ L_{align},

(20)

where

λ

is a weighting coefficient that balances pixel-wise segmentation accuracy and cross-modal semantic alignment. In all experiments,

λ

is set to 1 unless otherwise specified.

3. Results

In this section, we conduct a comprehensive evaluation of the proposed LoVLANet framework on two widely used remote sensing semantic segmentation benchmarks, namely LoveDA and GID. We first introduce the datasets and experimental settings. Then, ablation studies are performed to verify the effectiveness of individual components, including the proposed KK attention model, Gaussian local prior, and the simplified MLP design. Finally, LoVLANet is compared with state-of-the-art (SOTA) methods to analyze its performance under the adopted experimental protocol.

3.1. Datasets

In this study, all experiments use RGB remote sensing images as the model input.

3.1.1. LoveDA

The LoveDA [26] dataset is collected from 18 administrative regions across three cities in China, namely, Nanjing, Changzhou, and Wuhan. It covers two representative scenarios (urban and rural), with an overall area of approximately 536.15 km². LoveDA contains 5987 high-resolution RGB images with a spatial resolution of 0.3 m, each of size

1024 \times 1024

pixels.

LoveDA provides large-scale pixel-level annotations over six land-cover categories, building, road, water, bare land, forest, and agricultural land, with 166,768 annotated instances in total. The dataset is challenging due to strong intra-class variation and inter-class similarity across urban and rural scenes, particularly for boundary-sensitive classes (e.g., road) and spectrally similar categories (e.g., forest vs. agricultural land).

Following the official split, 2522 images are used for training and 1669 images for validation. All quantitative comparisons and qualitative visualizations are conducted on the validation split. In addition, pixels labeled as no-data/ignore regions are excluded from both loss computation and metric evaluation, and we report the performance on the six foreground categories. The official test set is not used since its evaluation relies on the online server and differs in background/ignore-label handling.

3.1.2. GID

The GID-5 [27] dataset is derived from imagery acquired by the GF-2 satellite. By fusing panchromatic and multispectral bands, the dataset achieves a spatial resolution of 4 m, with each image covering an area of approximately

6800 \times 7200

pixels. The dataset spans more than 60 cities across China, covering a total area of about 75,900 km². All images are free of cloud contamination and exhibit consistent imaging quality, encompassing diverse scenes such as urban–rural mixed areas, natural vegetation, and water bodies. The spectral and textural characteristics closely reflect real-world remote sensing applications.

In this work, we adopt the widely used five-class version of GID. The dataset contains 150 large-scale images with pixel-level annotations and follows a train–validation split strategy, as no official independent test set is provided. Specifically, 120 images are used for training to optimize model parameters, while the remaining 30 images serve as the validation set for hyperparameter tuning and model selection. To further evaluate performance on held-out images, six images are split from the validation set and used as a test set.

The five land-cover categories in GID include built-up areas, cropland, forest, grassland, and water. These classes exhibit diverse spatial characteristics: built-up areas have complex boundaries and concentrated distributions; cropland forms large continuous regions but is spectrally similar to forest; forest regions show strong texture continuity; grassland is sparsely distributed with low pixel proportion; and water bodies exhibit relatively homogeneous spectral signatures with clear boundaries.

3.2. Evaluation Metrics

We used four standard metrics widely used to evaluate the performance of semantic segmentation: mean Intersection over Union (mIoU), Overall Accuracy (OA), mean F1-score (mF1), and Kappa. mIoU and mF1 are the average values calculated over all categories. The OA measures the proportion of correctly classified pixels to all pixels. Kappa measures agreement corrected for chance. These metrics are computed pixel for pixel and are defined as follows:

I o U_{i} = \frac{x_{i i}}{\sum_{j = 0}^{C - 1} x_{i j} + \sum_{j = 0}^{C - 1} x_{j i} - x_{i i}}

(21)

m I o U = \frac{1}{C} \sum_{i = 0}^{C - 1} I o U_{i}

(22)

O A = \frac{\sum_{i = 0}^{C - 1} x_{i i}}{\sum_{i = 0}^{C - 1} \sum_{j = 0}^{C - 1} x_{i j}}

(23)

p_{e} = \frac{\sum x_{i j} \sum x_{j i}}{{(\sum_{i = 0}^{C - 1} \sum_{j = 0}^{C - 1} x_{i j})}^{2}}

(24)

K a p p a = \frac{O A - p_{e}}{1 - p_{e}}

(25)

where C is the total number of classes, and

x_{i j}

denotes the number of instances of class i that are predicted to be class j.

3.3. Implementation Details

All experiments are implemented in PyTorch 1.13.1 and conducted on an NVIDIA RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). To ensure a fair comparison, we follow the standard training and data augmentation protocols widely adopted in remote sensing semantic segmentation. LoVLANet adopts a heterogeneous vision–language backbone, where the text encoder is inherited from RemoteCLIP [4], and the visual encoder is instantiated as a ViT-B/32 model implemented with the timm library (version 1.0.15). The proposed neighborhood-aware attention module is integrated into the visual encoder, aiming to enhance local pixel–neighborhood dependency modeling. During training, the initial learning rate is

4 \times 10^{- 4}

with a weight decay of

0.01

. We use a batch size of 2 and train the model for 60 epochs. The learning rate is decayed using a polynomial schedule:

lr (e) = {lr}_{0} {(1 - \frac{e}{E})}^{0.9},

(26)

where e denotes the current epoch and E is the total number of epochs.

For consistency evaluation, Cohen’s Kappa is computed on foreground classes only. Background pixels are excluded from both boundary extraction and metric computation for the LoveDA and GID datasets.

3.4. Performance Comparison

For both LoveDA and GID datasets, we adopt a foreground-only segmentation setting, where background regions are not explicitly predicted and are excluded from evaluation and visualization. We benchmark the proposed LoVLANet framework against six representative state-of-the-art semantic segmentation methods, namely UNetFormer [1], DC-Swin, D2LS [11], SFA-Net [12], SegMAN [13], and SegNeXt [2]. These comparison methods cover different local and contextual modeling strategies for remote sensing semantic segmentation, including CNN–Transformer hybrid feature fusion, Swin/window-based attention, dynamic dictionary-based semantic representation, state–space and local attention mechanisms, and convolutional attention designs. Therefore, they provide representative references for evaluating the effectiveness of LoVLANet against existing frameworks that also attempt to enhance spatial context or local structure modeling. All compared methods are evaluated under the same fixed training and evaluation protocol, including identical data splits, input resolutions, data augmentation strategies, and evaluation metrics. Under this protocol, LoVLANet achieves slightly higher mIoU than the closest competing method on LoveDA and shows more evident improvements in complementary metrics and qualitative local-structure preservation.

Unless otherwise stated, the official implementations and recommended configurations provided by the original authors are adopted. The quantitative comparison results on the LoveDA dataset are reported in Table 1, while the corresponding results on the GID dataset are summarized in Table 2.

3.4.1. Performance Comparison on the LoveDA Dataset

Quantitative comparisons on the LoveDA dataset are summarized in Table 1.

As shown in Table 1, LoVLANet obtains the highest mIoU, mF1, OA, and Kappa on the LoveDA validation set under the adopted fixed protocol. The modest mIoU gain over D2LS [11] is accompanied by favorable mF1, OA, and Kappa, and LoVLANet obtains the best IoU for road, water, and agricultural land. These results indicate that the proposed method provides a competitive performance trend and improves the preservation of spatially continuous land-cover structures.

Figure 2 shows that LoVLANet produces more spatially coherent predictions with clearer boundaries than competing methods, especially for roads, water bodies, and agricultural land. These qualitative observations are consistent with the quantitative improvements reported in Table 1.

Figure 3 further provides a zoomed-in comparison of local spatial structures. As highlighted by the red boxes, competing methods often break road continuity under complex backgrounds, while LoVLANet better restores connected road structures and suppresses background noise by combining Gaussian-weighted spatial priors with key–key semantic affinity. This demonstrates the effectiveness of localized vision–language reasoning for preserving local spatial structures.

3.4.2. Performance Comparison on the GID Dataset

Table 2 reports the quantitative results on the GID dataset.

As shown in Table 2, LoVLANet obtains the highest results across all overall metrics and category-level IoUs on the GID dataset. Compared with the strongest competing method, the mIoU improves from 89.79% to 92.49%, with particularly clear gains in the meadow category and improvements in the built-up, farmland, forest, and water categories. These results indicate that LoVLANet improves both average segmentation accuracy and category-level performance under the adopted experimental setting.

For the GID dataset, only foreground semantic classes are visualized, consistent with the evaluation protocol. As shown in Figure 4, LoVLANet reduces fragmented predictions and preserves more continuous boundaries in large-scale scenes, especially for linear structures and densely built-up regions. These visual results are consistent with the improvements in OA and Kappa reported in Table 2.

Figure 5 further provides zoomed-in examples of challenging spatial structures. DC-Swin produces an anomalous region within a contiguous water body, and SegNeXt fails to capture thin diagonal built-up structures, whereas LoVLANet suppresses isolated noise and recovers more precise local shapes. This supports the effectiveness of neighborhood-aware attention in preserving regional homogeneity and structural continuity.

3.5. Ablation Studies

We conduct extensive ablation studies to investigate the contribution of each proposed component, including KK attention, Gaussian spatial prior (Gauss), and the reduced MLP (ReduceMLP). Unless otherwise specified, a ViT backbone with a RemoteCLIP [4] text encoder is used as the baseline, which achieves an mIoU of 66.25% on the LoveDA dataset and 89.53% on the GID dataset.

3.5.1. Results on the LoveDA Dataset

The ablation results on the LoveDA dataset are summarized in Table 3.

As shown in Table 3, the full model achieves the best mIoU of 69.77% and the best mF1 of 81.50%, demonstrating the overall effectiveness of combining KK attention, the Gaussian spatial prior, and ReduceMLP. It also obtains the highest IoU for the barren land category, indicating improved recognition of difficult and low-contrast land-cover regions. However, the full model does not achieve the best value for every individual category or overall metric. For example, KK attention alone performs best in the water and forest categories, while the Gauss + ReduceMLP setting obtains slightly higher OA and Kappa. This suggests that individual modules may favor specific land-cover patterns, whereas the full model provides a more balanced trade-off between category-level accuracy and mean segmentation performance.

The category-level results also reveal a limitation of using KK attention alone. This KK-only setting should be interpreted as a diagnostic ablation rather than the final behavior of the proposed model. KK attention improves water from 67.86% to 81.75% and forest from 57.05% to 62.21%, because these categories usually contain more homogeneous regions where neighboring tokens tend to share similar semantic responses. However, the building IoU decreases from 74.95% to 66.70%. This may be because buildings contain sharper boundaries, smaller object instances, and more complex geometric structures, where relying only on key–key semantic affinity may over-smooth local responses and weaken fine structural details. This category-specific trade-off motivates the combination of KK attention with the Gaussian spatial prior and ReduceMLP. After combining KK attention with ReduceMLP, the building IoU increases to 78.13%, showing that the reduced projection complexity helps recover structured object details. The full model further achieves the best mIoU and mF1, indicating that the proposed modules compensate for each other and provide a more balanced performance across categories.

3.5.2. Sensitivity Analysis of the Gaussian Kernel Width

To study the influence of the Gaussian kernel width, we conduct a sensitivity analysis of

σ

on the LoveDA validation set. In this work,

σ

is treated as a fixed hyperparameter rather than a learnable parameter. Since the Gaussian prior is constructed on the ViT token grid instead of the original geospatial coordinate system, the same

σ

value is used for both LoveDA and GID to maintain a consistent neighborhood range in feature space. As shown in Table 4, the performance varies only moderately under different

σ

values. The best result is achieved when

σ = 2.0

, with 69.77% mIoU, 81.50% mF1, and 84.55% OA. Therefore,

σ

is fixed to 2.0 in all experiments.

3.5.3. Results on the GID Dataset

The ablation results on the GID dataset are reported in Table 5.

As shown in Table 5, the full model achieves the best mIoU of 92.49% and the best mF1 of 96.01%, both tied with the strongest partial configurations. More importantly, it obtains the highest IoU for farmland, forest, and water, showing that the complete design improves category-level segmentation for several major land-cover types. Although ReduceMLP alone gives slightly higher OA and Kappa, the differences are marginal, and the full model provides stronger mean performance and more balanced category-level results. These observations indicate that KK attention, the Gaussian spatial prior, and ReduceMLP are complementary rather than uniformly optimal for every single metric.

3.5.4. Computational Cost Analysis

To address the practical efficiency of the proposed modules, we further report the computational cost of representative ablation variants under a unified inference setting. All measurements are conducted using a single

1024 \times 1024

RGB input on the same GPU. As shown in Table 6, KK attention and ReduceMLP reduce the parameter count and inference cost compared with the baseline. The full LoVLANet model achieves 155.16 M parameters, 434.64 G FLOPs, 46.17 ms inference time, 21.66 FPS, and 2.94 GB peak memory. Compared with the baseline, the full model reduces the parameters from 166.97 M to 155.16 M, FLOPs from 446.73 G to 434.64 G, inference time from 49.44 ms to 46.17 ms, and memory usage from 3.20 GB to 2.94 GB. Although self-attention has quadratic complexity with respect to the number of tokens, the proposed modifications do not increase the token resolution or introduce additional dense attention branches. Therefore, under the evaluated

1024 \times 1024

input setting, the proposed design does not add extra computational burden beyond the baseline attention structure.

4. Discussion

The experimental results suggest that explicitly modeling pixel-level neighborhood relationships is particularly beneficial for remote sensing segmentation, where local spatial continuity and structural coherence are crucial for accurate dense prediction. By combining global language priors from RemoteCLIP [4] with ViT-based visual representations and neighborhood-aware attention, LoVLANet improves semantic discrimination while preserving local consistency in complex scenes. Across the two evaluated benchmarks, the results show that explicitly modeling pixel-level neighborhood relationships can improve local spatial consistency under the adopted experimental settings. These observations further support the effectiveness of the proposed neighborhood-aware design for RGB optical remote sensing segmentation. The reported margins are discussed together with complementary metrics, category-level IoUs, ablation trends, and qualitative visualizations, rather than as standalone evidence. In addition, LoveDA and GID are evaluated as separate supervised benchmark settings rather than as a cross-dataset transfer-learning protocol.

The quantitative improvements and visual comparisons are consistent with the design goal of the Neighborhood Key–Key Encoder. Key–key similarity encourages tokens with similar land-cover characteristics to reinforce each other, while the Gaussian spatial prior constrains this interaction according to spatial proximity. The ablation results on LoveDA and GID further show that KK attention, the Gaussian prior, and ReduceMLP are complementary rather than uniformly optimal for every category or metric. Therefore, the full model should be interpreted as a balanced combination of component roles, rather than as a single component that dominates all cases.

Nevertheless, some limitations remain. The current neighborhood modeling strategy is still relatively fixed and may be less effective when handling highly irregular object boundaries or severely fragmented regions.

At the same time, the results should be interpreted with caution. The sensitivity analysis treats the Gaussian kernel width as a fixed hyperparameter, so the selected neighborhood range may not be optimal for every object scale or boundary shape. The category-level ablation results also indicate that stronger semantic affinity can benefit homogeneous regions such as water or forest, but may risk smoothing small objects or sharply bounded structures when used alone. These observations define the main uncertainty of the current design: it improves mean performance and local consistency under the adopted settings, while some category-specific trade-offs remain. The KK-only ablation is, therefore, used to analyze component behavior rather than optimization dynamics.

These observations suggest that future work should further improve neighborhood-aware modeling, especially for cases with irregular boundaries, fragmented regions, and scale-varying land-cover structures.

5. Conclusions

In this paper, we proposed LoVLANet, a localized vision–language segmentation framework for remote sensing imagery. By integrating RemoteCLIP [4]-based language priors, a customized ViT visual backbone, and a neighborhood-aware attention mechanism, the proposed method effectively enhances both semantic discrimination and spatial consistency. Experiments on the LoveDA and GID benchmarks show that LoVLANet achieves competitive and generally improved segmentation performance under the adopted experimental settings.

The ablation results further verify the effectiveness and complementarity of the proposed components. More broadly, the results suggest that local spatial relationships should be explicitly considered during attention formation in high-resolution remote sensing semantic segmentation, rather than being treated only as a later feature-fusion refinement. These findings provide a cautious but general design insight supported by the current experiments: global semantic guidance and local spatial constraints are complementary for improving dense land-cover prediction under the evaluated RGB optical benchmarks. In future work, we plan to explore more flexible neighborhood-aware modeling strategies, so the model can better balance semantic affinity and local structural preservation under complex land-cover patterns.

Author Contributions

Conceptualization, X.T.; methodology, X.T.; software, S.T.; validation, S.T., C.Z. and L.H.; formal analysis, S.T.; investigation, L.H.; resources, C.Z. and Z.X.; data curation, S.T.; writing—original draft preparation, S.T.; writing—review and editing, C.Z. and X.T.; visualization, S.T.; supervision, C.Z.; project administration, Z.X.; funding acquisition, C.Z. and Z.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Open Fund of the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, grant number 24R04; the Hubei Provincial Key Research and Development Program, grant number 2025BAB061; the Hubei Provincial Major Science and Technology Program, grant number 2025BEA001; and the Wuhan Natural Science Foundation Focused Program, grant number 2026040101030022.

Data Availability Statement

Datasets relevant to our paper are available online.

Acknowledgments

The numerical calculations in this paper were performed using the facilities of the School of Artificial Intelligence, Hubei University.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Guo, M.; Lu, C.; Hou, Q.; Liu, Z.; Cheng, M.; Hu, S. SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation. In Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2022. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: Cambridge, MA, USA, 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. RemoteCLIP: A Vision Language Foundation Model for Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622216. [Google Scholar] [CrossRef]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef] [PubMed]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Yuan, Y.; Chen, X.; Wang, J. Object-Contextual Representations for Semantic Segmentation. In Proceedings of the Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 173–190. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Wang, L. GeoSeg: A Semantic Segmentation Toolbox Including UNetFormer and DC-Swin. 2022. Available online: https://github.com/WangLibo1995/GeoSeg (accessed on 11 October 2025).
Zou, X.; Li, Y.; Zhang, S.; Li, K.; Wang, S.; Tao, P.; Xing, J.; Lang, C. Dynamic Dictionary Learning for Remote Sensing Image Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–23 October 2025; pp. 22457–22466. [Google Scholar]
Hwang, G.; Jeong, J.; Lee, S.J. SFA-Net: Semantic Feature Adjustment Network for Remote Sensing Image Segmentation. Remote Sens. 2024, 16, 3278. [Google Scholar] [CrossRef]
Fu, Y.; Lou, M.; Yu, Y. SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 19077–19087. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 12077–12090. [Google Scholar]
D’Ascoli, S.; Touvron, H.; Leavitt, M.L.; Morcos, A.S.; Biroli, G.; Sagun, L. ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: Cambridge, MA, USA, 2021; Volume 139, pp. 2286–2296. [Google Scholar]
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. CvT: Introducing Convolutions to Vision Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 22–31. [Google Scholar] [CrossRef]
Gao, P.; Geng, S.; Zhang, R.; Ma, T.; Fang, R.; Zhang, Y.; Li, H.; Qiao, Y. CLIP-Adapter: Better Vision-Language Models with Feature Adapters. Int. J. Comput. Vis. 2024, 132, 581–595. [Google Scholar] [CrossRef]
Dong, X.; Bao, J.; Zheng, Y.; Zhang, T.; Chen, D.; Yang, H.; Zeng, M.; Zhang, W.; Yuan, L.; Chen, D.; et al. MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 10995–11005. [Google Scholar]
Li, B.; Weinberger, K.Q.; Belongie, S.; Koltun, V.; Ranftl, R. Language-driven Semantic Segmentation. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Lüddecke, T.; Ecker, A. Image Segmentation Using Text and Image Prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 7086–7096. [Google Scholar]
Hajimiri, S.; Ayed, I.B.; Dolz, J. Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 5061–5071. [Google Scholar] [CrossRef]
Xu, W.; Xu, Y.; Chang, T.; Tu, Z. Co-Scale Conv-Attentional Image Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9981–9990. [Google Scholar]
Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks; Vanschoren, J., Yeung, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 1. [Google Scholar]
Tong, X.Y.; Xia, G.S.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote Sens. Environ. 2020, 237, 111322. [Google Scholar] [CrossRef]

Figure 1. Detailed architecture of the proposed Localized Vision–Language Attention Network (LoVLANet) framework. The visual encoder extracts hierarchical features

{F_{1}, F_{2}, F_{3}, F_{4}}

while explicitly capturing localized spatial dependencies via the Neighborhood Key-Key Encoder. The text encoder generates dense text embeddings

\hat{T} \in {{\hat{t}}_{1}, \dots, {\hat{t}}_{N}}

from category prompts. Finally, the decoder fuses spatial features into dense visual embeddings

\hat{V} \in {{\hat{v}}_{1}, \dots, {\hat{v}}_{H W}}

and performs feature alignment for pixel-level segmentation. Colors in the output mask represent predicted semantic classes.

Figure 1. Detailed architecture of the proposed Localized Vision–Language Attention Network (LoVLANet) framework. The visual encoder extracts hierarchical features

{F_{1}, F_{2}, F_{3}, F_{4}}

while explicitly capturing localized spatial dependencies via the Neighborhood Key-Key Encoder. The text encoder generates dense text embeddings

\hat{T} \in {{\hat{t}}_{1}, \dots, {\hat{t}}_{N}}

from category prompts. Finally, the decoder fuses spatial features into dense visual embeddings

\hat{V} \in {{\hat{v}}_{1}, \dots, {\hat{v}}_{H W}}

and performs feature alignment for pixel-level segmentation. Colors in the output mask represent predicted semantic classes.

Figure 2. Qualitative comparison results on the LoveDA dataset. Colors denote semantic class labels.

Figure 3. Zoomed-in visual comparison on the LoveDA dataset. The red boxes highlight the effectiveness of the proposed Neighborhood Key-Key Encoder in preserving the topological continuity of linear structures (e.g., roads) against complex background interference. Colors denote semantic class labels.

Figure 4. Qualitative comparison results on the GID dataset. Colors denote semantic class labels.

Figure 5. Zoomed -in visual comparison on the GID dataset. The red boxes highlight the observed advantage of LoVLANet in suppressing semantic noise and preserving the fine-grained continuity of challenging spatial structures. Colors denote semantic class labels.

Table 1. Quantitative performance comparison on the LoveDA validation set (%).

Method	mIoU	IoU per Category						mF1	OA	Kappa
Method	mIoU	Building	Road	Water	Barren	Forest	Agricultural	mF1	OA	Kappa
UNetFormer [1]	56.37	55.39	73.54	68.38	35.89	46.86	58.17	71.24	72.61	64.52
DC-Swin (UNetFormer)	66.00	83.09	66.09	77.17	44.07	53.33	72.25	78.68	81.63	75.33
D2LS [11]	69.08	80.92	74.35	77.15	43.42	63.63	75.04	80.98	83.69	78.11
SFA-Net [12]	64.11	75.55	72.59	75.43	46.66	49.47	65.00	77.46	78.58	71.93
SegMAN [13]	67.79	86.31	71.15	76.27	43.97	52.70	76.35	79.84	83.50	77.33
SegNeXt [2]	68.21	82.81	73.54	77.87	41.81	56.40	76.82	80.15	83.69	77.94
Ours (LoVLANet)	69.77	81.11	74.75	79.45	46.51	59.57	77.24	81.50	84.55	79.07

Note: Bold values indicate the best result in each column (including ties).

Table 2. Quantitative performance comparison on the GID dataset (%).

Method	mIoU	IoU per Category					mF1	OA	Kappa
Method	mIoU	Built-Up	Farmland	Forest	Meadow	Water	mF1	OA	Kappa
UNetFormer [1]	87.61	95.19	93.08	91.46	63.97	94.35	92.92	96.43	95.01
DC-Swin (UNetFormer)	88.55	95.96	93.39	91.39	68.83	93.17	93.61	96.55	95.15
D2LS [11]	89.58	95.56	93.89	91.90	71.81	94.74	94.25	96.86	95.60
SFA-Net [12]	83.48	93.51	90.63	89.75	51.39	92.15	90.03	95.11	93.16
SegMAN [13]	89.58	95.56	94.01	91.05	72.31	94.97	94.26	96.88	95.62
SegNeXt [2]	89.79	95.57	93.94	91.50	73.14	94.82	94.40	96.89	95.63
Ours (LoVLANet)	92.49	96.53	94.64	92.94	81.11	95.35	96.01	97.55	96.56

Note: Bold values indicate the best result in each column (including ties).

Table 3. Ablation results of the proposed modules on the LoveDA validation set (%).

Modules			mIoU	IoU per Category						Overall Metrics
KK	Gauss	Reduce MLP	mIoU	Building	Road	Water	Barren	Forest	Agricultural	mF1	OA	Kappa
			66.25	74.95	62.97	67.86	31.06	57.05	75.04	79.01	82.06	75.69
✓			66.83	66.70	66.68	81.75	20.65	62.21	77.40	78.88	82.64	76.69
	✓		68.13	74.39	68.13	80.59	22.48	54.36	81.13	79.82	84.03	78.27
		✓	69.24	76.94	68.93	80.03	27.61	58.35	81.92	80.77	84.66	79.14
✓	✓		67.93	68.02	80.27	81.10	29.24	58.69	80.65	80.14	83.95	78.17
✓		✓	68.99	78.13	64.61	78.71	28.82	59.39	81.15	80.81	84.25	78.61
	✓	✓	69.47	73.11	67.67	80.89	31.83	59.26	81.76	81.08	84.67	79.21
✓	✓	✓	69.77	73.28	67.33	79.04	33.67	58.93	80.52	81.50	84.55	79.07

Note: Bold values indicate the best result in each column (including ties).

Table 4. Sensitivity analysis of the Gaussian kernel width

σ

on the LoveDA validation set (%).

Table 4. Sensitivity analysis of the Gaussian kernel width

σ

on the LoveDA validation set (%).

$σ$	mIoU	mF1	OA
0.5	68.91	80.82	84.02
1.0	69.28	81.12	84.21
1.5	69.61	81.36	84.39
2.0	69.77	81.50	84.55
2.5	69.58	81.34	84.43
3.0	69.25	81.07	84.18

Note: Bold values indicate the best result in each column (including ties).

Table 5. Ablation results of the proposed modules on the GID dataset (%).

Modules			mIoU	IoU per Category					Overall Metrics
KK	Gauss	Reduce MLP	mIoU	Built-Up	Farmland	Forest	Meadow	Water	mF1	OA	Kappa
			89.53	94.95	93.15	91.04	72.63	93.91	94.26	96.69	95.35
✓			91.97	96.62	94.27	92.38	78.55	94.98	95.71	97.41	96.36
	✓		92.36	96.63	94.32	92.07	81.86	94.89	95.95	97.42	96.37
		✓	92.45	96.65	94.58	92.45	80.94	95.30	95.98	97.57	96.59
✓	✓		92.49	96.59	94.53	92.59	81.25	95.14	96.01	97.52	96.51
✓		✓	92.21	96.52	94.48	92.32	79.73	95.04	95.85	97.47	96.45
	✓	✓	92.25	96.59	94.28	92.33	80.79	94.88	95.88	97.41	96.36
✓	✓	✓	92.49	96.53	94.64	92.94	81.11	95.35	96.01	97.55	96.56

Note: Bold values indicate the best result in each column (including ties).

Table 6. Computational cost comparison of representative ablation variants using a single

1024 \times 1024

RGB input.

Table 6. Computational cost comparison of representative ablation variants using a single

1024 \times 1024

RGB input.

Variant	Params	FLOPs	Time	FPS	Memory
	(M)	(G)	(ms)		(GB)
Baseline	166.97	446.73	49.44	20.23	3.20
KK	159.88	439.80	48.62	20.57	3.08
ReduceMLP	162.25	440.60	47.96	20.85	3.02
KK + ReduceMLP	155.16	433.70	45.82	21.82	2.91
Full	155.16	434.64	46.17	21.66	2.94

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zeng, C.; Tao, S.; Tan, X.; Xiao, Z.; Hu, L. Pixel’s Neighbors Are Noteworthy: Localized Vision–Language Attention for Remote Sensing Semantic Segmentation. Remote Sens. 2026, 18, 1708. https://doi.org/10.3390/rs18111708

AMA Style

Zeng C, Tao S, Tan X, Xiao Z, Hu L. Pixel’s Neighbors Are Noteworthy: Localized Vision–Language Attention for Remote Sensing Semantic Segmentation. Remote Sensing. 2026; 18(11):1708. https://doi.org/10.3390/rs18111708

Chicago/Turabian Style

Zeng, Cheng, Sheng Tao, Xiaowei Tan, Zhifeng Xiao, and Lei Hu. 2026. "Pixel’s Neighbors Are Noteworthy: Localized Vision–Language Attention for Remote Sensing Semantic Segmentation" Remote Sensing 18, no. 11: 1708. https://doi.org/10.3390/rs18111708

APA Style

Zeng, C., Tao, S., Tan, X., Xiao, Z., & Hu, L. (2026). Pixel’s Neighbors Are Noteworthy: Localized Vision–Language Attention for Remote Sensing Semantic Segmentation. Remote Sensing, 18(11), 1708. https://doi.org/10.3390/rs18111708

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Pixel’s Neighbors Are Noteworthy: Localized Vision–Language Attention for Remote Sensing Semantic Segmentation

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Related Work

2.1.1. Single-Modal Remote Sensing Semantic Segmentation

2.1.2. Vision–Language Models for Semantic Segmentation

2.1.3. Attention Mechanisms and Local Context Modeling

2.2. Methodology

2.2.1. Overall Framework

2.2.2. Remote Sensing-Oriented Text Encoder

2.2.3. Neighborhood Key–Key Encoder

Neighborhood Key–Key Attention

Key–Key Attention

Gaussian-Weighted Neighborhood Matrix

Removing the Feed-Forward Network

2.2.4. Decoder

2.2.5. Training Objective

3. Results

3.1. Datasets

3.1.1. LoveDA

3.1.2. GID

3.2. Evaluation Metrics

3.3. Implementation Details

3.4. Performance Comparison

3.4.1. Performance Comparison on the LoveDA Dataset

3.4.2. Performance Comparison on the GID Dataset

3.5. Ablation Studies

3.5.1. Results on the LoveDA Dataset

3.5.2. Sensitivity Analysis of the Gaussian Kernel Width

3.5.3. Results on the GID Dataset

3.5.4. Computational Cost Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI