TransUV: A TransNeXt-Based Model with Multi-Scale and Attention Fusion for Fine-Grained Urban Village Extraction

Lin, Xiaobao; Wang, Yu; Zhou, Yaming; Wang, Guangjun; Chen, Sai

doi:10.3390/rs18020223

Open AccessArticle

TransUV: A TransNeXt-Based Model with Multi-Scale and Attention Fusion for Fine-Grained Urban Village Extraction

by

Xiaobao Lin

¹,

Yu Wang

^2,*,

Yaming Zhou

²,

Guangjun Wang

¹

and

Sai Chen

¹

School of Land Science and Technology, China University of Geosciences (Beijing), Beijing 100083, China

²

Satellite Application Center for Ecology and Environment, Ministry of Ecology and Environment, Beijing 100094, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(2), 223; https://doi.org/10.3390/rs18020223

Submission received: 26 November 2025 / Revised: 25 December 2025 / Accepted: 5 January 2026 / Published: 9 January 2026

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The novel model introduced leverages a modified TransNeXt backbone with multi-scale attention fusion for the fine-grained extraction of urban villages (UVs) from high-resolution remote sensing imagery.
The model comprises a Multi-level Feature Enhancement Module (MFEM) and an Advanced Attention Fusion Module (AAFM) to enhance boundary clarity and texture representation. It achieves an mIoU of 86.67% and an overall accuracy of 92.98% and outperforms several mainstream models.

What are the implications of the main findings?

TransUV demonstrates strong generalization capability across complex urban environments, which facilitates the precise delineation of irregular and densely structured UVs.
The proposed approach achieves a balance between efficiency and accuracy, making it a robust solution for urban village extraction in urban renewal and monitoring applications.

Abstract

Urban villages (UVs) are widespread in rapidly urbanizing regions, but their fine-grained delineation from high-resolution remote sensing imagery remains a challenge due to complex spatial textures and ambiguous boundaries. To address this issue, this paper proposes TransUV, a TransNeXt-based encoder–decoder segmentation framework tailored to UV extraction. At the encoder front end, a Multi-level Feature Enhancement Module (MFEM) injects boundary- and texture-aware inductive bias by combining Laplacian-of-Gaussian (LoG) filtering with Gaussian smoothing, which strengthens edge responses while suppressing noise. At the decoder stage, we design a lightweight SegUV decoder equipped with an Advanced Attention Fusion Module (AAFM) that adaptively fuses multi-scale features using complementary channel, spatial, and directional attention. Experiments on 0.5 m imagery from two Chinese cities demonstrate that TransUV achieves an mIoU of 86.67% and an overall accuracy of 92.98%, significantly outperforming other mainstream models.

Keywords:

urban villages; high-resolution remote sensing; multi-scale attention fusion; semantic segmentation; TransUV model

Graphical Abstract

1. Introduction

Urban villages (UVs) typically refer to areas within cities characterized by inadequate planning, underdeveloped infrastructure, poor sanitary conditions, and high population density [1,2,3]. These areas are historical outcomes of the long-term evolution of the institutional rural-urban dichotomy and represent a concentration of the contradictions between land systems and urban expansion [4,5,6]. While UVs provide low-cost living spaces for migrant populations, they are also associated with typical “urban maladies” such as excessive building density (>70%) and insufficient green space per capita (<1 m²) [7]. Due to the generally inadequate infrastructure (particularly drainage systems) and weak environmental management, UVs often become hotspots for black and odorous water bodies [8].

The United Nations Sustainable Development Goal (SDG 11.1) explicitly calls for making “By 2030, ensure access for all to adequate, safe, and affordable housing and basic services and upgrade slums.” Consequently, the precise identification of marginalized urban spaces, including UVs, is of critical importance. The identified UVs can also provide key auxiliary spatial information for the screening and management of urban black and odorous water bodies. Driven by critical urban challenges, resource availability, and policy biases, urban studies have shown a predominant focus on large cities [9]. China is home to over 100 large cities with permanent populations exceeding one million, yet publicly accessible spatial data remain extremely scarce, significantly hindering the implementation and monitoring of urban renewal policies [10]. Traditional methods relying on manual surveys are not only inefficient and slow to update, but also struggle to meet the demands of rapid urban renewal and sustainable governance [11].

With the rapid advancement of remote sensing technology, the fast and systematic acquisition of land surface information has become feasible [12]. The technology provides essential technical support for natural resource surveys, infrastructure censuses, and assessments of urban sustainable development. It also offers a new technical pathway for the automatic identification of UVs. To improve the accuracy and efficiency of UV identification in complex urban environments, existing studies have mainly followed two technical routes: (1) Fusing multi-source data to enrich the original feature space and enhance the separability between UVs and their surrounding urban environment [13,14]; and (2) Optimizing identification methods based on a single remote sensing data source to reduce dependence on auxiliary data and improve method generalizability [15,16,17].

Along the first route, in comparison with the more prevalent fusion schemes in other remote sensing applications (e.g., image-image fusion or image–geographic data fusion), UV identification tasks often additionally incorporate diverse urban spatial data such as street-view images, points of interest (POIs), and human mobility trajectories to represent their formation mechanisms and intensive human activities. Huang et al. [18] integrated remote sensing imagery with street-view images to enhance the utilization of street-level information and achieved good performance in UV identification in Shenzhen; Xiao et al. [19] combined remote sensing imagery with POI data to identify UVs in multiple cities, including Shenzhen, Fuzhou, and Beijing, and reported significant improvements in classification accuracy; Chen et al. [10] fused remote sensing and mobility trajectory data to achieve fine-scale UV mapping with a spatial resolution of 2.5 m in Shenzhen. These studies demonstrate that multi-source data fusion strategies can effectively enhance the accuracy of UV identification under complex urban backgrounds. Fan et al. [20] proposed the SemiUIS method, which integrates crowdsourced geospatial data with a semi-supervised learning strategy to achieve high-accuracy mapping of urban informal settlements using only a limited number of labeled samples. However, the acquisition cost of multi-source urban spatial data is high due to the incomplete spatial coverage, and data quality may vary significantly across cities. As a result, the spatial generalization ability of the constructed models is limited, and it is particularly difficult to extend such methods to cities where multi-source data are scarce, which constrains their practical applicability [21].

The second route focuses on identification methods based on a single remote sensing data source [22]. Compared with multi-source fusion approaches, these methods rely less on auxiliary data and are thus more suitable for engineering applications. With the increasing diversity of remote sensing data and improvements in spatial resolution, related methods have continued to evolve. Early studies mainly relied on traditional machine learning and object-based image analysis techniques. Wurm et al. [23] employed a random forest classifier within the Kennaugh element framework using synthetic aperture radar (SAR) imagery to extract UVs. This effectively capturing multi-scale texture features while maintaining rotation invariance and relatively low computational complexity. Kit et al. [24] used multi-temporal high-resolution satellite imagery and applied Canny and LSD edge detection algorithms to identify UVs and analyze their changes in Hyderabad, India. They successfully achieved high-accuracy extraction for 2003 and 2010 without additional corrections. D’Oleire-Oltmanns et al. [25] adopted an object-based image analysis (OBIA) approach based on multi-temporal medium-to high-resolution optical imagery to effectively identify UVs in the Pearl River Delta region. This approach was widely used before the rise of deep learning. Subsequently, Huang et al. [26] introduced a scene-classification-based framework on this basis to characterize the differences between UVs and other urban land-cover types from multiple dimensions. They achieved high identification accuracy across multi-temporal datasets in Shenzhen and Wuhan, which also verifies the transferability of the method.

As deep learning advances, particularly with the rapid maturation of semantic segmentation models, methods based on convolutional neural networks (CNNs) [27] and Transformers have gradually become the mainstream for the automatic extraction of UVs [28,29,30,31,32]. Feng et al. [33] proposed a network structure that integrates multi-scale dilated convolutions with non-local feature extraction modules. This structure effectively handles the variations in the shape and scale of UVs and achieved an overall accuracy of 94.27% in experiments conducted in parts of Beijing. Gella et al. [34] employed a Mask R-CNN combined with domain-adaptive transfer learning to explore the feasibility of transferring models trained on historical imagery to newer imagery and evaluated its ability to capture the spatiotemporal dynamics of UVs. Chai et al. [35] proposed a multi-scale masked Transformer model (MaskUV), which can simultaneously capture local textures and global contextual information. It achieved an F1-score of 84.39% and an intersection over union (IoU) of 73.00% on their UVSet dataset, demonstrating strong performance in UVs identification in the Pearl River Delta. Li et al. [15] proposed the UV-Mamba model based on high-resolution remote sensing imagery. In this model, state space modeling is employed to effectively enhance the capture of long-range spatial context from a single data source, thereby improving the identification accuracy of UVs. Nevertheless, when applied to UV areas that are highly heterogeneous, structurally complex, and rapidly evolving, existing semantic segmentation methods still suffer from insufficient feature representation, local misclassification, and blurred boundaries. At the same time, many models have a large number of parameters, low inference efficiency, and limited cross-scene generalization ability, which pose challenges for their deployment and promotion in real-world operational settings [4].

To address core challenges in UV extraction, such as high-density building distribution, blurred boundaries, and irregular shapes, this paper proposes a novel remote sensing image segmentation model named TransUV. The key innovations include a task-driven structural redesign: the TransNeXt backbone network is employed to jointly model both local and global information [36], and is integrated with a Multi-level Feature Enhancement Module (MFEM) and an Advanced Attention Fusion Module (AAFM), achieving comprehensive optimization from low-level features to high-level semantics. The specific contributions are as follows:

At the front end of the encoder, we propose the Multi-level Feature Enhancement Module (MFEM). This module introduces learnable Laplacian of Gaussian (LoG) and Gaussian filtering as priors to effectively enhance edge and texture responses, suppress noise, and thus significantly improve contour discriminability.
In the decoder, we design the Advanced Attention Fusion Module (AAFM). This module integrates channel, spatial, and directional attention mechanisms and adaptively fuses them through dynamic weighting, thereby enhancing the representation of UV structures that exhibit strong directionality and irregular morphology.
A training sample selection strategy based on a coverage threshold is proposed. This strategy filters out fragmented samples with low coverage to alleviate class imbalance and labeling noise, thereby improving training stability and deployment robustness.

Experimental results demonstrate that TransUV outperforms state-of-the-art methods in terms of boundary clarity, regional integrity, and cross-scene generalization capability, offering a more effective solution for the automated and fine-grained monitoring of UVs in complex environments.

2. Study Area and Data

2.1. Study Area

This study focuses on the main urban cores of Kunming and Nanning, two cities with relatively concentrated distributions of UVs, with areas of 608.92 km² and 526.39 km², respectively (Figure 1). As the capital of Yunnan Province, Kunming has witnessed extensive urban expansion driven by its unique climatic and ecological advantages. However, along with rapid development, numerous older UVs still remain, resulting in a diverse urban land-use structure and an urgent need for urban renewal [37]. Nanning is located in the southern subtropical monsoon climate zone, featuring a humid environment and abundant greenery. As a “gateway city” oriented towards ASEAN, during the urbanization process, traditional rural settlements in Nanning have gradually been surrounded by urban functions, creating a typical urban–rural nested structure and prominent UVs issues [38].

To identify the boundaries of the study areas, we defined the core built-up areas of both cities according to the scope of the main urban cores, current administrative boundaries, and functional zoning. The study area in Kunming (Figure 1a) encompasses Panlong, Wuhua, Guandu, Chenggong, and Xishan districts; the study area in Nanning (Figure 1b) spans Qingxiu, Xingning, Liangqing, Yongning, Jiangnan, and Xixiangtang districts.

As shown in Figure 2, UVs in both study areas exhibit typical characteristics such as high building density, narrow alleyways, and irregular morphologies. Nevertheless, there are notable inter-city differences in spatial characteristics: the roofs in Kunming’s UV display a diverse color palette—including gray-white, light blue, and some dark tones—indicating relatively high heterogeneity. By contrast, the roofs in Nanning’s UVs are predominantly uniform in a blue hue, reflecting the widespread use of blue metal sheets, which produces a highly distinctive, homogeneous roof signature across large areas. In this study, the annotations of UVs from Kunming and Nanning were integrated and used to jointly train the TransUV model. This enables the model to simultaneously learn and capture the distinct spatial characteristics of UVs in both cities, thereby providing data support and a theoretical basis for subsequent precise extraction and urban renewal strategies.

2.2. Data

The primary data source consists of 0.5 m spatial resolution remote sensing imagery acquired in 2023 by the Jilin-1 satellite constellation, obtained through the Ovital map software (v10.3.0). Owing to its sub-meter spatial resolution, Jilin-1 can precisely capture multiple urban geographic features, such as building structures, road network textures, vegetation, and water bodies, presenting substantial advantages in spatial detail and temporal currency. This provides a strong data foundation for the fine-grained identification of UVs’ spatial characteristics [39,40,41].

To ensure the accuracy of sample annotations, this study established a rigorous quality control process. First, to precisely define the geographic scope for UV identification, we used the global urban boundary derived from the 2020 GAIA dataset (Global Artificial Impervious Area) published by Li et al. [42], combined with remote sensing images from the 2023 Jilin-1 satellite. Through visual interpretation and boundary revision, the built-up area within the study region was delineated. Within this defined area, two researchers, each with a background in remote sensing and urban planning and over two years of experience in high-resolution image analysis, independently performed annotations using a “back-to-back” strategy [43]. Prior to annotation, both researchers received systematic specialized training, covering the criteria for distinguishing the morphological textures of UVs and rural villages, operation standards for annotation tools, and quality control methods, to ensure consistency in the initial annotation results.

For difficult cases and discrepancies that occurred during the annotation process, all inconsistent areas were submitted to an arbitration panel composed of senior experts for final adjudication. This panel consists of remote sensing experts from the Satellite Environment Application Center of the Ministry of Ecology and Environment and professors from China University of Geosciences (Beijing). The expert panel carried out focused discussions and adjudications based on multi-source image features, urban planning regulations, and field knowledge, and produced authoritative final annotation results. This process effectively guaranteed the reliability of the annotated data, offering a solid benchmark for model training and evaluation.

During the data preprocessing stage, the images and their corresponding labels were uniformly cropped into 512 × 512 pixel patches. An area threshold method (i.e., UV patch coverage ≥ 20%) was then applied to filter valid samples, resulting in a total of 2021 valid patches (see Figure 3). To construct the training and validation sets, we adopted a random sampling strategy and divided the data at an 8:2 ratio. In this process, no pre-stratification was performed based on city or other attributes, ensuring that each patch had an equal probability of being assigned to either the training or validation set. This approach helps avoid biases that may arise from manual stratification, ensuring the independence and representativeness of the training and validation sets in terms of the overall data distribution. As a result, the final split consisted of 1616 patches in the training set and 405 patches in the validation set, with the sample distribution between the two cities in the training and validation sets shown in Table 1.

3. TransUV Architecture

3.1. Overview

As shown in Figure 4, the proposed TransUV model adopts an encoder–decoder framework. In the encoder, the model uses TransNeXt [36] as the backbone for feature extraction. Its aggregation attention mechanism can jointly model local details and global context, providing a feature representation that better adapts to the dense and irregular structure of UVs. Specifically, we introduce an MFEM at the front end of the encoder, which explicitly enhances edge and texture responses using techniques such as LoG filtering during the initial feature extraction stage, addressing the issue of boundary ambiguity in UVs. In the decoder, the proposed SegUV decoder progressively restores spatial details and outputs fine segmentation results, with its core being the AAFM. This module integrates channel, spatial, and directional perception attention, enabling the adaptive fusion of multi-scale features, particularly suited for capturing the complex directional structures presented by internal alleys and building arrangements in UVs.

Overall, TransUV, through the collaborative design of the TransNeXt backbone, MFEM, and AAFM, forms a specialized processing pipeline tailored to the morphological features of UVs. This architecture not only overcomes the limitations of general Transformer models (such as ViT and Swin Transformer) in maintaining fine-grained features and modeling long-range dependencies across windows but also significantly improves the extraction accuracy of UV instances in complex urban scenarios.

3.2. Encoder

In this study, TransNeXt-Tiny is adopted as the encoder, which utilizes a bionic foveal vision mechanism to accomplish multi-scale feature extraction. The encoder employs a four-stage hierarchical backbone structure and an overlapping patch embedding mechanism similar to that of PVTv2. In the first to third stages, each TransNeXt Block is composed of stacked Aggregated Attention modules and Convolutional Gated Linear Units (Convolutional GLUs) to jointly model local and global features. Since the feature map resolution in the fourth stage is relatively low and traditional feature pooling modules cannot operate effectively, a Multi-Head Self-Attention (MHSA) module, consistent with that in PVTv2, is employed to maintain global information modeling capability.

To enhance the model’s capability in perceiving edges and textures in complex UV scenarios, this study introduces an MFEM into the encoder, as illustrated in Figure 5. The proposed module integrates mathematical operations such as Laplacian of Gaussian (LoG) filtering and Gaussian smoothing to effectively strengthen boundary responses in the input features while suppressing noise interference, thereby providing more discriminative feature representations for subsequent Transformer-based encoding. Key parameters within the MFEM (e.g., the scale factor K and standard deviation

σ

) were optimized based on the ablation experiments presented in Appendix A.

First, in the LoGFilter Block, a 7 × 7 convolution kernel combined with Laplacian of Gaussian (LoG) filtering is employed to enhance the saliency of building boundaries. LoG filtering highlights gradient variations in edge regions through the composite operation of Gaussian smoothing and the second-order Laplacian derivative (k = 7,

σ

= 1), and its mathematical formulation is given as:

LOG (x, y, σ) = \frac{x^{2} + y^{2} - 2 σ^{2}}{2 π σ^{4}} \exp (- \frac{x^{2} + y^{2}}{2 σ^{2}})

(1)

where

(x, y)

denote the pixel coordinates, and σ controls the scale of the Gaussian kernel.

Subsequently, the module conducts preliminary downsampling through depthwise separable convolution and introduces a Gaussian Block to suppress pseudo-textures and high-frequency noise. The Gaussian smoothing kernel (k = 9,

σ

= 0.5) is defined as:

G (x, y, σ) = \frac{1}{2 π σ^{2}} \exp (- \frac{x^{2} + y^{2}}{2 σ^{2}})

(2)

Its weighted averaging mechanism eliminates local detail disturbances while preserving the overall structural information, thus providing clean feature inputs for subsequent hierarchical processing. By combining structured downsampling and max-pooling operations, the MFEM integrates multi-level features to compress the spatial resolution to one-quarter of the original input, generating a compact yet semantically rich low-dimensional representation that facilitates global contextual modeling in the TransNeXt backbone network.

Within the TransNeXt backbone, the Aggregated Attention mechanism employs a dual-path design to emulate the multi-scale perception process of the biological visual system, as shown in Figure 6. This mechanism integrates local and global attention within a unified framework, enabling each token to capture fine-grained information from neighboring features while also acquiring coarse-grained contextual information from downsampled global features. Through hierarchical aggregation, the model achieves an effective balance between global modeling and local detail representation, thereby enhancing the completeness and discriminability of feature representations. The underlying principles are as follows:

Due to a given query position, local features are extracted from the neighborhood using a k × k sliding window, while global context is captured via the downsampled feature map. The attention computation is formulated as follows:

S_{(i, j) ~ ρ (i, j)} = (Q_{(i, j)} + QE) {\hat{K}}_{ρ (i, j)}^{T}

(3)

S_{(i, j) ~ σ (X)} = (Q_{(i, j)} + QE) {\hat{K}}_{σ (X)}^{T}

(4)

where

ρ (i, j)

stands for the local window,

σ (X)

denotes the globally pooled features, and

QE

is the learnable Query Embedding.

The attention logits derived from two parallel processing pathways are concatenated and subsequently normalized, ensuring that fine-grained and coarse-grained features are assigned appropriate weights within a unified softmax operation:

A_{(i, j)} = softmax (τ \log N \times Concat (S_{(i, j) ~ ρ (i, j)}, S_{(i, j) ~ σ (X)}) + B_{(i, j)})

(5)

where Concat denotes the concatenation operation,

τ \in R

is a learnable temperature parameter,

N

denotes the number of valid tokens, and

B_{(i, j)}

represents the positional bias.

AA (X_{(i, j)}) = (A_{(i, j) ~ ρ (i, j)} + Q_{(i, j)} T) V_{ρ (i, j)} + A_{(i, j) ~ σ (X)} V_{σ (X)}

(6)

where

A_{(i, j) ~ ρ (i, j)}

,

A_{(i, j) ~ σ (X)}

denote the local and global attention weights after splitting, respectively, and

T

is a learnable token.

This aggregated attention mechanism is crucial for enhancing the model’s ability to handle features at different spatial resolutions, and it is particularly effective for tasks requiring multi-scale detail processing.

3.3. Decoder

To address the challenges in UV remote sensing images, such as high-density building distributions, blurred boundaries, and occlusions, this study proposes an enhanced decoder structure—SegUV. While maintaining a lightweight design, this structure significantly improves feature representation and semantic segmentation accuracy through structural optimization.

To mitigate differences in channel dimensions and semantic levels across stages, a Channel Alignment Module is first introduced. It employs 1 × 1 convolutions to map multi-scale features to a unified dimension of 256. The core of the decoder is the AAFM, which aims to enhance feature learning by incorporating multiple attention mechanisms. As shown in Figure 7, the module integrates various attention branches, including spatial attention, lightweight channel attention, and direction-aware attention, combined with a dynamic weight fusion strategy. This design effectively improves the model’s robustness and further enhances segmentation accuracy, particularly when handling complex spatial structures and textures. The detailed design is as follows:

Spatial Attention Branch

The spatial attention branch captures spatial information from the input feature map through average pooling and max pooling operations. The outputs of these operations are concatenated and processed with a 7 × 7 convolution. This enables the network to capture both spatial context and frequency information. The computation is as follows:

A_{s p a t i a l} = σ ({Conv}_{7 \times 7} (Concat (Mean (X); Max (X))))

(7)

where

Mean (X)

and

Max (X)

represent the average pooling and max pooling of the input feature map

X

, respectively.

2.: Lightweight Channel Attention Branch

This branch implements channel attention on the input feature map using two 1 × 1 convolutional layers. First, one convolution reduces the channel dimension, and then another convolution to restores it, generating an attention map to weight the input features. The specific operations are as follows:

A_{c h a n n e l} = σ (Conv (ReLU (Conv (X))))

(8)

where

σ

denotes the Sigmoid activation function, and

X

is the input feature map.

Through this branch, the network can learn the importance of each channel and adaptively enhance or suppress specific channel features.

In this way, the spatial attention learns the correlations of the global spatial structure.

3.: Directional-Aware Attention Branch

The direction-aware branch processes the input feature map through multiple convolutional layers, each of which extracts texture responses along different directions. By leveraging these direction-specific convolutional outputs, the network captures directional information in the image. All directional features are then fused via concatenation followed by a 1 × 1 convolution, to produce the final direction-aware attention map. The computation process of this branch is as follows:

A_{d i r e c t i o n} = σ ({Conv}_{1 \times 1} (Cat ({Conv}_{1} (X), \dots, {Conv}_{D} (X))))

(9)

where

{Conv}_{i} (X)

denotes the outputs of convolutions along different directions, and

D

represents the number of directions.

4.: Dynamic Weight Fusion Mechanism

To adaptively fuse the outputs of different attention branches, a lightweight weight generator is introduced. This module generates three weight values through adaptive pooling and convolution operations, which correspond to the weighted fusion of spatial attention, lightweight channel attention, and direction aware attention. The computation process is as follows:

w_{1}, w_{2}, w_{3} = Softmax (Conv 2 d (ReLU (Conv 2 d (AvgPool (X)))))

(10)

where

w_{1}, w_{2}, w_{3}

are used to weight the outputs of the different attention branches.

Using the computed weights, the module performs a weighted fusion of the outputs from the attention branches to obtain the final fused feature map. The fusion operation is defined as:

F_{fused} = w_{1} \cdot A_{c h a n n e l} + w_{2} \cdot A_{s p a t i a l} + w_{3} \cdot A_{d i r e c t i o n}

(11)

Finally, a convolutional layer processes the fused features to produce the transformed feature map. Subsequently, this map is combined with the original input through a residual connection to further enhance feature representation:

Output = ReLU (GN ({Conv}_{3 \times 3} (F_{f u s e d}))) + X

(12)

where GN denotes the Group Normalization operation. The final Output integrates both the attention-enhanced features and the residual information.

3.4. Accuracy Evaluation

To systematically evaluate the performance of the proposed model in UV extraction tasks, this study conducts both accuracy evaluation and ablation experiments. In the accuracy evaluation, commonly used metrics such as Overall Accuracy (OA), mean Intersection over Union (mIoU), Precision, Recall, and F1Score are adopted as the primary evaluation criteria.

OA reflects the overall classification capability of the model; mIoU provides a more accurate measure of segmentation performance for the UVs class under imbalanced class distributions; Precision evaluates the reliability of the model’s predictions for UVs regions; Recall indicates the model’s performance in terms of detection completeness; and F1 Score, as the harmonic mean of Precision and Recall, is used to comprehensively assess the balance between accuracy and completeness.

P r e c i s i o n = \frac{T P}{T P + F P}

(13)

R e c a l l = \frac{T P}{T P + F N}

(14)

F 1 S c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(15)

O A = \frac{T P + T N}{T P + F P + F N + T N}

(16)

m I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{T P}{T P + F P + F N}

(17)

where TP, FP, FN, and TN denote true positives, false positives, false negatives, and true negatives of the predictions, respectively, and

k

represents the class.

4. Results and Analysis

4.1. Experimental Environment and Implementation Details

All experiments were implemented on the PyTorch 2.0 framework and trained and inferred on an NVIDIA GeForce RTX 4070 GPU (8 GB memory). The training procedure adopted a mini-batch stochastic gradient descent scheme with a batch size of 2. The initial learning rate was set to 0.0006 and the AdamW optimizer was employed. The total number of training epochs was 150. A combination of linear warm-up and polynomial decay (PolyLR, power = 1.0, eta_min = 0.0) was used for learning rate scheduling. Specifically, the first 5 epochs used a linear warm-up strategy (start_factor = 1 × 10⁻⁶) to improve training stability at the early stage.

To reduce the impact of randomness, all experiments were initialized with random parameters and repeated three times with random seeds set to 42, 2023, and 3407, respectively. After each training run, the main metrics (mIoU, OA, Precision, Recall, and F1-score) were computed, and the reported final results reported are the average performance over the three runs. All models were trained under the same data split, input size, data augmentation strategies, and hyperparameters to ensure comparability and fairness.

4.2. Comparison Experiments

To comprehensively evaluate the performance of the proposed TransUV model, several representative semantic segmentation networks were selected as baselines. These include CNN-based models (U-Net, PSPNet, DeepLab v3+), Transformer-based models (Vit, Swin transformer, Segmenter), and hybrid architectures (SegFormer, Mask2Former). All models were trained from scratch without using any pre-trained weights. The input image size was uniformly set to 512 × 512 pixels, and the dataset was split into training and validation sets with a ratio of 8:2.

From the results in Table 2, TransUV achieves the best performance on both key metrics, with an mIoU of 86.67% and an OA of 92.98%, clearly outperforming advanced models such as Mask2former and Deeplab v3+. This indicates that when dealing with UVs characterized by high building density and ambiguous boundaries, TransUV possesses stronger feature representation and spatial discrimination capability. In terms of efficiency, TransUV has 28.67 M parameters and 34.58 G FLOPs. Although its computational cost is slightly higher than that of the lightweight Segformer, it is substantially lower than that of large models such as Vit and Swin transformer, thereby achieving a favorable balance between accuracy and efficiency. These results demonstrate that TransUV not only excels in theoretical performance but also has strong potential for practical applications.

To further analyze the trade-off between performance and complexity, a FLOPs–mIoU bubble chart (Figure 8) was plotted based on the results in Table 2. In this chart, the horizontal axis represents the computational cost FLOPs (G), the vertical axis denotes segmentation accuracy mIoU (%), and the bubble area corresponds to the number of parameters. As shown in Figure 8, Segmenter has relatively small FLOPs and parameter size, yet it exhibits markedly low accuracy. PSPNet, Deeplab v3+, Vit, Swin Transformer, and Mask2Former are clustered on the right side of the chart, with large FLOPs and parameter scales. Among them, Vit and Swin Transformer have the highest computational and parameter complexity and can be regarded as typical “high-complexity–medium-to-high-accuracy” models. Meanwhile, Segformer achieves relatively high mIoU with extremely low FLOPs and parameter size, making it stand out among lightweight models.

Compared with these baselines, TransUV is located in the upper-left region of the bubble chart. It achieves the highest mIoU while maintaining a relatively low computational cost and a moderate number of parameters. To be more specific, among models with mIoU close to or exceeding 84%, the FLOPs of TransUV are much lower than those of PSPNet, Deeplab v3+, Vit, Swin Transformer, and Mask2Former. They are only slightly higher than that of the ultra-lightweight Segformer, yet TransUV yields a clearly superior accuracy. This suggests that TransUV lies close to the Pareto frontier in the two-dimensional “accuracy–efficiency” space, combining high accuracy with low computational cost and thus providing a more cost-effective option for deploying fine-scale UV extraction models in real-world operational systems.

On the basis of quantitative evaluation, this study further analyzes the performance of different models in UV segmentation through visual comparisons. As shown in Figure 9, Segmenter (Figure 9c) is nearly unable to effectively extract UV regions, indicating its limited adaptability to highly fragmented and spatially heterogeneous complex scenes. Traditional CNN-based methods, including PSPNet (Figure 9d), Unet (Figure 9f), and DeepLab v3+ (Figure 9g), generally have a large number of holes when segmenting dense UV areas, accompanied by poor boundary continuity and completeness. This limitation fundamentally originates from the local receptive field of convolutional operations [44,45]. Although convolution is effective in capturing local texture features, it has difficulty establishing a coherent long-range understanding of entire UVs as a single, unified entity. In particular, Unet (Figure 9f) shows obvious misclassification in the fourth sample, where a large number of regular urban buildings are incorrectly identified as UVs, further demonstrating its limited ability to represent the complex morphological characteristics of UVs.

In contrast, Transformer-based networks and hybrid architectures achieve better internal integrity in the segmentation results, with significantly reduced fragmentation. Only Vit (Figure 9h) still shows a small number of holes. This improvement mainly benefits from the long-range semantic modeling capability enabled by the self-attention mechanism in Transformers. However, despite their clear advantages in global consistency, Transformer-based models still exhibit certain ambiguities in UV boundary delineation and occasionally misclassify non–UV regions near object edges. Notably, TransUV (Figure 9k) delivers the best overall performance among all compared methods, generating more accurate, continuous, and complete UV boundaries while effectively suppressing misclassification around building edges. These results indicate that TransUV (Figure 9k) achieves a more favorable balance between global contextual modeling and fine-grained boundary perception, making it particularly suitable for fine-grained UV extraction in complex high-resolution remote sensing scenes.

4.3. Ablation Experiment

In order to verify the contribution of each core module in the TransUV model, the author systematically conducted ablation experiments on the UV datasets. The results are shown in Table 3. The experiment used TransNeXt as the baseline model, which had a mIoU of 84.13% and an overall accuracy (OA) of 91.52%. This demonstrates that the baseline model has a strong initial feature extraction ability.

After replacing the decoder of the baseline model with the SegUV decoder, the mIoU increased to 85.33%, the OA increased to 92.21%, and the F1 Score reached 91.04%. This change indicates that this structure effectively improves the feature reconstruction and semantic information fusion capabilities during the decoding process. After further adding the FCN auxiliary head module, all indicators showed a stable upward trend. The mIoU increased to 86.15%, the OA reached 92.67%, the F1 Score significantly increased to 91.60%, and the recall rate also increased to 91.69%. This indicates that multi-level semantic aggregation helps enhance the model’s ability to discriminate UV areas. Introducing the MFEM at the encoding end further improved the mIoU to 86.30%, the OA to 92.79%, and the Precision significantly increased to 93.33%. This indicates that the module has significant effects in complex boundary extraction and antialiasing processing. Finally, the AAFM is embedded in the decoder, and the overall performance of the model reaches its optimal level. The mIoU is improved to 86.67%, the OA is 92.98%, and the F1 Score increases to 91.89%, while a high recall rate (91.32%) and accuracy rate (92.46%) are maintained. This shows that the multi-branch attention mechanism plays a crucial role in fusing multi-scale context and direction-aware features. The differentiated trends in the improvement of various metrics precisely demonstrate that the MFEM and AAFM function in a clear and complementary division of roles. The MFEM significantly enhances precision by strengthening feature discriminability, while the AAFM effectively restores recall through global contextual fusion while maintaining high precision. The synergistic collaboration between the two ultimately achieves a comprehensive improvement in the overall performance of the model.

To further investigate the influence of the MFEM and AAFM on the model’s attention distribution, Grad-CAM was employed to generate feature heatmaps. As shown in Figure 10, we compared the feature activation patterns of the base model (Base), the model with only the MFEM (+MFEM), and the model incorporating both the MFEM and AAFM (+MFEM&AAFM) across different network stages (Stage-1 to Stage-4) as well as at the final output (Head).

The results indicate that, compared with the base model, the introduction of the MFEM significantly enhances the model’s ability to focus on key regions. In the early stages (Stage-1 and Stage-2), feature activations already exhibit a more distinct initial aggregation trend than those of the base model. As the network depth increases (Stage-3 and Stage-4), the model’s attention becomes more clearly and stably focused on the UV areas. In the final Head output, high-response regions are highly localized within the internal structures of UVs, thus significantly enhancing the recognition accuracy of the target regions. Building on the integration of the MFEM, the further incorporation of the AAFM effectively enhances the capture of UV boundaries. Although its activation patterns in the shallow stages (Stage-1 and Stage-2) are similar to those of the +MFEM model, more extensive and coherent response regions are observed in the deeper stages (Stage-3 and Stage-4). Most importantly, in the final Head heatmap, the activation intensity along the contours of UVs is significantly enhanced. This indicates that the AAFM optimizes feature fusion and attention focusing, particularly by increasing the model’s sensitivity to discriminative boundary features.

4.4. Inference over the Study Areas

In this study, only a small portion of each study area was used for training and validation. The best-performing model weights obtained from this phase were then transferred and applied to the entire study areas to generate complete prediction maps of UVs in the central urban districts of Kunming and Nanning.

In terms of overall extraction performance, the predicted urban-village patches are highly consistent in spatial location and geometric shape with residential areas in the high-resolution imagery. These residential areas are characterized by high building density, low building height, and predominantly self-built houses. The locally enlarged views I–IV in Figure 11 demonstrate that the proposed method can accurately delineate the irregular boundaries of UVs as well as their fine internal road networks, and exhibits strong capability in identifying both large, contiguous clusters and scattered patches embedded within the urban built-up area. At the macro scale, the spatial distribution of the predicted results is largely consistent with the typical urban-village areas recognized in previous studies and local planning practice. It should also be noted that, due to similarities in building morphology, height, and roof materials, a few low-rise, high-greening villa compounds located on the urban fringes of Kunming and Nanning are misclassified as urban-village patches. Such misdetections exist to a certain extent; however, the corresponding patches account for only a small proportion of the total area and number of predictions, and thus have limited impact on the interpretation of the overall spatial pattern and subsequent statistical analyses.

In the central urban area of Kunming, urban-village patches are mainly distributed within the continuous built-up areas of Wuhua District, Panlong District, Guandu District, and the northern part of Chenggong District, which forms a belt-like pattern along the north–south urban development axis. The corridor formed by Wuhua–Panlong–Guandu constitutes the core urban area, within which the purple urban-village patches are highly concentrated and partially merged into large contiguous clusters, indicating a high degree of agglomeration of UVs in the city center. Further south into Chenggong District, UVs still appear sporadically along the main development corridor, but the patches decrease markedly within the newly developed groups at the southernmost part of the area. In contrast, almost no urban-village patches are found in the extensive mountainous and water-covered areas in the western and southern parts of Xishan District. Only a few patches appear at the edges of the built-up area adjacent to the main urban core. Overall, the spatial pattern of UVs in Kunming can be summarized as “high concentration in the center, belt-like extension along the north–south axis, and scarcity in mountainous and water areas”.

In the central urban area of Nanning, the spatial distribution of UVs is highly consistent with the valley-type urban morphology. Almost all predicted patches are embedded within the built-up areas along both banks of the Yongjiang River, forming a pronounced belt-shaped cluster pattern along the river. UVs are most concentrated in Xixiangtang District and Jiangnan District, where large, continuous patches are distributed along the northern and southern banks of the Yongjiang, respectively. Numerous urban-village patches are also present in the northern part of Xingning District and the northern part of Liangqing District. This patches, together with those in Xixiangtang and Jiangnan constitute a high-density ring within the main urban area. In contrast, because of the extensive green spaces and mountainous terrain in the central part of Qingxiu District, UVs are relatively sparse. They only occur sporadically along the built-up edges near the river valley and district boundaries. Yongning District has the fewest urban-village patches overall. Small patches are only observed along the Yongjiang River and at the northern fringe adjacent to the main urban area, while the southeastern part of the district, dominated by non-built-up land, shows almost no urban-village predictions.

By comparing the two cities, it can be observed that UVs in Kunming are more prominently distributed along the north–south urban development axis, whereas those in Nanning are mainly organized along the east–west river valley, forming a clear “river-oriented belt with clusters on both banks” structure. Despite differences in topography and development axes, both cities exhibit a common pattern: UVs are primarily concentrated within and around the existing built-up areas of the old city, and their occurrence gradually weakens towards newly developed urban districts and non-construction areas such as mountains and water bodies. This consistent ring-like distribution further corroborates the spatial plausibility and reliability of the model’s extraction results.

4.5. Case Study: Multi-Temporal Prediction of UV Redevelopment

This section takes Liede Village (113.327°–113.339°E, 23.112°–23.119°N) in Tianhe District, Guangzhou City, as a case study to evaluate the ability of the TransUV model to capture the dynamic changes of UVs from multi-temporal remote sensing images. As the first systematic old city renovation project in Guangzhou (Guangzhou Municipal Government, 2009), Liede Village’s renovation process is representative: demolition work was initiated in 2007 and the renovation was basically completed by the end of 2009. Villagers moved into their new homes before the Spring Festival in 2010. The study selected high-resolution remote sensing images from three time points: 2007 (before demolition), 2009 (during demolition), and 2017 (after renovation). Our research uses the TransUV model for the UVs of the study area identification and analyze their changes.

The experimental results show that the TransUV model can effectively identify the UV areas of Liede Village in different periods (Figure 12). In the 2007 image, the model successfully identified a large number of densely built UV areas, indicating that the area had not undergone renovation. The predicted masks exhibit a continuous and complete distribution, fully demonstrating that the model can obtain UV extraction results with good internal consistency and integrity in high-density building environments. By 2009, with the progress of the demolition work, the red areas identified by the model had significantly decreased, and their boundaries could clearly outline the boundary between the remaining UV buildings and the exposed ground surface. The internal mask remained coherent, indicating that the model could distinguish mixed land types. By 2017, the original UVs had been upgraded to modern residential communities. The model output results showed that the characteristics of UVs had completely disappeared, and were replaced by regularly arranged light gray new building clusters and green belts. The masking results are highly consistent with the actual land features, clearly presenting the reconstruction of the spatial pattern after urban renewal.

The results of the research indicate that the TransUV model can accurately identify the range changes of UVs during the renovation process, and the recognition results match well with the actual land features, verifying the applicability of the model in multi-period UVs recognition tasks.

5. Discussion

5.1. Rationality of the Area Threshold Strategy

In this study, an area threshold strategy based on UV coverage ratio was introduced in the label screening stage to enhance the quality of the label set and reduce training cost. Specifically, for each image patch, we calculated UVs coverage ratio r, and samples with r < 20% were directly discarded. Figure 13 shows the distribution of r for all labeled patches. The violin–box plot indicates that the 25th, 50th, and 75th percentiles are 7.5%, 19.4%, and 38.6%, respectively, and most samples fall within the 0–40% range. The 20% threshold is slightly higher than the empirical median of 19.4% and can therefore be regarded as a data-driven “median split” that statistically separates samples into “weak-coverage” and “large-area coverage” groups.

The examples on the right side of Figure 13 further provide an intuitive justification for this division. When r ≈ 7.5%, UV areas appear only as scattered fragments, are highly sensitive to slice position and minor annotation perturbations, and are still dominated by background land-cover types as a whole. In contrast, when r ≈ 38.6%, UV areas form connected patches with more stable texture and morphological characteristics, and the discriminative information is more concentrated. If these two types of samples were mixed for training, the few fragmented UV targets would easily be “overwhelmed” by the background in the feature space, thereby amplifying the effects of label noise and class imbalance.

In terms of data scale, 4143 labeled patches were initially available. After applying the 20% area threshold, 2021 patches were retained for model training and validation, accounting for approximately 48.8% of all samples, with about half of the low-coverage samples removed. On the one hand, this substantially reduces redundant and unstable samples and focuses training on more representative UV patterns. This is beneficial for enhancing model convergence and generalization. On the other hand, the remaining dataset is still sufficiently large, so the learning capacity of the model is not compromised by a lack of samples. It should be noted that this strategy unavoidably sacrifices the recall of very small and isolated UV fragments. However, given that the main objective of this study is to identify contiguous UV areas, this trade-off, which prioritizes “regional integrity”, is reasonable and interpretable.

5.2. Effectiveness and Limitations of the Proposed Method

The proposed TransUV model achieves an mIoU of 86.67% and an OA of 92.98% in the UV recognition task. Its overall performance surpasses that of multiple mainstream baseline methods, demonstrating the effectiveness of the proposed framework in complex urban environments. On the encoder side, the model incorporates the MFEM, and on the decoder side, it employs the AAFM. These components jointly enhance the representation of texture and boundary information across multiple scales and, via attention mechanisms, adaptively select and fuse key features, thereby alleviating typical problems in UV segmentation such as fuzzy boundaries, missed small objects, and class confusion.

Visual comparisons further substantiate these quantitative findings. TransUV exhibits exceptional performance in boundary completeness and regional coherence, even in challenging scenarios with fragmented morphologies or complex materials. This capability stems from a more fundamental architectural advantage. When compared to classical CNN architectures like Unet and PSPNet, TransUV’s integration of global contextual modeling—inherited from the TransNeXt backbone—enables a more holistic understanding of the scene. This overcomes the inherent limitation of CNNs, whose local receptive fields often lead to fragmented predictions in highly heterogeneous areas like UVs. Conversely, compared to pure Transformer-based models (e.g., Vit, Swin transformer), the carefully designed MFEM and AAFM ensure that fine-grained local details are preserved and enhanced throughout the network. This results in a superior balance between capturing long-range dependencies and extracting precise local textures, yielding sharper detail restoration and competitive inference efficiency.

Nevertheless, this study still has certain limitations. We observe that, in some cases, TransUV tends to misclassify low-rise, highly vegetated residential villa areas as UVs. This is mainly due to the high morphological similarity between these two land-use types in terms of building density and roof materials. Such misclassifications typically occur in suburban villa areas that share visual characteristics with UVs, including relatively dense building layouts and irregular texture patterns. From a mechanistic perspective, the low-rise and high-density building configurations of villa areas are spatially similar to those of UVs, while commonly used roof materials, such as clay tiles and cement tiles, exhibit overlapping spectral reflectance characteristics with the blue color-coated steel roofs widely found in UVs. In addition, the fragmented texture patterns induced by vegetation shadows in highly vegetated villa areas further exacerbate feature confusion during model discrimination.

To mitigate the aforementioned misclassification issues, future research could further improve the model’s discriminative capability through multi-source data fusion and temporal feature modeling. One promising direction is the integration of height-related data, such as LiDAR or Digital Surface Models (DSM), to explicitly exploit vertical structural differences between land-use types. Villa areas generally follow relatively strict height regulations, whereas UVs often exhibit irregular vertical extensions and variations in building height. Moreover, incorporating multi-temporal remote sensing imagery to capture the temporal evolution of different land-use types is also worthy of further exploration. UVs typically exhibit more dynamic evolution patterns, such as informal expansion or frequent modifications, while villa areas tend to remain relatively stable over longer time scales. These multi-source and multi-temporal data can provide critical contextual information beyond the visual features captured by single-date high-resolution imagery, thereby further enhancing the reliability and cross-scene generalization ability of UVs identification.

6. Conclusions

In order to enhance the accuracy of UV extraction from high-resolution remote sensing imagery, this paper proposes TransUV, a multi-scale attention-fusion segmentation framework built upon TransNeXt. During data preprocessing, an area threshold strategy based on UV coverage is introduced to retain high-quality samples and alleviate label noise and class imbalance. In model design, MFEM is incorporated at the encoder front end to enhance boundary and texture cues, while AAFM is embedded in a lightweight decoder (SegUV) to enable adaptive multi-scale feature fusion and semantic refinement.

Experimental results show that, under identical settings, TransUV consistently outperforms representative CNN-based and Transformer-based baselines on key metrics (e.g., mIoU and recall). It produces more complete boundaries and better detail preservation in qualitative comparisons. These results provide evidence for the effectiveness of the proposed data screening strategy and task-oriented architectural design for UV extraction.

Overall, TransUV offers a feasible technical solution for the fine-grained recognition of complex UV land-cover patterns in high-resolution imagery. Future work will further evaluate its robustness and generalizability across broader geographic regions and heterogeneous data sources. It will also explore the integration of complementary information (e.g., height or multi-temporal observations) to reduce confusion among morphologically similar land-use types. The proposed approach provides a methodological reference for large-scale UV monitoring and urban renewal analysis.

Author Contributions

Conceptualization, X.L. and G.W.; Data Curation, X.L. and S.C.; Methodology, Y.W., Y.Z. and S.C.; Validation, X.L.; Writing—Original Draft, X.L.; Writing—Review and Editing, Y.W., Y.Z. and G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Chengdu Urban and Rural Black and Odorous Water Remote Sensing Identification Ground Verification Project (HBD508) and the China Geological Survey Project “Construction of Marine Geological Information System and Product Development” (DD20191006).

Data Availability Statement

Due to licensing and usage restrictions, these data have not been authorized for public release.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Appendix A

This appendix aims to investigate the influence of the core parameters in the two key sub-modules of the Multi-scale Edge Fusion Module (MEFM), namely, the LoGFilter Block and the Gaussian Block, on the model’s segmentation performance through ablation experiments. The parameters of the LoGFilter Block include the scale factor K and standard deviation σ; similarly, the Gaussian Block has its own scale factor K and standard deviation σ. All experiments were conducted under identical training settings, and the evaluation metrics—mIoU, OA, and F1-Score—are consistent with those used in the main text.

Table A1. Ablation Results for Parameters of the LoG Filter Block.

k	σ	mIoU	OA	F1-Score
5	1	85.80	92.50	91.25
9	1	86.76	93.05	91.88
7	0.5	86.38	92.83	91.65
7	2	86.62	92.97	91.8
7	1	86.82	93.07	91.96

Table A2. Ablation Results for Parameters of the Gaussian Block.

k	σ	mIoU	OA	F1-Score
7	0.5	86.12	92.68	91.46
11	0.5	86.18	92.71	91.51
9	0.2	86.03	92.64	91.38
9	1	86.36	92.81	91.66
9	0.5	86.82	93.07	91.96

Experimental results indicate that the parameter combination k = 7, σ = 1 yields the best performance for the LoG Filter Block. For the Gaussian Block, the optimal parameter combination is k = 9, σ = 0.5.

References

Liu, Y.; He, S.; Wu, F.; Webster, C. Urban villages under China’s rapid urbanization: Unregulated assets and transitional neighbourhoods. Habitat Int. 2010, 34, 135–144. [Google Scholar] [CrossRef]
Chen, B.; Feng, Q.; Niu, B.; Yan, F.; Gao, B.; Yang, J.; Gong, J.; Liu, J. Multi-modal fusion of satellite and street-view images for urban village classification based on a dual-branch deep neural network. Int. J. Appl. Earth Obs. Geoinf. 2022, 109, 102794. [Google Scholar] [CrossRef]
Wei, H.; Qi, W.; Liu, S.; Li, Y.; Yin, Y. Uncovering the edge urban villages using social media big data: A case study in Beijing, China. Habitat Int. 2025, 159, 103357. [Google Scholar] [CrossRef]
Cao, R.; Tu, W.; Chen, D.; Zhang, W. Mapping urban villages in China: Progress and challenges. Comput. Environ. Urban Syst. 2025, 119, 102282. [Google Scholar] [CrossRef]
Liu, R.; Wong, T.-C. Urban village redevelopment in Beijing: The state-dominated formalization of informal housing. Cities 2018, 72, 160–172. [Google Scholar] [CrossRef]
Kasula, P.; Dedekorkut-Howes, A.; Shearer, H.; Baum, S. Social inclusion of urban villages: A systematic review of global urban planning practices. Cities 2026, 169, 106509. [Google Scholar] [CrossRef]
Gao, Q.L.; Yue, Y.; Tu, W.; Cao, J.; Li, Q.Q. Segregation or integration? Exploring activity disparities between migrants and settled urban residents using human mobility data. Trans. GIS 2021, 25, 2791–2820. [Google Scholar] [CrossRef]
Hu, H.; Sun, Y.; Xi, J. Treatment and water quality improvement technology of black and malodorous water body in urban area. Environ. Prot. 2015, 43, 24–26. [Google Scholar]
Bo, S.; Cheng, C. Political hierarchy and urban primacy: Evidence from China. J. Comp. Econ. 2021, 49, 933–946. [Google Scholar] [CrossRef]
Chen, D.; Tu, W.; Cao, R.; Zhang, Y.; He, B.; Wang, C.; Shi, T.; Li, Q. A hierarchical approach for fine-grained urban villages recognition fusing remote and social sensing data. Int. J. Appl. Earth Obs. Geoinf. 2022, 106, 102661. [Google Scholar] [CrossRef]
Ayo, B. Integrating Openstreetmap Data and Sentinel-2 Imagery for Classifying and Monitoring Informal Settlements. Master’s Thesis, Universidade NOVA de Lisboa, Lisbon, Portugal, 2020. [Google Scholar]
Yinnian, L. Development of hyperspectral imaging remote sensing technology. Natl. Remote Sens. Bull. 2021, 25, 439–459. [Google Scholar] [CrossRef]
Fan, R.; Li, F.; Han, W.; Yan, J.; Li, J.; Wang, L. Fine-scale urban informal settlements mapping by fusing remote sensing images and building data via a transformer-based multimodal fusion network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Tan, X.; Meng, Q.; Zhao, F.; Zhang, L.; Hu, X.; Jancsó, T. HR-UVFormer: A top-down and multimodal hierarchical extraction approach for urban villages. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4407115. [Google Scholar] [CrossRef]
Li, L.; Chen, B.; Zou, X.; Xing, J.; Tao, P. UV-Mamba: A DCN-Enhanced State Space Model for Urban Village Boundary Identification in High-Resolution Remote Sensing Images. In Proceedings of the ICASSP 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
Wang, Z.; Sun, Q.; Zhang, X.; Hu, Z.; Chen, J.; Zhong, C.; Li, H. CUGUV: A Benchmark Dataset for Promoting Large-Scale Urban Village Mapping with Deep Learning Models. Sci. Data 2025, 12, 390. [Google Scholar] [CrossRef]
Gibril, M.B.A.; Al-Ruzouq, R.; Bolcek, J.; Shanableh, A.; Jena, R. Building Extraction from Satellite Images Using Mask R-CNN and Swin Transformer. In Proceedings of the 2024 34th International Conference Radioelektronika (RADIOELEKTRONIKA), Žilina, Slovakia, 17–18 April 2024; pp. 1–5. [Google Scholar]
Huang, Y.; Zhang, F.; Gao, Y.; Tu, W.; Duarte, F.; Ratti, C.; Guo, D.; Liu, Y. Comprehensive urban space representation with varying numbers of street-level images. Comput. Environ. Urban Syst. 2023, 106, 102043. [Google Scholar] [CrossRef]
Xiao, C.; Zhou, J.; Huang, J.; Zhu, H.; Xu, T.; Dou, D.; Xiong, H. A contextual master-slave framework on urban region graph for urban village detection. In Proceedings of the 2023 IEEE 39th International Conference on Data Engineering (ICDE), Anaheim, CA, USA, 3–7 April 2023; pp. 736–748. [Google Scholar]
Fan, R.; Niu, H.; Xu, Z.; Chen, J.; Feng, R.; Wang, L. Refined urban informal settlements mapping at agglomeration-scale with the guidance of background-knowledge from easy-accessed crowdsourced geospatial data. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4401716. [Google Scholar]
Metzger, N.; Daudt, R.C.; Tuia, D.; Schindler, K. High-resolution population maps derived from sentinel-1 and sentinel-2. Remote Sens. Environ. 2024, 314, 114383. [Google Scholar] [CrossRef]
Shu, Y.; Cai, Z.; Li, G.; Yan, Q.; Li, B.; Si, W.; Qiao, D. Use of Multi-Feature Extraction and Transfer Learning to Identify Urban Villages in China. Remote Sens. 2025, 17, 424. [Google Scholar] [CrossRef]
Wurm, M.; Taubenböck, H.; Weigand, M.; Schmitt, A. Slum mapping in polarimetric SAR data using spatial features. Remote Sens. Environ. 2017, 194, 190–204. [Google Scholar] [CrossRef]
Kit, O.; Lüdeke, M. Automated detection of slum area change in Hyderabad, India using multitemporal satellite imagery. ISPRS J. Photogramm. Remote Sens. 2013, 83, 130–137. [Google Scholar] [CrossRef]
d’Oleire-Oltmanns, S.; Coenradie, B.; Kleinschmit, B. An object-based classification approach for mapping migrant housing in the mega-urban area of the Pearl River Delta (China). Remote Sens. 2011, 3, 1710–1723. [Google Scholar] [CrossRef]
Huang, X.; Liu, H.; Zhang, L. Spatiotemporal detection and analysis of urban villages in mega city regions of China using high-resolution remotely sensed imagery. IEEE Trans. Geosci. Remote Sens. 2015, 53, 3639–3657. [Google Scholar] [CrossRef]
Pan, Z.; Xu, J.; Guo, Y.; Hu, Y.; Wang, G. Deep learning segmentation and classification for urban village using a worldview satellite image based on U-Net. Remote Sens. 2020, 12, 1574. [Google Scholar] [CrossRef]
Abascal, A.; Vanhuysse, S.; Grippa, T.; Rodriguez-Carreño, I.; Georganos, S.; Wang, J.; Kuffer, M.; Martinez-Diez, P.; Santamaria-Varas, M.; Wolff, E. AI perceives like a local: Predicting citizen deprivation perception using satellite imagery. npj Urban Sustain. 2024, 4, 20. [Google Scholar] [CrossRef]
Wang, Q.; Chen, W.; Huang, Z.; Tang, H.; Yang, L. MultiSenseSeg: A cost-effective unified multimodal semantic segmentation model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4703724. [Google Scholar] [CrossRef]
Shi, Q.; Liu, M.; Liu, X.; Liu, P.; Zhang, P.; Yang, J.; Li, X. Domain adaption for fine-grained urban village extraction from satellite images. IEEE Geosci. Remote Sens. Lett. 2019, 17, 1430–1434. [Google Scholar] [CrossRef]
Chang, Y.; Yu, X.; Yang, X.; Chen, Z.; Chen, P.; Yang, X.; Bai, Y. Agricultural Greenhouse Extraction Based on Multi-Scale Feature Fusion and GF-2 Remote Sensing Imagery. Remote Sens. 2025, 17, 2061. [Google Scholar] [CrossRef]
Zhuozheng, L.; Zhennan, X.; Runliang, X.; Jiahao, S.; Ruihui, M.; Chen, L.; Daofang, L.; Li, X. GPRNet: A Geometric Prior-Refined Semantic Segmentation Network for Land Use and Land Cover Mapping. Remote Sens. 2025, 17, 3856. [Google Scholar]
Feng, Q.; Chen, B.; Niu, B.; Ren, Y.; Wang, Y.; Liu, J. Identification of urban villages from remote sensing image based on multi-scale dilated convolutional neural network. Trans. Chin. Soc. Agric. Mach. 2021, 52, 181–189. [Google Scholar]
Gella, G.W.; Wendt, L.; Lang, S.; Tiede, D.; Hofer, B.; Gao, Y.; Braun, A. Mapping of dwellings in IDP/refugee settlements from very high-resolution satellite imagery using a mask region-based convolutional neural network. Remote Sens. 2022, 14, 689. [Google Scholar] [CrossRef]
Chai, Z.; Liu, M.; Shi, Q.; Zhang, Y.; Zuo, M.; He, D. Fine-Grained Urban Village Extraction By Mask Transformer from High-Resolution Satellite Images in Pearl River Delta. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 13657–13668. [Google Scholar] [CrossRef]
Shi, D. Transnext: Robust foveal visual perception for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 17773–17783. [Google Scholar]
Zhang, Z.; Wang, B.; Buyantuev, A.; He, X.; Gao, W.; Wang, Y.; Dawazhaxi; Yang, Z. Urban agglomeration of Kunming and Yuxi cities in Yunnan, China: The relative importance of government policy drivers and environmental constraints. Landsc. Ecol. 2019, 34, 663–679. [Google Scholar] [CrossRef]
Shuangshuang, T.; Zhenhua, J.; Hualou, L.; Daifei, J.; Xiaoling, G. Spatial pattern and classification of rural settlements in Guangxi. Econ. Geogr. 2023, 43, 159–168. [Google Scholar]
Li, D.; Wang, M.; Jiang, J. China’s high-resolution optical remote sensing satellites and their mapping applications. Geo-Spat. Inf. Sci. 2021, 24, 85–94. [Google Scholar] [CrossRef]
Li, J.; Bai, Y.; Huang, S.; Yang, S.; Sun, Y.; Yang, X. Color-Distortion Correction for Jilin-1 KF01 Series Satellite Imagery Using a Data-Driven Method. Remote Sens. 2024, 16, 4721. [Google Scholar] [CrossRef]
Yi, Z.; Cheng, X.; Ma, J.; Zhu, R.; Tian, J.; Zhou, Y.; Zhao, X.; Li, H. CGEarthEye: A High-Resolution Remote Sensing Vision Foundation Model Based on the Jilin-1 Satellite Constellation. arXiv 2025, arXiv:2507.00356. [Google Scholar]
Li, X.; Gong, P.; Zhou, Y.; Wang, J.; Bai, Y.; Chen, B.; Hu, T.; Xiao, Y.; Xu, B.; Yang, J. Mapping global urban boundaries from the global artificial impervious area (GAIA) data. Environ. Res. Lett. 2020, 15, 094044. [Google Scholar] [CrossRef]
Wang, X.; Chen, L.; Ban, T.; Lyu, D.; Guan, Y.; Wu, X.; Zhou, X.; Chen, H. Accurate label refinement from multiannotator of remote sensing data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4700413. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30; NIPS Foundation: La Jolla, CA, USA; San Diego, CA, USA, 2017. [Google Scholar]

Figure 1. Geographical location and scope of the study area. (a) Kunming; (b) Nanning.

Figure 2. Schematic diagram of typical UVs in the study area. (a,b) represent the UVs in Kunming City and Nanning City, respectively.

Figure 3. Spatial distribution and sample examples of UVs in the study area.

Figure 4. Overall Architecture of TransUV.

Figure 5. Structure of the MFEM.

Figure 6. Structure of Aggregated Attention, where the green portion represents the Sliding Window Attention, the yellow portion represents the Global Pooled Attention, and the blue portion represents the Positional Attention.

Figure 7. Structure of the AAFM.

Figure 8. Complexity comparison of the nine network models in terms of FLOPs and parameter size.

Figure 9. Visualization of experimental results: (a) original image, (b) ground truth, (c) Segmenter; (d) PSPNet; (e) Segformer; (f) Unet; (g) Deeplab v3+; (h) Vit; (i) Swin transformer; (j) Mask2former; and (k) TransUV (our model). The red circles highlight areas of poor segmentation performance, such as inaccuracies, omissions, or erroneous regions.

Figure 10. Visualization of feature activation heatmaps for the baseline and proposed modules across different network stages. This figure illustrates the Grad-CAM attention maps of the input image (overlaid on the geographical background) at various network depths (Stage-1 to Stage-4) and the final output layer (Head). The color scale represents the intensity of the activation, where warmer colors (red, yellow) indicate higher attention and cooler colors (blue, purple) indicate lower attention.

Figure 11. Predicted UV distribution in the central urban areas of Kunming (a) and Nanning (b), with local zoomed examples (I–IV). The red circles highlight areas where a small number of villas were incorrectly classified as UV.

Figure 12. Predicted results of UVs in Liede Village. (a–c) are the predicted results for October 2007, October 2009, and October 2017, respectively. And the red part represents the predicted range of UVs.

Figure 13. Distribution of Target/Background Ratio and Threshold Screening for UV Labels. The red dashed line marks a threshold for screening samples.

Table 1. Description of training and validation samples for UV extraction. All image patches have a spatial size of 512 × 512 pixels.

City	Training Patches	Validation Patches	Total Patches
Kunming	991	254	1245
Nanning	625	151	776
Total	1616	405	2021

Table 2. Comparison Experiment.

Model	mIoU (%)	OA (%)	Flops (G)	Params (M)
Segmenter	70.60	82.80	12.67	6.71
PSPNet	83.33	90.61	178.48	48.98
Segformer	83.71	91.09	6.38	3.72
Unet	82.78	90.54	202.85	29.06
Deeplab v3+	84.84	91.78	176.38	43.59
Vit	83.18	90.75	258.07	57.99
Swin transformer	83.27	91.07	236.15	59.94
Mask2former	85.33	91.99	235.57	46.82
TransUV	86.67	92.98	34.58	28.67

Table 3. Ablation Experiment.

Model	mIoU	OA	F1-Score	Recall	Precision
TransNeXt	84.13	91.52	90.24	90.48	90.00
+SegUV	85.33	92.21	91.04	90.84	91.24
+FCN	86.15	92.67	91.60	91.69	91.50
+MFEM	86.30	92.79	91.57	89.86	93.33
+AAFM	86.67	92.98	91.89	91.32	92.46

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lin, X.; Wang, Y.; Zhou, Y.; Wang, G.; Chen, S. TransUV: A TransNeXt-Based Model with Multi-Scale and Attention Fusion for Fine-Grained Urban Village Extraction. Remote Sens. 2026, 18, 223. https://doi.org/10.3390/rs18020223

AMA Style

Lin X, Wang Y, Zhou Y, Wang G, Chen S. TransUV: A TransNeXt-Based Model with Multi-Scale and Attention Fusion for Fine-Grained Urban Village Extraction. Remote Sensing. 2026; 18(2):223. https://doi.org/10.3390/rs18020223

Chicago/Turabian Style

Lin, Xiaobao, Yu Wang, Yaming Zhou, Guangjun Wang, and Sai Chen. 2026. "TransUV: A TransNeXt-Based Model with Multi-Scale and Attention Fusion for Fine-Grained Urban Village Extraction" Remote Sensing 18, no. 2: 223. https://doi.org/10.3390/rs18020223

APA Style

Lin, X., Wang, Y., Zhou, Y., Wang, G., & Chen, S. (2026). TransUV: A TransNeXt-Based Model with Multi-Scale and Attention Fusion for Fine-Grained Urban Village Extraction. Remote Sensing, 18(2), 223. https://doi.org/10.3390/rs18020223

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TransUV: A TransNeXt-Based Model with Multi-Scale and Attention Fusion for Fine-Grained Urban Village Extraction

Highlights

Abstract

1. Introduction

2. Study Area and Data

2.1. Study Area

2.2. Data

3. TransUV Architecture

3.1. Overview

3.2. Encoder

3.3. Decoder

3.4. Accuracy Evaluation

4. Results and Analysis

4.1. Experimental Environment and Implementation Details

4.2. Comparison Experiments

4.3. Ablation Experiment

4.4. Inference over the Study Areas

4.5. Case Study: Multi-Temporal Prediction of UV Redevelopment

5. Discussion

5.1. Rationality of the Area Threshold Strategy

5.2. Effectiveness and Limitations of the Proposed Method

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI