1. Introduction
Urban villages (UVs) typically refer to areas within cities characterized by inadequate planning, underdeveloped infrastructure, poor sanitary conditions, and high population density [
1,
2,
3]. These areas are historical outcomes of the long-term evolution of the institutional rural-urban dichotomy and represent a concentration of the contradictions between land systems and urban expansion [
4,
5,
6]. While UVs provide low-cost living spaces for migrant populations, they are also associated with typical “urban maladies” such as excessive building density (>70%) and insufficient green space per capita (<1 m
2) [
7]. Due to the generally inadequate infrastructure (particularly drainage systems) and weak environmental management, UVs often become hotspots for black and odorous water bodies [
8].
The United Nations Sustainable Development Goal (SDG 11.1) explicitly calls for making “By 2030, ensure access for all to adequate, safe, and affordable housing and basic services and upgrade slums.” Consequently, the precise identification of marginalized urban spaces, including UVs, is of critical importance. The identified UVs can also provide key auxiliary spatial information for the screening and management of urban black and odorous water bodies. Driven by critical urban challenges, resource availability, and policy biases, urban studies have shown a predominant focus on large cities [
9]. China is home to over 100 large cities with permanent populations exceeding one million, yet publicly accessible spatial data remain extremely scarce, significantly hindering the implementation and monitoring of urban renewal policies [
10]. Traditional methods relying on manual surveys are not only inefficient and slow to update, but also struggle to meet the demands of rapid urban renewal and sustainable governance [
11].
With the rapid advancement of remote sensing technology, the fast and systematic acquisition of land surface information has become feasible [
12]. The technology provides essential technical support for natural resource surveys, infrastructure censuses, and assessments of urban sustainable development. It also offers a new technical pathway for the automatic identification of UVs. To improve the accuracy and efficiency of UV identification in complex urban environments, existing studies have mainly followed two technical routes: (1) Fusing multi-source data to enrich the original feature space and enhance the separability between UVs and their surrounding urban environment [
13,
14]; and (2) Optimizing identification methods based on a single remote sensing data source to reduce dependence on auxiliary data and improve method generalizability [
15,
16,
17].
Along the first route, in comparison with the more prevalent fusion schemes in other remote sensing applications (e.g., image-image fusion or image–geographic data fusion), UV identification tasks often additionally incorporate diverse urban spatial data such as street-view images, points of interest (POIs), and human mobility trajectories to represent their formation mechanisms and intensive human activities. Huang et al. [
18] integrated remote sensing imagery with street-view images to enhance the utilization of street-level information and achieved good performance in UV identification in Shenzhen; Xiao et al. [
19] combined remote sensing imagery with POI data to identify UVs in multiple cities, including Shenzhen, Fuzhou, and Beijing, and reported significant improvements in classification accuracy; Chen et al. [
10] fused remote sensing and mobility trajectory data to achieve fine-scale UV mapping with a spatial resolution of 2.5 m in Shenzhen. These studies demonstrate that multi-source data fusion strategies can effectively enhance the accuracy of UV identification under complex urban backgrounds. Fan et al. [
20] proposed the SemiUIS method, which integrates crowdsourced geospatial data with a semi-supervised learning strategy to achieve high-accuracy mapping of urban informal settlements using only a limited number of labeled samples. However, the acquisition cost of multi-source urban spatial data is high due to the incomplete spatial coverage, and data quality may vary significantly across cities. As a result, the spatial generalization ability of the constructed models is limited, and it is particularly difficult to extend such methods to cities where multi-source data are scarce, which constrains their practical applicability [
21].
The second route focuses on identification methods based on a single remote sensing data source [
22]. Compared with multi-source fusion approaches, these methods rely less on auxiliary data and are thus more suitable for engineering applications. With the increasing diversity of remote sensing data and improvements in spatial resolution, related methods have continued to evolve. Early studies mainly relied on traditional machine learning and object-based image analysis techniques. Wurm et al. [
23] employed a random forest classifier within the Kennaugh element framework using synthetic aperture radar (SAR) imagery to extract UVs. This effectively capturing multi-scale texture features while maintaining rotation invariance and relatively low computational complexity. Kit et al. [
24] used multi-temporal high-resolution satellite imagery and applied Canny and LSD edge detection algorithms to identify UVs and analyze their changes in Hyderabad, India. They successfully achieved high-accuracy extraction for 2003 and 2010 without additional corrections. D’Oleire-Oltmanns et al. [
25] adopted an object-based image analysis (OBIA) approach based on multi-temporal medium-to high-resolution optical imagery to effectively identify UVs in the Pearl River Delta region. This approach was widely used before the rise of deep learning. Subsequently, Huang et al. [
26] introduced a scene-classification-based framework on this basis to characterize the differences between UVs and other urban land-cover types from multiple dimensions. They achieved high identification accuracy across multi-temporal datasets in Shenzhen and Wuhan, which also verifies the transferability of the method.
As deep learning advances, particularly with the rapid maturation of semantic segmentation models, methods based on convolutional neural networks (CNNs) [
27] and Transformers have gradually become the mainstream for the automatic extraction of UVs [
28,
29,
30,
31,
32]. Feng et al. [
33] proposed a network structure that integrates multi-scale dilated convolutions with non-local feature extraction modules. This structure effectively handles the variations in the shape and scale of UVs and achieved an overall accuracy of 94.27% in experiments conducted in parts of Beijing. Gella et al. [
34] employed a Mask R-CNN combined with domain-adaptive transfer learning to explore the feasibility of transferring models trained on historical imagery to newer imagery and evaluated its ability to capture the spatiotemporal dynamics of UVs. Chai et al. [
35] proposed a multi-scale masked Transformer model (MaskUV), which can simultaneously capture local textures and global contextual information. It achieved an F1-score of 84.39% and an intersection over union (IoU) of 73.00% on their UVSet dataset, demonstrating strong performance in UVs identification in the Pearl River Delta. Li et al. [
15] proposed the UV-Mamba model based on high-resolution remote sensing imagery. In this model, state space modeling is employed to effectively enhance the capture of long-range spatial context from a single data source, thereby improving the identification accuracy of UVs. Nevertheless, when applied to UV areas that are highly heterogeneous, structurally complex, and rapidly evolving, existing semantic segmentation methods still suffer from insufficient feature representation, local misclassification, and blurred boundaries. At the same time, many models have a large number of parameters, low inference efficiency, and limited cross-scene generalization ability, which pose challenges for their deployment and promotion in real-world operational settings [
4].
To address core challenges in UV extraction, such as high-density building distribution, blurred boundaries, and irregular shapes, this paper proposes a novel remote sensing image segmentation model named TransUV. The key innovations include a task-driven structural redesign: the TransNeXt backbone network is employed to jointly model both local and global information [
36], and is integrated with a Multi-level Feature Enhancement Module (MFEM) and an Advanced Attention Fusion Module (AAFM), achieving comprehensive optimization from low-level features to high-level semantics. The specific contributions are as follows:
At the front end of the encoder, we propose the Multi-level Feature Enhancement Module (MFEM). This module introduces learnable Laplacian of Gaussian (LoG) and Gaussian filtering as priors to effectively enhance edge and texture responses, suppress noise, and thus significantly improve contour discriminability.
In the decoder, we design the Advanced Attention Fusion Module (AAFM). This module integrates channel, spatial, and directional attention mechanisms and adaptively fuses them through dynamic weighting, thereby enhancing the representation of UV structures that exhibit strong directionality and irregular morphology.
A training sample selection strategy based on a coverage threshold is proposed. This strategy filters out fragmented samples with low coverage to alleviate class imbalance and labeling noise, thereby improving training stability and deployment robustness.
Experimental results demonstrate that TransUV outperforms state-of-the-art methods in terms of boundary clarity, regional integrity, and cross-scene generalization capability, offering a more effective solution for the automated and fine-grained monitoring of UVs in complex environments.
3. TransUV Architecture
3.1. Overview
As shown in
Figure 4, the proposed TransUV model adopts an encoder–decoder framework. In the encoder, the model uses TransNeXt [
36] as the backbone for feature extraction. Its aggregation attention mechanism can jointly model local details and global context, providing a feature representation that better adapts to the dense and irregular structure of UVs. Specifically, we introduce an MFEM at the front end of the encoder, which explicitly enhances edge and texture responses using techniques such as LoG filtering during the initial feature extraction stage, addressing the issue of boundary ambiguity in UVs. In the decoder, the proposed SegUV decoder progressively restores spatial details and outputs fine segmentation results, with its core being the AAFM. This module integrates channel, spatial, and directional perception attention, enabling the adaptive fusion of multi-scale features, particularly suited for capturing the complex directional structures presented by internal alleys and building arrangements in UVs.
Overall, TransUV, through the collaborative design of the TransNeXt backbone, MFEM, and AAFM, forms a specialized processing pipeline tailored to the morphological features of UVs. This architecture not only overcomes the limitations of general Transformer models (such as ViT and Swin Transformer) in maintaining fine-grained features and modeling long-range dependencies across windows but also significantly improves the extraction accuracy of UV instances in complex urban scenarios.
3.2. Encoder
In this study, TransNeXt-Tiny is adopted as the encoder, which utilizes a bionic foveal vision mechanism to accomplish multi-scale feature extraction. The encoder employs a four-stage hierarchical backbone structure and an overlapping patch embedding mechanism similar to that of PVTv2. In the first to third stages, each TransNeXt Block is composed of stacked Aggregated Attention modules and Convolutional Gated Linear Units (Convolutional GLUs) to jointly model local and global features. Since the feature map resolution in the fourth stage is relatively low and traditional feature pooling modules cannot operate effectively, a Multi-Head Self-Attention (MHSA) module, consistent with that in PVTv2, is employed to maintain global information modeling capability.
To enhance the model’s capability in perceiving edges and textures in complex UV scenarios, this study introduces an MFEM into the encoder, as illustrated in
Figure 5. The proposed module integrates mathematical operations such as Laplacian of Gaussian (LoG) filtering and Gaussian smoothing to effectively strengthen boundary responses in the input features while suppressing noise interference, thereby providing more discriminative feature representations for subsequent Transformer-based encoding. Key parameters within the MFEM (e.g., the scale factor K and standard deviation
) were optimized based on the ablation experiments presented in
Appendix A.
First, in the LoGFilter Block, a 7 × 7 convolution kernel combined with Laplacian of Gaussian (LoG) filtering is employed to enhance the saliency of building boundaries. LoG filtering highlights gradient variations in edge regions through the composite operation of Gaussian smoothing and the second-order Laplacian derivative (k = 7,
= 1), and its mathematical formulation is given as:
where
denote the pixel coordinates, and
σ controls the scale of the Gaussian kernel.
Subsequently, the module conducts preliminary downsampling through depthwise separable convolution and introduces a Gaussian Block to suppress pseudo-textures and high-frequency noise. The Gaussian smoothing kernel (k = 9,
= 0.5) is defined as:
Its weighted averaging mechanism eliminates local detail disturbances while preserving the overall structural information, thus providing clean feature inputs for subsequent hierarchical processing. By combining structured downsampling and max-pooling operations, the MFEM integrates multi-level features to compress the spatial resolution to one-quarter of the original input, generating a compact yet semantically rich low-dimensional representation that facilitates global contextual modeling in the TransNeXt backbone network.
Within the TransNeXt backbone, the Aggregated Attention mechanism employs a dual-path design to emulate the multi-scale perception process of the biological visual system, as shown in
Figure 6. This mechanism integrates local and global attention within a unified framework, enabling each token to capture fine-grained information from neighboring features while also acquiring coarse-grained contextual information from downsampled global features. Through hierarchical aggregation, the model achieves an effective balance between global modeling and local detail representation, thereby enhancing the completeness and discriminability of feature representations. The underlying principles are as follows:
Due to a given query position, local features are extracted from the neighborhood using a k × k sliding window, while global context is captured via the downsampled feature map. The attention computation is formulated as follows:
where
stands for the local window,
denotes the globally pooled features, and
is the learnable Query Embedding.
The attention logits derived from two parallel processing pathways are concatenated and subsequently normalized, ensuring that fine-grained and coarse-grained features are assigned appropriate weights within a unified softmax operation:
where Concat denotes the concatenation operation,
is a learnable temperature parameter,
denotes the number of valid tokens, and
represents the positional bias.
where
,
denote the local and global attention weights after splitting, respectively, and
is a learnable token.
This aggregated attention mechanism is crucial for enhancing the model’s ability to handle features at different spatial resolutions, and it is particularly effective for tasks requiring multi-scale detail processing.
3.3. Decoder
To address the challenges in UV remote sensing images, such as high-density building distributions, blurred boundaries, and occlusions, this study proposes an enhanced decoder structure—SegUV. While maintaining a lightweight design, this structure significantly improves feature representation and semantic segmentation accuracy through structural optimization.
To mitigate differences in channel dimensions and semantic levels across stages, a Channel Alignment Module is first introduced. It employs 1 × 1 convolutions to map multi-scale features to a unified dimension of 256. The core of the decoder is the AAFM, which aims to enhance feature learning by incorporating multiple attention mechanisms. As shown in
Figure 7, the module integrates various attention branches, including spatial attention, lightweight channel attention, and direction-aware attention, combined with a dynamic weight fusion strategy. This design effectively improves the model’s robustness and further enhances segmentation accuracy, particularly when handling complex spatial structures and textures. The detailed design is as follows:
The spatial attention branch captures spatial information from the input feature map through average pooling and max pooling operations. The outputs of these operations are concatenated and processed with a 7 × 7 convolution. This enables the network to capture both spatial context and frequency information. The computation is as follows:
where
and
represent the average pooling and max pooling of the input feature map
, respectively.
- 2.
Lightweight Channel Attention Branch
This branch implements channel attention on the input feature map using two 1 × 1 convolutional layers. First, one convolution reduces the channel dimension, and then another convolution to restores it, generating an attention map to weight the input features. The specific operations are as follows:
where
denotes the Sigmoid activation function, and
is the input feature map.
Through this branch, the network can learn the importance of each channel and adaptively enhance or suppress specific channel features.
In this way, the spatial attention learns the correlations of the global spatial structure.
- 3.
Directional-Aware Attention Branch
The direction-aware branch processes the input feature map through multiple convolutional layers, each of which extracts texture responses along different directions. By leveraging these direction-specific convolutional outputs, the network captures directional information in the image. All directional features are then fused via concatenation followed by a 1 × 1 convolution, to produce the final direction-aware attention map. The computation process of this branch is as follows:
where
denotes the outputs of convolutions along different directions, and
represents the number of directions.
- 4.
Dynamic Weight Fusion Mechanism
To adaptively fuse the outputs of different attention branches, a lightweight weight generator is introduced. This module generates three weight values through adaptive pooling and convolution operations, which correspond to the weighted fusion of spatial attention, lightweight channel attention, and direction aware attention. The computation process is as follows:
where
are used to weight the outputs of the different attention branches.
Using the computed weights, the module performs a weighted fusion of the outputs from the attention branches to obtain the final fused feature map. The fusion operation is defined as:
Finally, a convolutional layer processes the fused features to produce the transformed feature map. Subsequently, this map is combined with the original input through a residual connection to further enhance feature representation:
where GN denotes the Group Normalization operation. The final Output integrates both the attention-enhanced features and the residual information.
3.4. Accuracy Evaluation
To systematically evaluate the performance of the proposed model in UV extraction tasks, this study conducts both accuracy evaluation and ablation experiments. In the accuracy evaluation, commonly used metrics such as Overall Accuracy (OA), mean Intersection over Union (mIoU), Precision, Recall, and F1Score are adopted as the primary evaluation criteria.
OA reflects the overall classification capability of the model; mIoU provides a more accurate measure of segmentation performance for the UVs class under imbalanced class distributions; Precision evaluates the reliability of the model’s predictions for UVs regions; Recall indicates the model’s performance in terms of detection completeness; and F1 Score, as the harmonic mean of Precision and Recall, is used to comprehensively assess the balance between accuracy and completeness.
where
TP,
FP,
FN, and
TN denote true positives, false positives, false negatives, and true negatives of the predictions, respectively, and
represents the class.
4. Results and Analysis
4.1. Experimental Environment and Implementation Details
All experiments were implemented on the PyTorch 2.0 framework and trained and inferred on an NVIDIA GeForce RTX 4070 GPU (8 GB memory). The training procedure adopted a mini-batch stochastic gradient descent scheme with a batch size of 2. The initial learning rate was set to 0.0006 and the AdamW optimizer was employed. The total number of training epochs was 150. A combination of linear warm-up and polynomial decay (PolyLR, power = 1.0, eta_min = 0.0) was used for learning rate scheduling. Specifically, the first 5 epochs used a linear warm-up strategy (start_factor = 1 × 10−6) to improve training stability at the early stage.
To reduce the impact of randomness, all experiments were initialized with random parameters and repeated three times with random seeds set to 42, 2023, and 3407, respectively. After each training run, the main metrics (mIoU, OA, Precision, Recall, and F1-score) were computed, and the reported final results reported are the average performance over the three runs. All models were trained under the same data split, input size, data augmentation strategies, and hyperparameters to ensure comparability and fairness.
4.2. Comparison Experiments
To comprehensively evaluate the performance of the proposed TransUV model, several representative semantic segmentation networks were selected as baselines. These include CNN-based models (U-Net, PSPNet, DeepLab v3+), Transformer-based models (Vit, Swin transformer, Segmenter), and hybrid architectures (SegFormer, Mask2Former). All models were trained from scratch without using any pre-trained weights. The input image size was uniformly set to 512 × 512 pixels, and the dataset was split into training and validation sets with a ratio of 8:2.
From the results in
Table 2, TransUV achieves the best performance on both key metrics, with an mIoU of 86.67% and an OA of 92.98%, clearly outperforming advanced models such as Mask2former and Deeplab v3+. This indicates that when dealing with UVs characterized by high building density and ambiguous boundaries, TransUV possesses stronger feature representation and spatial discrimination capability. In terms of efficiency, TransUV has 28.67 M parameters and 34.58 G FLOPs. Although its computational cost is slightly higher than that of the lightweight Segformer, it is substantially lower than that of large models such as Vit and Swin transformer, thereby achieving a favorable balance between accuracy and efficiency. These results demonstrate that TransUV not only excels in theoretical performance but also has strong potential for practical applications.
To further analyze the trade-off between performance and complexity, a FLOPs–mIoU bubble chart (
Figure 8) was plotted based on the results in
Table 2. In this chart, the horizontal axis represents the computational cost FLOPs (G), the vertical axis denotes segmentation accuracy mIoU (%), and the bubble area corresponds to the number of parameters. As shown in
Figure 8, Segmenter has relatively small FLOPs and parameter size, yet it exhibits markedly low accuracy. PSPNet, Deeplab v3+, Vit, Swin Transformer, and Mask2Former are clustered on the right side of the chart, with large FLOPs and parameter scales. Among them, Vit and Swin Transformer have the highest computational and parameter complexity and can be regarded as typical “high-complexity–medium-to-high-accuracy” models. Meanwhile, Segformer achieves relatively high mIoU with extremely low FLOPs and parameter size, making it stand out among lightweight models.
Compared with these baselines, TransUV is located in the upper-left region of the bubble chart. It achieves the highest mIoU while maintaining a relatively low computational cost and a moderate number of parameters. To be more specific, among models with mIoU close to or exceeding 84%, the FLOPs of TransUV are much lower than those of PSPNet, Deeplab v3+, Vit, Swin Transformer, and Mask2Former. They are only slightly higher than that of the ultra-lightweight Segformer, yet TransUV yields a clearly superior accuracy. This suggests that TransUV lies close to the Pareto frontier in the two-dimensional “accuracy–efficiency” space, combining high accuracy with low computational cost and thus providing a more cost-effective option for deploying fine-scale UV extraction models in real-world operational systems.
On the basis of quantitative evaluation, this study further analyzes the performance of different models in UV segmentation through visual comparisons. As shown in
Figure 9, Segmenter (
Figure 9c) is nearly unable to effectively extract UV regions, indicating its limited adaptability to highly fragmented and spatially heterogeneous complex scenes. Traditional CNN-based methods, including PSPNet (
Figure 9d), Unet (
Figure 9f), and DeepLab v3+ (
Figure 9g), generally have a large number of holes when segmenting dense UV areas, accompanied by poor boundary continuity and completeness. This limitation fundamentally originates from the local receptive field of convolutional operations [
44,
45]. Although convolution is effective in capturing local texture features, it has difficulty establishing a coherent long-range understanding of entire UVs as a single, unified entity. In particular, Unet (
Figure 9f) shows obvious misclassification in the fourth sample, where a large number of regular urban buildings are incorrectly identified as UVs, further demonstrating its limited ability to represent the complex morphological characteristics of UVs.
In contrast, Transformer-based networks and hybrid architectures achieve better internal integrity in the segmentation results, with significantly reduced fragmentation. Only Vit (
Figure 9h) still shows a small number of holes. This improvement mainly benefits from the long-range semantic modeling capability enabled by the self-attention mechanism in Transformers. However, despite their clear advantages in global consistency, Transformer-based models still exhibit certain ambiguities in UV boundary delineation and occasionally misclassify non–UV regions near object edges. Notably, TransUV (
Figure 9k) delivers the best overall performance among all compared methods, generating more accurate, continuous, and complete UV boundaries while effectively suppressing misclassification around building edges. These results indicate that TransUV (
Figure 9k) achieves a more favorable balance between global contextual modeling and fine-grained boundary perception, making it particularly suitable for fine-grained UV extraction in complex high-resolution remote sensing scenes.
4.3. Ablation Experiment
In order to verify the contribution of each core module in the TransUV model, the author systematically conducted ablation experiments on the UV datasets. The results are shown in
Table 3. The experiment used TransNeXt as the baseline model, which had a mIoU of 84.13% and an overall accuracy (OA) of 91.52%. This demonstrates that the baseline model has a strong initial feature extraction ability.
After replacing the decoder of the baseline model with the SegUV decoder, the mIoU increased to 85.33%, the OA increased to 92.21%, and the F1 Score reached 91.04%. This change indicates that this structure effectively improves the feature reconstruction and semantic information fusion capabilities during the decoding process. After further adding the FCN auxiliary head module, all indicators showed a stable upward trend. The mIoU increased to 86.15%, the OA reached 92.67%, the F1 Score significantly increased to 91.60%, and the recall rate also increased to 91.69%. This indicates that multi-level semantic aggregation helps enhance the model’s ability to discriminate UV areas. Introducing the MFEM at the encoding end further improved the mIoU to 86.30%, the OA to 92.79%, and the Precision significantly increased to 93.33%. This indicates that the module has significant effects in complex boundary extraction and antialiasing processing. Finally, the AAFM is embedded in the decoder, and the overall performance of the model reaches its optimal level. The mIoU is improved to 86.67%, the OA is 92.98%, and the F1 Score increases to 91.89%, while a high recall rate (91.32%) and accuracy rate (92.46%) are maintained. This shows that the multi-branch attention mechanism plays a crucial role in fusing multi-scale context and direction-aware features. The differentiated trends in the improvement of various metrics precisely demonstrate that the MFEM and AAFM function in a clear and complementary division of roles. The MFEM significantly enhances precision by strengthening feature discriminability, while the AAFM effectively restores recall through global contextual fusion while maintaining high precision. The synergistic collaboration between the two ultimately achieves a comprehensive improvement in the overall performance of the model.
To further investigate the influence of the MFEM and AAFM on the model’s attention distribution, Grad-CAM was employed to generate feature heatmaps. As shown in
Figure 10, we compared the feature activation patterns of the base model (Base), the model with only the MFEM (+MFEM), and the model incorporating both the MFEM and AAFM (+MFEM&AAFM) across different network stages (Stage-1 to Stage-4) as well as at the final output (Head).
The results indicate that, compared with the base model, the introduction of the MFEM significantly enhances the model’s ability to focus on key regions. In the early stages (Stage-1 and Stage-2), feature activations already exhibit a more distinct initial aggregation trend than those of the base model. As the network depth increases (Stage-3 and Stage-4), the model’s attention becomes more clearly and stably focused on the UV areas. In the final Head output, high-response regions are highly localized within the internal structures of UVs, thus significantly enhancing the recognition accuracy of the target regions. Building on the integration of the MFEM, the further incorporation of the AAFM effectively enhances the capture of UV boundaries. Although its activation patterns in the shallow stages (Stage-1 and Stage-2) are similar to those of the +MFEM model, more extensive and coherent response regions are observed in the deeper stages (Stage-3 and Stage-4). Most importantly, in the final Head heatmap, the activation intensity along the contours of UVs is significantly enhanced. This indicates that the AAFM optimizes feature fusion and attention focusing, particularly by increasing the model’s sensitivity to discriminative boundary features.
4.4. Inference over the Study Areas
In this study, only a small portion of each study area was used for training and validation. The best-performing model weights obtained from this phase were then transferred and applied to the entire study areas to generate complete prediction maps of UVs in the central urban districts of Kunming and Nanning.
In terms of overall extraction performance, the predicted urban-village patches are highly consistent in spatial location and geometric shape with residential areas in the high-resolution imagery. These residential areas are characterized by high building density, low building height, and predominantly self-built houses. The locally enlarged views I–IV in
Figure 11 demonstrate that the proposed method can accurately delineate the irregular boundaries of UVs as well as their fine internal road networks, and exhibits strong capability in identifying both large, contiguous clusters and scattered patches embedded within the urban built-up area. At the macro scale, the spatial distribution of the predicted results is largely consistent with the typical urban-village areas recognized in previous studies and local planning practice. It should also be noted that, due to similarities in building morphology, height, and roof materials, a few low-rise, high-greening villa compounds located on the urban fringes of Kunming and Nanning are misclassified as urban-village patches. Such misdetections exist to a certain extent; however, the corresponding patches account for only a small proportion of the total area and number of predictions, and thus have limited impact on the interpretation of the overall spatial pattern and subsequent statistical analyses.
In the central urban area of Kunming, urban-village patches are mainly distributed within the continuous built-up areas of Wuhua District, Panlong District, Guandu District, and the northern part of Chenggong District, which forms a belt-like pattern along the north–south urban development axis. The corridor formed by Wuhua–Panlong–Guandu constitutes the core urban area, within which the purple urban-village patches are highly concentrated and partially merged into large contiguous clusters, indicating a high degree of agglomeration of UVs in the city center. Further south into Chenggong District, UVs still appear sporadically along the main development corridor, but the patches decrease markedly within the newly developed groups at the southernmost part of the area. In contrast, almost no urban-village patches are found in the extensive mountainous and water-covered areas in the western and southern parts of Xishan District. Only a few patches appear at the edges of the built-up area adjacent to the main urban core. Overall, the spatial pattern of UVs in Kunming can be summarized as “high concentration in the center, belt-like extension along the north–south axis, and scarcity in mountainous and water areas”.
In the central urban area of Nanning, the spatial distribution of UVs is highly consistent with the valley-type urban morphology. Almost all predicted patches are embedded within the built-up areas along both banks of the Yongjiang River, forming a pronounced belt-shaped cluster pattern along the river. UVs are most concentrated in Xixiangtang District and Jiangnan District, where large, continuous patches are distributed along the northern and southern banks of the Yongjiang, respectively. Numerous urban-village patches are also present in the northern part of Xingning District and the northern part of Liangqing District. This patches, together with those in Xixiangtang and Jiangnan constitute a high-density ring within the main urban area. In contrast, because of the extensive green spaces and mountainous terrain in the central part of Qingxiu District, UVs are relatively sparse. They only occur sporadically along the built-up edges near the river valley and district boundaries. Yongning District has the fewest urban-village patches overall. Small patches are only observed along the Yongjiang River and at the northern fringe adjacent to the main urban area, while the southeastern part of the district, dominated by non-built-up land, shows almost no urban-village predictions.
By comparing the two cities, it can be observed that UVs in Kunming are more prominently distributed along the north–south urban development axis, whereas those in Nanning are mainly organized along the east–west river valley, forming a clear “river-oriented belt with clusters on both banks” structure. Despite differences in topography and development axes, both cities exhibit a common pattern: UVs are primarily concentrated within and around the existing built-up areas of the old city, and their occurrence gradually weakens towards newly developed urban districts and non-construction areas such as mountains and water bodies. This consistent ring-like distribution further corroborates the spatial plausibility and reliability of the model’s extraction results.
4.5. Case Study: Multi-Temporal Prediction of UV Redevelopment
This section takes Liede Village (113.327°–113.339°E, 23.112°–23.119°N) in Tianhe District, Guangzhou City, as a case study to evaluate the ability of the TransUV model to capture the dynamic changes of UVs from multi-temporal remote sensing images. As the first systematic old city renovation project in Guangzhou (Guangzhou Municipal Government, 2009), Liede Village’s renovation process is representative: demolition work was initiated in 2007 and the renovation was basically completed by the end of 2009. Villagers moved into their new homes before the Spring Festival in 2010. The study selected high-resolution remote sensing images from three time points: 2007 (before demolition), 2009 (during demolition), and 2017 (after renovation). Our research uses the TransUV model for the UVs of the study area identification and analyze their changes.
The experimental results show that the TransUV model can effectively identify the UV areas of Liede Village in different periods (
Figure 12). In the 2007 image, the model successfully identified a large number of densely built UV areas, indicating that the area had not undergone renovation. The predicted masks exhibit a continuous and complete distribution, fully demonstrating that the model can obtain UV extraction results with good internal consistency and integrity in high-density building environments. By 2009, with the progress of the demolition work, the red areas identified by the model had significantly decreased, and their boundaries could clearly outline the boundary between the remaining UV buildings and the exposed ground surface. The internal mask remained coherent, indicating that the model could distinguish mixed land types. By 2017, the original UVs had been upgraded to modern residential communities. The model output results showed that the characteristics of UVs had completely disappeared, and were replaced by regularly arranged light gray new building clusters and green belts. The masking results are highly consistent with the actual land features, clearly presenting the reconstruction of the spatial pattern after urban renewal.
The results of the research indicate that the TransUV model can accurately identify the range changes of UVs during the renovation process, and the recognition results match well with the actual land features, verifying the applicability of the model in multi-period UVs recognition tasks.
5. Discussion
5.1. Rationality of the Area Threshold Strategy
In this study, an area threshold strategy based on UV coverage ratio was introduced in the label screening stage to enhance the quality of the label set and reduce training cost. Specifically, for each image patch, we calculated UVs coverage ratio
r, and samples with
r < 20% were directly discarded.
Figure 13 shows the distribution of
r for all labeled patches. The violin–box plot indicates that the 25th, 50th, and 75th percentiles are 7.5%, 19.4%, and 38.6%, respectively, and most samples fall within the 0–40% range. The 20% threshold is slightly higher than the empirical median of 19.4% and can therefore be regarded as a data-driven “median split” that statistically separates samples into “weak-coverage” and “large-area coverage” groups.
The examples on the right side of
Figure 13 further provide an intuitive justification for this division. When
r ≈ 7.5%, UV areas appear only as scattered fragments, are highly sensitive to slice position and minor annotation perturbations, and are still dominated by background land-cover types as a whole. In contrast, when
r ≈ 38.6%, UV areas form connected patches with more stable texture and morphological characteristics, and the discriminative information is more concentrated. If these two types of samples were mixed for training, the few fragmented UV targets would easily be “overwhelmed” by the background in the feature space, thereby amplifying the effects of label noise and class imbalance.
In terms of data scale, 4143 labeled patches were initially available. After applying the 20% area threshold, 2021 patches were retained for model training and validation, accounting for approximately 48.8% of all samples, with about half of the low-coverage samples removed. On the one hand, this substantially reduces redundant and unstable samples and focuses training on more representative UV patterns. This is beneficial for enhancing model convergence and generalization. On the other hand, the remaining dataset is still sufficiently large, so the learning capacity of the model is not compromised by a lack of samples. It should be noted that this strategy unavoidably sacrifices the recall of very small and isolated UV fragments. However, given that the main objective of this study is to identify contiguous UV areas, this trade-off, which prioritizes “regional integrity”, is reasonable and interpretable.
5.2. Effectiveness and Limitations of the Proposed Method
The proposed TransUV model achieves an mIoU of 86.67% and an OA of 92.98% in the UV recognition task. Its overall performance surpasses that of multiple mainstream baseline methods, demonstrating the effectiveness of the proposed framework in complex urban environments. On the encoder side, the model incorporates the MFEM, and on the decoder side, it employs the AAFM. These components jointly enhance the representation of texture and boundary information across multiple scales and, via attention mechanisms, adaptively select and fuse key features, thereby alleviating typical problems in UV segmentation such as fuzzy boundaries, missed small objects, and class confusion.
Visual comparisons further substantiate these quantitative findings. TransUV exhibits exceptional performance in boundary completeness and regional coherence, even in challenging scenarios with fragmented morphologies or complex materials. This capability stems from a more fundamental architectural advantage. When compared to classical CNN architectures like Unet and PSPNet, TransUV’s integration of global contextual modeling—inherited from the TransNeXt backbone—enables a more holistic understanding of the scene. This overcomes the inherent limitation of CNNs, whose local receptive fields often lead to fragmented predictions in highly heterogeneous areas like UVs. Conversely, compared to pure Transformer-based models (e.g., Vit, Swin transformer), the carefully designed MFEM and AAFM ensure that fine-grained local details are preserved and enhanced throughout the network. This results in a superior balance between capturing long-range dependencies and extracting precise local textures, yielding sharper detail restoration and competitive inference efficiency.
Nevertheless, this study still has certain limitations. We observe that, in some cases, TransUV tends to misclassify low-rise, highly vegetated residential villa areas as UVs. This is mainly due to the high morphological similarity between these two land-use types in terms of building density and roof materials. Such misclassifications typically occur in suburban villa areas that share visual characteristics with UVs, including relatively dense building layouts and irregular texture patterns. From a mechanistic perspective, the low-rise and high-density building configurations of villa areas are spatially similar to those of UVs, while commonly used roof materials, such as clay tiles and cement tiles, exhibit overlapping spectral reflectance characteristics with the blue color-coated steel roofs widely found in UVs. In addition, the fragmented texture patterns induced by vegetation shadows in highly vegetated villa areas further exacerbate feature confusion during model discrimination.
To mitigate the aforementioned misclassification issues, future research could further improve the model’s discriminative capability through multi-source data fusion and temporal feature modeling. One promising direction is the integration of height-related data, such as LiDAR or Digital Surface Models (DSM), to explicitly exploit vertical structural differences between land-use types. Villa areas generally follow relatively strict height regulations, whereas UVs often exhibit irregular vertical extensions and variations in building height. Moreover, incorporating multi-temporal remote sensing imagery to capture the temporal evolution of different land-use types is also worthy of further exploration. UVs typically exhibit more dynamic evolution patterns, such as informal expansion or frequent modifications, while villa areas tend to remain relatively stable over longer time scales. These multi-source and multi-temporal data can provide critical contextual information beyond the visual features captured by single-date high-resolution imagery, thereby further enhancing the reliability and cross-scene generalization ability of UVs identification.
6. Conclusions
In order to enhance the accuracy of UV extraction from high-resolution remote sensing imagery, this paper proposes TransUV, a multi-scale attention-fusion segmentation framework built upon TransNeXt. During data preprocessing, an area threshold strategy based on UV coverage is introduced to retain high-quality samples and alleviate label noise and class imbalance. In model design, MFEM is incorporated at the encoder front end to enhance boundary and texture cues, while AAFM is embedded in a lightweight decoder (SegUV) to enable adaptive multi-scale feature fusion and semantic refinement.
Experimental results show that, under identical settings, TransUV consistently outperforms representative CNN-based and Transformer-based baselines on key metrics (e.g., mIoU and recall). It produces more complete boundaries and better detail preservation in qualitative comparisons. These results provide evidence for the effectiveness of the proposed data screening strategy and task-oriented architectural design for UV extraction.
Overall, TransUV offers a feasible technical solution for the fine-grained recognition of complex UV land-cover patterns in high-resolution imagery. Future work will further evaluate its robustness and generalizability across broader geographic regions and heterogeneous data sources. It will also explore the integration of complementary information (e.g., height or multi-temporal observations) to reduce confusion among morphologically similar land-use types. The proposed approach provides a methodological reference for large-scale UV monitoring and urban renewal analysis.