Accurate Extraction of Rural Residential Buildings in Alpine Mountainous Areas by Combining Shadow Processing with FF-SwinT

Luan, Guize; Luo, Jinxuan; Gao, Zuyu; Zhao, Fei

doi:10.3390/rs17142463

Open AccessArticle

Accurate Extraction of Rural Residential Buildings in Alpine Mountainous Areas by Combining Shadow Processing with FF-SwinT

¹

Institute of International Rivers and Eco-Security, Yunnan University, Kunming 650500, China

²

Yantai Center of Coastal Zone Geological Survey, China Geological Survey, Yantai 264001, China

³

School of Earth Sciences, Yunnan University, Kunming 650500, China

⁴

Power China Kunming Engineering Limited Corporation, Kunming 650033, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(14), 2463; https://doi.org/10.3390/rs17142463

Submission received: 17 May 2025 / Revised: 9 July 2025 / Accepted: 15 July 2025 / Published: 16 July 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Precise extraction of rural settlements in alpine regions is critical for geographic data production, rural development, and spatial optimization. However, existing deep learning models are hindered by insufficient datasets and suboptimal algorithm structures, resulting in blurred boundaries and inadequate extraction accuracy. Therefore, this study uses high-resolution unmanned aerial vehicle (UAV) remote sensing images to construct a specialized dataset for the extraction of rural settlements in alpine mountainous areas, while introducing an innovative shadow mitigation technique that integrates multiple spectral characteristics. This methodology effectively addresses the challenges posed by intense shadows in settlements and environmental occlusions common in mountainous terrain analysis. Based on the comparative experiments with existing deep learning models, the Swin Transformer was selected as the baseline model. Building upon this, the Feature Fusion Swin Transformer (FF-SwinT) model was constructed by optimizing the data processing, loss function, and multi-view feature fusion. Finally, we rigorously evaluated it through ablation studies, generalization tests and large-scale image application experiments. The results show that the FF-SwinT has improved in many indicators compared with the traditional Swin Transformer, and the recognition results have clear edges and strong integrity. These results suggest that the FF-SwinT establishes a novel framework for rural settlement extraction in alpine mountain regions, which is of great significance for regional spatial optimization and development policy formulation.

Keywords:

rural settlement extraction; image segmentation; UAV images; swin transformer; deep learning

1. Introduction

China has undergone one of the most extensive and rapid urbanization processes globally [1]. This process is particularly significant in rural areas, and many villages have been transformed into emerging towns. Urbanization has enhanced rural infrastructure and significantly enhanced residents’ quality of life. However, it has profoundly altered the spatial distribution patterns of surface features and brought about a series of problems in terms of planning, resources and ecology [2]. Consequently, regular updates to the urban and rural geographic information database are essential to support cadastral management and the development of Digital Earth. Artificial objects are important elements of spatial geographic information databases, and settlements, as the largest element of artificial objects, occupy an important position in urban planning, military reconnaissance and map drawing [3]. In order to realize rural revitalization and digital cities, the Chinese government has issued multiple policies and regulations concerning rural settlements, indicating the need to continuously promote the comprehensive improvement of rural land [4].

The rural regions in alpine mountainous areas represent distinctive zones in terms of rural development efforts, yet their harsh natural environment and limited agricultural productivity have hindered social and economic progress. Consequently, targeted investment in critical infrastructure development should be prioritized. This requires accurate information about the number, area, and distribution of rural settlements to support spatial optimization and the improvement of related management systems [5].

Compared with urban buildings, rural settlements in alpine mountainous areas are characterized by a small scale, scattered distribution and great internal differences, which make their extraction more challenging and have contributed to the scarcity of related research [6]. These challenges render low-resolution imagery insufficient for effective feature extraction, leading researchers to use very-high-resolution (VHR) data for building extraction studies [7,8,9]. Due to the dense population and scattered land use layout of villages in China, VHR UAV imagery offers high flexibility, low cost and strong real-time performance, making it a preferred tool for rural homestead monitoring. In addition, the relatively low cost of UAV remote sensing makes it feasible to conduct large-scale data collection in rural areas with limited resources. Although VHR UAV images provide detail, accurately extracting buildings from such data remains a challenging task [10,11]. To date, no public dataset focusing on rural buildings in alpine mountainous areas has been established.

In recent years, advancements in computer hardware have driven significant breakthroughs in deep learning, particularly within computer vision (CV). These developments have further expanded the applications of deep learning to Earth observation, particularly in tasks such as image classification and semantic segmentation [12]. Foundational CNN architectures such as AlexNet [13], GoogLeNet [14] and ResNet [15] have laid the groundwork for subsequent developments in image segmentation. Notably, convolutional neural network (CNN)-based image segmentation algorithms primarily rely on data-driven feature learning for automated classification. This approach exhibits exceptional robustness, with minimal sensitivity to spectral and shape variations in input images. CNN architectures demonstrate exceptional performance in building footprint detection tasks [16,17,18]. However, these models still suffer from the inherent limitations of CNNs, including potential spatial information loss, fixed input size constraints, and limited generalization capability. Building upon these foundations, Long et al. introduced a groundbreaking contribution to semantic segmentation—the fully convolutional network (FCN) [19]. This architecture innovatively substituted the traditional CNN’s fully connected layers with convolutional layers, becoming the first framework capable of processing arbitrary-sized inputs for segmentation tasks. The FCN architecture subsequently became a seminal work that revolutionized semantic segmentation methodologies. Subsequently, Ronneberger et al. enhanced the FCN’s segmentation performance for high-resolution imagery through skip connection integration [20]. This architectural innovation gained particular prominence in medical imaging applications. Chen and other scholars proposed the DeepLab models based on the idea of introducing the FCN full convolution structure and end-to-end learning into DeepLab. In 2018, they introduced DeepLabv3+, which achieved strong performance in complex scenes segmentation [21]. In the same year Wang et al. proposed the UPerNet model, which incorporated key advantages from the DeepLab series [22]. Compared to CNN-based methods, Transformer-based approaches demonstrate notable advantages in modeling sequential data and global features. Chen et al. proposed the ADF-Net model based on the Transformer architecture, which enhances the recognition of architectural features, improves the spatial information representation, and effectively captures the global dependence among buildings [23]. To address the high computational and memory costs of Transformer models for large-scale images, Liu et al. proposed the Swin Transformer [24]. This method captures complex geometric structures and environmental details in remote sensing imagery, which has obvious advantages for building edge extraction and structural identification in densely built urban areas. However, its adaptability in complex environments, such as alpine mountainous regions, requires further validation.

In conclusion, this study focuses on the extraction of rural buildings in alpine mountain areas by introducing a shadow mitigation method that integrates multiple features to construct a specialized dataset tailored for rural buildings in alpine mountainous areas. Based on this foundation, we optimized the Swin Transformer architecture through enhanced data processing, refined loss functions, and multi-view feature fusion, and we developed a novel settlement extraction framework termed the FF-SwinT. We further evaluated the FF-SwinT using multiple quantitative metrics, confirming its robustness and accuracy for extracting rural settlements from remote sensing imagery in alpine mountainous regions.

2. Study Area

Deqin County (27°33′04″~29°15′12″N, 98°35′06″~99°32′20″E), located in the Diqing Tibetan Autonomous Prefecture of Yunnan Province, lies between the Yunling and Nushan Mountains, at the junction of Yunnan, Sichuan and Tibet (Figure 1). The county is dominated by high mountains and deep canyons, with an average elevation of 4270.2 m, and exhibits the topographical characteristics of being high in the north and low in the south. In Deqin County, 153 natural villages are located between 1800 m and 2400 m above sea level, accounting for 28.75% of the county’s total. Another 379 natural villages are located above 2400 m, representing 71.23% of the total natural villages in the county.

Deqin County has the typical characteristics of alpine mountainous regions and diverse topography, and the rural areas of Deqin County can well represent the typical rural conditions of alpine mountainous regions. Based on these characteristics, we selected several villages in Deqin County as case study sites. UAV imagery was collected and used to support the subsequent analyses.

3. Materials

3.1. Data Acquisition and Processing

3.1.1. UAV Remote Sensing Data Acquisition

The remote sensing imagery of Deqin Country was obtained by UAV flights, yielding 0.05 m resolution UAV orthophoto images covering 20 rural settlements. The spatial distribution of the sampling points is shown in Figure 2. The orthophoto collection points are mainly distributed across Yunling town, Yanmen town, and Benzilan town in Deqin County. The orthophotos clearly reveal the spatial extent, roads, cultivated land and other key features of the rural settlements, including the roof structures, courtyard layout and other fine-scale details of individual homesteads.

3.1.2. UAV Remote Sensing Data Preprocessing

To guarantee the data reliability, the quality of UAV-acquired remote sensing imagery must be assessed prior to processing. This includes noise detection, distortion analysis, blur detection, and contrast analysis.

In the image processing stage, it is necessary to define the aerial survey parameters and review the inertial navigation data in advance. If positional deviations caused by image distortion are detected, distortion correction should be applied to ensure accurate geolocation restoration. Then, aerial triangulation can be performed using a UAV image processing system, and the methods of double virtual image area network adjustment and single camera image area network leveling can be adopted to improve the image processing accuracy. In addition, the potential POS anomalies and high overlap require targeted data correction to ensure the accuracy of the aerial triangulation network. Finally, to address the chromatic aberration caused by cloud cover or camera performance between aerial photographs, image correction, smoothing, enhancement, and stitching are needed to improve the overall quality of the final mosaic.

3.2. Create Deep Learning Datasets

Due to the computational constraints in terms of processing high-resolution imagery, deep learning models face challenges in both input handling and local feature extraction from large-scale remote sensing data. Consequently, image clipping becomes an essential preprocessing step. When determining the clipping size, too small a size may hinder the semantic segmentation task in capturing the complete morphological characteristics and spatial context information of the target. However, if the clipping size is too large, too much background information will be introduced, leading to unbalanced categories and increasing the consumption of computing resources. Consequently, achieving optimal model performance requires careful calibration between the spatial resolution and the image dimensions during preprocessing. By comparing the effects of common image task sizes (Figure 3), 512 × 512 was selected as the clipping size.

After labeling the samples with ArcGIS 10.8, the images were clipped by sliding window clipping, and the window size was set to 512 × 512, and the step size was set to 256, so that the overlap ratio of adjacent windows was 50%. Following segmentation, image patches that did not contain labeled buildings, along with their corresponding remote sensing images, were removed. A total of 26,810 valid image samples were retained for model training.

Building upon the clipped samples, data enhancement can effectively increase the diversity of samples and help the model to be better generalized, thus improving its performance on unseen data. To maintain the high image quality of UAV remote sensing image and preserve the color and texture characteristics of the settlement itself, this study mainly applied geometric transformations such as flipping and rotation to enhance the data. An example of the enhancement is shown in Figure 4. After data enhancement, a sample library containing 30,810 samples was generated.

The final processing stage utilized Python to structure the dataset according to the PASCAL VOC annotation standards. The dataset was divided into 90% (27,729 samples) for training and 10% (3081 samples) for model validation. This ratio was chosen based on the adequacy of the dataset size and the high demand for training data diversity in segmentation tasks. The total sample size of this study was 30,810, including 3081 samples in the validation set. This size significantly exceeds the minimum validation set requirement for small and medium-sized datasets. The 3081 validation samples cover various regions, lighting conditions, and shadow complexities, providing a reliable assessment of the model’s generalization ability on unseen data. Although adopting a conventional 80:20 split could improve the validation reliability, it would reduce the training set by approximately 3000 samples. This reduction may lead to the loss of critical training data, particularly for tasks requiring the learning of fine-grained features such as complex building contours.

4. Methods

4.1. Shadow Processing of Remote Sensing Image Based on Multi-Feature Fusion

The shadows cast by settlements in remote sensing images often weaken the optical features of these settlements, blur the boundaries between different features, and make object shapes difficult to distinguish. In this study, time-based features are introduced into the image processing workflow, and multiple feature fusion techniques are employed to mitigate the impact of shadows. Given that the majority of the radiation energy in visible remote sensing images originates from sunlight, the chromaticity of the shadow regions is expected to align with that of the regions directly illuminated by sunlight. Both high-brightness and shadowed regions remain unaffected by the normalized color space [25]. Therefore, the shadow feature can be differentiated by examining the contrast between the original color space and the normalized color space of the image. The corresponding formulas are presented in Equations (1)–(3):

r = \frac{R}{(R + G + B)}

(1)

g = \frac{G}{(R + G + B)}

(2)

f e a t u r e 1 = m e a n (|r - R| + |g - G|)

(3)

where R, G, and B denote the pixel values of the three channels in the RGB color space, while r and g represent the normalized contributions of the R and G channels within the RGB composition. By calculating the difference between the original RGB channels and the normalized ratio, this approach captures the spectral stability of shadows by using the proportional relationship between the channels. The chromaticity of the shadow area remains consistent despite variations in the solar radiation. This feature enhances the contrast between shadowed and non-shadow areas, as reflected in the inter-channel differences—representing a distinct spectral feature.

On this basis, considering the varying sensitivity of the human eye to red, green, and blue light, a weighted method is employed to calculate the brightness and extract the color-based shadow distinguishing features (Equation (4)). K-means clustering is applied to combine the pixel matrix weight ω in the shadow region with the color shadow feature (feature2), resulting in a comprehensive brightness parameter [1]. This parameter incorporates both the human eye’s perception of shadows and spectral clustering features (Equation (5)).

f e a t u r e 2 = 0.46 R + 0.5 G + 0.04 B

(4)

f e a t u r e 2^{'} = f e a t u r e 2 + λ \times ω

(5)

where ω represents the pixel matrix extracted from the clustering results, and λ is the weight. Based on the human visual sensitivity to the RGB channels, the “low brightness” characteristics of shadows are directly quantified from the per-channel luminance, thereby enhancing the spectral difference at the channel level. The K-means clustering of pixels according to the spectral characteristics means that pixels in shadow areas will form continuous clusters in space due to the spectral similarity. Thus, the clustering results essentially capture the spatial continuity of shadows and help eliminate isolated low-brightness pixels that may be misclassified when relying solely on single-channel features.

Additionally, to eliminate noise, reduce the computational complexity, and enhance the accuracy of the settlement edge recognition, it is essential to remove the discrete shadow patches produced by vegetation surrounding the settlement. Since vegetation typically appears green in the visible spectrum, the minimum difference between the green band and the red and blue bands serves as an identification feature [26], as illustrated in Equation (6).

f e a t u r e 3 = G - \min (R, B)

(6)

The green spectral characteristics of the vegetation region are captured by Equation (6), which enables the detection of vegetation-associated “speckled shadows”. The morphological characteristics (scattered and small scale) of this kind of shadow are significantly different from the elongated and continuous shadows typically cast by buildings.

Based on the features outlined above, the final representation of the shadow features is constructed by assigning feature weights, as shown in Equation (7). The discrimination of shadows is then performed as described in Equation (8):

f e a t u r e = α \times f e a t u r e 1 + b \times f e a t u r e 2^{'} + c \times f e a t u r e 3

(7)

s h a d o w = \{\begin{matrix} 1, i f (f e a t u r e > V) \\ 0, i f (f e a t u r e \leq V) \end{matrix}

(8)

where

α

, b and c are the feature weights; and V is the judgment threshold. The parameters suitable for the image are obtained by adjusting these values through testing.

This method combines the spectral channel features with the spatial characteristics and achieves the comprehensive depiction of shadows from both the spectral and spatial dimensions. This method addresses the limitations of using single features and improves the accuracy of shadow recognition. Single-channel characteristics (such as only relying on brightness) tend to misjudge dark non-shadow objects (such as dark roofs and water bodies) as shadows. The channel features (feature1, feature2) ensure the capture of the shadow core attribute of “low brightness and stable chromaticity” at the spectral level, and the spatial features (spot morphology of feature3) exclude non-shadow interference areas (such as vegetation spot shadows) through spatial continuity and morphological differences. Together, they realize “spectral screening + spatial verification”, which significantly improves the accuracy of shadow recognition.

4.2. Segmentation Model

A variety of image segmentation models have been proposed, where the segmentation accuracy is highly context-dependent. In this study, we trained and applied these models for the extraction of rural residential settlements in mountainous plateau regions. We quantitatively compared the model performance using the MIOU and F1-score metrics. Building on the most effective baseline model, we optimized the framework to address the segmentation framework to the specific characteristics of rural settlements in this region. The following section provides a brief introduction to the basic segmentation model employed in this study.

4.2.1. FNC

Long et al.’s FCN architecture, building upon foundational CNNs like AlexNet, VGGNet, and GoogLeNet, established the first viable framework for pixel-level semantic segmentation, marking a watershed moment in deep learning-based image analysis [19]. The FCN’s fundamental breakthrough involves substituting conventional fully connected layers with convolutional operations, enabling spatial heatmap generation instead of singular classification outputs. This enables the model to process input images of arbitrary sizes. Additionally, the FCN incorporates a skip connection structure, which facilitates the combination of deep global semantic information with shallow local details, thereby enhancing the model’s ability to understand global semantics while preserving finer local features.

4.2.2. UNet

UNet, introduced in the seminal biomedical segmentation paper, employs a symmetrical encoder–decoder structure [27]. The encoder alternates between convolutional operations and downsampling layers, systematically compressing the spatial dimensions while abstracting the hierarchical features. The decoder, in contrast, is composed of deconvolutional layers and skip connections, which work together to upsample the feature maps from the encoder, restoring them to the original resolution of the input image. This enables the model to perform pixel-level classification effectively.

4.2.3. DANet

DANet is a dual attention network designed to enhance semantic segmentation tasks by incorporating contextual information [28]. DANet extends the FCN’s architecture by incorporating parallel channel and spatial attention modules, enabling cross-dimensional dependency modeling. This dual-path attention approach enhances the global context awareness and elevates the segmentation performance. However, in cases of small target segmentation, DANet still encounters challenges.

4.2.4. Deeplabv3-Plus

Deeplabv3-plus, introduced by the Google Brain team in 2018, is the latest version of the DeepLab series of models, designed to address issues related to spatial resolution and the fusion of contextual information in semantic segmentation [21]. While conventional CNNs expand the receptive fields via pooling layers, this approach inevitably sacrifices spatial resolution, resulting in critical feature degradation. To overcome this challenge, Deeplabv3-plus uses dilated convolutions in the encoder, which enlarge the convolution kernel size and enable the network to capture a broader range of feature information without sacrificing resolution. Furthermore, Deeplabv3-plus incorporates the concept of deformable convolution and enhances the backbone network [29], which is responsible for extracting image features.

4.2.5. UPerNet

UPerNet is a semantic segmentation model that has shown impressive performance on benchmark datasets, effectively segmenting a wide variety of image concepts [22,30]. State-of-the-art segmentation models now commonly utilize deep CNN structures containing multiple dozens of convolutional layers for enhanced feature learning. These deep architectures enable the upper layers to capture global semantic information through a large receptive field while maintaining low computational complexity. UPerNet addresses the challenge of insufficient receptive fields by integrating the FPN and the PPM [31]. This combination allows UPerNet to capture multi-scale features more effectively, enhancing its ability to perform unified perceptual analysis and improving the segmentation performance across different image scales and tasks.

4.2.6. Transformer

A Transformer is not a specific algorithm designed for semantic segmentation but rather a foundational framework that can be applied to various tasks, including semantic segmentation [32]. Each Transformer encoder layer integrates three core operations: multi-head attention, MLP transformation, and normalization between layers The core innovation behind Self-Attention is its ability to model the relationships between input elements regardless of their position in the sequence, addressing challenges such as the vanishing gradient problem commonly encountered in sequence modeling tasks. This characteristic makes the Transformer architecture well-suited for tasks involving long-range dependencies, including semantic segmentation, where capturing the global context is crucial for accurate pixel-level classification.

4.2.7. Swin Transformer

Developed by Microsoft Research Asia, the Swin Transformer adapts the NLP-oriented transformer principles to computer vision through its hierarchical feature representation approach [24]. Unlike standard Transformers, the Swin Transformer introduces several key innovations, including hierarchical distributed attention, a shift window mechanism, windowed relative position encoding, and the use of smaller transformer blocks. These modifications are designed to address the computational and memory challenges inherent in high-resolution image processing while maintaining robust performance. This makes the Swin Transformer particularly effective for tasks like image semantic segmentation, where both precision and efficiency are critical. The model is available in four different versions, with the Swin-T variant being selected in this paper due to its balance between computational efficiency and performance, particularly in the context of remote sensing image analysis.

4.3. Improved Swin Transformer

Comparative evaluation revealed the Swin-T’s superior accuracy among benchmarked models, establishing it as our baseline architecture. The optimization strategy focuses on incorporating adaptive cross-dimensional fusion mechanisms to enhance the multi-scale feature integration, thereby boosting the segmentation precision and robustness in remote sensing applications.

4.3.1. Optimization Loss Function

The original Swin Transformer uses binary cross entropy (BCE) loss as its loss function, which, however, has limitations, such as the sensitivity to class imbalance, the inadequate boundary and regional continuity constraints, and the issue of gradient vanishing [33]. To address these challenges, this study proposes a hybrid loss function that combines balanced cross-entropy (BalanCE) and dice loss to optimize the performance of the Swin Transformer. The following is a detailed explanation of each component of the loss function:

(1) BCE loss

BalanCE enhances the standard weighted cross-entropy loss by introducing a coefficient β, which allows for proportional adjustment of the loss contributions from both positive and negative classes. The core principle of this method remains based on traditional cross-entropy, making it effective for handling class imbalance. However, while it addresses the imbalance issue, it does not directly optimize the overall overlap of the segmentation region, which can result in limited effectiveness in capturing fine boundary details. The mathematical expression for this loss function is presented in Equation (9).

L_{B a l a n C E} = - \frac{1}{N} \sum_{i = 1}^{N} (β y_{i} \log ({\hat{P}}_{i}) + (1 - β) (1 - y_{i}) l o g (1 - {\hat{P}}_{i}))

(9)

where

y_{i}

∈ [0, 1] represents the binary label of pixel

i

, with 1 indicating the positive class and 0 the negative class.

\hat{P_{i}}

∈ [0, 1] is the predicted probability of pixel

i

belonging to the positive class, while

(1 - \hat{P_{i}})

∈ [0, 1] is the probability of pixel

i

belonging to the negative class.

N

represents the total number of pixels.

(2) Dice loss

Frequently employed in semantic segmentation, dice loss quantifies the overlap between the predicted and ground truth regions through set similarity measurement. It evaluates the similarity between the predicted segmentation region and the ground truth, performing well on datasets with class imbalances. However, a notable issue with dice loss is that when the prediction closely matches the true label, the gradient approaches zero. This can cause the training process to stall, hindering further improvements to the model accuracy. To address this issue, researchers have proposed using the squared dice loss, which replaces the simple sum in the denominator of the dice loss function with the sum of squares (Equation (10)). This modification helps maintain more stable gradients. However, squared dice loss also has limitations. Specifically, it lacks a weight adjustment mechanism, making it less effective for handling class imbalances. In particular, when certain classes contain fewer pixels, squared dice loss may fail to assign sufficient attention to these smaller classes, potentially leading to poor segmentation results for these classes.

L_{S q u a r e d D i c e} = 1 - \frac{2 \sum_{i = 1}^{N} y_{i} {\hat{p}}_{i} + s m o o t h}{\sum_{i = 1}^{N} y_{i}^{2} + \sum_{i = 1}^{N} {\hat{p}}_{i}^{2} + s m o o t h}

(10)

(3) Combination loss

Considering that BalanCE adjusts the weight through β, the issue of category imbalance, which squared dice cannot effectively address, is resolved. However, squared dice loses its advantage in terms of the segmentation region overlap and boundary details, aspects that compensate for BalanCE’s limitations in optimizing the region overlap. Building upon this, and inspired by the ambiguity of existing combinatorial loss functions, we propose a new combinatorial loss function, LBCE-Sdice, to resolve the problem of class imbalance while simultaneously improving the model’s accuracy in predicting region shapes. This does not put forward a new single loss function, but in the task of extracting detailed characteristic buildings, the variants of the existing loss function are selected and innovatively integrated, forming a collaborative optimization mechanism with task adaptability. BalanCE is different from the original BCE or WCE. The β coefficient added by BalanCE on the basis of WCE can dynamically adjust the contribution ratio of positive and negative losses and more flexibly adapt to the scene in which the shadow area leads to a large fluctuation in the proportion of negative samples (non-buildings). Squared dice loss is not the original dice loss either. It replaces the simple sum with the sum of squares, which solves the problem of gradient disappearance of the original dice loss when the prediction is close to the real value, and is more suitable for the fine optimization of building boundaries. BalanCE’s β coefficient makes up for squared dice’s defect of “no weight adjustment”, squared dice’s “optimization of regional overlap” makes up for BalanCE’s “lack of boundary details”, and the realization of LBCE-Sdice realizes “defect complementarity” through mechanism design. In addition, on the basis of BalanCE and squared dice, the β coefficient of BalanCE can dynamically improve the loss weight of building categories in fuzzy areas and avoid misjudgment of the model because the pixel value is close to negative samples. Squared dice’s sensitivity to regional overlap forces the model to learn the spatial correlation of “building-shadow”, which indirectly improves the accuracy of the overall shape prediction. The specific LBCE-Sdice formula is as follows.

L_{B C E - S D i c e} = (1 - α) L_{B C E} + α L_{S q u a r e d D i c e}

(11)

where the constant α is ∈ [0, 1].

4.3.2. Adaptive Cross-Dimensional Feature Fusion Improved Swin Transformer

The Swin Transformer’s Patch Partition and Patch Merging modules differ significantly from traditional segmentation algorithms. The Patch Partition limits the scope of the attention calculation through block processing, thereby reducing the computational complexity. The Patch Merging module performs downsampling and multi-scale feature extraction, similar to the pooling operations in CNNs. However, the Swin Transformer does not fully account for important features in both the channel and spatial dimensions, and it lacks an inter-channel and spatial attention mechanism. This results in a diminished ability to capture local information, leading to limitations in high-precision settlement target extraction tasks.

To address these limitations, this study integrates the Convolutional Block Attention Module (CBAM) to enhance the model’s ability to learn discriminative features adaptively. The CBAM applies a dual-attention mechanism that sequentially infers attention maps along the channel and spatial attention, enhancing the feature discriminability. This adaptive approach specifically addresses the Swin Transformer’s limitations in fine-grained structure detection by dynamically emphasizing task-critical features, as illustrated in Figure 5.

Figure 5 illustrates that the CBAM includes two sub-modules: channel attention module (CAM) and spatial attention module (SAM). The CAM aims to keep the channel dimension unchanged and compress the spatial dimension into a scalar, so that the network can focus on the category information inside the image. It adopts two pooling methods, average pooling and max-pooling, to integrate the spatial information of the feature maps and generate images describing the spatial context, respectively. These pooled features are then processed through the neural network with hidden layers. The resulting channel attention feature vector is generated by summing the outputs element-wise. For the CAM, when extracting fine buildings, there is a lot of background information unrelated to the buildings in the image, and the buildings themselves also contain many categories. Distinguishing these categories of information accurately is the basis of fine extraction. When processing features, the original Swin Transformer pays more attention to the category information contained in different channels, and it cannot adaptively highlight those channel features that play a key role in building category judgment. The CAM integrates spatial information through average pooling and max pooling, and it then generates channel attention feature vectors through neural network processing, which can make the network automatically focus on the channels containing important category information and suppress the interference of irrelevant or secondary channels, so as to identify the category attributes of buildings more accurately and lay a good foundation for subsequent extraction. Therefore, the purpose of adding the CAM is to enhance the model’s adaptive ability to capture the category information at the channel level, so as to cope with the extraction scenes with diverse building categories and complex backgrounds. The SAM preserves the spatial dimensions while reducing the channel depth, computing the inter-feature relationships to produce spatial attention maps. This dual-path approach employs max and average pooling for context aggregation, followed by convolutional fusion. The final refinement stage performs cross-dimensional attention fusion, applying multiplicative feature weighting at both the intermediate and final layers for adaptive enhancement. For the SAM, buildings have specific spatial position and morphological characteristics in the image, such as spatial information such as the edge, outline and texture distribution, which directly affects the accuracy and fineness of the building extraction. In the process of feature processing, the original Swin Transformer did not pay enough attention to the spatial location information, so it was difficult to accurately locate the specific location and boundary of the building. The SAM can compress the multi-channel dimension into one channel, calculate the spatial relationship between features and generate a spatial attention map, which can make the network focus on the spatial area where the building is located and strengthen the capture of key spatial features such as the edge and outline of the building. This critically improves the ability to distinguish adjacent buildings and accurately outline the boundaries of buildings, which can effectively improve the spatial accuracy of the building extraction. Therefore, the SAM is added to enhance the focusing ability of the model on the spatial position information, so as to solve the problems of the complex spatial form and high positioning requirements of buildings.

4.3.3. Model Structure Based on Multi-Resolution Feature Fusion

Due to the presence of multi-scale and chaotic settlement in remote sensing images of high-altitude mountain regions, these complex geometric background features often lead to significant information loss and a degradation of the segmentation quality when decoding with only a single resolution feature. Although the semantic segmentation model based on the Swin Transformer performs well in extracting rural settlements, there are still instances where its extraction performance is suboptimal. Specifically, the PPM module in UPerNet fuses multi-scale associations among subregions by applying multi-scale pooling to different regions of the feature map, thereby aggregating the global context information. The FPN module, on the other hand, effectively integrates multi-scale information by fusing feature maps of different resolutions layer by layer, thus enhancing the model’s ability to recognize objects with varying complexity. The combination of PPM and FPN can efficiently integrate features extracted at different stages of the encoder, thereby supporting the final prediction results and optimizing the processing of multi-dimensional targets.

The Swin Transformer has limited ability to aggregate global context information when processing features. Fine building extraction not only needs to pay attention to the local characteristics of the building itself but also needs to combine the global scene in which it is located. Without the effective integration of global context information, the model is prone to misclassifying isolated local areas as buildings or to missing the details of buildings closely related to the global scene. The PPM module can aggregate global context information from different spatial ranges by applying multi-scale pooling in different regions of the feature map, effectively associate local features with global scene features, help the model understand the attributes of buildings in the overall scene more accurately, and reduce the extraction error caused by misjudgment of local features. At the same time, buildings show significant multi-scale characteristics in the image, and there are differences in the expression of building characteristics at different scales, and the multi-scale association of each subregion is very important for extracting integrity. The Swin Transformer has insufficient ability to integrate multi-scale features, and it is difficult to consider the feature expression of buildings with different scales, which is prone to the problems of incomplete extraction of large-scale buildings and neglect of small-scale buildings. The PPM module generates feature subgraphs with different scales through multi-scale pooling and then fuses them with the original features, which can fully capture the multi-scale feature association from local to global, make the model pay attention to the feature information of buildings with different scales at the same time, enhance the adaptability to multi-scale buildings, and effectively improve the integrity and consistency of the extraction, especially when dealing with buildings with significant scale differences.

To address these challenges, this study utilizes UPerNet as a decoder for the Swin Transformer model to achieve multi-resolution feature fusion processing. Additionally, a shadow processing strategy for remote sensing images based on multi-feature fusion is adopted, integrating channel and spatial features into the Swin Transformer encoder. The network is named FF-SwinT, and its model structure is shown in Figure 6.

5. Experimental Setup and Results Discussion

5.1. Experimental Environment and Parameters

The experimental environment of this study runs on the Windows 10 professional workstation edition 64-bit operating system, featuring an Intel Core i7-10700 processor (8 cores, clocked at 3.9 GHz) and NVIDIA GeForce RTX 2080 Super graphics processor, equipped with 64 GB of memory to ensure efficient data processing and computing capabilities. The development language is Python 3.8, the deep learning framework adopts PyTorch 1.8.1, and GPU acceleration support is provided in combination with CUDA 10.2, so as to meet the high performance requirements of the experiment for deep learning calculation, as shown in the following Table 1.

All the methods in this experiment use an input image size of 512 × 512 and set batch normalization (BN) during network training. This normalization method can accelerate the convergence and model performance improvement. The normalization technique standardizes the layer inputs to zero mean and unit variance, addressing gradient instability issues while accelerating convergence. For the Swin Transformer implementation, we configure the following: batch size = 2, iterations = 480,000, and the AdamW optimizer—parameters that enable stable training and efficient convergence of the feature extraction framework. The parameters related to AdamW are set as the initial learning rate

l_{r} = 6 \times 10^{- 5}

,

β_{1}

= 0.9,

β_{2}

= 0.999, and the weight attenuation coefficient is 0.01. The learning rate attenuation strategy is polynomial decay (Poly). The linear warmup strategy is used to preheat the learning rate, which makes the initial training of the model more stable.

5.2. Experimental Result

5.2.1. Image Shadow Extraction and Processing

By comprehensively considering the color space, human visual perception, spectral characteristics and influence of vegetation, a final shadow descriptor is constructed or a discriminant expression for shadow features is constructed to facilitate shadow extraction. Then, the influence of shadow is reduced by fusing multiple features. The final shadow extraction result and shadow mitigation are shown in Figure 7. As shown by the results, the discriminant expression can well identify the shadow features, and on this basis, the shadow weakening processing with multi-features is effective. Specifically, Figure 7a and Figure 7c, respectively, represent the images before and after shadow processing, and the blue area in Figure 7b represents the shadow extraction result. The quality of the shaded area in Figure 7a is noticeably enhanced in Figure 7c.

5.2.2. Evaluation of Semantic Segmentation Performance of Settlement Based on Multiple Models

This research employs a dataset split of 27,729 training samples and 3081 validation samples. Six semantic segmentation models for settlement are implemented: FCN, UNet, DANet, Deeplabv3-plus, UPerNet and Swin Transformer. The number of iterations of all the models is 480,000, and the accuracy index reached convergence. The model performance was evaluated using the mean intersection over union (MIOU), mean accuracy (mAccuracy) and average F1-score (F1-score), and the evaluation results are shown in Table 2. Additionally, the intersection over union (IOU) and accuracy results of the settlement and backgrounds are shown in Table 3.

To show the performance difference of the model more intuitively, this paper qualitatively analyzes the representative test results. The model is evaluated in terms of the settlement integrity, edge definition, extraction effect of settlements with different shapes, extraction effect of settlements with different spatial scales, influence of the settlement aggregation degree and extraction effect of settlements with environmental noise. Figure 8 shows the prediction effects of the six models. The results show that the Swin Transformer model performs best, with the MIOU, mAccuracy and F1-score all being higher than the other models, especially the F1-score, reaching 94.91%, which shows that the model can distinguish different regions more accurately in image segmentation tasks.

5.2.3. Improved FF-SwinT Model Based on Ablation Experiment

To address the limitations in terms of the Swin Transformer baseline—such as the class imbalance, inaccurate segmentation of complex edges and small targets, and insufficient multi-scale feature fusion—a new semantic segmentation model, the FF-SwinT, is proposed. In addition, an ablation experiment is designed to evaluate the effects of the shadow processing of remote sensing images based on multi-feature fusion, improved combined loss function

L_{B C E - S D i c e}

and adaptive cross-dimensional attention feature fusion on the segmentation performance on settlements and background areas in high-precision remote sensing images of rural areas in alpine mountainous areas. The average evaluation metrics of the model ablation experiment in the datasets are shown in Table 4. As shown in Table 4, the shadow processing of remote sensing image based on multi-feature fusion, combined loss function and adaptive cross-dimensional attention feature fusion can continuously improve the accuracy of the Swin Transformer in turn, and the accuracy of the FF-SwinT model significantly improves over that of the Swin Transformer. Table 5 presents the more detailed evaluation results for the settlements and background classes in the high-resolution remote sensing datasets in alpine rural areas.

Among them, A stands for the shadow processing of remote sensing image based on multi-feature fusion, B stands for the combined loss function

L_{B C E - S D i c e}

for the class imbalance and shape accuracy of the segmentation results, C stands for adaptive cross-dimensional attention feature fusion, and FF-SwinT stands for the final model of multi-resolution feature fusion.

5.2.4. Cross-Regional Model Generalization Ability Verification

The FF-SwinT is designed for extracting rural settlements in alpine mountainous areas from high-resolution remote sensing imagery, with the goal of improving the extraction accuracy. In order to verify its generalization ability in other regions, this study uses the image data of other regions in western Sichuan (not included in the training dataset) for testing, and the results are shown in Figure 9. The test results show that the residential structures are accurately identified, with clearly defined segmentation boundaries. Although there are some slight deviations, the overall performance can meet the needs of actual data production and collection.

In addition, one of the main application goals of the image clipping training model is to provide support for the interpretation of large-scale remote sensing images. In order to test the integrity of the model in extracting edges from small-scale images and the accuracy of turning mosaics into large-scale images, some remote sensing images of the Benzilan community were selected. Firstly, the remote sensing image were clipped according to the same window size and sliding segmentation method, which was consistent with the model input size. Then, the clipped images were fed into the model for prediction, and the final mask results were spliced according to the clipping order. Finally, the result was converted into a vector map by the ArcGIS tool and compared with the original remote sensing layer, and the final interpretation result was output, as shown in Figure 10. The results show that the overall effect is good, the settlement extraction is accurate and the vector edge is complete, which verifies the feasibility and effectiveness of the model.

6. Discussion

Public datasets often perform poorly when used to train deep learning models for settlement classification in alpine mountainous rural areas. In terms of training sample datasets, this study generates sample datasets through data labeling. In the aspect of prediction datasets, the influence of image shadow on model segmentation is considered, and a multi-feature processing method is used to reduce the interference of shadow. On this basis, this study compared seven methods that performed well in settlement object extraction: FCN, UNet, DANet, Deeplabv3-plus, UPerNet, Transformer and Swin Transformer. Among them, the Swin Transformer shows the best architectural extraction effect. In addition, the results in Table 3 and Table 4 show that the model based on the Transformer is obviously superior to other models based on the CNN in the task of high-precision remote sensing image segmentation. This shows that Transformer-based approaches outperform traditional CNNs in this application scenario, which is consistent with previous research results [34]. This may be because the self-attention mechanism of the Transformer enables the model to better catch the global feature information of the image, especially in complex terrain such as alpine mountains, and it can effectively distinguish subtle differences. However, the traditional CNN architecture (such as ResNet-101) mainly extracts local features through local convolution operation, which makes it difficult to achieve the same global vision.

Inspired by UPerNet and local feature enhancement, this study designs ablation experiments for rural settlement datasets in alpine mountain areas on the basis of the optimal model Swin Transformer and quantitatively analyzes the influence of image shadow processing, constructing a new loss function, embedding the CBAM and adopting a multi-resolution feature fusion structure as the decoder on the model performance. These experiments further validate the effectiveness of the proposed method. As shown in Table 4 and Table 5, it can be seen that in the process of gradually improving the treatment method, the basic indexes such as the MIOU, mAccuracy and F1 have been significantly improved. The MIOU increased from 91.16% to 93.3%, the mAccuracy increased from 95.06% to 96.68%, and the F1 increased from 94.91% to 96.68%. It is worth noting that the performance of the model is significantly improved when adding C operation, that is, embedding the CBAM into the Swin Transformer, especially in terms of the MIOU, which may help to catch finer-grained information or improve boundary processing. From the contribution of a single module, module A has the greatest influence, while the contribution of B and C is relatively small, but there is a significant synergistic effect when used in combination.

The FF-SwinT model is built around the common problems (fuzzy features, scale differences, etc.) of remote sensing building extraction, which has the foundation of cross-landscape migration. For example, in desert and tropical environments, in order to solve unique problems such as fuzzy sand and dust in desert, sparse samples and complex shadows in tropical areas, it is necessary to fine-tune the training in interference feature extraction, loss parameters and data enhancement. Theoretically, it can improve the detection rate of small-scale targets in the desert and optimize the accuracy of vegetation occlusion boundary in the tropics. Although there are extreme scene limitations, the migration application can be realized through “universal framework + fine-tuning of scenes”.

7. Conclusions

In order to solve the problems of the lack of datasets and the suboptimal performance of existing models in settlement extraction in rural areas of alpine mountains, this study constructed a dataset of UAV thematic remote sensing images. From the aspects of data processing, loss function, feature extraction and model structure, the Swin Transformer model is improved and the FF-SwinT model is proposed. The experimental results show that it achieved 93.3% MIOU, 96.72% mAccuracy, and 96.68% F1-score, respectively, which all achieve the best performance effect, and the performance exceeds 5% compared with the other algorithms.

The dataset has made a new contribution to the extraction of small-scale buildings, and at the same time, it also provides more samples for common building segmentation tasks in different scenarios. Furthermore, the FF-SwinT model significantly advances the settlement extraction techniques in rural residential areas across plateau and mountainous regions, offering a more effective solution for building identification and extraction in complex geographical environments. However, this study did not compare and verify the building extraction accuracy of this model in other special scenes. In the follow-up study, more scenes will be analyzed.

Author Contributions

Formal analysis, G.L., J.L. and Z.G.; Investigation, G.L. and Z.G.; Methodology, G.L. and Z.G.; Software, G.L., Z.G. and F.Z.; Visualization, G.L. and J.L.; Writing—original draft, G.L., J.L. and Z.G.; Writing—review & editing, G.L. and J.L.; Data curation, Z.G.; Validation, Z.G.; Funding acquisition, F.Z.; Project administration, F.Z.; Resources, F.Z.; Supervision, F.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Scientific Research Foundation of Yunnan Province Education Department [grant number 2024J0023] and the Yunnan Revitalization Talent Support Program Young Talent Project [grant number C6213001229].

Data Availability Statement

The data are available from the corresponding author upon reasonable request.

Conflicts of Interest

Author Zuyu Gao works for Power China Kunming Engineering Limited Corporation. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Macqueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, 21 June–18 July 1965 and 27 December 1965–7 January 1966; University of California Press: Berkeley, CA, USA, 1967; pp. 281–298. [Google Scholar]
Dadashpoor, H.; Azizi, P.; Moghadasi, M. Land use change, urbanization, and change in landscape pattern in a metropolitan area. Sci. Total Environ. 2019, 655, 707–719. [Google Scholar] [CrossRef] [PubMed]
Ispir, D.A.; Yildiz, F. Using deep learning algorithms for built-up area extraction from high-resolution GÖKTÜRK-1 satellite imagery. Earth Sci. Inform. 2025, 18, 1–18. [Google Scholar]
Yin, Q.; Sui, X.; Ye, B.; Zhou, Y.; Li, C.; Zou, M.; Zhou, S. What role does land consolidation play in the multi-dimensional rural revitalization in China? A research synthesis. Land Use Policy 2022, 120, 106261. [Google Scholar] [CrossRef]
Luo, F.; Huang, Z. Large area house measurement based on the UAV tilt photogrammetry: Take the survey of houses in the west of Guangzhou finance city as an example. Geotech. Investig. Surv. 2019, 47, 55–58. [Google Scholar]
Yu, M.; Zhou, F.; Xu, H.; Xu, S. Advancing Rural Building Extraction via Diverse Dataset Construction and Model Innovation with Attention and Context Learning. Appl. Sci. 2023, 13, 13149. [Google Scholar] [CrossRef]
Huang, X.; Yuan, W.; Li, J.; Zhang, L. A new building extraction postprocessing framework for high-spatial-resolution remote-sensing imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 10, 654–668. [Google Scholar] [CrossRef]
Guo, Z.; Du, S. Mining parameter information for building extraction and change detection with very high-resolution imagery and GIS data. GIScience Remote Sens. 2017, 54, 38–63. [Google Scholar] [CrossRef]
Lee, D.H.; Lee, K.M.; Lee, S.U. Fusion of lidar and imagery for reliable building extraction. Photogramm. Eng. Remote Sens. 2008, 74, 215–225. [Google Scholar] [CrossRef]
Toth, C.; Jóźków, G. Remote sensing platforms and sensors: A survey. ISPRS J. Photogramm. Remote Sens. 2016, 115, 22–36. [Google Scholar] [CrossRef]
Ma, Y.; Wu, H.; Wang, L.; Huang, B.; Ranjan, R.; Zomaya, A.; Jie, W. Remote sensing big data computing: Challenges and opportunities. Future Gener. Comput. Syst. 2015, 51, 47–60. [Google Scholar] [CrossRef]
Hoeser, T.; Kuenzer, C. Object detection and image segmentation with deep learning on earth observation data: A review-part i: Evolution and recent trends. Remote Sens. 2020, 12, 1667. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhao, W.; Du, S.; Wang, Q.; Emery, W.J. Contextually guided very-high-resolution imagery classification with semantic segments. ISPRS J. Photogramm. Remote Sens. 2017, 132, 48–60. [Google Scholar] [CrossRef]
Hamaguchi, R.; Fujita, A.; Nemoto, K.; Imaizumi, T.; Hikosaka, S. Effective use of dilated convolutions for segmenting small object instances in remote sensing imagery. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1442–1450. [Google Scholar]
Fu, G.; Liu, C.; Zhou, R.; Sun, T.; Zhang, Q. Classification for high resolution remote sensing imagery using a fully convolutional network. Remote Sens. 2017, 9, 498. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Part III 18. pp. 234–241. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Wang, R.; Jiang, H.; Li, Y. UPerNet with ConvNeXt for Semantic Segmentation. In Proceedings of the 2023 IEEE 3rd International Conference on Electronic Technology, Communication and Information (ICETCI), Changchun, China, 26–28 May 2023; pp. 764–769. [Google Scholar]
Chen, P.; Lin, J.; Zhao, Q.; Zhou, L.; Yang, T.; Huang, X.; Wu, J. ADF-Net: An Attention-Guided Dual-Branch Fusion Network for Building Change Detection near the Shanghai Metro Line Using Sequences of TerraSAR-X Images. Remote Sens. 2024, 16, 1070. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Landabaso, J.; Pardas, M.; Xu, L. Shadow removal with blob-based morphological reconstruction for error correction. In Proceedings of the (ICASSP ’05), IEEE International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, PA, USA, 23–23 March 2005. [Google Scholar]
Ashourloo, D.; Nematollahi, H.; Huete, A.; Aghighi, H.; Azadbakht, M.; Shahrabi, H.S.; Goodarzdashti, S. A new phenology-based method for mapping wheat and barley using time-series of Sentinel-2 images. Remote Sens. Environ. 2022, 280, 113206. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. In Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 205–218. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 3146–3154. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 418–434. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Guo, Q.; Wang, C.; Xiao, D.; Huang, Q. A novel multi-label pest image classifier using the modified Swin Transformer and soft binary cross entropy loss. Eng. Appl. Artif. Intell. 2023, 126, 107060. [Google Scholar] [CrossRef]
Yuan, W.; Xu, W. MSST-Net: A multi-scale adaptive network for building extraction from remote sensing images based on swin transformer. Remote Sens. 2021, 13, 4743. [Google Scholar] [CrossRef]

Figure 1. Study area.

Figure 2. Sample distribution of the UAV high-precision remote sensing images in Deqin County.

Figure 3. Comparison of the window cropping effects of different image sizes.

Figure 4. Data augmentation results: (a) vertical mirroring, (b) horizontal mirroring, (c) rotation 90°, (d) rotation 180°, and (e) rotation 270°.

Figure 5. Swin Transformer structure diagram with CBAM. (a) Swin Transformer, (b) Spatial attention module, (c) Channel attention module.

Figure 6. Swin Transformer structure diagram with the introduction of the UPerNet decoder. (a) Swin Transformer, (b) Spatial attention module, (c) Channel attention module, (d) PPM.

Figure 7. Image shadow processing results: (a) source image, (b) shadow extraction result, and (c) shadow processing result.

Figure 8. Semantic segmentation deep learning algorithm settlement extraction results. (a) Integrality, (b) Complex shape, (c) Small settlement, (d) Large settlement, (e) Dense settlement, (f) Regular settlement, (g) Shielding construction.

Figure 9. Verification of the model’s universality.

Figure 10. Large-scale usability verification: (a) source image, (b) interpretation result image, (c) converted to vector visualization, and (d) interpretation results image enlargement.

Table 1. Experimental environment.

Experiment Environment	Configuration Details
Operating System	Windows 10 Professional Workstation 64-bit
Central Processor (CPU)	Intel Core i7-10700, 3.9 GHz (16 cores)
Graphics Processor (GPU)	NVIDIA GeForce RTX 2080 Super
RAM	64 GB
Programming Language	Python 3.8
Deep Learning Framework and Computing Platform	PyTorch 1.8.1 + CUDA 10.2

Table 2. Average index results for the high-precision remote sensing image dataset of each model in alpine mountainous areas (%).

Segmentation Models	Backbone	MIOU	mAccuracy	F1-Score
FCN	ResNet-101	88.42	93.66	90.73
UNet	--	84.66	90.67	85.02
DANet	ResNet-101	87.05	92.75	89.14
Deeplabv3-plus	ResNet-101	88.35	93.78	91.11
UPerNet	ResNet-101	88.16	93.26	89.81
Swin Transformer	Swin Transformer	91.16	95.06	94.91

Table 3. Evaluation index results of each model on the settlement and background in the high-precision remote sensing image dataset of rural areas in alpine mountainous areas (%).

Segmentation Models	Class	IOU	Accuracy
FCN	Settlement	81.68	89.76
FCN	Background	95.16	97.56
UNet	Settlement	75.72	84.05
UNet	Background	93.61	97.29
DaNet	Settlement	79.54	94.57
DaNet	Background	88.17	97.32
Deeplabv3-plus	Settlement	81.60	90.14
Deeplabv3-plus	Background	95.11	97.42
UPerNet	Settlement	81.24	88.83
UPerNet	Background	95.08	97.70
Swin Transformer	Settlement	86.97	92.21
Swin Transformer	Background	95.35	97.92

Table 4. Average index results of the model ablation experiments.

Experiment Group	MIOU	mAccuracy	F1
Swin Transformer	91.16	95.06	94.91
Swin Transformer + A	91.30	95.89	95.98
Swin Transformer + A + B	91.83	96.01	95.96
Swin Transformer + A + B + C	92.66	96.22	96.17
FF-SwinT	93.30	96.72	96.68

Table 5. Evaluation index results of the model ablation experiment settlements and backgrounds.

Experiment Group	Class	IOU	Accuracy
Swin Transformer	Settlement	86.97	92.21
Swin Transformer	Background	95.35	97.92
Swin Transformer + A	Settlement	87.28	94.73
Swin Transformer + A	Background	95.31	97.05
Swin Transformer + A + B	Settlement	88.05	94.75
Swin Transformer + A + B	Background	95.62	97.28
Swin Transformer + A + B + C	Settlement	89.2	94.8
Swin Transformer + A + B + C	Background	96.12	97.64
FF-SwinT	Settlement	90.15	95.5
FF-SwinT	Background	96.45	97.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luan, G.; Luo, J.; Gao, Z.; Zhao, F. Accurate Extraction of Rural Residential Buildings in Alpine Mountainous Areas by Combining Shadow Processing with FF-SwinT. Remote Sens. 2025, 17, 2463. https://doi.org/10.3390/rs17142463

AMA Style

Luan G, Luo J, Gao Z, Zhao F. Accurate Extraction of Rural Residential Buildings in Alpine Mountainous Areas by Combining Shadow Processing with FF-SwinT. Remote Sensing. 2025; 17(14):2463. https://doi.org/10.3390/rs17142463

Chicago/Turabian Style

Luan, Guize, Jinxuan Luo, Zuyu Gao, and Fei Zhao. 2025. "Accurate Extraction of Rural Residential Buildings in Alpine Mountainous Areas by Combining Shadow Processing with FF-SwinT" Remote Sensing 17, no. 14: 2463. https://doi.org/10.3390/rs17142463

APA Style

Luan, G., Luo, J., Gao, Z., & Zhao, F. (2025). Accurate Extraction of Rural Residential Buildings in Alpine Mountainous Areas by Combining Shadow Processing with FF-SwinT. Remote Sensing, 17(14), 2463. https://doi.org/10.3390/rs17142463

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Accurate Extraction of Rural Residential Buildings in Alpine Mountainous Areas by Combining Shadow Processing with FF-SwinT

Abstract

1. Introduction

2. Study Area

3. Materials

3.1. Data Acquisition and Processing

3.1.1. UAV Remote Sensing Data Acquisition

3.1.2. UAV Remote Sensing Data Preprocessing

3.2. Create Deep Learning Datasets

4. Methods

4.1. Shadow Processing of Remote Sensing Image Based on Multi-Feature Fusion

4.2. Segmentation Model

4.2.1. FNC

4.2.2. UNet

4.2.3. DANet

4.2.4. Deeplabv3-Plus

4.2.5. UPerNet

4.2.6. Transformer

4.2.7. Swin Transformer

4.3. Improved Swin Transformer

4.3.1. Optimization Loss Function

4.3.2. Adaptive Cross-Dimensional Feature Fusion Improved Swin Transformer

4.3.3. Model Structure Based on Multi-Resolution Feature Fusion

5. Experimental Setup and Results Discussion

5.1. Experimental Environment and Parameters

5.2. Experimental Result

5.2.1. Image Shadow Extraction and Processing

5.2.2. Evaluation of Semantic Segmentation Performance of Settlement Based on Multiple Models

5.2.3. Improved FF-SwinT Model Based on Ablation Experiment

5.2.4. Cross-Regional Model Generalization Ability Verification

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI