Accurate Extraction of Rural Residential Buildings in Alpine Mountainous Areas by Combining Shadow Processing with FF-SwinT
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis manuscript focuses on a rural Settlement extraction method for alpine areas based on the FF-SwinT model. The main contributions are as follows: since public datasets are ineffective for training deep learning models in rural alpine areas, the study proposes an image processing method that combines multiple features and improves the accuracy and robustness of the model by optimizing the Swin Transformer architecture. My suggestions are as follows:
- The abstract section describes too many experimental results, which are incorrectly written. The detailed results of the experiments should appear in the conclusion section.
- Swin Transformer is the base model used by the authors, and SAM, CAM, and PPM are the modules added by the authors, which need to be further explained by the authors to explain the necessity of adding the modules.
- it is recommended to explain in detail the fusion channels and spatial features and how these features enhance the model's ability to handle shaded regions.
- The comparison algorithms (FCN, UNet, DaNet, DeepPlabv 3+, UperNet) used by the authors need to be cited as references.
- BCE loss and Dice loss are already existing functions, please further explain the innovativeness.
- All the symbols appearing in the formula should be defined, the authors present R,G,B in formula 1, but the meaning of R, G, B is not introduced until below formula 8, all the symbols are placed at the end to be explained, it is strange.
- “We allocate 90% (27,729 samples) for training and 10% (3,081 samples) for model validation.” I think the ratio setting of 90% and 10% is not common. Why is it set that way?
- The authors constructed the dataset as one of the main research components, so please make it available on the public website.
- Reference 10 is cited 2 times in a row on the second page, this kind of writing is not standardized. In addition, many of the images in the manuscript are not vector graphics and appear shadowed when enlarged. The tables are beyond realistic.
- The language of the paper may need to be optimized to ensure logical clarity. Authors are advised to seek revisions from scholars whose native language is English.
The language of the paper may need to be optimized to ensure logical clarity. Authors are advised to seek revisions from scholars whose native language is English.
Author Response
Comments 1: The abstract section describes too many experimental results, which are incorrectly written. The detailed results of the experiments should appear in the conclusion section.
Response 1: Thank you for your comments on our abstract. We have reduced the part about the results in the abstract to make it more concise. The following is the reduced content:
The results show that FF-SwinT has improved in many indicators compared with the traditional Swin Transformer, and the recognition results have clear edges and strong integrity.
Comments 2: Swin Transformer is the base model used by the authors, and SAM, CAM, and PPM are the modules added by the authors, which need to be further explained by the authors to explain the necessity of adding the modules.
Response 2: Thank you for your valuable suggestions. In view of the problem that you pointed out that it is necessary to further supplement SAM, CAM and PPM modules, we have made in-depth thinking and analysis, and now we add the following explanations:
In the task of fine building extraction, it is very important to capture the feature information accurately, but the original Swin Transformer has some limitations in this process.
For CAM, when extracting fine buildings, there are a lot of background information unrelated to buildings in the image, and the buildings themselves also contain many categories. Distinguishing these categories of information accurately is the basis of fine extraction. When processing features, the original Swin Transformer pays more attention to the category information contained in different channels, and can't adaptively highlight those channel features that play a key role in building category judgment. The CAM module integrates spatial information through average pooling and max-pooling, and then generates channel attention feature vectors through neural network processing, which can make the network automatically focus on the channels containing important category information and suppress the interference of irrelevant or secondary channels, so as to identify the category attributes of buildings more accurately and lay a good foundation for subsequent extraction. Therefore, the purpose of adding CAM module is to enhance the model's adaptive ability to capture the category information at the channel level, so as to cope with the extraction scenes with diverse building categories and complex backgrounds.
For SAM, buildings have specific spatial position and morphological characteristics in the image, such as spatial information such as edge, outline and texture distribution, which directly affect the accuracy and fineness of building extraction. In the process of feature processing, the original Swin Transformer did not pay enough attention to the spatial location information, so it was difficult to accurately locate the specific location and boundary of the building. SAM module can compress the multi-channel dimension into one channel, calculate the spatial relationship between features and generate a spatial attention map, which can make the network focus on the spatial area where the building is located and strengthen the capture of key spatial features such as the edge and outline of the building. This critically improves to distinguish adjacent buildings and accurately outline the boundaries of buildings, which can effectively improve the spatial accuracy of building extraction. Therefore, the SAM module is added to enhance the focusing ability of the model on the spatial position information, so as to solve the problems of complex spatial form and high positioning requirements of buildings.
To sum up, the CAM module and the SAM module make up for the deficiency of the original Swin Transformer in capturing key features in the fine building extraction from the two dimensions of channel and space, respectively. They work together to realize adaptive feature refinement and significantly improve the performance of the model in this task. Therefore, it is necessary and reasonable to add these two modules.
Swin Transformer has limited ability to aggregate global context information when processing features. Fine building extraction not only needs to pay attention to the local characteristics of the building itself, but also needs to combine the global scene in which it is located. Without the effective integration of global context information, the model is prone to misclassifying isolated local areas as buildings, or miss the details of buildings closely related to the global scene. The PPM module can aggregate global context information from different spatial ranges by applying multi-scale pooling in different regions of the feature map, effectively associate local features with global scene features, help the model understand the attributes of buildings in the overall scene more accurately, and reduce the extraction error caused by misjudgment of local features. At the same time, buildings show significant multi-scale characteristics in the image, and there are differences in the expression of building characteristics at different scales, and the multi-scale association of each sub-region is very important for extracting integrity. Swin Transformer has insufficient ability to integrate multi-scale features, and it is difficult to give consideration to the feature expression of buildings with different scales, which is prone to the problems of incomplete extraction of large-scale buildings and neglect of small-scale buildings. PPM module generates feature subgraphs with different scales through multi-scale pooling, and then fuses them with the original features, which can fully capture the multi-scale feature association from local to global, make the model pay attention to the feature information of buildings with different scales at the same time, enhance the adaptability to multi-scale buildings, and effectively improve the integrity and consistency of extraction, especially when dealing with buildings with significant scale differences.
In summary, the PPM module makes up for the shortcomings of the original model in global scene understanding and multi-scale feature processing by aggregating global context information and integrating the characteristics of multi-scale association, and significantly improves the accuracy and integrity of fine building extraction, so it is necessary to add this module.
Thank you again for your comments. We have added relevant contents in the revised manuscript.
Comments 3: It is recommended to explain in detail the fusion channels and spatial features and how these features enhance the model's ability to handle shaded regions.
Response 3: Thank you for your valuable suggestions. In view of the problem that you pointed out that it is necessary to explain the fusion channel and spatial characteristics in detail and how these characteristics enhance the model's ability to deal with shadow areas, we have conducted in-depth combing, and now we add the following explanations:
In the multi-feature fusion shadow processing method proposed in this paper, channel features and spatial features capture shadow characteristics from spectral attributes and spatial distribution respectively. By calculating the difference between the original RGB channels and the normalized ratio, this approach captures the spectral stability of shadows by using the proportional relationship between channels. The chromaticity of shadow area remains consistent variations in solar radiation, This feature enhances the contrast between shadowed and non-shadow areas, as reflected in the inter-channel differences—representing a distinct spectral feature. Based on the human visual sensitivity to RGB channels, the "low brightness" characteristics of shadows are directly quantified from the per-channel luminance, thereby enhancing the spectral difference at the channel level. The K-means clustering of pixels according by spectral characteristics, pixels in shadow areas will form continuous clusters in space due to spectral similarity. Thus, the clustering results essentially capture the spatial continuity of shadows and help eliminate isolated low-brightness pixels that may be misclassified when relying solely on single-channel features. The green spectral characteristics of vegetation region are captured by feature3, and then the "speckled shadows" related to vegetation are located. The spatial form (scattered and small scale) of this kind of shadow is significantly different from the "continuous and long strip" of building shadow, which is a spatial form feature. This method combines the channel characteristics with the spatial characteristics, and realizes the comprehensive depiction of shadows from the "spectrum-space" two dimensions. This method makes up for the limitation of single feature and improves the accuracy of shadow recognition. Single channel characteristics (such as only relying on brightness) tend to misjudge dark non-shadow objects (such as dark roofs and water bodies) as shadows. Channel features (feature1, feature2) ensure that the shadow core attribute of "low brightness and stable chromaticity" is locked from the spectral level, and spatial features (spot morphology of feature3) exclude non-shadow interference areas (such as vegetation spot shadows) through spatial continuity and morphological differences. Together, they realize "spectral screening+spatial verification", which significantly improves the accuracy of shadow recognition.
We have supplemented the relevant contents in the revised manuscript. Thank you again for your guidance. Please read it further.
Comments 4: The comparison algorithms (FCN, UNet, DaNet, DeepPlabv 3+, UperNet) used by the authors need to be cited as references.
Response 4: Thank you for your opinion. In the revised manuscript, each model has been introduced with references of related research.
Comments 5: BCE loss and Dice loss are already existing functions, please further explain the innovativeness.
Response 5: Thank you for your valuable suggestions. In order to solve this problem, we have thoroughly combed the design idea of the combined loss function in this study, and now supplement its innovation from the following angles:
The innovation of this study is not to put forward a new single loss function, but to make targeted selection and innovative integration of the variants of the existing loss function in view of the specific challenges of the fine building extraction task, and form a collaborative optimization mechanism with task adaptability, which is embodied in the following two aspects:
(1) Targeted screening of loss function variants: focusing on the precise matching of task pain points.
In the existing research, BCE, original dice loss and their variants have existed, but this study is not simply reuse, but for the core difficulties of building extraction task (unbalanced category, blurred boundary, small target buildings are easily ignored), specific variants of two types of losses are screened and optimized:
BalanCE is different from the original BCE or WCE. The β coefficient added by BalanCE on the basis of WCE can dynamically adjust the contribution ratio of positive and negative losses, and more flexibly adapt to the scene in which the shadow area leads to a large fluctuation in the proportion of negative samples (non-buildings).
Squared Dice Loss is not the original dice loss either. It replaces the simple sum by the sum of squares, which solves the problem of gradient disappearance of the original dice loss when the prediction is close to the real value, and is more suitable for fine optimization of building boundaries.
This kind of screening is not a random combination, but a "customized selection" of loss function variants based on task characteristics.
(2) Innovation of fusion mechanism: collaborative optimization design of complementary defects
Most of the existing combined losses are simple superposition of "cross entropy+original dice loss", which does not solve the inherent defects of the two types of losses (such as the unbalanced category of cross entropy and the gradient problem of dice loss). In this study, the combined loss of LBCE-Sdice realized "defect complementation" through mechanism design:
BalanCE's β coefficient makes up for Squared Dice's defect of "no weight adjustment", Squared Dice's "optimization of regional overlap" makes up for BalanCE's "lack of boundary details", and the realization of LBCE-Sdice realizes "defect complementarity" through mechanism design. In addition, on the basis of BalanCE and Squared Dice, the β coefficient of BalanCE can dynamically improve the loss weight of building categories in fuzzy areas, and avoid misjudgment of the model because the pixel value is close to negative samples. Squared Dice's sensitivity to regional overlap forces the model to learn the spatial correlation of "building-shadow", which indirectly improves the accuracy of overall shape prediction.
This cooperative mechanism of "dynamic weight adjustment+regional morphological constraint" is an improvement on the existing "simple superposition" mode of portfolio loss, and the optimization effect of 1+1>2 is realized.
To sum up, the innovation of this study is not to create a new loss function, but to build a more efficient LBCE-Sdice portfolio loss based on the in-depth analysis of the difficulties in the task of building extraction, and to solve the inherent defects of existing losses and simple portfolio in this task through "customized selection of loss variants+design of complementary fusion mechanism". We have supplemented the above analysis in the revised manuscript. Thank you again for your guidance. Please read it further.
Comments 6: All the symbols appearing in the formula should be defined, the authors present R,G,B in formula 1, but the meaning of R, G, B is not introduced until below formula 8, all the symbols are placed at the end to be explained, it is strange.
Response 6: Thank you for your advice. In the revised manuscript, we have revised this kind of problem, and put the meaning represented by the symbols in the formula after the first equation to make it more standardized.
Comments 7: “We allocate 90% (27,729 samples) for training and 10% (3,081 samples) for model validation.” I think the ratio setting of 90% and 10% is not common. Why is it set that way?
Response 7: Thank you for your valuable question. The basis of using this ratio is the adequacy of data sample size and the high demand of training data diversity for tasks. The total sample size of this study is 30,810, including 3,081 samples in the validation set. This number has far exceeded the minimum requirement of verification set for small and medium-sized data sets. 3,081 samples cover different areas, different lighting conditions and different shadow complexity, which can stably reflect the generalization ability of the model on "unseen" data. If the common ratio of 8:2 is adopted, the verification stability will be further improved, but about 3,000 training samples will be reduced, which may lose some key sample information for tasks that need to learn fine-grained patterns such as complex contour features.
We have supplemented the above analysis in the revised manuscript. Thank you again for your guidance. Please read it further.
Comments 8: The authors constructed the dataset as one of the main research components, so please make it available on the public website.
Response 8: Thank you for your advice. If the manuscript is accepted, we will provide the dataset on the public website and get it for free.
Comments 9: Reference 10 is cited 2 times in a row on the second page, this kind of writing is not standardized. In addition, many of the images in the manuscript are not vector graphics and appear shadowed when enlarged. The tables are beyond realistic.
Response 9: Thank you for your comment. In the revised manuscript, we have improved the references in the corresponding positions to make them more standardized. Regarding the problem of pictures, they were clearly added in the manuscript, but they may be compressed in the process of converting to PDF, resulting in lower resolution, and the subsequent editing team may improve the problem.
Comments 10: The language of the paper may need to be optimized to ensure logical clarity. Authors are advised to seek revisions from scholars whose native language is English.
Response 10: Thank you for your comment. After revising the manuscript, we invited an English-speaking teacher to improve our manuscript to make it more in line with English norms.
Reviewer 2 Report
Comments and Suggestions for AuthorsReview Report for Manuscript ID: remotesensing-3677377
I would like to thank the authors for their valuable efforts in addressing rural feature extraction using UAV imagery. The manuscript attempts to apply an experimental approach to compare multiple deep learning models for feature extraction and accuracy assessment. However, several aspects require substantial improvement.
The methods section lacks sufficient detail regarding model architectures, training parameters, and validation strategies. Similarly, the results section needs further clarification, particularly with regard to performance metrics and uncertainty evaluation.
Additionally, the current title does not clearly reflect the manuscript's scope or content. I strongly recommend revising it to make it more concise, specific, and aligned with the study’s objectives.
Overall, extensive English language editing is necessary, and the references should be revised for consistency and relevance according to MDPI system.
I recommend that the manuscript be reconsidered after major revisions.
Comments on the Quality of English LanguageAn extensive English language editing is necessary
Author Response
Comments 1: The methods section lacks sufficient detail regarding model architectures, training parameters, and validation strategies. Similarly, the results section needs further clarification, particularly with regard to performance metrics and uncertainty evaluation.
Response 1:Thank you for your opinion. In the revised manuscript, we have added more details in dataset division, shadow processing, optimization method, model building and result analysis. The following are the detailed revisions:
We allocate 90% (27,729 samples) for training and 10% (3,081 samples) for model validation. The basis of using this ratio is the adequacy of data sample size and the high demand of training data diversity for tasks. The total sample size of this study is 30,810, including 3,081 samples in the validation set. This number has far exceeded the minimum requirement of verification set for small and medium-sized data sets. 3,081 samples cover different areas, different lighting conditions and different shadow complexity, which can stably reflect the generalization ability of the model on "unseen" data. If the common ratio of 8:2 is adopted, the verification stability will be further improved, but about 3,000 training samples will be reduced, which may lose some key sample information for tasks that need to learn fine-grained patterns such as complex contour features.
Shadows cast by settlement in remote sensing images often weaken the optical features of these settlement, blur the boundaries between different features, and make object shapes difficult to distinguish. In this study, time-based features are introduced into the image processing workflow, and multiple feature fusion techniques are employed to mitigate the impact of shadows. Given that the majority of radiation energy in visible remote sensing images originates from sunlight, the chromaticity of the shadow regions is expected to align with that of the regions directly illuminated by sunlight. Both high-brightness and shadowed regions remain unaffected by the normalized color space. Therefore, the shadow feature can be differentiated by examining the contrast between the original color space and the normalized color space of the image. The corresponding formulas are presented in Eqs. (1), (2) and (3):
By calculating the difference between the original RGB channels and the normalized ratio, the essence is to capture the spectral stability of shadows by using the proportional relationship between channels. The chromaticity of shadow area keeps consistent under the change of solar radiation, and this feature magnifies the discrimination between this stability and non-shadow area through the numerical difference between channels, which is a typical channel spectral feature.
On this basis, considering the varying sensitivity of the human eye to red, green, and blue light, a weighted method is employed to calculate the brightness and extract the color-based shadow distinguishing features (Eq. 4). K-means clustering is then applied to combine the pixel matrix weight ω in the shadow region with the color shadow feature (feature2), resulting in a comprehensive brightness parameter. This parameter incorporates both human eye perception of shadows and spectral clustering features (Eq. 5).
Based on the sensitive weight of human eyes to RGB channels, the "low brightness" characteristics of shadows are directly quantified from the channel brightness dimension, which further strengthens the spectral difference at the channel level. When K-means algorithm clusters pixels according to spectral characteristics, pixels in shadow areas will form continuous clusters in space due to spectral similarity. The clustering results essentially capture the spatial continuity characteristics of shadows and avoid isolated low-brightness points that may be misjudged by single channel characteristics.
Additionally, to eliminate noise, reduce computational complexity, and enhance the accuracy of settlement edge recognition, it is essential to remove the speckled shadows produced by vegetation surrounding the settlement. Since vegetation typically appears green in the visible spectrum, the minimum difference between the green band and the red and blue bands is utilized as an identification feature, as illustrated in Eq. 6.
The green spectral characteristics of vegetation region are captured by Eq. 6, and then the "speckled shadows" related to vegetation are located. The spatial form (scattered and small scale) of this kind of shadow is significantly different from the "continuous and long strip" of building shadow, which is a spatial form feature.
Based on the features outlined above, the final representation of the shadow features is constructed by assigning feature weights, as shown in Eq. 7. The discrimination of shadows is then performed as described in Eq. 8.
This method combines the channel characteristics with the spatial characteristics, and realizes the comprehensive depiction of shadows from the "spectrum-space" two dimensions. This method makes up for the limitation of single feature and improves the accuracy of shadow recognition. Single channel characteristics (such as only relying on brightness) tend to misjudge dark non-shadow objects (such as dark roofs and water bodies) as shadows. Channel features (feature1, feature2) ensure that the shadow core attribute of "low brightness and stable chromaticity" is locked from the spectral level, and spatial features (spot morphology of feature3) exclude non-shadow interference areas (such as vegetation spot shadows) through spatial continuity and morphological differences. Together, they realize "spectral screening+spatial verification", which significantly improves the accuracy of shadow recognition.
Considering that BalanCE adjusts the weight through β, the issue of category imbalance, which Squared Dice cannot effectively address, is resolved. However, Squared Dice loses its advantage in terms of segmentation region overlap and boundary details, aspects that compensate for BalanCE's limitations in optimizing region overlap. Building upon this, and inspired by the ambiguity in existing combinatorial loss functions, we propose a new combinatorial loss function, LBCE-Sdice, to resolve the problem of class imbalance while simultaneously improving the model's accuracy in predicting region shapes. This does not put forward a new single loss function, but in the task of extracting detailed characteristic buildings, the variants of the existing loss function are selected and innovatively integrated, forming a collaborative optimization mechanism with task adaptability. BalanCE is different from the original BCE or WCE. The β coefficient added by BalanCE on the basis of WCE can dynamically adjust the contribution ratio of positive and negative losses, and more flexibly adapt to the scene in which the shadow area leads to a large fluctuation in the proportion of negative samples (non-buildings). Squared Dice Loss is not the original dice loss either. It replaces the simple sum by the sum of squares, which solves the problem of gradient disappearance of the original dice loss when the prediction is close to the real value, and is more suitable for fine optimization of building boundaries. BalanCE's β coefficient makes up for Squared Dice's defect of "no weight adjustment", Squared Dice's "optimization of regional overlap" makes up for BalanCE's "lack of boundary details", and the realization of LBCE-Sdice realizes "defect complementarity" through mechanism design. In addition, on the basis of BalanCE and Squared Dice, the β coefficient of BalanCE can dynamically improve the loss weight of building categories in fuzzy areas, and avoid misjudgment of the model because the pixel value is close to negative samples. Squared Dice's sensitivity to regional overlap forces the model to learn the spatial correlation of "building-shadow", which indirectly improves the accuracy of overall shape prediction.
As can be seen from Fig. 5, the CBAM module includes two sub-modules: channel attention module (CAM) and spatial attention module (SAM). Among them, CAM aims to keep the channel dimension unchanged and compress the spatial dimension into a scalar, so that the network can focus on the category information inside the image. It adopts two pooling methods, average pooling and maximum pooling, to integrate the spatial information of feature maps and generate images describing spatial context respectively. Then, the two descriptions are input into the neural network with hidden layers. Finally, the final channel attention feature vector is generated by element summation. For CAM, when extracting fine buildings, there are a lot of background information unrelated to buildings in the image, and the buildings themselves also contain many categories. Distinguishing these categories of information accurately is the basis of fine extraction. When processing features, the original Swin Transformer pays more attention to the category information contained in different channels, and can't adaptively highlight those channel features that play a key role in building category judgment. The CAM module integrates spatial information through average pooling and maximum pooling, and then generates channel attention feature vectors through neural network processing, which can make the network automatically focus on the channels containing important category information and suppress the interference of irrelevant or secondary channels, so as to identify the category attributes of buildings more accurately and lay a good foundation for subsequent extraction. Therefore, the purpose of adding CAM module is to enhance the model's adaptive ability to capture the category information at the channel level, so as to cope with the extraction scenes with diverse building categories and complex backgrounds. SAM preserves spatial dimensions while reducing channel depth, computing inter-feature relationships to produce spatial attention maps. This dual-path approach employs max and average pooling for context aggregation, followed by convolutional fusion. The final refinement stage performs cross-dimensional attention fusion, applying multiplicative feature weighting at both intermediate and final layers for adaptive enhancement. For SAM, buildings have specific spatial position and morphological characteristics in the image, such as spatial information such as edge, outline and texture distribution, which directly affect the accuracy and fineness of building extraction. In the process of feature processing, the original Swin Transformer did not pay enough attention to the spatial location information, so it was difficult to accurately locate the specific location and boundary of the building. SAM module can compress the multi-channel dimension into one channel, calculate the spatial relationship between features and generate a spatial attention diagram, which can make the network focus on the spatial area where the building is located and strengthen the capture of key spatial features such as the edge and outline of the building. This is of great significance to distinguish adjacent buildings and accurately outline the boundaries of buildings, which can effectively improve the spatial accuracy of building extraction. Therefore, the SAM module is added to enhance the focusing ability of the model on the spatial position information, so as to solve the problems of complex spatial form and high positioning requirements of buildings.
Swin Transformer has limited ability to aggregate global context information when processing features. Fine building extraction not only needs to pay attention to the local characteristics of the building itself, but also needs to combine the global scene in which it is located. Without the effective integration of global context information, the model is easy to misjudge isolated local areas as buildings, or miss the details of buildings closely related to the global scene. The PPM module can aggregate global context information from different spatial ranges by applying multi-scale pooling in different regions of the feature map, effectively associate local features with global scene features, help the model understand the attributes of buildings in the overall scene more accurately, and reduce the extraction error caused by misjudgment of local features. At the same time, buildings show significant multi-scale characteristics in the image, and there are differences in the expression of building characteristics at different scales, and the multi-scale association of each sub-region is very important for extracting integrity. Swin Transformer has insufficient ability to integrate multi-scale features, and it is difficult to give consideration to the feature expression of buildings with different scales, which is prone to the problems of incomplete extraction of large-scale buildings and neglect of small-scale buildings. PPM module generates feature subgraphs with different scales through multi-scale pooling, and then fuses them with the original features, which can fully capture the multi-scale feature association from local to global, make the model pay attention to the feature information of buildings with different scales at the same time, enhance the adaptability to multi-scale buildings, and effectively improve the integrity and consistency of extraction, especially when dealing with buildings with significant scale differences.
This research employs a dataset split of 27,729 training samples and 3,081 validation instances. Six semantic segmentation models for settlement are constructed: FCN, UNet, DaNet, DeepPlabv 3+, UperNet and Swin transformer. The number of iterations of all models is 480,000, and the accuracy index reached convergence. As can be seen from Tab. 4, shadow processing of remote sensing image based on multi-feature fusion, combined loss function and adaptive cross-dimensional attention feature fusion can continuously improve the accuracy of Swin transformer in turn, and the accuracy of FF-SwinT model is significantly improved than that of Swin transformer.
FF-SwinT model is built around the common problems (fuzzy features, scale differences, etc.) of remote sensing building extraction, which has the foundation of cross-landscape migration. For example, in desert and tropical environment, in order to solve the unique problems such as fuzzy sand and dust in desert, sparse samples and complex shadows in tropical areas, it is necessary to fine-tune training in interference feature extraction, loss parameters and data enhancement. Theoretically, it can improve the detection rate of small-scale targets in the desert and optimize the accuracy of vegetation occlusion boundary in the tropics. Although there are extreme scene limitations, the migration application can be realized through "universal framework+fine-tuning of scenes".
Comments 2: Additionally, the current title does not clearly reflect the manuscript's scope or content. I strongly recommend revising it to make it more concise, specific, and aligned with the study’s objectives.
Response 2: Thank you for your opinion. We changed the title of the paper to " Accurate extraction of rural residential buildings in alpine mountainous areas by combining shadow processing with FF-SwinT".
Comments 3: Overall, extensive English language editing is necessary, and the references should be revised for consistency and relevance according to MDPI system.
Response 3: Thank you for your comment. After revising the manuscript, we invited an English-speaking teacher to improve our manuscript to make it more in line with English norms.
Reviewer 3 Report
Comments and Suggestions for AuthorsAccurate and fast detection of rural buildings in mountainous Alpine regions is crucial for spatial planning, rural development and creation of geographic data. However, existing deep learning models show limited effectiveness in this specific environment, among others due to insufficient data sets, blurred object edges and imperfect algorithm structure. In response to these challenges, a special data set based on high-resolution drone (UAV) images was developed in this work and a novel shadow reduction technique was applied, using multispectral image features. Based on the comparative analysis of different deep learning models, the Swin Transformer architecture was selected, which was then extended with the proprietary FF-SwinT (Feature Fusion Swin Transformer) model, introducing improvements in data processing, loss function and multi-view feature integration.
I have a few comments and questions:
fig. 7. What are the differences between a and c figure? what does it mean these blue same areas?
What % of shadow is recudted during your experiment?
table 2. The is no information about iteration numer during your processing
In my opinion increasing the accuracy from 95% up to 96% is not important and repeatable.
Additional questions that may justify the method:
1. What is the innovation of the proposed FF-SwinT model compared to traditional methods?
2. What are the distinctive features of the drone (UAV) images used in the study?
3. What spectral features are used in this technique and how do they improve extraction quality?
4. Can the model be adapted to other types of landscapes, e.g. desert or tropical? What will be the difference in training and results?
Author Response
Comments 1: fig. 7. What are the differences between a and c figure? what does it mean these blue same areas?
Response 1: Thank you for your advice. Fig. 7(a) and 7(c) represent the images before and after shadow processing, respectively, and the blue area in Fig. 7(b) represents the shadow extraction result. Careful observation shows that the area in Fig. 7(a) with the shadow in Fig. 7(b) has a great improvement in quality in Fig. 7(c).
In the revised manuscript, it has been supplemented in the corresponding position.
Comments 2: What % of shadow is recudted during your experiment?
Response 2: Thank you for your question about the details of our manuscript. Shadow problem is common in the task of building extraction, but the focus of this study is on the buildings in rural settlements in alpine mountainous areas, so although there is shadow, the proportion is not high. I'm sorry that we didn't count the number of related pixels during shadow processing, so we can't give detailed proportion results.
Comments 3: table 2. The is no information about iteration numer during your processing.
Response 3: Thank you for your opinion. The number of iterations of all models is 480,000, and the accuracy index has reached the convergence state. In the revised manuscript, the annotation has been indicated in the text above Tab. 2.
Comments 4: In my opinion increasing the accuracy from 95% up to 96% is not important and repeatable.
Response 4: Thank you for your insight. Rural residential areas in alpine mountainous areas have obvious particularity: buildings are mostly low and scattered adobe houses or wooden houses, and are often affected by snow and ice, terrain shadows, vegetation cover (such as alpine meadows), which leads to the key difficulty of "easy to miss inspection and misjudge". In this kind of scene, 95% to 96% accuracy improvement is not a simple numerical increase, but a targeted breakthrough in key scenes.
Aiming at the "feature blur caused by terrain shadow", our multi-feature fusion shadow processing module has contributed 0.63% to the accuracy improvement in the ablation experiment, which proves that its ability to identify the edge of buildings in the shadow area is one of the shortcomings of the original model. In order to solve the problem of "missed detection of small-scale scattered residential areas", LBCE-Sdice combined loss strengthens the attention to small sample areas through dynamic weights, and contributes 0.12% to the accuracy improvement in the ablation experiment alone, which solves the problem that the original model is ignored because of the low proportion of small target pixels. At the same time, aiming at the problem of "insufficient fusion ability of multi-scale features", the addition of PPM also contributed 0.11% to the accuracy improvement in the ablation experiment. It can be seen that the 1% improvement is the result of the synergy of several improvement modules, and each sub-module is designed for the pain point of the task, and its gain has clear technical support, rather than random fluctuation.
Comments 5: What is the innovation of the proposed FF-SwinT model compared to traditional methods?
Response 5: Thank you for your advice. The innovation of FF-SwinT model lies in the whole process innovation framework of "feature enhancement-context modeling-interference suppression-loss optimization", aiming at the core pain points of remote sensing building extraction (especially rural residential areas in alpine mountainous areas): through cross-dimensional attention coordination (CBAM and Swin self-attention combination), it breaks through the one-sidedness of traditional feature extraction and realizes the dual enhancement of channel and spatial features; With the help of multi-scale global context aggregation (PPM and hierarchical feature fusion), it overcomes the shortcomings of traditional models in adapting to scale differences, taking into account large-scale buildings and small-scale scattered settlements; Innovate the shadow adaptive processing mechanism of multi-feature fusion, accurately distinguish and suppress shadow interference, and break through the robustness bottleneck of traditional single feature; The LBCE-Sdice combined loss function is designed to realize the collaborative optimization of category balance and regional morphological constraints, solve the single constraint limitation of traditional losses, and finally form an optimization model adapted to complex scenes, which significantly surpasses the performance of traditional methods in feature capture, interference processing and result recognition.
In the revised manuscript, we also supplemented the innovation in the corresponding position.
Comments 6: What are the distinctive features of the drone (UAV) images used in the study?
Response 6: Thank you for your question about the details of our manuscript. The main feature of UAV data used in the study is that the resolution is higher than that of ordinary remote sensing images, reaching 0.05m. In addition, it is not fundamentally different from large-scale remote sensing image data. This also reflects the universality of the research method, and it will not be applicable to a particular UAV image.
Comments 7: What spectral features are used in this technique and how do they improve extraction quality?
Response 7: Thank you for your advice. In this study, UAV images were used to extract rural residential areas in alpine mountainous areas, and R, G, B bands were used. In the multi-feature fusion shadow processing method proposed in this paper, the shadow characteristics are captured from spectral attributes through channel characteristics. By calculating the difference between the original RGB channels and the normalized ratio, the essence is to capture the spectral stability of shadows by using the proportional relationship between channels. The chromaticity of shadow area keeps consistent under the change of solar radiation, and this feature magnifies the discrimination between this stability and non-shadow area through the numerical difference between channels, which is a typical channel spectral feature. Based on the sensitive weight of human eyes to RGB channels, the "low brightness" characteristics of shadows are directly quantified from the channel brightness dimension, which further strengthens the spectral difference at the channel level. When the FF-SwinT model is used to apply the shadow-processed image, it is also realized on the basis of RGB bands. In the revised manuscript, we also supplemented the relevant contents in the corresponding position.
Comments 8: Can the model be adapted to other types of landscapes, e.g. desert or tropical? What will be the difference in training and results?
Response 8: Thank you for your advice. FF-SwinT model is built around the common problems (fuzzy features, scale differences, etc.) of remote sensing building extraction, which has the foundation of cross-landscape migration. For example, in desert and tropical environment, in order to solve the unique problems such as fuzzy sand and dust in desert, sparse samples and complex shadows in tropical areas, it is necessary to fine-tune training in interference feature extraction, loss parameters and data enhancement. Theoretically, it can improve the detection rate of small-scale targets in the desert and optimize the accuracy of vegetation occlusion boundary in the tropics. Although there are extreme scene limitations, the migration application can be realized through "universal framework+fine-tuning of scenes".
The above contents have been supplemented in the discussion.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsI have no more comments.
Reviewer 2 Report
Comments and Suggestions for AuthorsReview Report for the Manuscript: Extraction of Rural Settlements in Alpine Mountainous Areas Based on the FF-SwinT Model
I would like to thank the authors for their great efforts in preparing this manuscript. It is evident that the authors have revised the manuscript and addressed most of the previously raised comments and suggestions.
At this stage, I recommend the manuscript be accepted for publication in its current form.