Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Fine-Tuning-Based Transfer Learning for Building Extraction from Off-Nadir Remote Sensing Images

Remote Sens. 2025, 17(7), 1251; https://doi.org/10.3390/rs17071251

by Bipul Neupane^1,2,*

, Jagannath Aryal^1,2

and Abbas Rajabifard²

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Remote Sens. 2025, 17(7), 1251; https://doi.org/10.3390/rs17071251

Submission received: 23 February 2025 / Revised: 26 March 2025 / Accepted: 31 March 2025 / Published: 1 April 2025

(This article belongs to the Special Issue Applications of AI and Remote Sensing in Urban Systems II)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Authors performed manuscript titled "Transfer learning for building extraction from off-nadir remote sensing images". Off-nadir angle is an important parameter of the optical images' quality and it influences images' resolution and objects' identification. Authors proposed a new so called fine-tuning-based transfer learning (FTL) method to improve segmentation accuracy in off-nadir images. Manuscript is well structured and doesn't need serious corrections.

Detailed comments:

Fig.1.- Please provide alphabetical order of the images and a scale. What are blue dash and solid lines, please clarify.

L.105- What gaps, please clarify.

Fig.2.- Should be "samples...are provided".

Fig.3(b).- Should be "samples of...".

Fig.4(b, c).- Should be "an example of...", please correct.

L. 238- What "fro" is?

Ls.268-269- Please fill out section 2.4 ...giving an introduction. The same for (L.298) section 2.5 and (Ls. 336-337) sections 3 and 3.1. The same for (L.380) section 3.2.

L.384- Please place Fig.5 the section authors cited it in.

L.404- Should be "Analysis of training time and computational efficiency". The same for Fig. 6.

Fig.5.- Please provide alphabetical order of the radar charts, both: as a graphics and a caption. And double check the current caption which is confused.

L.466- Please provide a better clarification.

Fig.10.- " Result samples" is incorrect, please improve English grammar. So far, it means that "result takes or tries a small amount of smth". Another remark: "Box" is cubical, authors used "rectangles" for graphical description, please improve the caption

L.698- Please double check References and provide a current access date.

Author Response

Thank you for your valuable comments. Please find the attached response document in PDF format.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

More relevant references in 2023-2024 should be added and considered.
Please check the references, and some of which have formatting issues such as missing volume and issue number
The source of the manual annotation could be mentioned.

Comments on the Quality of English Language

no comments

Author Response

Thank you for your valuable comments. Please find the attached response document in PDF format.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This paper proposes a Fine-Tuning-based Transfer Learning (FTL) method, aiming to address the label misalignment problem and improve the accuracy of building extraction. FTL pretrains the model on a large-scale noisy dataset and then fine-tunes it on a small clean dataset to adapt to the target dataset, thereby reducing the impact of label misalignment. The work is meaningful, and the experiments are well-conducted. My detailed comments are as follows:

In addition to comparisons with existing methods, more ablation studies could be added to further verify the contributions of each component in FTL. For example, the effects of using only intra-class variance weighting (without data augmentation) and using only data augmentation (without intra-class variance weighting) could be explored.
When describing the FTL method, more technical details could be provided to help readers better understand its implementation. For example, how is the approximation of intra-class variance calculated (specific implementation of EMA)? What are the specific parameter settings for data augmentation (e.g., the proportion of random erasing, the range of color jittering)? For key mathematical formulas (such as the calculation of intra-class variance and the definition of the loss function), more explanations could be added to clarify the physical meaning of each variable and the derivation process of the formulas.
Although FTL performs well on multiple datasets, its potential limitations could be discussed. For example, how does FTL perform on extremely low-quality images (such as those with severe occlusion or extreme lighting conditions), and is there room for improvement?
More visual results could be added to demonstrate FTL's performance on images of different qualities. For example, the distribution of pseudo-labels on low-quality and high-quality images could be shown, as well as how FTL corrects these pseudo-labels.
When summarizing the contributions, the aspects in which your method outperforms existing methods could be more explicitly pointed out. For example, the performance improvement percentage of FTL on specific datasets could be mentioned quantitatively.

6. More recent references should be cited and reviewed.

Author Response

Thank you for your valuable comments. Please find the attached response document in PDF format.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

Dear authors, thank you for submitting of your article. I find it interresting, because I work in the field of photogrammetry and remote sensing.

I have several remarks or questions:

Transfer learning for building extraction from off-nadir remote sensing images

Why transfer learning? And I think it is not a building extraction, but building footpints extractions.

Introduction

You write about detecting building footprints on aerial or satellite images. Nowhere is there any mention of the reason for the displacement of the footprints. This should be mentioned at the beginning and not immediately citing deep learning, although I understand that it is modern and in almost every paper. Photogrammetry has been dealing with the issue of footprint displacements for a very long time and has produced, for example, true orthophoto output.

row 59 The accuracy on the BONAI dataset...define the dataset, waht is BONAI

row 65 ...other studies focus on distilling supervision from label noise in Teacher-Student learning framework. Teacher-Student learning to transfer the knowledge from a Teacher previously trained on large-noisy data (with misalignment) to a Student ‘distilled’ on small-clean data...

this is really hard to understand. You have to explain what you mean by distillation and what a teacher-student is. I can't understand it from your text. Teachers and students are in schools, and you seem to mean some kind of data-driven learning process here. Explain.

What do you mean by "Supervision from label noise". Explain.

Fig. 2 A sample of large-noisy and small-clean datasets. What is meant by this? Where is the noise? Do you mean the shift in the position of an object due to its spatial structure and central projection, which it typical in aerial photogrammetry (but in VHR satellite data too, of course it is not typical central projection in this case, only in rows of scanned data)

Fig.11 The sample image, ground truth (GT) labels, and segmentation outputs highlight challenges posed by urban scenes with tall buildings. The ground truth should be a precise footprint of the building? I see shifted footprint. A real footprint is a intersection of the building with the ground.

I don't quite understand what you're looking for. If it's supposed to be a real intersection of the building with the ground, I don't see much there.

Conclusion

These findings highlight the potential of FTL as a robust method for accurate building footprint extraction, particularly in off-nadir urban environments??? particularly in off-nadir images of urban environments.

References, please, add relevant informations like doi, if exist.

38.Iakubovskii, P. Segmentation Models Pytorch, 2019. accessed on 22-07-2023.

27. Ahn, S.; Kim, S.; Ko, J.; Yun, S.Y. Fine tuning pre trained models for robustness under noisy labels. arXiv preprint arXiv:2310.17668 678
2023. 679
28. Mnih, V. Machine Learning for Aerial Image Labeling. PhD thesis, University of Toronto, 2013. Accessed on: 12-06-2023.

...insufficient info, give the web address

Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the 631
International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241. I think it is not a relevant reference

Author Response

Thank you for your valuable comments. Please find the attached response document in PDF format.

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

Thank you to the authors for the revisions made to the paper based on the previous round of review comments. I have no other new comments.

Author Response

Comment: Thank you to the authors for the revisions made to the paper based on the previous round of review comments. I have no other new comments.

Response: Thank you for confirming that our responses were meeting your expectations.

Reviewer 4 Report

Comments and Suggestions for Authors

Dear authors, thank you for submitting of revised version of your article.

Thank you for your explanatory notes and editing of the text. Sometimes it is hard to understand the goals of research, artificial intelligence and neural networks are used everywhere, but not always to the benefit of the cause.

You write: 2.4. Comparison Methods
The proposed FTL method is compared to existing methods of knowledge distillation and deep mutual learning from [24]. The two methods are described here.

I think you need to compare the results with something else, preferably a more accurate measurement and especially a different method.You are comparing neural networks and their results, but you should compare the neural network result with something accurate, e.g. a cadastral map or a direct geodetic measurement.

You write in conclusion:

The analysis of spatial resolution further highlighted that 60 cm imagery provided the best balance for building extraction, whereas 30 cm images introduced excessive noise for small buildings, and 120 cm images were more suitable for skyscrapers.

???

You mean 30, 60 and 120cm GSD? So pixel size?
Edit in the full text.
And why do images with 30 cm GSD have more noise? Explain that.

Author Response

Thank you for your comments. Please find the response in the attached document.

Author Response File: Author Response.pdf

Round 3

Reviewer 4 Report

Comments and Suggestions for Authors

Dear authors, thank you for your replay and updating and improving of your controbution.

I haven more questions on your reserch, only : they should take more information from photogrammetry and not just use artificial intelligence all the time. The mathematical basics and geometry are still valid.

Article Menu

Fine-Tuning-Based Transfer Learning for Building Extraction from Off-Nadir Remote Sensing Images

Further Information

Guidelines

MDPI Initiatives

Follow MDPI