Previous Article in Journal
Memory-Based Temporal Transformer U-Net for Multi-Frame Infrared Small Target Detection
Previous Article in Special Issue
High-Resolution Mapping and Spatiotemporal Dynamics of Cropland Soil Temperature in the Huang-Huai-Hai Plain, China (2003–2020)
 
 
Article
Peer-Review Record

Cross-Domain Land Surface Temperature Retrieval via Strategic Fine-Tuning-Based Transfer Learning: Application to GF5-02 VIMI Imagery

Remote Sens. 2025, 17(23), 3803; https://doi.org/10.3390/rs17233803 (registering DOI)
by Peyman Heidarian 1,2,3, Hua Li 1,*, Zelin Zhang 1, Yumin Tan 2,3, Feng Zhao 4, Biao Cao 5, Yongming Du 1 and Qinhuo Liu 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Remote Sens. 2025, 17(23), 3803; https://doi.org/10.3390/rs17233803 (registering DOI)
Submission received: 26 September 2025 / Revised: 18 November 2025 / Accepted: 20 November 2025 / Published: 23 November 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript introduced a three-stage strategic fine-tuning transfer-learning (SFTL) for LST estimation from GF5-02 VIMI, and validated the effectiveness of the method through integrating large-scale simulation and satellite imagery. Although the experimental results have analyzed in detail, there are some comments for improvement, especially the experiment analysis.

  1. In the work, five distinct machine learning models, TrF, CNN, DNN, RF and LGBM were selected to offer comparative insights. Why does the author choose those methods, are there any other new methods to retrieval LST?
  2. From Figure 5 and Table 6, the RMSE increased from fine-tuning to generalization. Has there been any attempt to use a portion of the data from both regions for fine-tuning, with the remaining data used to evaluate generalization?
  3. The main research areas are Huailai and Heihe. Have other validation sites been considered, such as SURFRAD、ICOS、TERN and BSRN?
  4. Some figures in this manuscript are blur, the quality should be improved, such as Figure 1.
  5. The caption of the Figure 1 and the figure should be in the same page.
  6. It is recommended to highlight the innovative in the abstract.

 

Author Response

Reviewer 1:

The manuscript introduced a three-stage strategic fine-tuning transfer-learning (SFTL) for LST estimation from GF5-02 VlMl, and validated the effectiveness of the method through integrating large-scale simulation and satellite imagery, Although the experimental results have analyzed in detail, there are some comments for improvement, especially the experiment analysis.

Reply: We are truly grateful for your kind regard as well as your insightful and encouraging comments.

 

  1. In the work, five distinct machine learning models, TrF, CNN, DNN, RF and LGBM were selected to offer comparative insights. Why does the author choose those methods. are there any other new methods to retrieval LST?

Reply: Authors sincerely thank the reviewer for this insightful comment. Our selection of the five machine learning models—Transformer, Convolutional Neural Network (CNN), Deep Neural Network (DNN), Random Forest (RF), and Light Gradient Boosting Machine (LGBM)—was intentional and grounded in methodological diversity, complementary representational strengths, and their demonstrated performance in recent remote sensing regression studies (updated in our literature review).

Our objective was not only to assess predictive accuracy but also to investigate how different, yet widely used, architectural families respond to transfer learning and fine-tuning for LST estimation, which remains insufficiently explored in previous work. Importantly, the chosen models match the characteristics of our available datasets (large simulated training set, limited in-situ samples, and multispectral TIR observations), allowing a fair and meaningful comparison. In summary:

  • The selected models span different levels of complexity, allowing us to evaluate performance trade-offs between nonlinear tabular learners (RF, LGBM) and deep neural architectures (DNN, CNN, Transformer).
  • More complex or data-hungry architectures such as U-Net, Vision Transformers (ViT), GANs, and Kolmogorov–Arnold Networks (KANs) were not used because they typically require dense pixel-wise supervision, very large labeled datasets, or long temporal sequences (e.g., LSTM), which are not available in our study due to the sparse in-situ observations and long revisit period of GF5-02 VIMI.
  • For these reasons, the chosen model suite is scientifically justified and optimal for the LST retrieval and transfer-learning scenario under real-world constraints, especially the limited availability of high-quality in-situ data.

We have added clarification in Section 3.1.1. Pre-training Process, explaining the rationale for selecting the five machine learning models and why more complex architectures were not included.

  1. From Figure 5 and Table 6, the RMSE increased from fine-tuning to generalization. Has there been any attempt to use a portion of the data from both regions for fine-tuning, with the remaining data used to evaluate generalization?

Reply: We thank the reviewer for this valuable suggestion.

In this study, authors intentionally did not mix “Heihe” and “Huailai” data during fine-tuning because our objective was to evaluate true cross-domain generalization—i.e., how well a model adapted to one region performs on an entirely unseen region. Mixing data from both sites during fine-tuning would reduce the domain gap and obscure the ability to measure real cross-site transfer. Therefore, we kept the regions strictly separated to preserve the integrity of the cross-domain experiment. We will clarify this point in the revised manuscript.

  1. The main research areas are Huailai and Heihe. Have other validation sites been considered such as SURFRAD、ICOS、TERN and BSRN?

Reply: We appreciate the reviewer’s suggestion.

In this study, additional networks such as SURFRAD, ICOS, TERN, and BSRN were not included because “GF5-02/VIMI” currently has limited global coverage and revisit opportunities, making it difficult to obtain spatiotemporally matched satellite–in-situ pairs at those sites. The Huailai and Heihe stations were selected because they provide dense, high-quality TIR surface measurements with precise temporal alignment to available GF5-02 overpasses. We agree that expanding validation to international networks is valuable, and we highlight this as an important direction for future work.

  1. Some figures in this manuscript are blur, the quality should be improved, such as Figure 1.
  2. The caption of the Figure 1 and the figure should be in the same page.

Reply to 4 and 5: Thank you for pointing this out. We have replaced the blurry figures (including Figure 1) with high-resolution versions and ensured that each figure and its caption appear on the same page in the revised manuscript.

  1. lt is recommended to highlight the innovative in the abstract.

 

Reply: We thank the reviewer for this suggestion.

The abstract has been revised to explicitly highlight the innovative aspects of our work. In particular, we now clearly emphasize the novelty of the proposed three-stage SFTL framework, including the integration of a large simulated dataset, an engineered humidity-sensitive feature, and multiple parameter-efficient fine-tuning strategies within a unified approach for cross-site LST estimation. These revisions strengthen the clarity of the contribution.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript presents methodology research for retrieval LST from GF5-02 VIMI. The manuscript is well-organized, and the result is reliable. It can be accepted after a minor revision with the following comments:

  • The introduction for the auxiliary data is missed in the manuscript, such as the CWV for LST retrieval, even the ERA5 reanalysis data is mentioned in Figure 1. Additionally, the spatial resolution of the ERA5 Reanalysis Data is very coarse, about 25 km, while the spatial resolution of VIMI is 40 meters. How to consider the scale difference between the two datasets?
  • In Figure 5 and Figure 9, could you please explain why there is a large bias in some cases, such as at Corn5? Is the problem of the ground station’s spatial representativeness or the geolocation mismatch between the pixel and in-situ stations?

Author Response

Reviewer 2:

The manuscript presents methodology research for retrieval LST from GF5-02 VIMl. The Suggestions for Authors manuscript is well-organized, and the result is reliable. It can be accepted after a minor revision with the following comments:

Reply: We thank the reviewer for this important observation.

 

The introduction for the auxiliary data is missed in the manuscript, such as the CW for LST retrieval, even the ERA5 reanalysis data is mentioned in Figure 1.

Reply: We thank the reviewer for pointing out this omission.

In the revised manuscript, we added a dedicated subsection “2.4 Auxiliary Meteorological and Reanalysis Data” describing the ERA5 reanalysis dataset, the total column water vapor, and their roles in both the operational SW algorithm and the SFTL models. We now explicitly explain how ERA5 WVC is temporally matched, interpolated to the study regions, and linked to the simulated and real-data experiments.

 

Additionally, the spatial resolution of the ERA5 Reanalysis Data is very coarse, about 25 km, while the spatial resolution of VlMl is 40 meters. How to consider the scale difference between the two datasets?

Reply: We thank the reviewer for this important observation.

Although ERA5 has a coarse resolution (~25 km), its WVC fields are used only as slowly varying atmospheric background. Given the small, flat Huailai and Heihe regions, WVC is assumed to vary smoothly within each ERA5 cell, so one value is applied to all underlying VIMI pixels. This introduces mainly systematic rather than pixel-scale differences, and any sub-grid humidity variation appears only as small residual biases in Section 4.

 

In Figure 5 and Figure 9, could you please explain why there is a large bias in some cases, such as at Corn5? ls the problem of the ground station's spatial representativeness or the geo-location mismatch between the pixel and in-situ stations?

Reply: Authors thank the reviewer for this central observation.

In the revised manuscript, we added an explicit explanation in the validation section (around Figures 5 and 9). Inspection of the high-resolution VIMI imagery and land-cover information shows that Corn5 is located near field boundaries and mixed surface conditions. As a result, the 40 m satellite pixel contains a mixture of crops and neighboring surfaces, whereas the ground radiometer samples only a small, homogeneous patch. We therefore attribute the larger bias at Corn5 primarily to a spatial representativeness mismatch between the station footprint and the VIMI pixel, with any residual geo-location offset (≤ one pixel) playing a secondary role. Although Corn5 could be treated as an outlier, it was retained in the analysis to reflect real observational conditions and to avoid biasing the evaluation toward only ideal station–pixel configurations.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

 I consider the authors have well addressed all the issues. I recommend its publication at its present form.

Back to TopTop