Review Reports - Unmixing-Guided Convolutional Transformer for Spectral Reconstruction

Round 1

Reviewer 1 Report

In this work, an unmixing-guided convolutional transformer network is proposed for hyperspectral reconstruction tasks. This work effectively integrates mathematical modeling with deep learning, utilizing clear and concise language. Moreover, the author employs graphics to elucidate the role of hyperspectral unmixing technology in image processing and employs parallel convolutional transformer modules to address the limitations of CNN modeling in the channel dimension. The reconstruction results within the dataset highlight the superiority and high interpretability of model in spectral reconstruction. However, there are some issues need to be resolved: 1. Figure 1 contains an inconsistency. Specifically, the component depicted in Figure 1 is referred to as the Unmixing Guided Convolutional Transformer Network (UGCT) in Abundance Generator, while later it is named the Unmixing Guided Convolutional Transformer Abundance Generator (UGCA). Is there a distinction between the two? If not, it is advisable to standardize the terminology. 2. Eq.2 lacks clarification regarding the “Soft” operation symbol. It is necessary to provide an explanation for the symbol it represents. 3. In page 5, please check the word of “visual” in the sentence of “replacing linear projection and other components, and improved the accuracy of various visual tasks.” 4. In the “Experiments and Results” section, please review the sentence of “The success of incorporating LMM into SR tasks on the integration of the accurate spectral library as an a priori.” 5. The novelty of this paper focus on Unmixing, please clarify the reason of utilizing the Linear Mixing Model rather than Nonlinear Mixing Model in “Introduction” section.

The English language should be improved to be more clear and concise before publishment.

Author Response

We would like to sincerely thank the Reviewer1 for your insightful comments and we have revised the manuscript carefully according to the comments item by item.

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper introduces a novel Convolutional Transformer Network based image processing technique that can be used to reconstruct high spatial resolution hyperspectral images using fused low spatial resolution hyperspectral images and high spatial resolution multi-spectral band images. More specifically, Paralleled-Residual Multi-Head Self-Attention (PMSA) technique is included to facilitate fine feature extraction and the Spectral-Spatial Aggregation Module (S2AM) is used to optimize the reconstruction performance. A hyperspectral unmixing framework is used to relate the endmember features to the prior knowledge from the spectral library. The application on 2018 IEEE GRSS data fusion contest data is used to demonstrate the advantage of the proposed methodology over other methods for the spectra reconstruction performance.

Major comments:

The paper is well organized. The results look convincing. I would recommend the acceptance of this paper for publication. However, the implementation of the methodology (as is shown in Figure 2 and Figure 3) is a rather complicated scheme, which is hard for readers to clearly what critical function each step provides and what are the necessities of those steps. The discussions given in the paper are not sufficient enough for readers to fully capture the technical details. I would strongly suggest the authors to add more detailed discussion related with the flow diagram in Figures 2 and 3. For example, answers to question like “why a two-stage “DownSample-UpSample’ process is needed and what is benefit of such a two-stage process” will be very helpful to readers to better understand the implementation scheme.

Minor issues:

A lot of abbreviations are used in the paper, sometimes the full form is not provided before the abbreviated form is used. I listed a few cases here. Please double check. Below are a few minor issues that need to be corrected or explained.

Line 158: please use “Deep Learning based techniques” instead of “DL-based techniques”?

Line 176: please use “Vision Transformer block” instead of “ViT block”

Figure 3: What is ‘Conv2D+BN’ ? What is the ‘ShoutCut’? Is it actually ‘ShortCut’?

Figures 6 and 8: Please add label of x-axis to the figures.

The paper is written in proper English. I have no difficulties in understanding the context.

Author Response

We would like to sincerely thank the Reviewer2 for your insightful comments and we have revised the manuscript carefully according to the comments item by item.

Author Response File: Author Response.pdf

Reviewer 3 Report

The study presents a Unmixing-Guided Convolutional Transformer Network (UGCT) approach for achieving interpretable spectral reconstruction in hyperspectral imaging. The proposed UGCT model exploits the combined advantages of convolutional neural networks (CNNs) and transformers, allowing it to extract both local and non-local features from the image.

This manuscript is well-written and has a coherent structure in explicating the research in English. Hence, I posit that it meets the requisite standards for publication.

Author Response

We would like to sincerely thank the Reviewer3 for your insightful comments and we have revised the manuscript carefully according to the comments item by item.

Author Response File: Author Response.pdf