Next Article in Journal
Indoor–Outdoor Point Cloud Alignment Using Semantic–Geometric Descriptor
Next Article in Special Issue
Multi-Scale Spectral-Spatial Attention Network for Hyperspectral Image Classification Combining 2D Octave and 3D Convolutional Neural Networks
Previous Article in Journal
Impact of Atmospheric Correction Methods Parametrization on Soil Organic Carbon Estimation Based on Hyperion Hyperspectral Data
 
 
Article
Peer-Review Record

From Video to Hyperspectral: Hyperspectral Image-Level Feature Extraction with Transfer Learning

Remote Sens. 2022, 14(20), 5118; https://doi.org/10.3390/rs14205118
by Yifan Sun, Bing Liu *, Xuchu Yu, Anzhu Yu, Kuiliang Gao and Lei Ding
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Remote Sens. 2022, 14(20), 5118; https://doi.org/10.3390/rs14205118
Submission received: 31 August 2022 / Revised: 3 October 2022 / Accepted: 8 October 2022 / Published: 13 October 2022
(This article belongs to the Special Issue Computational Intelligence in Hyperspectral Remote Sensing)

Round 1

Reviewer 1 Report

I would like to thank the authors for the publication. The use of Transfer Learning to classify and extract features from hyperspectral images is a new topic and should be continuously developed.

The presented studies are described in an interesting way. The brief theoretical introduction presents the essential basics of classical image classification methods and a modern approach to classification and feature extraction.

The authors also clearly and correctly described their original approach.

The presented results are encouraging, but the tables should be edited because they are not readable and reading the values ​​is difficult.

In the results section, however, I miss information about the accuracy of the classification of individual classes. The overall accuracy results are really promising. However, when comparing the visually presented results (images after classification), there are apparent differences between the classification results. If possible, I would ask you to add such analyses and reflect on why there may be such differences in classification.

Author Response

Response: Thank you very much for your affirmation and recognition of our works, which will motivate us to continue to do more in-depth research. At the same time, thank you for your valuable suggestions, we have made corresponding modifications:

  • We have re-edited all the tables, increased the spacing of rows and columns in the tables and made them easy to read and analysis.
  • We have added and improved the relevant discussion.

To focus on the analysis of single category, we take IP as an example to discuss the differences in accuracy per category. As we can see in Table 2, Spe-TL achieves the highest accuracy on six categories of ground objects, including 2, 5, 11,13,14 and 16, which shows the advantage of Spe-TL in accuracy per category. Besides, Spe-TL also achieves results close to the highest on other categories of ground objects. However, as for the categories 7 and 9, Spe-TL has a relatively poor performance because there are three methods including SSTN, FreeNet, CEGCN achieve the accuracy of 100%. Interestingly, other comparative methods also perform poorly on such two categories, which shows the difficulty of identifying them. Due to the transformer architecture of SSTN and patch-level input of FreeNet and CEGCN, such three methods do well in capturing long-distance spatial information, which leads a better performance on categories 7 and 9 that is hard to distinguish merely relying on spectral feature:

There are two main reasons about the apparent differences on classification maps. First, an approach with high accuracy may not produce a high-quality full-domain classification map. Second, the accuracy is obtained by calculating only with labeled samples, while full-domain classification map is obtained by classifying all sample including the labeled and the unlabeled. Therefore, the accuracy is not equal to the quality of the full-domain classification maps. A high-accuracy method really has practical values only when it produces high quality classification maps. As for the effect of the full-domain classification maps, we have discussion in detail previously and our method is advantageous on producing fine classification maps. Spe-TL better combines the advantages of both sides. On the one hand, the classification pattern of SVM guarantees the refined restoration of details. On the other hand, the powerful discriminative capacity of the image-level feature effectively decreases the number of noises to obtain better visual effect. In order to show the advantage of Spe-TL to restore details more clearly, we enlarge some areas on four scenes, as shown in Fig. 6-9. From these areas, we observe that Spe-TL can not only achieve accurate classification, but also subtly reflect the authentic distribution of ground objects compared with other methods. For example, on the Indian Pines scene, there are planar stones area and linear trees road located in the north, linear grass lane located in the middle; On the Salinas scene, there are linear vinyard trellis path located in the south, linear romaine path and planar rough plow area located in the west; On the Pavia University scene, there is planar roof of metal and asphalt located in the north; On the Houston scene, there are planar roof of circular commercial mall, planar soil, grass area, and linear road, which are located in the northwest.

Author Response File: Author Response.docx

Reviewer 2 Report

The work is interesting, supported by an extensive set of experiments and comparisons with existing solutions. The paper is well-structured, generally providing good explanations and detail. There is though need for further elaboration in methods and results sections. The manuscript has occasional spelling mistakes and sentences with incorrect structure.

Below are more detailed comments with reference to text, equations, figures and tables in the paper. 

Section 3.1, line 206: explain the third dimension of input images Rh x w x 3, i.e. what is/why the value 3. 

Line 207: elaborate on the value of L, how it is chosen, or derived from where, transferred from .... Further down in section 4.2 seems to be a choice for L=6. Is this value fixed, or did you try different values for the L level? 

Line 210, equation 1 (... throughout the section): mention what is the index t in Ctl

Eq. 1, eq. 2 ...: mention what are the arguments x, x1, x2 e.g. pixels of 2D images (?)

Line 218: explain the length N of vector Ctl, e.g. the number of frames in a video, number of bands for an HS image (?)

Lines 219-220: mention the norm/distance metric that is used for defining the range of pixels in the cost volume calculation. Also (quickly) elaborate on the choice of the default value d=3. 

Section 3.2: and source task Tt -> and target task Tt 

Eq. 3: is the argument p the same as argument x in eq. 1? If so, better keep the same notation. 

Line 234: explain what are u and v in the optical flow f={u, v}

Section 3.4: From the explanations provided in this section and in section 4.3 I infer that set Feaexpands with the increase of D, e.g Fea10 contains the features set Fea9, and the same holds for FeaconcatD. Performance results in Table 1, show oscillation or decrease of performance with the increasing number of features. This hints on the need for feature selection optimisation procedure, e.g. forward or backward feature selection, or sequential floating forward SFFS. SVM with a feature section mechanism might prove to give better performance than voting. 

Section 4.3, lies 374-379: (related to the comment above) instead of searching for an optimal scale, you could look for optimal features at the largest scale, D=10. This would also reduce the load of processing.

Table 1: explain the figures in the rows, what are the intervals (how many sigmas, etc.)

Tables 2-5: explain what is in the rows; what metric is being reported for the performance per category, 1-16, 1-9 ...

Figures 6-9: the black background in sub-figures b (ground truth) is unlabelled pixels, right? Mention in the text and/or add in the legend. 

Figure 10: SpeMotion in the legend refers to your method See-TL, isn't it? 

Section 4.5, table 7: The values in Specific-TL column run times for one scale only (k=10), thus not the final classification including voting, right? It is interesting and important to know the run time for the complete classification procedure; do add another column for these values. 

Section 4.6, lines 493-496: the explanation for flow diagrams in Fig.11(b) is not clear, i.e. how are colours related to direction of variation and its degree. 

Lines 500-507: shortly explain how T-SNE visualisation work; otherwise, the results shown in Fig.12 cannot be understood. 

Author Response

Response to Reviewer 2 Comments

The work is interesting, supported by an extensive set of experiments and comparisons with existing solutions. The paper is well-structured, generally providing good explanations and detail. There is though need for further elaboration in methods and results sections. The manuscript has occasional spelling mistakes and sentences with incorrect structure.

Response: Thank you very much for your affirmation and recognition of our works, which will motivate us to continue to do more in-depth research. At the same time, thank you for your valuable suggestions, we have made corresponding modifications:

  • We further elaborated the methods and results sections point by point according to the comments.
  • We checked full paper in this revision and improved the language to guarantee the high-quality of paper.

Below are more detailed comments with reference to text, equations, figures and tables in the paper. 

1.Section 3.1, line 206: explain the third dimension of input images Rh x w x 3, i.e. what is/why the value 3. 

Response: Thanks for your valuable advice. Because the PWC-Net is designed for 3-channel RGB image, so the input images both are 3 channels. And we have further stated it in this revision.

2.Line 207: elaborate on the value of L, how it is chosen, or derived from where, transferred from .... Further down in section 4.2 seems to be a choice for L=6. Is this value fixed, or did you try different values for the L level? 

Response: Thanks for your valuable advice. We have elaborated the value of L in this revision. The value of L actually decides the layers number of pyramids extractor and the depth of the network. And the setting also requires to consider the size of input because the smaller input cannot go through the deeper pyramid extractor. Since the discussion of network structure is not the focus of this paper, we select the optimal setting (L=6)according to the original literature to train the network and transfer knowledge of the network. And we further state it on the network section and section 4.2.

3.Line 210, equation 1 (... throughout the section): mention what is the index t in Ctl

Response: Thanks for your valuable advice. The index t denotes the serial number of two images as input (t=1,2). We have mentioned it in this revision.

4.Eq. 1, eq. 2 ...: mention what are the arguments x, x1, x2 e.g. pixels of 2D images (?)

Response: Thanks for your valuable advice. Yes, x denotes each pixel in the image, x1, x2 denote each pixel in the first image and the corresponding matching pixels in the second image, respectively. We have supplemented it in this revision.

  1. Line 218: explain the length N of vector Ctl, e.g. the number of frames in a video, number of bands for an HS image (?)

Response: Thanks for your valuable advice. We have explained it in this revision. The length N of column vector  is actually the channel dimension of feature , which varies with layer from 16 to 32, 64, 96, 128 and 196.

  1. Lines 219-220: mention the norm/distance metric that is used for defining the range of pixels in the cost volume calculation. Also (quickly) elaborate on the choice of the default value d=3. 

Response: Thanks for your valuable advice. We have improved the elaboration in this revision. First, the range of d is actually the size of searching window, which requires smaller than the size of feature maps  at the l-th level. Second, too large d will lead to huge computational overhead and no performance improvement, and too small d will lead to insufficient matching retrieval. The choice of the value d=3 is according to the original literature of PWC-Net based on optimal value selection. Besides, we state the choice of the value d in section 4.2.

  1. Section 3.2: and source task Tt-> and target task Tt

Response: Sincerely thanks for your careful review. We have revised this rough mistake that we don’t find.

  1. Eq. 3: is the argument p the same as argument x in eq. 1? If so, better keep the same notation. 

Response: Thanks for your valuable advice. Yes, they both denote the location of pixels in the equation. We have improved it in this revision.

  1. Line 234: explain what are u and v in the optical flow f={u, v}

Response: Thanks for your valuable advice. We have supplemented the details in this revision. the motion of the object in 3D space is reflected on the image to form the brightness mode motion of the image, and the visible motion of the brightness mode produces the optical flow f={u, v}, which contains a horizontal motion component u and a vertical motion component v of pixels. Since the HSI is a static scene with no moving on spatial dimension, so we use the optical flow estimation method to calculate the variation information on spectral dimension, which denotes the direction and degree of variation about at the current point of the spectral curve.

  1. Section 3.4: From the explanations provided in this section and in section 4.3 I infer that set Feaexpands with the increase of D, e.g Fea10contains the features set Fea9, and the same holds for FeaconcatD. Performance results in Table 1, show oscillation or decrease of performance with the increasing number of features. This hints on the need for feature selection optimisation procedure, e.g. forward or backward feature selection, or sequential floating forward SFFS. SVM with a feature section mechanism might prove to give better performance than voting. 

Response: Thanks for your valuable advice. We have thought carefully about your comments, but there may be a misunderstanding. We have further explained it in this revision. In fact, FeaD is not a set of features, it is a concatenated feature under a certain interval D that determines the scale of feature. For instance,  Fea10  and Fea9  are independent and unrelated. Therefore, performance results in Table 1 show oscillation or decrease of performance with the increasing interval D. Therefore, a vote strategy is utilized to better utilize the discriminative capacity of features at different scales to achieve classification. For the purpose, a multi-scale features set  contains k + 1 groups of different concatenated features  , and k actually decides the number of . Different  in set are utilized to construct k + 1 training tasks respectively with same training samples. Performance results in Table 1 prove that the voting results perform better than any single-scale feature, which illustrates the effectiveness of the proposed voting strategy. With the vote strategy, the relatively prominent discrimination advantages of features at different scales can be utilized, and the classification errors can be smoothed, so as to improve the accuracy further.

  1. Section 4.3, lies 374-379: (related to the comment above) instead of searching for an optimal scale, you could look for optimal features at the largest scale, D=10. This would also reduce the load of processing.

Response: Thanks for your valuable advice. As mentioned above, we have further explained the relevant description that may lead to ambiguity. In fact, FeaD is not a set of features, it is a concatenated feature under a certain interval D that determines the scale of feature. For instance,  Fea10  and Fea9  are independent and unrelated. When we set k=10, the 11 groups of different concatenated features  (D=0,1,…,10) are utilized to construct 11 training tasks respectively with same training samples, and the corresponding classification results are show in Table 1. Performance results in Table 1 prove that the voting results perform better than any single-scale feature, which illustrates the effectiveness of the proposed voting strategy. With the vote strategy, the relatively prominent discrimination advantages of features at different scales can be utilized, and the classification errors can be smoothed, so as to improve the accuracy further.

  1. Table 1: explain the figures in the rows, what are the intervals (how many sigmas, etc.)

Response: Thanks for your valuable advice. We are not sure if we have understood your suggestion correctly. Therefore, we explain this question from two prospective:

  • The meaning of interval D has been illustrated in detail in section 3.3 and 3.4. And we further state it simply in analysis section in this revision.
  • The figures in the rows represent mean±standard of ten times of experiments under specific setting. And we further state it in this revision.
  1. Tables 2-5: explain what is in the rows; what metric is being reported for the performance per category, 1-16, 1-9 ...

Response: Thanks for your valuable advice. As mentioned above, results in the rows represent mean±standard of ten times of experiments under specific setting, and we further state it in this revision. The metric of performance per category is just the percentage of the number of correctly classified samples of a category to the total number of samples of the category. We further added this point in section in this revision 4.1.

  1. Figures 6-9: the black background in sub-figures b (ground truth) is unlabelled pixels, right? Mention in the text and/or add in the legend. 

Response: Thanks for your valuable advice. Yes, the black background areas in the ground truth map are samples without labels. We have added it in the legend of Figures 6-9.

  1. Figure 10: SpeMotion in the legend refers to your method Spe-TL, isn't it? 

Response: Thanks for your valuable advice. Yes, it is actually our method. To avoid the ambiguity, we have revised it in the legend.

  1. Section 4.5, table 7: The values in Spe-TL column run times for one scale only (k=10), thus not the final classification including voting, right? It is interesting and important to know the run time for the complete classification procedure; do add another column for these values. 

Response: Thanks for your valuable advice. We think you may be referring to the results in Table 8, which is the running time of different approaches on the different scenes. We must note that this time is the training and testing time after the vote, i.e., k=10 means 11 times of training and tests (referring to our answer for question 10 and 11). If we only utilize one single-scale feature to train and test, the time consumption just takes less than a second. To make this clear, and to meet your requirements, we have added a column to represent the running time case for k=0, where only a single-scale feature (D=0) is utilized for training and testing.

  1. Section 4.6, lines 493-496: the explanation for flow diagrams in Fig.11(b) is not clear, i.e. how are colours related to direction of variation and its degree. 

Response: Thanks for your valuable advice. We have improved the explanation in this revision. The variation of homogeneous region is similar while the variation of heterogeneous region is distinguishable. The optical flow diagrams utilize different colors to represent the direction of spectral variation, i.e., the similar colors denote the areas of approximately consistent variation at the current band. Besides, the brightness of color is utilized to represent the degree of variation at an area with similar variation, i.e., the higher brightness denotes the more prominent variation. Such variation information is utilized to construct the final image-level feature that has more discriminative power.

  1. Lines 500-507: shortly explain how T-SNE visualization work; otherwise, the results shown in Fig.12 cannot be understood. 

Response: Thanks for your valuable advice. We have shortly explained how T-SNE visualization works and improved the explanation about the results in Fig.12 in this revision. The dimension-reduced-based T-SNE visualization method can map high-dimensional features to two-dimensional space, and make features with high similarity in high-dimensional feature space have adjacent distance in two-dimensional space. To further present the discriminative capacity of the image-level feature intuitively, we utilize the T-SNE visualization method to process features before and after extracting on different scenes, as shown in Fig. 12. For a certain scene, different colored dots represent different categories of samples, and the distance between them in two-dimensional space can be approximately seen as the feature-similarity between them in high-dimensional space. Except for Indian Pines scene that only uses 600 samples per class because of the relatively small number of samples, all scenes select 1000 samples per class for visualizing. As we can observe, the heterogeneous samples that overlap in the low-dimension space widely exist before feature extraction. It illustrates the high-similarity of original features, which is difficult to distinguish them. Then, Spe-TL obviously enhances the distinguishable power of feature by feature extraction. This is reflected in the fact that homogeneous samples become more concentrated, and the distance between heterogeneous samples is enlarged in low-dimensional space. Therefore, the image-level feature is more conducive for identification and classification.

Author Response File: Author Response.docx

Reviewer 3 Report

The authors propose a new method to classify HSI by leveraging image features extracted from adjacent HSI bands. To this aim, they transfer the optical flow estimation network pre-trained on video data. Their method has been validated on 4 publicly available HSI scenes. Results concerning classification accuracy and inference time against other methods are clearly reported.

Author Response

Response: Thank you for your careful review of our work! Thank you very much for your affirmation and recognition of our works, which will motivate us to continue to do more in-depth research. We checked full paper in this revision and improved the language to guarantee the high-quality of paper.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Thank you for all corrections.

Back to TopTop