Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Clustering-Based Representation Learning through Output Translation and Its Application to Remote-Sensing Images

Remote Sens. 2022, 14(14), 3361; https://doi.org/10.3390/rs14143361

by Qinglin Li^1,2, Bin Li^1,2, Jonathan M. Garibaldi³

and Guoping Qiu^1,2,4,5,*

Reviewer 1:

Brett Borghetti

Reviewer 2:

Davide Moroni

Reviewer 3:

Yingnong Chen

Remote Sens. 2022, 14(14), 3361; https://doi.org/10.3390/rs14143361

Submission received: 29 May 2022 / Revised: 4 July 2022 / Accepted: 8 July 2022 / Published: 12 July 2022

Round 1

Reviewer 1 Report

The authors’ goal is to develop a new technique for clustering/pseudolabeling which outperforms the existing state of the art techniques by maximizing the spread of observations over possible cluster labels. The authors provide both a well-written math/theory backbone and a deep, convincing empirical exploration of the performance of their method, but some concerns remain…

Comments/suggestions:

At a fundamental level it seems like the original D dims are being projected into a k-dim space where k also equals the number of clusters. I think this is what makes it possible for the author’s method to work. (if there were fewer than k dims I think this would fail since the translations would have to be nonlinear, and if more than k dims, then it would increase the chance of overfitting, right? Can the authors comment about this – perhaps in the results/discussion/conclusion?

Figure 2 seems to be misleading for a clustering algorithm which has k clusters in k dimensions since figure 2 shows k=4 clusters in 2 dimensions. Typical clustering algorithms don’t produce linear boundaries which are orthogonal to dimensional axes values because usually the number of dimensions is fewer than the number of clusters. If this is a notional diagram then try to make something that is clearer for the reader to understand what the axes represent. I believe that in a clustering diagram with k clusters in k dimensions, every cluster has at least one adjacent side with every other cluster, right? This diagram is inadequate to represent k=4 since you would need 4 axes, right?

Also the concept of linear translation breaks down in this diagram since you are translating a 4-dim object which is only being projected into 2 space…

I am still a bit lost in how the authors are measuring the performance of a clustering/pseudo-labeling method on data which has known labels. How is accuracy determined for an unsupervised learning problem? Are the authors pretending to not know any labels and just letting clustering work then declaring that each cluster gets a pseudo label to its closest corresponding true label? More clarification is needed here on how the authors adapt what are typically supervised learning activities (training using labeled truth data, evaluating accuracy on predicting labels on data where the correct labels are known) to unsupervised learning activities: not having any truth labels. In particular, the experiments in section 4 all compare performance of the Authors method to supervised learning methods (KNN and linear classifiers)

Other minor suggestions are as follows (primarily spelling/grammar):

There is an overuse of “we” and “our” throughout the paper. In cases where comparing to other work, pronouns like these may be needed (ours vs. theirs), but not throughout the math part… for example, it is not necessary to say “We can simplify (1) to…” when a better phrase is “Equation (1) simplifies to…”

Both “neighborhood” and “neighbourhood” are used in the paper. Standardize.

Specific Comments

Line 45: “based the neighborhood” -> “based on the neighborhood”

Line 51: “develope” -> “developed”

Line 92: “due to they can capture” -> “due to the fact that they can capture”

Line 121-132: clunky phrase “…the distribution of the N samples in k clusters is the most uniform” -> “the N samples are distributed as uniformly (evenly) as possible among the k clusters”

Line 177 just before equation 8: “D” has not been defined in the D-dimensional real vector space.

Equation 8: If O(s) is the output of I(s) then is O(s) the output of the softmax activation on the final layer of the CNN? Clarify this if possible since softmax is nonlinear, but the last sentence of line of 177 states that the fully connected layer linearly maps X(s) to a k-dim space….

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper addresses an important issue, which is the need for self-supervised earning methods for image categorization, especially in the field of remote sensing.

The paper is well organized with a blend of theoretical content and explanations, followed by real applications on well-known general-purpose benchmark datasets followed by tests on remote sensing-specific datasets.

Ablation studies are reported to show the specific boost offered by the methods.

Having recognized the methods of the paper, I should say there is room for improvement:

- Section 3.1 and appendix B are very well-known facts by anyone with a background in analysis. Although, they have not been explicitly stated in this context. Perhaps the authors should not present them as theorems stemming from the paper.

- Since binomial coefficients are used at a later stage, it is suggested to use them also in section 3,1

- English is correct, however it is somewhat cumbersome. Some proofreading is necessary. There are some typos (line 51 we develope - We developline 250 MINIST instead of MNIST)

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 3 Report

In this manuscript, the authors proposed a clustering-based method for unsupervised learning. I have some questions.

1. In Table 1 and Table 2, what is OLT? The proposed method is mentioned as OLT, so is it typo? If they are typos, I think the authors should be more careful.

2. Line 266 to Line 268, "especially in Linear Classifier where the 267

accuracy is only 0.1% lower than the accuracy in supervised learning" I wonder how to get the number "0.1%"? Because the Supervised is 96.5%, and OTL is 95.2%, so how to obtain the result "0.1%"

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

The authors have revised the paper as required.

Reviewer 3 Report

The authors have addressed my questions. I have one extended question that is the aim of the proposed method is to output even distributed pseudo labels. Then, what situations the aim would fail? Could authors explain in the paper under which situation, the proposed method cannot output even distributed pseudo labels.

Author Response

Thank you very much for the question.

For the two situations where the aim would fail:

Lots of outputs are extremely close such that translations cannot create different pseudo labels for them, for example, the extreme case that half of the outputs (cluster number is much large than 2) are the same which would only create one pseudo label.
For lots of outputs, the maximum components are not unique. For example, the case that for half of the outputs (cluster number is much large than 2), the half components are the same and their values are the maximum.

In general, these two situations will not happen because the inputs are different enough even for the inputs contain the same objects belonging to the same class and the network is well randomly initialized.

Article Menu

Clustering-Based Representation Learning through Output Translation and Its Application to Remote-Sensing Images

Further Information

Guidelines

MDPI Initiatives

Follow MDPI