Clustering-Based Representation Learning through Output Translation and Its Application to Remote-Sensing Images

Li, Qinglin; Li, Bin; Garibaldi, Jonathan M.; Qiu, Guoping

doi:10.3390/rs14143361

Open AccessArticle

Clustering-Based Representation Learning through Output Translation and Its Application to Remote-Sensing Images

¹

College of Electronic and Information Engineering, Shenzhen University, Shenzhen 518052, China

²

Guangdong Key Lab for Intelligent Information Processing, Shenzhen University, Shenzhen 518052, China

³

School of Computer Science, The University of Nottingham, Nottingham NG8 1BB, UK

⁴

Shenzhen Institute of AI and Robotics for Society, Shenzhen 518172, China

⁵

Pengcheng Laboratory, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(14), 3361; https://doi.org/10.3390/rs14143361

Submission received: 29 May 2022 / Revised: 4 July 2022 / Accepted: 8 July 2022 / Published: 12 July 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In supervised deep learning, learning good representations for remote-sensing images (RSI) relies on manual annotations. However, in the area of remote sensing, it is hard to obtain huge amounts of labeled data. Recently, self-supervised learning shows its outstanding capability to learn representations of images, especially the methods of instance discrimination. Comparing methods of instance discrimination, clustering-based methods not only view the transformations of the same image as “positive” samples but also similar images. In this paper, we propose a new clustering-based method for representation learning. We first introduce a quantity to measure representations’ discriminativeness and from which we show that even distribution requires the most discriminative representations. This provides a theoretical insight into why evenly distributing the images works well. We notice that only the even distributions that preserve representations’ neighborhood relations are desirable. Therefore, we develop an algorithm that translates the outputs of a neural network to achieve the goal of evenly distributing the samples while preserving outputs’ neighborhood relations. Extensive experiments have demonstrated that our method can learn representations that are as good as or better than the state of the art approaches, and that our method performs computationally efficiently and robustly on various RSI datasets.

Keywords:

clustering; representation learning; remote-sensing images (RSI); even distribution

1. Introduction

The self-supervised visual-representation learning paradigm, where “supervision” is created from making use of the intrinsic information contained in the data itself and based on prior knowledge about the world, provides us with an unsupervised way to learn representations of RSI. Recently, methods of instance discrimination [1,2,3,4,5,6] have achieved remarkable progress in unsupervised representation learning. These methods view every sample and its transformations as a “class”, while, in clustering-based methods [7,8], samples grouped to the same cluster are considered as a “class”. In fact, data augmentations are often used in clustering-based methods, and, thus, transformations of the same images are also viewed as the same “class”. However, so far, methods of instance discrimination overall have better performances than clustering-based methods, which seems to imply that clustering-based methods could be further improved.

Degeneracy is a problem that nearly all clustering-related methods need to solve. The usual ways to tackle it is by setting constraints on the clustering process. Whether it is based on maximizing entropy [9,10] or adding regularization [11], all aim to distribute the samples evenly to the clusters. However, even distribution is not a necessary condition for preventing degeneracy; there is a lack of a clear theoretical interpretation for why even distribution works well in these clustering-related methods. On the other hand, clustering should be based on the neighborhood relations of the samples or that of their representations. To avoid degeneracy, the normal clustering process (that clusters should be formed based on the relations amongst the samples) would have to be artificially introduced. It is unclear if approaches used by previous methods to achieve even distribution can completely preserve the neighborhood relations of samples or that of their representations. By closely observing the Sinkhorn–Knopp algorithm used by [8], it is easy to see that the relations between neighboring samples or the distances between the outputs have been changed by the intervention to achieve even distribution.

In this paper, we first introduce a quantity that can be used to measure how discriminative representations are under a generic clustering setting where the clusters are formed based on the neighborhood relations or similarity of an images’ representations. From this quantity, we show that evenly distributing the images into clusters requires the visual representations to be most discriminative. This provides an insight that clearly explains why evenly distributing the samples into clusters worked well in previous related work [8,10]. However, we are only interested in the even distributions that preserve the model’s smoothness, i.e., preserve the neighborhood relations or similarity of the images’ representations, and, thus, we develop an algorithm (The code is avaible at https://github.com/qlilx/OTL that translates the outputs of a convolutional neural network to achieve the goal of evenly distributing the images while, at the same time, preserving the current model’s smoothness. In summary, we make the following contributions:

A quantity that can be used to measure the discriminativeness of visual representations is introduced and a clear theoretical explanation for evenly distributing images into clusters is given from the viewpoints of this quantity.
A new efficient and robust clustering-based representation learning algorithm, through translating the outputs of a convolutional neural network, is proposed.
Our method performs as well as or better than the state-of-the-art approaches and works very well on a variety of RSI datasets.

2. Related Works

Generally, most self-supervised methods can be categorized into four categories: clustering-based methods, contrastive methods, pretext tasks and generative models.

Clustering-based methods learn visual representations by integrating network optimization and clustering. Clustering is a classical approach to unsupervised machine learning [12]. K-means, a standard clustering algorithm, has been applied to DeepCluster [7] for grouping the features extracted from a deep neural network. Subsequently, the cluster assignments were utilized as supervision to update the weights of the network. The method, Anchor Neighborhood Discovery (AND) [13] exploits a class-consistent neighborhood for unsupervised learning, which combines the advantages of both sample-specificity learning and clustering while overcoming their disadvantages. Maximizing the information between the indices of inputs and labels, self-labeling [8] not only avoids the degenerated problem but also creates pseudo-labels for training a deep neural network by a standard cross-entropy loss. Other earlier works of combining clustering and deep learning can be found in [14,15,16,17,18].

The contrastive method [1,2,3,4,5,6] has recently shown remarkable performances on visual-representation learning. The contrastive method is a method of performing instance discrimination, which only considers the transformations of the same image as the “positive” samples, while others are the “negative” samples in the contrastive loss. Recently, some methods combining clustering methods and contrastive methods have appeared. In [19], the nearest neighbors are also considered as the “positive” samples in the contrastive loss, and in [20], both instance-level and cluster-level contrastive losses are introduced. In the past year, contrastive methods have been extensively applied to area of remote sensing [21,22,23,24].

A pretext task uses hand-crafted “supervision” to replace manual labels in supervised learning. These “supervision” pretext tasks, are devised from exploiting the intrinsic information of the unlabeled data. These pretext tasks include predicting context [25], solving jigsaw puzzles [26,27], image rotation and colorization [28,29,30,31], spatio-temporal consistence [32], and so on. The applications of this method to RSI can be found in [33,34,35], which use domain knowledge and temporal prediction to supervise the training.

Generative models. The latent distributions in generative models could be visual representations, due to the fact that they can capture the distributions of input data. These generative models mainly include Boltzmann machines [36,37,38], autoencoders [39] and generative adversarial networks (GAN) [40,41,42,43]. In [44], an improved generative adversarial network was applied to the super-resolution processing of RSI.

3. Method

The goal of this paper is to learn good visual representations for images through clustering them; that is, clustering is not the target but a means to learn visual representation. This section will introduce two essential aspects of good visual representations and, based on them, design our algorithm and training method.

3.1. Discriminativeness of Representations

In the representation feature space, the feature vectors should reveal as much as possible the difference in nonidentical samples; that is, good representations should be discriminative. To compare which representations are more discriminative, we need to be able to measure the discriminativeness of representations. In this section, based on the generic clustering model as shown in Figure 1, we try to find a criterion that can be used to compare the discriminativeness of representations.

Consider a general clustering that groups N images, based on their representations in feature space F, into k clusters by certain methods, for example, k-means. Samples in the same cluster will be assigned the same pseudo-labels that will be applied to standard cross entropy loss to update the network. Due to samples in the same cluster having closer representations and being indiscriminately treated, we could legitimately regard the representations of samples from the same cluster as indistinguishable. If the N representations in space F are very discriminative, there will be few indistinguishable representations.

Therefore, the quantity that can be used to compare the discriminativeness of representations should be able to indicate the representations’ degree of distinguishableness. When we say “distinguishable”, when at least two representations are needed to be compared. Consequently, in this paper, we use the number of distinguishable pairwise comparisons of representations as the quantity to measure the discriminativeness of representations.

In different representation feature spaces, the clustering results will be different. Let

n_{i} (F)

be the number of samples assigned into cluster i in the feature space F, and, thus,

N = n_{1} (F) + n_{2} (F) + \dots + n_{k} (F)

, where N is the total number of samples. The total number of pairwise comparisons of N representations is

N (N - 1) / 2

, in which the number of indistinguishable pairwise comparisons is

N_{ind} (F) = \sum_{i}^{k} (\begin{matrix} n_{i} \\ 2 \end{matrix})

(1)

and the number of distinguishable pairwise comparisons is

N_{dis} (F) = \sum_{i = 1}^{k - 1} (n_{i} (F) \sum_{j = i + 1}^{k} n_{j} (F)) .

(2)

Consider representations in two different feature spaces

F_{1}

and

F_{2}

, and perform clustering in these two spaces. If

N_{dis} (F_{1}) > N_{dis} (F_{2}),

(3)

then the representations in

F_{1}

ought to be more discriminative than those in

F_{2}

because there are more distinguishable representations in

F_{1}

. Therefore, the quantity

N_{dis} (F)

can be used to measure the discriminativeness of the N samples’ representations in a feature space F: the larger is

N_{dis} (F)

, the more discriminative are the representations. Several simple toy examples are given in Appendix A to elaborate the rationality of this quantification.

From the view points of

N_{dis} (F)

, the degeneracy problem in clustering methods is caused by small

N_{dis} (F)

. For example, the extreme case that

N_{dis} (F) = 0

means that the representations are totally indistinguishable, which corresponds to the case where all samples are assigned into the same cluster. Therefore, increasing

N_{dis} (F)

can prevent the representations from degenerating. However, the question is, when will

N_{dis} (F)

reach the maximum?

Note that

N_{ind} + N_{dis} = N (N - 1) / 2

. Thus, the maximum

N_{dis} (F)

corresponds to the minimal

N_{ind} (F)

due to N and k being constants. Equation (1) simplifies to

\begin{matrix} N_{ind} = \frac{1}{2} (\sum_{i}^{k} n_{i}^{2} (F) - N) \end{matrix}

(4)

Again, noting N is a constant,

N_{ind}

achieves the minimal value when

\sum_{i}^{k} n_{i}^{2} (F)

is the minimal. According to the inequality of arithmetic and geometric means (AM–GM inequality) [45] and noting

n_{i} (F) \geq 0

,

\sum_{i}^{k} n_{i}^{2} (F)

takes the minimal value if

n_{i} (F) = \bar{n} (F) = N / k, \forall i \in {1, 2 \dots, k}

(5)

which is also the condition for

N_{ind}

to reach the maximum. For the cases when

\bar{n} = N / k

is not an integer, a rigorous proof is given in Appendix B that

N_{ind}

reaches the largest value when the standard deviation of the distribution of

n_{i} (F)

is the minimal, i.e., the N samples are distributed as uniformly (evenly) as possible among the k clusters.

Remark 1.

If N nonidentical images are most evenly distributed into k clusters according to their representations in feature space F, then

N_{ind} (F)

must be the maximum.

The fact stated in Remark 1 gives an explicit explanation for why even distribution works well in previous clustering-related methods, e.g., [8]. From the view point of Remark 1: an even distribution corresponds to the most discriminative representations, which can be regarded as the extreme opposite of the degenerated cases. However, for previous works in the literature, the even distribution is simply used as a tool to prevent the model from collapsing. In contrast, in this paper, we provide a fundamental insight that has enabled us to give a clear interpretation for even distribution and use it to develop new techniques to improve the discriminativeness of representations.

Intuitively, “the most discriminative representations give the most uniform distribution” is correct only for special cases. It is important to emphasize that, in this paper, we measure the discriminativeness of representations by

N_{ind} (F)

, which gives how many pairs of representations are distinguishable. Again, the rationality of this quantification can be seen from simple examples in Appendix A.

For the given N and k, when

n_{i} (F)

is the most uniform distribution, the assignment of the input images to the k clusters has

N_{e}

possible combinations

\begin{matrix} N_{e} = \frac{1}{k!} (\begin{matrix} N \\ n_{1} (F) \end{matrix}) \prod_{i = 2}^{k} (\begin{matrix} N - \sum_{j = 1}^{i - 1} n_{j} (F) \\ n_{i} (F) \end{matrix}) \end{matrix}

(6)

This means that the solution to maximal

N_{ind} (F)

is not unique. Therefore, the question now is how we shall approach maximum

N_{ind} (F)

such that it is good for representation learning.

3.2. General-Purpose Prior: Smoothness

Good representation should be discriminative. However, at the end of Section 3.1, we demonstrated that there are many ways to reach evenly distributed pseudo-labels that can be used to improve the discriminativeness of representations. We are only interested in the ones that are good for representation learning. Recall that the basic task of representation learning is to make similar samples to have close representations while different samples have distinct representations. In other words, representation learning is the pursuit of a smooth learning model tht satisfies

f (x_{1}) \approx f (x_{2}) iff x_{1} \approx x_{2},

(7)

where f is the mapping function of the model to be learnt. Smoothness is a basic and general requirement of a learning model [46]. Therefore, only the pseudo-labels that are good for improving the smoothness of the model are desirable. Specifically, this requires that similar samples are more likely to be assigned the same pseudo-labels.

Some model architectures have a certain inherent smoothness. For example, randomly initialized AlexNet possesses some degree of smoothness [26]. Therefore, clustering based on the outputs or internal features extracted by the model is likely to group close samples together, thus assigning them with the same pseudo-labels. Applying these pseudo-labels in the standard cross-entropy loss can enhance the smoothness of the model. Besides this, making the representations of transformations of the same image close to each other can also improve the smoothness of the model. Therefore, without any manual annotation, assigning pseudo-labels based on the neighborhood relations of the outputs or the internal representations of the network is a reasonable strategy. This provides us with an important constraint on approaching the even distribution of images in clusters, that is, the neighborhood relations of the outputs or the internal representations of the network must be protected. Recall that evenly assigning the pseudo-labels is necessary because it can not only prevent the model from collapsing, but also improve the discriminativeness of representations.

3.3. Algorithm

This section proposes an effective algorithm that can produce evenly distributed pseudo-labels according to the outputs of the learning model while keeping the smoothness of the current model unaltered. We then use the pseudo-labels and a label-consistent-training (LCT) technique to improve the model’s smoothness.

Model architecture. We consider a general convolutional neural network (CNN) model, as shown in Figure 1, for representation learning. One or several fully connected layers are attached at the end of the CNNs, which will output a k-dimensional vector indicating the assignment of the input samples to the k clusters. Specifically, the sth input images

I (s)

is mapped to

X (s)

by the convolutional block

ϕ

, i.e.,

X (s) = ϕ (I (s))

. By one or several fully connected layers, the feature

X (s)

in D-dimensional space is then mapped to a k-dimensional space:

R^{D} \to R^{k}

O (s) = (o_{1} (s), o_{2} (s), \dots, o_{k} (s)),

(8)

where

O (s)

is the output of

I (s)

without softmax operation (in this paper, all “output“ refers to the final layer of network but the softmax activation on the final layer). The output dimension is set to be the same as the cluster number, such that images could be assigned into k clusters based on the rule: the input

I (s)

is assigned to cluster

c_{i}

if and only if

o_{i} (s)

is the maximal component:

I (s) \in c_{i} iff o_{i} (s) \geq o_{j} (s), \forall j \neq i, i, j = 1, 2, \dots, k .

(9)

Clearly, (9) is a winner-takes-all competition extensively used in unsupervised competitive learning [47]. The largest output wins the competition and the rest lose out.

Output translation. In most cases, if we perform clustering as (9) based on outputs (8), most input samples will be assigned into a few clusters. Consider a toy model: assigning eight points in a two-dimensional plane to four clusters based on the quadrant the points are located in. Usually, based on the original location of points, they would not be evenly distributed into the four clusters; for example, the first graph of Figure 2 (one red point represents a point, the boundaries in the diagram are the x and y axes). If the points need to be evenly distributed into the four clusters based on the same rule, the values of their coordinates have to be changed and the question is how to change the values of them. If there is a constraint that their relative distances should be kept (the smoothness requirement), translating them as whole is the most straightforward approach to carry that out, as shown in the second graph of Figure 2. Considering N points in k dimensions and changing the clustering rule to Equation (9), the model then becomes what this paper needs to study and, similarly, the even distribution of pseudo-labels could be reached by translating the outputs in k dimensions.

Therefore, the core task of our algorithm is that of finding the translation T

\begin{matrix} O^{'} (s) = O (s) - T \end{matrix}

(10)

such that cluster assignment according to (9) using the new outputs

O^{'} (s)

gives evenly distributed pseudo-labels. Obviously, T is a k-dimensional vector:

T = (t_{1}, t_{2}, \dots, t_{k})

. To find

t_{1}, t_{2}, \dots, t_{k}

, we borrow the basic idea from a competitive learning or self-organising learning algorithm called frequency sensitive competitive learning (FSCL), widely used in the design of optimal vector quantisation [48,49]. If one component (or a direction) “wins” too much in the competition (9), we move all output vectors away from this direction; if one direction “loses” the competition too often, we move all output vectors towards this direction. To perform this, we need to construct each direction’s wining frequency indicator in features space F. We count the number of samples assigned into every cluster

c_{i}

according to (9), and set the number as its wining frequency indicator. For convenience, we subtract the mean

\bar{n} (F) = N / k

from each cluster to obtain the frequency indicator as

\begin{matrix} C = (\begin{matrix} n_{1} (F) - \bar{n} (F) \\ n_{2} (F) - \bar{n} (F) \\ \dots \\ n_{k} (F) - \bar{n} (F) \end{matrix}) = (\begin{matrix} {\hat{n}}_{1} (F) \\ {\hat{n}}_{2} (F) \\ \dots \\ {\hat{n}}_{k} (F) \end{matrix}) \end{matrix}

(11)

where

{\hat{n}}_{i} (F) = n_{i} (F) - \bar{n} (F)

. C can be used as all directions’ winning frequency indicator, e.g., the winning frequency of the first direction is

{\hat{n}}_{1} (F)

. Note that some

{\hat{n}}_{i} (F)

are positive while others are non-positive. The vector of translation should be proportional to this indicator

T \propto C

. The frequency indicator determines the direction of translation. The magnitude of the translation should be close to that of the output vectors and change with the evenness of the frequency distribution: the more uniform is the distribution, the smaller should be the translation magnitude. Therefore, the translation vector could be set as

\begin{matrix} T = α (\begin{matrix} {\hat{n}}_{1} (F) \\ {\hat{n}}_{2} (F) \\ \dots \\ {\hat{n}}_{k} (F) \end{matrix}), α = \frac{std}{{std}_{\max}} (o_{\max} - o_{\min}) . \end{matrix}

(12)

where “

{std}_{\max}

” is the maximal standard deviation, which corresponds to the case where all samples fall into the same cluster, while “

std

” is the standard deviation of the current distribution.

(o_{\max} - o_{\min})

is the difference between the maximal and the minimal components in all outputs vectors, which gives the order of magnitude of

α

. The ratio “

std / {std}_{\max}

” indicates the degree of unevenness: unevenness increases from 0 to 1 as “

std

” changes from 0 to the maximum. Noting that, from Equations (10) to (12), all the quantities are from the outputs, no hyper-parameters are introduced.

To make the distribution of pseudo-labels as uniform as possible, the translation should be repeatedly performed, and

α

should decrease as iterations of translation increase and, thus, we introduce a decay rate

β

(see Algorithm 1). The pseudo-code of creating evenly distributed pseudo-labels by output translation is given in Algorithm 1.

The translation (10) is just a simple linear transformation, that is, a smooth mapping

O^{'} (s_{1}) \approx O^{'} (s_{2}) iff O (s_{1}) \approx O (s_{2}) .

(13)

Therefore, translation (10) does not change the smoothness of the current model, that is, if

O (s_{1}) \approx O (s_{2}) iff I (s_{1}) \approx I (s_{2}),

(14)

then

O^{'} (s_{1}) \approx O^{'} (s_{2}) iff I (s_{1}) \approx I (s_{2}) .

(15)

More algorithm analysis can be found in Appendix C, where we give a detailed comparison with the Sinkhorn–Knopp algorithm used in [8] in terms of of motivation and technical details, and where we also give a clear mathematical illustration demonstrating how our algorithm preserves the distances of outputs.

Algorithm 1: Creating even distributed pseudo-labels by output translation

Model update. When the translation of outputs is completed, the sth input sample

I (s)

is assigned to cluster

c_{i}

according to (9), based on the final translated outputs

O^{'} (s) = (o_{1}^{'} (s), o_{2}^{'} (s) \dots, o_{k}^{'} (s))

:

\begin{matrix} I (s) \in c_{i}, y^{'} (s) = (0, 0, \dots, y_{i}^{'} (s) = 1, \dots, 0, 0) \\ iff o_{i}^{'} (s) \geq o_{j}^{'} (s) \forall j \neq i i, j = 1, 2, \dots, k, \end{matrix}

(16)

where

y^{'} (t)

is the pseudo-label of

I (s)

. After reaching uniform labelling, we can update the weights of the network as standard supervised learning via standard cross-entropy loss

C E = - \frac{1}{N} \sum_{s}^{N} y^{'} (s) \log (softmax (O (s)),

(17)

which is established from the pseudo-labels

y^{'} (t)

and the outputs before translations

O (t)

.

The labels generated by (16) from different versions of the same sample may be not consistent. For example, the grayscale image and the color image of the same sample will give different outputs, which may label the same image differently according to (16). To remove this discrepancy, we propose a label-consistent-training (LCT) method. In this method, we construct the loss function by the sum of the cross entropy losses as:

L_{LCT} = \sum_{a, b}^{g} (- \frac{1}{N} \sum_{s}^{N} y^{'} {(s)}_{a} \log (softmax (O {(s)}_{b}))) .

(18)

In the loss above,

a, b

indicates different transformations of the same images

I (s)

. If the labels from different versions of the same samples are not consistent, the loss function above would be hard to decrease. Thus, decreasing this loss enables different versions of the same samples to give the same prediction, i.e., the same pseudo-label, which will make the model smoother.

4. Experiments Resluts

In this section, we first give our experiment results on several datasets, which are often used in computer vision, to compare to the state-of-the-art research, and then we conduct experiments on various remote-sensing image datasets.

Recall that our method is an unsupervised representation-learning method and its target is to learn good visual representations. Therefore, the ground-truth labels are never used in the training phase but the evaluation phase, and measuring the performance our method is to measure the performance of representations learned by our method. To evaluate the learned representations, we consider a non-parametric and a parametric classifier: weighted kNN (k-nearest neighbors) and linear classifier [50]. The evaluation process is supervised, that is, the labels are used, but only the CNN layers or the layers before the final layer are retained and frozen. For the weighted kNN evaluation, we take

k = 50

,

σ = 0.1

and an embedding size of 128. For all experiments, we set decay rate to

β = 1.5

and

α_{0} = 10^{- 15}

in Algorithm 1.

4.1. Compare to the State-of-the-Art Approaches

To conveniently compare with other representation-learning methods, in this part our experiments were conducted on a large-scale dataset, ImageNet LSVRC-12 [51] and three smaller scale datasets: CIFAR-10/100 and SVHN. ImageNet LSVRC-12 contains around 1.28 million training pictures of 1000 classes and 50-thousand validation pictures. There are 50,000 training images and 10,000 test images in the CIFAR-10/100 dataset, whose number of classes is 10/100 and pixel resolution is

32 \times 32

. SVHN is similar to MNIST (images of digits), which is a real-world image dataset. There are 73,257 images (

32 \times 32

) for training and 26,032 images (

32 \times 32

) for testing. All the ground truths of these datasets were used only in the evaluations of representations. The implementation details are given in Appendix D.

For fairly comparing to the previous work, we set the same cluster number k for these datasets as [8]. For CIFAR-10, the cluster number was set as 128. Therefore, the mean number of samples in each cluster is

50, 000 / 128

. According to Remark 1, although this number is not an integer, setting it as the target still works: making the number of the samples assigned to each cluster as close as possible to the mean number. The cluster number is set as 128 for SVHN and 512 for CIFAR-100 when experiments were conducted in AlexNet.

Table 1 gives kNN and linear-classifier evaluations of representations learned from CIFAR-10/100 and SVHN through our method OTL (output translation). In addition, kNN evaluations were performed based on the fully connected layer before the final layer, and all layers except the final layer were frozen. For linear classifier, all the fully connected layers were discarded and the retained CNN layers were frozen. The last CNN layer was then attached by a new randomly initialized fully connected layer whose weights were to be updated by a supervised learning. With simple augmentations, the performances of our method reached the state-of-the-art performances. When strong augmentations and LCT (label consistent training) were employed, the performances of our method outperformed the previous works in the literature. For SVHN, the performance of our method is very close to the performance of supervised learning, especially linear classifier, where the accuracy is only

0.1 %

lower than the accuracy in supervised learning. When one assumes the actual number of classes in the data is known, and sets k equal to the actual number of classes, this can be regarded as giving the model some prior knowledge. It is seen that such prior knowledge can help improve the performances of our OTL algorithm, especially in the case for linear classifier on SVHN, where the accuracy is the same as that of fully supervised learning.

In Table 2, the experiments were ran on ResNet-50 and for fairly comparing to other methods,

k = 128

was set for all experiments on ResNet-50. Both evaluations, kNN and linear classifier, were conducted on the last CNN layer. Note that both Instance [53], SimCRL [6] and ISL [54] are methods of instance discrimination. This table shows that clustering-based representation-learning methods could learn representations that are as good as that of instance discrimination or even better. Compared to Table 1, we find that the performances on ResNet-50 overall are better than on AlexNet, which is reasonable and predictable.

To demonstrate the effectiveness of our method on a large-scale dataset and the speed of our algorithm in dealing with huge number of images, i.e., over one million, we also conducted experiments on ImageNet with AlexNet. As can be seen from Table 3, our method, without strong augmentations and LCT, achieves state-of-the-art performances in evluations of both linear and kNN classifiers. In our model, comparing the representations from the five CNN layers, representations from the last two layers perform better than that of the first three layers.

4.2. Impact of Even Pseudo-Label Distribution on Performances

We have to emphasize, again, that evenly distributing pseudo-labels is in order to make the representations highly discriminative. Besides even distribution, our algorithm can conveniently set various unevenly distributed pseudo-labels, which can be performed by modifying the

\bar{n} (F)

in frequency indicator C (11) as the desired distribution.

Unevenly distributed pseudo-label. To investigate the influences of unevenness of pseudo-labels’ distribution, various uneven target distributions of pseudo-labels were set, as shown in Figure 3 (for more details, see Appendix E) for training on CIFAR-10/100. The kNN evaluations for different target distributions are given in Table 4 (these models were trained on AlexNet with fewer epochs than in Table 1 and without strong augmentations and the LCT technique). As one can see from both tables, the accuracy of kNN decreases as evenness decreases. This is consistent with the argument made in the method section that evenly distributed pseudo-labels are good for learning representations.

Unevenly distributed datesets. The standard datasets used in this paper are evenly distributed, i.e., each class contains the same number of images. However, the effectiveness of our method does not depend on the data being evenly distributed. In order to show that, we considered five unevenly distributed datasets which are made from CIFAR-10 by deleting different numbers of images from different classes. As a comparison, five evenly distributed datasets were made, and each has the same number of samples as the corresponding dataset in the five unevenly distributed ones.

In Table 5 (again with fewer epochs and without strong augmentations and the LCT technique), for unevenly distributed “D_100”, we deleted 0, 100, 200, ⋯ 900 images from class 0, 1, 2, ⋯, 9, while evenly distributed “D_100” has the same total number of images but the ground truth is evenly distributed. Similarly, for unevenly distributed “D_200”, we deleted 0, 200, 400, ⋯ 1800 images from class 0 to 9. For unevenly distributed data “D_500”, the number of images in the first class is ten times the number of images in the last class.

We can see from Table 5 that our method is almost unaffected by whether the actual data distribution is even or not. As can be expected, evenly distributed data leads to slightly better performances than that on the unevenly distributed data. This experiment demonstrates that the success of our algorithm does not rely on the even distribution of the datasets.

4.3. Application to Remote-Sensing Images

For all remote-sensing datasets, we divided images into two groups: the training sets and testing sets, that is, we randomly picked around 80% images for training and the others for testing. All training only involved training sets without using any labels. For all datasets, we employed the same backbone, i.e., ResNet-18. All the remote-sensing images were resized to 256 × 256 and then cropped to 224 × 224 and for the strong augmentations we used the strategy from [10] with Augment = 10, n

_{holes}

= 1, and length = 16. In this part, we will apply our method to seven remote-sensing image datasets: HMHR (How to Make High Resolution Remote Sensing Image dataset), EuroSAT, SIRI WHU google, UCMerced LandUse, AID, PatternNet, NWPU RESISC45.

For the very small datasets—HMHR, SIRI WHU google, UCMerced LandUse—we used pretrained models. The training epoch for these three datasets is 64, and we set initial learning rate as 0.03 and drops to 0.003, 0.0006 at epoch 32, 48. For the training without a pretrained model, the total training epoch was set to 300, and learning rate initially was 0.03 which dropped to 0.003 and 0.0006 at epoch 160 and 240. For all the training, the momentum was set to 0.9. For all datasets, the k was set to 128.

This is a dataset that is made from google map via LocalSpece Viewer. In this dataset, there only contains 533 images with five classes: building, farmland, greenbelt, wasteland, water. The resolution and spatial resolution of the remote-sensing images are 256 × 256 and 2.39 m. To prevent the network from overfitting, we used the strong augmentations [10] and the pretrained models trained from other remote-sensing images datasets: PatternNet, NWPU RESISC45 and EuroSAT. All the pretained models were trained through our method discussed in the Methods section and no labels were used in the training. For supervised training, we only considered the pretrained model on PatternNet. From Table 6, we see that our method works well on very small datasets with few classes. There is only a small gap (

2.8 %

) between supervised method and ours in linear classifier evaluation. Although using different pretrained models has different performances, all of them are effective and perform well.

EuroSAT [58,59] is a dataset that for land use and land cover classification. It is consist of 10 classes with 27,000 labeled images of annual crop, forest, highway, reiver, sealake and so on. The resolution and spatial resolution of these images are 64 × 64 and 10 m. Table 7 shows that the gap between supervised method and ours is only

1 %

in Linear classifier evaluation. In kNN evaluation, the performance of our method is also very close to the performance of supervised learning.

The dataset of SIRI WHU google [60] contains 12 classes images with 2400 images. There are 200 images for each class. The class of images includes: agriculture, commercial, harbor, idle land, industrial, meadow, overpass, park, pond, river, water, residential. The pixel and spatial resolution of images in this dataset are

200 \times 200

and 2 m. As can be seen from Table 8, using the pretained model on NWPU RESISC45 have the best performance which even is better than the that of supervised learning based on pretrained model on PatternNet. Representations learned based on pretrained model on PatternNet perform closely to that of supervised learning.

UCMerced landUse [61] gives the images that were manually extracted from large images from the USGS National Map Urban Area Imagery collection for various urban areas around the country. Each image measures 256 × 256 pixels, with a 0.3 m spatial resolution. There are 21 classes with 2100 images, including airplane, beach, buildings, freeway, golfcourse, tenniscourt and so on and there are 100 images for each class. Table 9 demonstrates that the performance of our method based on pretrained models on PatternNet and NWPU RESISC45 is very close to supervised learning, and in Linear Classification, our method even perform better.

AID [62] is a aerial image dataset with high resolution, 600 × 600 pixels, and a 0.5–0.8 m spatial resolution, which constitutes collected sample images from Google Earth imagery. There are 30 classes with 10,000 images. This dataset is not evenly distributed and in each class there are about 200 to 400 images. The scene classes include bare land, baseball field, beach, bridge, center, church, dense residential, desert, forest, meadow, medium residential, mountain, park, parking, playground, port, railway station, resort, school and so on. This experiment shows that, for high-resolution remote-sensing images and an unevenly distributed remote-sensing dataset, our method also works well (see Table 10).

PatternNet [63] is a remote-sensing dataset with 30,400 high-resolution images whose resolution is 256 × 256 pixels and spatial resolution is 0.062–4.693 m. These images were collected for remote-sensing image retrieval from Google Earth imagery or via the Google Map API for some US cities. For each of the 38 classes, there are 800 images. From the results in Table 11, we see that, in kNN evaluation, our method performs as well as that of supervised learning; while in linear classifier evaluation, our method has an even better performance.

NWPU RESISC45 [64] is made by Northwestern Polytechnical University (NWPU), which is avaiable for remote-sensing image scene classification (RESISC). It contains 31,500 images in total with a pixel resolution of 256 × 256 and a 0.2–30 m spatial resolution. This dataset cover 45 scene classes with 700 images in each class. These 45 scene classes include baseball diamond, basketball court, bridge, chaparral, church, circular-farmland, cloud, commercial area, dense residential, desert, intersection, island, lake, meadow, medium residential, mobile home park, wetland and so on. The experiment in Table 12 shows that, for a dataset with more classes, our method still has an excellent performance that is very close to performance of supervised learning.

From Table 6, Table 7, Table 8, Table 9, Table 10, Table 11 and Table 12, we see that, for the pixel resolutions of remote-sensing images ranging from

64 \times 64

to

600 \times 600

, i.e., from low to high resolution, spatial resolutions ranging from 0.062 m to 30 m, class numbers ranging from 5 to 45, and total numbers of images ranging from 533 to 31,500, our method can learn good representations, which are close to, or even better than, the representations learned from supervised learning. These experiments demonstrate that our method could be widely applied to learn the representations of remote-sensing images.

To investigate the effect of choosing different training and testing samples in the experiments of remote-sensing images, we conduct four more contrast experiments on the dataset SIRI WHU google. In this dataset, there are only 2400 images, and, thus, the random choice of training and testing samples may be quite different. We randomly selected

80 %

images as training samples and

20 %

as testing samples four times to construct four different divisions of the dataset SIRI WHU google. The performances on the five different divisions are given in Table 13. From Table 13, we see that different training and testing samples would affect the performances both on supervised learning and our method. However, in all five cases, using our method, without any labels can always learn good representations for remote-sensing images.

5. Discussion

To study the robustness of our algorithm, we consider how the two parameters, cluster number k and decay rate

β

(see Algorithm 1), affect the performance of our algorithm. For given N, k and

α_{0}

, the decay rate is the unique parameter in our algorithm, and has to be larger than one such that the magnitude of translations will decrease as iterations increase. We conducted two groups of experiments for

β

ranging from 1.5 to 6 and from 5 to 50 in Figure 4, and the results show that our algorithm can effectively create evenly distributed pseudo-labels for these wide ranges of

β

values, indicating our algorithm is insensitive to this free parameter.

In Figure 5, we demonstrate that our method is not only effective but also efficient (it only needs around 20 iterations to converge) in various settings of cluster number k. In Table 14 and Table 15 (without LCT and strong augmentations), we show the kNN evaluations for the representations learned using different k, which demonstrates that our method works well for k ranging from 32 to 2048. When

k = N

, our method would be the same as instance discrimination. Increasing k would make representations more discriminative, while decreasing k would make representations closer. Representation learning requires that representations of different samples to be as distinct as possible while representations of similar samples get closer. Therefore, very large or small k are not considered in this paper: small k would not give the highly discriminative representations while large k would only consider the nearest neighbors as the same class, or even only the transformations of the same sample as instance discrimination does.

The comparisons between our method and the Sinkhorn–Knopp algorithm are given in Figure 6. The Sinkhorn–Knopp algorithm is a classical method which has been widely used, especially in transport problems. For 20 different k values and different

N \times K

matrices, our method always obtained more uniform pseudo-labels (see Figure 6).

The main advantage of the Sinkhorn–Knopp algorithm is its speed of convergence. For creating pseudo-labels from outputs of ImageNet, the Sinkhorn–Knopp algorithm converges within 2 min [8]. In Figure 7, we give the standard deviations of pseudo-labels for ImageNet and the time for producing them at each epoch by our method. As can been seen from Figure 7a, for all epochs, the pseudo-labels created from our method are very uniform and the average standard deviation is only 1.07. From (b) in Figure 7, we see that our algorithm is very fast to converge even for 1.28 million pictures and 3000 clusters. The average time for approaching even distribution is only 12.68 s on GPU (NVIDIA A100).

6. Conclusions

In this paper, under a clustering model where images are clustered based on the representations of the images, we found a quantity that can be used to compare the discriminativeness of the representations and, based on this quantity, we demonstrated that evenly distributing images into clusters requires their representations to be most discriminative. We developed an algorithm to translate outputs such that pseudo-labels can be created evenly according to representations of images while keeping the smoothness of the model unchanged. Extensive experimental results demonstrate that our method is very effective and can learn representations that are as good as or better than the state-of-the-art methods. We then applied our method to various remote-sensing image datasets and the representations learned by our method are very close to, or even better than, the representations learned by supervised learning.

Author Contributions

Conceptualization, Q.L. and G.Q.; methodology, G.Q. and Q.L.; software, Q.L.; validation, G.Q., B.L. and J.M.G.; formal analysis, G.Q., B.L. and J.M.G.; investigation, B.L. and J.M.G.; resources, Q.L.; data curation, Q.L. and G.Q.; writing—original draft preparation, Q.L. and G.Q.; writing—review and editing, B.L. and J.M.G.; visualization, B.L. and J.M.G.; supervision, G.Q, B.L. and J.M.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work is funded by Guangdong Basic and Applied Basic Research Foundation (Grant 2019B151502001) Shenzhen R&D Program (Grant JCYJ20200109105008228).

Data Availability Statement

HMHR at https://github.com/leeguandong/How-to-make-high-resolution-remote-sensing-image-dataset- (accessed on 7 March 2018); EuroSAT at https://github.com/phelber/eurosat (accessed on 22 December 2018); SIRI WHU Google at http://www.lmars.whu.edu.cn/profweb/zhongyanfei/e-code.html (accessed on 19 April 2016); UCMerced LandUse at http://weegee.vision.ucmerced.edu/datasets/landuse.html (accessed on 28 October 2010); AID at https://captain-whu.github.io/AID/ (accessed on 15 November 2016); PatternNet at https://drive.Google.com/file/d/127lxXYqzO6Bd0yZhvEbgIfz95HaEnr9K/view (accessed on 25 May 2017); NWPU RESISC45 at https://gcheng-nwpu.github.io/#Datasets (accessed on 6 January 2017).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Examples for Measuring the Discriminativeness of Representations

Appendix A.1. Four Images Clustering

In this section, we give a simple toy model: four images clustered through the model, as in Figure 1. If we set the cluster number to four, all five possible ways to cluster are given in Table A1. Case 1 in Table A1 (there are four images in cluster C1 while no images are assigned to other clusters) corresponds to the degenerated solution, which is given by the most indiscriminative representations. This case gives the most indistinguishable pairwise comparisons, i.e., 6. The number of indistinguishable pairwise comparisons decreases from case 1 to case 5. For the last case, there are the most distinguishable pairwise comparisons, which requires the representations to be the most discriminative. This is reasonable when we remember that all the images are different: the most discriminative representations should reveal the most differences in the images. In fact, the last case is exactly what instance discrimination does.

Table A1. Five possible clusterings with cluster number as four. “IND”: the number of indistinguishable pairwise comparisons; “DIS”: the number of distinguishable pairwise comparisons.

Cluster	C1	C2	C3	C4	Ind	Dis
Case 1	4	0	0	0	6	0
Case 2	3	1	0	0	3	3
Case 3	2	2	0	2	2	4
Case 4	2	1	1	0	1	5
Case 5	1	1	1	1	0	6

If the cluster number is set to three and two, all the possible ways to group the images are given in Table A2 and Table A3. Similarly, basing on the most indiscriminative visual representations of images gives the clustering of Case 1, which has the most indistinguishable pairwise comparisons. If we use the number of distinguishable pairwise comparisons as the quantity to measure the discriminativeness of the visual representations, these three tables are consistent: Case 1 to 4 in Table A1 are same as Case 1 to 4 in Table A2, which have the same number of distinguishable pairwise comparisons and these numbers all increase from Case 1 to Case 4; Case 1 to 3 in Table A2 are the same as Case 1 to 3 in Table A3, which have the same number of distinguishable pairwise comparisons and these numbers all increase from Case 1 to Case 3. Therefore, clearly, this quantity, i.e., the number of distinguishable pairwise comparisons, does not change with the setting of cluster number, which only depends on how to group images. The cluster number only affects the maximum number of the distinguishable pairwise comparisons. Noting that, under a certain rule, for example, k-means, the way to group images is completely determined by the representations; this quantity, in fact, is determined by the representations. Specifically, its value changes with the discriminativeness of the images’ representations: the more discriminative are the representations, the larger is this quantity. Therefore, this quantity, the number of distinguishable pairwise comparisons, can be used to measure the discriminativeness of the images’ representations through the model shown in Figure 1.

Table A2. Five possible clusterings with cluster number as three. “IND”: the number of indistinguishable pairwise comparisons; “DIS”: the number of distinguishable pairwise comparisons.

Cluster	C1	C2	C3	Ind	Dis
case 1	4	0	0	6	0
case 2	3	1	0	3	3
case 3	2	2	2	2	4
case 4	2	1	1	1	5

Table A3. Five possible clusterings with cluster number as 2. “IND”: the number of indistinguishable pairwise comparisons; “DIS”: the number of distinguishable pairwise comparisons.

Cluster	C1	C2	Ind	Dis
case 1	4	0	6	0
case 2	3	1	3	3
case 3	2	2	2	4

Appendix A.2. Specific Toy Examples

Let us consider the clustering of the images in Figure A1 with the basic setting of the cluster number as three. Semantically, the best clustering is to group these images into two clusters: assigning the three cats c1, c2, c3 to one cluster while assigning the three dogs d1, d2, d3 to another cluster. However, according to Remark 1, to make representations of these images the most discriminative, we should group these images into three clusters, that is, assigning two images to each cluster. Assume that, based on the representations in two feature space

F_{1}

and

F_{2}

, the six images are clustered through k-means as:

1.: $F_{1} :$ cluster 1 {c1, c2, c3}; cluster 2 {d1, d2, d3}; cluster 3 { }
2.: $F_{2} :$ cluster 1 {c1, c2}; cluster 2 {c3, d1}; cluster 3 {d2, d3}

Figure A1. Three cats: c1, c2, c3 (a–c); three dogs: d1, d2, d3 (d–f).

Note that images are assigned into the same cluster because their representations are close, i.e., their representations are not distinguishable enough. On the other hand, in the representation learning model shown in Figure 1, images in the same cluster are assigned the same pseudo-labels, which will make the representations of the images in the same cluster closer. Therefore, we would assume that the representations of images in the same cluster as indistinguishable. Based on this assumption, we could list the pairs of distinguishable representations in

F_{1}

and

F_{2}

as:

1.: $F_{1} :$ [c1, d1]; [c1, d2]; [c1, d3]; [c2, d1]; [c2, d2]; [c2, d3]; [c3, d1]; [c3, d2]; [c3, d3]
2.: $F_{2} :$ [c1, c3]; [c1, d1]; [c1, d2]; [c1, d3]; [c2, c3]; [c2, d1]; [c2, d2]; [c2, d3]; [c3, d2]; [c3, d3]; [d1, d2]; [d1, d3]

To compare the representations’ degree of distinguishableness in

F_{1}

and

F_{2}

, we neglect the same pairs in the lists above and have:

1.: $F_{1} :$ [c3, d1]
2.: $F_{2} :$ [c1, c3]; [c2, c3]; [d1, d2]; [d1, d3]

These two lists indicate that representations of c3 and d1 are distinguishable in

F_{1}

while are indistinguishable in

F_{2}

; the representations between c1 and c3, c2 and c3; and d1 and d2, d1 and d3 are distinguishable in

F_{2}

while they are indistinguishable in

F_{1}

. From the viewpoints of the two lists above, assuming representations in

F_{2}

are more discriminative than representations in

F_{1}

is reasonable. We have to emphasize that we are not performing semantic classification and that every image is different, although some of them belong to the same semantic class. The purpose of evenly distributed pseudo-labels is to make the representations of images as discriminative as possible when k is fixed. Theoretically, as all the images are different, setting the cluster number is the same as the number of images (i.e.,

k = N

) will make the representations of images the most discriminative. However, although discriminative representations is one of the most important targets in representation learning, it is not the only one. Good representations should also reveal the similarity of the images: close images have close representations; different images have very different representations. Therefore, in methods of instance discrimination, both “positive” and “negative” samples are very important. In performing instance discrimination, only the transformations of the same images are regarded as “positive”, but in clustering-based methods, i.e.,

k < N

, similar images are also regarded as “positive”, which we usually cannot obtain only from the transformations of a image. Therefore, setting

k = N

may not be the best for visual-representation learning.

The example above can only illustrate that grouping images into more clusters is better to find more discriminative representations. Let us consider another example: that of clustering the three cats c1, c2, c3 and one dog d1 into two clusters. Consider the representations in

F_{1}

and

F_{2}

as:

1.: $F_{1} :$ cluster 1 {c1, c2, c3}; cluster 2 {d1};
2.: $F_{2} :$ cluster 1 {c1, c2}; cluster 2 {c3, d1};

Similarly, we can obtain different distinguishable pairwise comparisons in

F_{1}

and

F_{2}

:

1.: $F_{1} :$ [c3, d1]
2.: $F_{2} :$ [c1, c3]; [c2, c3]

Although the representations of c3 and d1 in

F_{2}

are not distinguishable, the representations of c1 and c3, c2 and c3 are distinguishable. For this example, we can still reasonably assume that the representations of images in

F_{2}

are more discriminative from the perspective of the two lists above. Again, every image is different; finding the representations that can reveal the differences amongst of images is one of the basic tasks in representation learning.

Appendix B. Proof for Cases That $\bar{n}$ = N/k Is Not an Integer

For when

\bar{n} = N / k

in (5) is not integer, the equalities (5) cannot be satisfied since all

n_{i} (i = 1, 2, \dots, k)

have to be integers. Setting

n_{i} = \bar{n} + {\hat{n}}_{i}

, we can express (4) as

\begin{matrix} N_{ind} & = & \frac{1}{2} [(n_{1}^{2} + n_{2}^{2} + \dots + n_{k}^{2}) - N] \\ = & \frac{1}{2} [({(\bar{n} + {\hat{n}}_{1})}^{2} + {(\bar{n} + {\hat{n}}_{2})}^{2} + \dots + {(\bar{n} + {\hat{n}}_{k})}^{2}) - N] \\ = & \frac{1}{2} [(k {\bar{n}}^{2} + 2 \bar{n} ({\hat{n}}_{1} + {\hat{n}}_{2} + \dots + {\hat{n}}_{k}) + ({\hat{n}}_{1}^{2} + {\hat{n}}_{2}^{2} + \dots + {\hat{n}}_{k}^{2})) - N] . \end{matrix}

Noting

k {\bar{n}}^{2} = k {(\frac{N}{k})}^{2} = \frac{N^{2}}{k}

(A1)

is a constant and

\begin{matrix} {\hat{n}}_{1} + {\hat{n}}_{2} + \dots + {\hat{n}}_{k} & = & (n_{1} - \bar{n}) + (n_{2} - \bar{n}) + \dots + (n_{k} - \bar{n}) \\ = & n_{1} + n_{2} + \dots + n_{k} - k \bar{n} \\ = & 0 \end{matrix}

(A2)

we obtain

\begin{matrix} N_{ind} & = & \frac{1}{2} [({\hat{n}}_{1}^{2} + {\hat{n}}_{2}^{2} + \dots + {\hat{n}}_{k}^{2}) + \frac{N^{2}}{k} - N], \end{matrix}

(A3)

where

({\hat{n}}_{1}^{2} + {\hat{n}}_{2}^{2} + \dots + {\hat{n}}_{k}^{2})

exactly is the square of standard deviation of distribution

{n_{1}, n_{2}, \dots, n_{k}}

. Thus,

N_{ind}

takes the minimal value when the standard deviation of the distribution

{n_{1}, n_{2}, \dots, n_{k}}

is the smallest, i.e., when the distribution

{n_{1}, n_{2}, \dots, n_{k}}

is the most uniform.

Appendix C. Algorithm Analysis

Compared with similar work, e.g., self-labeling [8], there are some differences in our method. First of all, the motivation of introducing evenly distributed pseudo-labels is different. In self-labeling [8], even distribution is an assumption and introduced as a constraint to prevent models from degeneracy. In this paper, we create uniform pseudo-labels to make the images’ representations as discriminative as possible. This is different because even distribution is not the only way to prevent degeneracy but to make representations as discriminative as possible, and evenly distributed, pseudo-labels are the best choice according to Remark 1.

Secondly, the methods of approaching even distribution are very different. The technique used in [8] is from a classical algorithm in the optimal transport problem, the Sinkhorn–Knopp algorithm, while our method is based on the basic idea of frequency-sensitive competitive learning. In the Sinkhorn–Knopp algorithm, the manipulation of the output matrix

M_{O}

,

M_{O} = (\begin{matrix} o_{1} (1) & o_{2} (1) & \dots & o_{k} (1) \\ o_{1} (2) & o_{2} (2) & \dots & o_{k} (2) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ o_{1} (N) & o_{2} (N) & \dots & o_{k} (N) \end{matrix}),

(A4)

is multiplication while our algorithm only involves addition or subtraction, which makes our algorithm computationally much more efficient.

Last but not least, our method treats every output as a whole to make changes, that is, subtracting the same k-dimensional vector from every row (see (10), which is very important since it can preserve the neighborhood relation of any two outputs. To see that clearly, let us consider two arbitrary rows of

M_{O}

(i.e., two arbitrary outputs), e.g., the

s_{1}

-th and

s_{2}

-th row. The difference in the two rows is

\begin{matrix} M_{O} [s_{1}, :] - M_{O} [s_{2}, :] = (o_{1} (s_{1}) - o_{1} (s_{2}), o_{2} (s_{1}) - o_{2} (s_{2}), \dots, o_{k} (s_{1}) - o_{k} (s_{2})), \end{matrix}

which will not change after translation (10)

M_{O}^{'} [s_{1}, :] - M_{O}^{'} [s_{2}, :] = M_{O} [s_{1}, :] - M_{O} [s_{2}, :],

(A5)

and their Euclidean distance will not change either

\begin{matrix} | M_{O}^{'} [s_{1}, :] - M_{O}^{'} [s_{2}, :] | & = & | M_{O} [s_{1}, :] - M_{O} [s_{2}, :] | \\ = & [{(o_{1} (s_{1}) - o_{1} (s_{2}))}^{2} + {(o_{2} (s_{1}) - o_{2} (s_{2}))}^{2} \\ + \dots + {(o_{k} (s_{1}) - o_{k} (s_{2}))}^{2}]^{1 / 2} . \end{matrix}

(A6)

The equalities (A5) and (A6) indicate that iteratively performing translation (10) would not change the difference and Euclidean distance between any two output vectors. This invariance is consistent with smoothness (15).

To clearly compare our algorithm with the Sinkhorn–Knopp algorithm used in [8], we give the specific form of outputs after performing both algorithms:

M_{O}^{'} (ours) = (\begin{matrix} o_{1} (1) - t_{1} & o_{2} (1) - t_{2} & \dots & o_{k} (1) - t_{k} \\ o_{1} (2) - t_{1} & o_{2} (2) - t_{2} & \dots & o_{k} (2) - t_{k} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ o_{1} (N) - t_{1} & o_{2} (N) - t_{2} & \dots & o_{k} (N) - t_{k} \end{matrix}),

(A7)

M_{O}^{'} (Sinkhorn - Knopp) = (\begin{matrix} r_{1} c_{1} o_{1} (1) & r_{1} c_{2} o_{2} (1) & \dots & r_{1} c_{k} o_{k} (1) \\ r_{2} c_{1} o_{1} (2) & r_{2} c_{2} o_{2} (2) & \dots & r_{2} c_{k} o_{k} (2) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ r_{k} c_{1} o_{1} (N) & r_{k} c_{2} o_{2} (N) & \dots & r_{k} c_{k} o_{k} (N) \end{matrix}) .

(A8)

where

r_{1}, \dots, r_{k}

and

c_{1}, \dots, c_{k}

are constants. Obviously, in general, making transformations as (A8) cannot preserve equalities (A5) and (A6).

Appendix D. Implementation Details

In the experiments conducted on AlexNet:

All images were resized to

256 \times 256

and then cropped to

224 \times 224

. Several augmentations were applied to inputs, such as color jitter, random grayscale, and so on. We trained all datasets with a batch size of 128 and a learning rate of 0.05 at the beginning. We trained CIFAR-10/100 for 1600 epochs in total and the pseudo-labels were updated around every two epochs. The learning rate dropped twice by multiplying 0.1 at epoch 960 and 1280. We trained SVHN for 400 epochs in total and the labels were updated nearly every epoch. The learning rate dropped twice by multiplying 0.1 at epoch 240 and 320. We trained ImageNet for 450 epochs in total and the labels were updated nearly every epoch. The learning rate dropped three times by multiplying 0.1 at epoch 160, 300 and 380. As the type of images in CIFAR-10 and CIFAR-100 are very similar, we used the same augmentation strategy, such as random cropping, color jitter, horizontal flipping and so on. Regarding SVHN, we did not use the horizontal-flipping scheme since flip transformations are improper for digit learning. For the strong augmentations, we used the strategy of [10] with Augment = 8, n_holes = 1, length = 16. We used SGD to optimize the network.

In the experiments conducted on ResNet-50:

All datasets kept their original size, and all the experiments used strong augmentations. For all experiments, k was set to 128. The total training epochs was 300. Learning rate initally was 0.01 and then dropped to 0.001 and 0.0005 at epochs 160 and 240. Other parameters were the same as experiments on AlexNet.

Appendix E. Unevenly Distributed Pseudo-Labels

To observe the effects of the distribution of the pseudo-labels on the performances of representation learning, we consider several unevenly distributed pseudo-labels. To do this, we only need to replace mean

\bar{n} (F)

in (11) with uneven distributions. In this paper we create the uneven distributions of target pseudo-labels as

{\tilde{n}}_{i} (F) = \frac{N i^{x}}{\sum_{i}^{k} i^{x}},

(A9)

where N is the number of images and k is the cluster number. For

x = 0, 2, 4, 6, 8, 10

, we give the distributions of pseudo-labels for CIFAR-10/100, i.e.,

k = 128

for CIFAR-10 while

k = 512

for CIFAR-100, in Figure 3.

References

Van den Oord, A.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Ye, M.; Zhang, X.; Yuen, P.C.; Chang, S.F. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6210–6219. [Google Scholar]
Henaff, O. Data-efficient image recognition with contrastive predictive coding. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 4182–4192. [Google Scholar]
Tian, Y.; Krishnan, D.; Isola, P. Contrastive multiview coding. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 776–794. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Caron, M.; Bojanowski, P.; Joulin, A.; Douze, M. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 132–149. [Google Scholar]
Asano, Y.M.; Rupprecht, C.; Vedaldi, A. Self-labelling via simultaneous clustering and representation learning. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Ji, X.; Henriques, J.F.; Vedaldi, A. Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–3 November 2019; pp. 9865–9874. [Google Scholar]
Van Gansbeke, W.; Vandenhende, S.; Georgoulis, S.; Proesmans, M.; Van Gool, L. Scan: Learning to classify images without labels. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 268–285. [Google Scholar]
Ghasedi Dizaji, K.; Herandi, A.; Deng, C.; Cai, W.; Huang, H. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5736–5745. [Google Scholar]
Aggarwal, C.C.; Reddy, C.K. (Eds.) Data Clustering: Algorithms and Applications; CRC Press: Boca Raton, FL, USA, 2014. [Google Scholar]
Huang, J.; Dong, Q.; Gong, S.; Zhu, X. Unsupervised Deep Learning by Neighbourhood Discovery. In Proceedings of the the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
Xie, J.; Girshick, R.; Farhadi, A. Unsupervised deep embedding for clustering analysis. In Proceedings of the International Conference on Machine Learning, PMLR, New York City, NY, USA, 19–24 June 2016; pp. 478–487. [Google Scholar]
Yang, B.; Fu, X.; Sidiropoulos, N.D.; Hong, M. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In Proceedings of the International Conference on Machine Learning. PMLR, Sydney, Australia, 6–11 August 2017; pp. 3861–3870. [Google Scholar]
Liao, R.; Schwing, A.; Zemel, R.; Urtasun, R. Learning deep parsimonious representations. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2016; Volume 29. [Google Scholar]
Dosovitskiy, A.; Springenberg, J.T.; Riedmiller, M.; Brox, T. Discriminative unsupervised feature learning with convolutional neural networks. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2014; Volume 27. [Google Scholar]
Yang, J.; Parikh, D.; Batra, D. Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2016; pp. 5147–5156. [Google Scholar]
Dwibedi, D.; Aytar, Y.; Tompson, J.; Sermanet, P.; Zisserman, A. With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9588–9597. [Google Scholar]
Li, Y.; Hu, P.; Liu, Z.; Peng, D.; Zhou, J.T.; Peng, X. Contrastive clustering. In Proceedings of the 2021 AAAI Conference on Artificial Intelligence (AAAI), Virtually, 22 February–1 March 2021. [Google Scholar]
Jung, H.; Oh, Y.; Jeong, S.; Lee, C.; Jeon, T. Contrastive Self-Supervised Learning With Smoothed Representation for Remote Sensing. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Ciocarlan, A.; Stoian, A. Ship Detection in Sentinel 2 Multi-Spectral Images with Self-Supervised Learning. Remote Sens. 2021, 13, 4255. [Google Scholar] [CrossRef]
Stojnić, V.; Risojević, V. Self-Supervised Learning of Remote Sensing Scene Representations Using Contrastive Multiview Coding. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 19–25 June 2021; pp. 1182–1191. [Google Scholar]
Li, H.; Li, Y.; Zhang, G.; Liu, R.; Huang, H.; Zhu, Q.; Tao, C. Global and local contrastive self-supervised learning for semantic segmentation of HR remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Doersch, C.; Gupta, A.; Efros, A.A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–15 December 2015; pp. 1422–1430. [Google Scholar]
Noroozi, M.; Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 69–84. [Google Scholar]
Kim, D.; Cho, D.; Yoo, D.; Kweon, I.S. Learning image representations by completing damaged jigsaw puzzles. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 793–802. [Google Scholar]
Gidaris, S.; Singh, P.; Komodakis, N. Unsupervised representation learning by predicting image rotations. arXiv 2018, arXiv:1803.07728. [Google Scholar]
Kolesnikov, A.; Zhai, X.; Beyer, L. Revisiting self-supervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1920–1929. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A. Colorful image colorization. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 649–666. [Google Scholar]
Larsson, G.; Maire, M.; Shakhnarovich, G. Learning representations for automatic colorization. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 577–593. [Google Scholar]
Wang, X.; Gupta, A. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2794–2802. [Google Scholar]
Akiva, P.; Purri, M.; Leotta, M. Self-Supervised Material and Texture Representation Learning for Remote Sensing Tasks. arXiv 2021, arXiv:2112.01715. [Google Scholar]
Ayush, K.; Uzkent, B.; Meng, C.; Tanmay, K.; Burke, M.; Lobell, D.; Ermon, S. Geography-Aware Self-Supervised Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Dong, H.; Ma, W.; Wu, Y.; Zhang, J.; Jiao, L. Self-Supervised Representation Learning for Remote Sensing Image Change Detection Based on Temporal Prediction. Remote Sens. 2020, 12, 1868. [Google Scholar] [CrossRef]
Hinton, G.E.; Osindero, S.; Teh, Y.W. A fast learning algorithm for deep belief nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef] [PubMed]
Lee, H.; Grosse, R.; Ranganath, R.; Ng, A.Y. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 609–616. [Google Scholar]
Tang, Y.; Salakhutdinov, R.; Hinton, G. Robust boltzmann machines for recognition and denoising. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2264–2271. [Google Scholar]
Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.A. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion. J. Mach. Learn. Res. 2010, 11, 3371–3408. [Google Scholar]
Ren, Z.; Lee, Y.J. Cross-domain self-supervised multi-task feature learning using synthetic imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 762–771. [Google Scholar]
Jenni, S.; Favaro, P. Self-supervised feature learning by learning to spot artifacts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2733–2742. [Google Scholar]
Xie, Q.; Dai, Z.; Du, Y.; Hovy, E.; Neubig, G. Controllable invariance through adversarial feature learning. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Donahue, J.; Simonyan, K. Large scale adversarial representation learning. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Xu, Y.; Luo, W.; Hu, A.; Xie, Z.; Xie, X.; Tao, L. TE-SAGAN: An Improved Generative Adversarial Network for Remote Sensing Super-Resolution Images. Remote Sens. 2022, 14, 2425. [Google Scholar] [CrossRef]
Cauchy, A.L. Cours d’analyse de l’Ecole Royale Polytechnique; Cambridge Library Collection—Mathematics; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef] [PubMed]
DeSieno. Adding a conscience to competitive learning. In Proceedings of the IEEE 1988 International Conference on Neural Networks, San Diego, CA, USA, 24–27 July 1988; Volume 1, pp. 117–124. [Google Scholar]
Ahalt.; Krishnamurthy.; Chen.; Melton. Vector quantization using frequency-sensitive competitive-learning neural networks. In Proceedings of the IEEE 1989 International Conference on Systems Engineering, Los Angeles, CA, USA, 6–10 February 1989; pp. 131–134. [Google Scholar]
Ahalt, S.C.; Krishnamurthy, A.K.; Chen, P.; Melton, D.E. Competitive learning algorithms for vector quantization. Neural Networks 1990, 3, 277–290. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1058–1067. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the CVPR09, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
Noroozi, M.; Pirsiavash, H.; Favaro, P. Representation learning by learning to count. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5898–5906. [Google Scholar]
Wu, Z.; Xiong, Y.; Yu, S.X.; Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3733–3742. [Google Scholar]
Wang, Z.; Wang, Y.; Wu, Z.; Lu, J.; Zhou, J. Instance Similarity Learning for Unsupervised Feature Representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10336–10345. [Google Scholar]
Pathak, D.; Krähenbühl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context Encoders: Feature Learning by Inpainting. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
Donahue, J.; Krähenbühl, P.; Darrell, T. Adversarial Feature Learning. In Proceedings of the ICLR (Poster), Toulon, France, 24–26 April 2017. [Google Scholar]
Goodfellow, I.J.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Helber, P.; Bischke, B.; Dengel, A.; Borth, D. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 2217–2226. [Google Scholar] [CrossRef] [Green Version]
Helber, P.; Bischke, B.; Dengel, A.; Borth, D. Introducing EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification. In Proceedings of the IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 204–207. [Google Scholar]
Ma, A.; Zhong, Y.; Zhang, L. Adaptive multiobjective memetic fuzzy clustering algorithm for remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2015, 53, 4202–4217. [Google Scholar] [CrossRef]
Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 3–5 November 2010; pp. 270–279. [Google Scholar]
Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef] [Green Version]
Zhou, W.; Newsam, S.; Li, C.; Shao, Z. PatternNet: A benchmark dataset for performance evaluation of remote sensing image retrieval. ISPRS J. Photogramm. Remote Sens. 2018, 145, 197–209. [Google Scholar] [CrossRef] [Green Version]
Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Schematics of representation learning based on clustering: input images are clustered into k clusters according to representations (or outputs) extracted from the CNN; and then the identities of clusters are utilized as pseudo-labels to train the CNN.

Figure 2. Left: most points are located in the first quadrant and thus are assigned into the cluster 1, while there are no points located in the third quadrant; right: Translating all points as a whole not only can prevent their relative distances, but also make them evenly distributed into the four quadrants.

Figure 3. Left: various target distributions of pseudo-labels for dataset CIFAR-10; Right: various target distributions of pseudo-labels for dataset CIFAR-100.

Figure 4. For

k = 128

, the performance of our method with various decay rate

β

in the Algorithm 1 is given. The x axis is the number of iterations and y axis is the standard deviations for the distribution of pseudo-labels. In (a,c) the method is applied on the outputs of CIFAR-10 from randomly weighted AlexNet; in (b,d) the method is applied on

N \times k

matrices randomly created.

Figure 4. For

k = 128

, the performance of our method with various decay rate

β

in the Algorithm 1 is given. The x axis is the number of iterations and y axis is the standard deviations for the distribution of pseudo-labels. In (a,c) the method is applied on the outputs of CIFAR-10 from randomly weighted AlexNet; in (b,d) the method is applied on

N \times k

matrices randomly created.

Figure 5. For N = 50,000 and

β = 1.5

, the performance of our algorithm under various cluster number k. In both (a,b), for all k, convergence needs only around 20 iterations or fewer and the final distributions are quite uniform.

Figure 5. For N = 50,000 and

β = 1.5

, the performance of our algorithm under various cluster number k. In both (a,b), for all k, convergence needs only around 20 iterations or fewer and the final distributions are quite uniform.

Figure 6. For N = 50,000, comparisons between the Sinkhorn–Knopp algorithm and our algorithm: we consider 20 different k (

50, 100, 150, \dots, 1000

) for two kinds of

N \times K

matrices: one is output of the randomly initialized AlexNet with CIFAR-10; one is randomly created

N \times k

matrix. In both (a,b), for all k, our algorithm has better performance on approaching even distributions.

Figure 6. For N = 50,000, comparisons between the Sinkhorn–Knopp algorithm and our algorithm: we consider 20 different k (

50, 100, 150, \dots, 1000

) for two kinds of

N \times K

matrices: one is output of the randomly initialized AlexNet with CIFAR-10; one is randomly created

N \times k

matrix. In both (a,b), for all k, our algorithm has better performance on approaching even distributions.

Figure 7. For

k = 3000

and 450 epochs on AlexNet, the standard deviation of pseudo-labels and time for convergence at every epoch: (a) the very small values of standard deviation means that the distribution of pseudo-labels is very uniform; (b) the average time to produce these labels is 12.68 s on GPU.

Figure 7. For

k = 3000

and 450 epochs on AlexNet, the standard deviation of pseudo-labels and time for convergence at every epoch: (a) the very small values of standard deviation means that the distribution of pseudo-labels is very uniform; (b) the average time to produce these labels is 12.68 s on GPU.

Table 1. Evaluations on AlexNet (OTL = our new method). The best results are highlighted by the bold.

	kNN/FC
Method	CIFAR-10	CIFAR-100	SVHN
Supervised	91.9	69.7	96.5
Counting [52]	41.7	15.9	43.4
DeepCluster [7]	62.3	22.7	84.9
Instance [53]	60.3	32.7	79.8
AND [13]	74.8	41.5	90.9
SeLa [8]	77.6	44.2	92.8
ISL [54]	82.8	50.3	91.0
OTL	83.1	53.7	93.8
OTL (Strong augmentations + LCT)	87.3	59.2	95.2
OTL ( $k =$ class number)	89.9	60.7	95.4
	Linear Classifier/conv5
Supervised	91.8	71.0	96.1
Counting [52]	50.9	18.2	63.4
DeepCluster [7]	77.9	41.9	92.0
Instance [53]	70.1	39.4	89.3
AND [13]	77.6	47.9	93.7
SeLa [8]	83.4	57.4	94.5
ISL [54]	85.8	60.1	93.9
OTL	84.3	59.2	95.0
OTL (Strong augmentations + LCT)	87.1	63.6	96.0
OTL ( $k =$ class number)	90.8	65.6	96.1

Table 2. Evaluations on ResNet-50 (OTL = Our new method). The best results are highlighted by the bold.

	kNN/FC
Method	CIFAR-10	CIFAR-100	SVHN
Instance [53]	81.8	42.3	92.9
AND [13]	87.6	49.0	93.2
ISL [54]	88.9	58.1	94.5
OTL (ours)	90.0	66.4	95.2
	Linear Classifier/conv5
Instance [53]	85.0	50.1	94.4
AND [13]	90.2	58.2	94.9
SimCRL [6]	90.6	71.6	-
ISL [54]	91.5	65.9	95.2
OTL (ours)	92.0	70.7	95.5

Table 3. Evaluation on ImageNet. “

^{*}

” denotes training on larger AlexNet.”3k × 10/1” denotes 3000 clusters and 10/1 heads (10/1 fully connected layers attached at the end of the architecture). The best results are highlighted by the bold.

Table 3. Evaluation on ImageNet. “

^{*}

” denotes training on larger AlexNet.”3k × 10/1” denotes 3000 clusters and 10/1 heads (10/1 fully connected layers attached at the end of the architecture). The best results are highlighted by the bold.

Classifier	Linear Classifier					kNN
Feature	conv1	conv2	conv3	conv4	conv5	FC
Supervised [50]	19.3	36.3	44.2	48.3	50.5	-
Random [50]	11.6	17.1	16.9	16.3	14.1	3.5
Inpainting [55]	14.1	20.7	21.0	19.8	15.5	-
BiGAN [56]	17.7	24.5	31.0	29.9	28.0	-
Instance retrieval [42]	16.8	26.5	31.8	34.1	35.6	31.3
RotNet [57]	18.8	31.7	38.7	38.2	36.5	-
AND $^{*}$ [41]	15.6	27.0	35.9	39.7	37.9	31.3
CMC $^{*}$ [39]	18.4	33.5	38.1	40.4	42.6	-
AET $^{*}$ [2]	19.3	35.4	44.0	43.6	42.4	-
SeLa [3k × 10] $^{*}$ [8]	20.3	32.2	38.6	41.4	39.6	-
ISL [54]	17.3	29.0	38.4	43.3	43.5	38.9
*OTL [3k × 1] $^{}$** (ours)	20.6	34.0	41.2	44.6	43.8	36.7

Table 4. CIFAR-10/100: evaluations for unevenly distributed pseudo-labels. The best results are highlighted by the bold.

CIFAR-10
std	0	352.3	524.4	654.4	654.3	857.5
kNN	79.4	78.2	77.7	77.3	76.1	75.4
CIFAR-100
std	0	87.5	130.4	162.8	189.8	213.4
kNN	48.9	48.0	46.5	44.72	45.4	44.9

Table 5. Unevenly VS evenly distributed samples: kNN evaluation.

Dataset	D_100	D_200	D_300	D_400	D_500
Uneven	77.0	76.8	76.0	75.4	73.6
Even	77.4	77.0	76.8	75.8	74.5

Table 6. HMHR: How-to-make-high-resolution-remote-sensing-image-dataset.

Method	k	Supervised	OTL (Ours)
Pretrained on		PatternNet	PatternNet	NWPU RESISC45	EuroSAT
kNN	10	82.9	79.0	75.2	72.4
	50	83.8	78.1	76.2	73.3
	100	84.8	78.1	78.1	76.2
	200	83.8	77.1	79.0	75.2
Linear Classifier	-	88.5	85.7	83.8	81.9

Table 7. EuroSAT: Land Use and Land Cover Classification with Sentinel-2.

Method	k	Supervised	OTL (Ours)
kNN	10	96.8	94.4
	50	96.6	94.7
	100	96.3	94.6
	200	95.9	94.4
Linear Classifier	-	96.9	95.9

Table 8. SIRI WHU google.

Method	k	Supervised	OTL (Ours)
Pretrained on		PatternNet	PatternNet	NWPU RESISC45	EuroSAT
kNN	10	86.9	85.4	89.0	80.8
	50	83.8	82.5	85.6	77.9
	100	83.5	80.4	84.6	76.0
	200	81.3	79.0	83.3	73.8
Linear Classifier	-	89.6	88.6	89.7	85.7

Table 9. UCMerced LandUse.

Method	k	Supervised	OTL (Ours)
Pretrained on		PatternNet	PatternNet	NWPU RESISC45	EuroSAT
kNN	10	88.1	86.9	86.2	81.7
	50	87.4	87.9	86.4	82.4
	100	87.1	88.1	86.2	82.6
	200	86.2	88.6	86.2	82.6
Linear Classifier	-	89.6	90.2	89.7	87.2

Table 10. AID: a benchmark dataset for performance evaluation of aerial scene classification.

Method	k	Supervised	OTL (Ours)
kNN	10	87.8	83.6
	50	87.6	83.5
	100	86.9	82.6
	200	86.7	81.9
Linear Classifier	-	89.5	85.5

Table 11. PatternNet.

Method	k	Supervised	OTL (Ours)
kNN	10	96.5	96.4
	50	96.2	96.1
	100	95.7	95.7
	200	95.5	95.6
Linear Classifier	-	96.7	97.4

Table 12. NWPU RESISC45.

Method	k	Supervised	OTL (Ours)
kNN	10	92.1	89.3
	50	90.6	88.7
	100	89.7	88.3
	200	89.1	87.4
Linear Classifier	-	92.3	89.8

Table 13. Five different ways to divide the dataset SIRI WHU google.

Dataset	Evaluation Method	k	Supervised	OTL (Ours)
Division 1	kNN	10	86.9	85.4
		50	83.8	82.5
		100	83.5	80.4
		200	81.3	79.0
	Linear Classifier	N/A	89.6	88.6
Division 2	kNN	10	92.5	89.6
		50	91.0	89.4
		100	89.4	87.9
		200	88.3	87.3
	Linear Classifier	N/A	92.6	92.1
Division 3	kNN	10	93.2	88.8
		50	90.6	88.3
		100	88.8	87.3
		200	88.5	87.1
	Linear Classifier	N/A	93.0	92.7
Division 4	kNN	10	93.1	89.8
		50	90.6	89.4
		100	89.4	87.9
		200	88.5	86.3
	Linear Classifier	N/A	92.8	92.5
Division 5	kNN	10	92.3	90.4
		50	90.4	89.0
		100	89.0	88.1
		200	87.9	87.1
	Linear Classifier	N/A	92.8	92.3

Table 14. CIFAR-10: Evaluations for different k.

k	32	64	128	256	512	1024
$\bar{n}$	1562.5	718.3	390.6	195.3	97.7	48.8
kNN	77.1	78.6	79.4	79.0	78.9	77.8

Table 15. CIFAR-100: Evaluations for different k.

k	64	128	256	512	1024	2048
$\bar{n}$	718.3	390.6	195.3	97.7	48.8	24.4
kNN	43.8	46.1	47.8	48.9	49.5	48.9

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Q.; Li, B.; Garibaldi, J.M.; Qiu, G. Clustering-Based Representation Learning through Output Translation and Its Application to Remote-Sensing Images. Remote Sens. 2022, 14, 3361. https://doi.org/10.3390/rs14143361

AMA Style

Li Q, Li B, Garibaldi JM, Qiu G. Clustering-Based Representation Learning through Output Translation and Its Application to Remote-Sensing Images. Remote Sensing. 2022; 14(14):3361. https://doi.org/10.3390/rs14143361

Chicago/Turabian Style

Li, Qinglin, Bin Li, Jonathan M. Garibaldi, and Guoping Qiu. 2022. "Clustering-Based Representation Learning through Output Translation and Its Application to Remote-Sensing Images" Remote Sensing 14, no. 14: 3361. https://doi.org/10.3390/rs14143361

APA Style

Li, Q., Li, B., Garibaldi, J. M., & Qiu, G. (2022). Clustering-Based Representation Learning through Output Translation and Its Application to Remote-Sensing Images. Remote Sensing, 14(14), 3361. https://doi.org/10.3390/rs14143361

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Clustering-Based Representation Learning through Output Translation and Its Application to Remote-Sensing Images

Abstract

1. Introduction

2. Related Works

3. Method

3.1. Discriminativeness of Representations

3.2. General-Purpose Prior: Smoothness

3.3. Algorithm

4. Experiments Resluts

4.1. Compare to the State-of-the-Art Approaches

4.2. Impact of Even Pseudo-Label Distribution on Performances

4.3. Application to Remote-Sensing Images

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Examples for Measuring the Discriminativeness of Representations

Appendix A.1. Four Images Clustering

Appendix A.2. Specific Toy Examples

Appendix B. Proof for Cases That $\bar{n}$ = N/k Is Not an Integer

Appendix C. Algorithm Analysis

Appendix D. Implementation Details

Appendix E. Unevenly Distributed Pseudo-Labels

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Clustering-Based Representation Learning through Output Translation and Its Application to Remote-Sensing Images

Abstract

1. Introduction

2. Related Works

3. Method

3.1. Discriminativeness of Representations

3.2. General-Purpose Prior: Smoothness

3.3. Algorithm

4. Experiments Resluts

4.1. Compare to the State-of-the-Art Approaches

4.2. Impact of Even Pseudo-Label Distribution on Performances

4.3. Application to Remote-Sensing Images

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Examples for Measuring the Discriminativeness of Representations

Appendix A.1. Four Images Clustering

Appendix A.2. Specific Toy Examples

Appendix B. Proof for Cases That n ¯ = N/k Is Not an Integer

Appendix C. Algorithm Analysis

Appendix D. Implementation Details

Appendix E. Unevenly Distributed Pseudo-Labels

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Appendix B. Proof for Cases That $\bar{n}$ = N/k Is Not an Integer