Clustering-Based Representation Learning through Output Translation and Its Application to Remote–Sensing Images

In supervised deep learning, learning good representations for remote–sensing images (RSI) relies on manual annotations. However, in the area of remote sensing, it is hard to obtain huge amounts of labeled data. Recently, self–supervised learning shows its outstanding capability to learn representations of images, especially the methods of instance discrimination. Comparing methods of instance discrimination, clustering–based methods not only view the transformations of the same image as “positive” samples but also similar images. In this paper, we propose a new clustering–based method for representation learning. We ﬁrst introduce a quantity to measure representations’ discriminativeness and from which we show that even distribution requires the most discriminative representations. This provides a theoretical insight into why evenly distributing the images works well. We notice that only the even distributions that preserve representations’ neighborhood relations are desirable. Therefore, we develop an algorithm that translates the outputs of a neural network to achieve the goal of evenly distributing the samples while preserving outputs’ neighborhood relations. Extensive experiments have demonstrated that our method can learn representations that are as good

as or better than the state of the art approaches, and that our method performs computationally efficiently and robustly on various RSI datasets. The paper has been published on Remote Sensing https://doi.org/10.3390/rs14143361.

Introduction
The self-supervised visual-representation learning paradigm, where "supervision" is created from making use of the intrinsic information contained in the data itself and based on prior knowledge about the world, provides us with an unsupervised way to learn representations of RSI. Recently, methods of instance discrimination [1][2][3][4][5][6] have achieved remarkable progress in unsupervised representation learning. These methods view every sample and its transformations as a "class", while, in clustering-based methods [7,8], samples grouped to the same cluster are considered as a "class". In fact, data augmentations are often used in clustering-based methods, and, thus, transformations of the same images are also viewed as the same "class". However, so far, methods of instance discrimination overall have better performances than clustering-based methods, which seems to imply that clustering-based methods could be further improved.
Degeneracy is a problem that nearly all clustering-related methods need to solve. The usual ways to tackle it is by setting constraints on the clustering process. Whether it is based on maximizing entropy [9,10] or adding regularization [11] , all aim to distribute the samples evenly to the clusters. However, even distribution is not a necessary condition for preventing degeneracy; there is a lack of a clear theoretical interpretation for why even distribution works well in these clustering-related methods. On the other hand, clustering should be based on the neighborhood relations of the samples or that of their representations. To avoid degeneracy, the normal clustering process (that clusters should be formed based on the relations amongst the samples) would have to be artificially introduced. It is unclear if approaches used by previous methods to achieve even distribution can completely preserve the neighborhood relations of samples or that of their representations. By closely observing the Sinkhorn-Knopp algorithm used by [8], it is easy to see that the relations between neighboring samples or the distances between the outputs have been changed by the intervention to achieve even distribution.
In this paper, we first introduce a quantity that can be used to measure how discriminative representations are under a generic clustering setting where the clusters are formed based on the neighborhood relations or similarity of an images' representations. From this quantity, we show that evenly distributing the images into clusters requires the visual representations to be most discriminative. This provides an insight that clearly explains why evenly distributing the samples into clusters worked well in previous related work [8,10]. However, we are only interested in the even distributions that preserve the model's smoothness, i.e., preserve the neighborhood relations or similarity of the images' representations, and, thus, we develop an algorithm 1 that translates the outputs of a convolutional neural network to achieve the goal of evenly distributing the images while, at the same time, preserving the current model's smoothness. In summary, we make the following contributions: • A quantity that can be used to measure the discriminativeness of visual representations is introduced and a clear theoretical explanation for evenly distributing images into clusters is given from the viewpoints of this quantity.
• A new efficient and robust clustering-based representation learning algorithm, through translating the outputs of a convolutional neural network, is proposed.
• Our method performs as well as or better than the state-of-the-art approaches and works very well on a variety of RSI datasets.

Related Works
Generally, most self-supervised methods can be categorized into four categories: clustering-based methods, contrastive methods, pretext tasks and generative models.
Clustering-based methods learn visual representations by integrating network optimization and clustering. Clustering is a classical approach to unsupervised machine learning [12]. K-means, a standard clustering algorithm, has been applied to DeepCluster [7] for grouping the features extracted from a deep neural network. Subsequently, the cluster assignments were utilized as supervision to update the weights of the network. The method, Anchor Neighborhood Discovery (AND) [13] exploits a class-consistent neighborhood for unsupervised learning, which combines the advantages of both sample-specificity learning and clustering while overcoming their disadvantages. Maximizing the information between the indices of inputs and labels, self-labeling [8] not only avoids the degenerated problem but also creates pseudo-labels for training a deep neural network by a standard cross-entropy loss. Other earlier works of combining clustering and deep learning can be found in [14][15][16][17][18].
The contrastive method [1][2][3][4][5][6] has recently shown remarkable performances on visual representation learning. The contrastive method is a method of performing instance discrimination, which only considers the transformations of the same image as the "positive" samples, while others are the "negative" samples in the contrastive loss. Recently, some methods combining clustering methods and contrastive methods have appeared. In [19], the nearest neighbors are also considered as the "positive" samples in the contrastive loss, and in [20], both instance-level and cluster-level contrastive losses are introduced. In the past year, contrastive methods have been extensively applied to area of remote sensing [21][22][23][24].
A pretext task uses hand-crafted "supervision" to replace manual labels in supervised learning. These "supervision" pretext tasks, are devised from exploiting the intrinsic information of the unlabeled data. These pretext tasks include predicting context [25], solving jigsaw puzzles [26,27], image rotation and colorization [28][29][30][31], spatio-temporal consistence [32], and so on. The applications of this method to RSI can be found in [33][34][35], which use domain knowledge and temporal prediction to supervise the training.
Generative models. The latent distributions in generative models could be visual representations, due to the fact that they can capture the distributions of input data. These generative models mainly include Boltzmann machines [36][37][38], autoencoders [39] and generative adversarial networks (GAN) [40][41][42][43]. In [44], an improved generative adversarial network was applied to the super-resolution processing of RSI.

Method
The goal of this paper is to learn good visual representations for images through clustering them; that is, clustering is not the target but a means to learn visual representation. This section will introduce two essential aspects of good visual representations and, based on them, design our algorithm and training method.

Discriminativeness of Representations
In the representation feature space, the feature vectors should reveal as much as possible the difference in nonidentical samples; that is, good representations should be discriminative. To compare which representations are more discriminative, we need to be able to measure the discriminativeness of representations. In this section, based on the generic clustering model as shown in Figure  1, we try to find a criterion that can be used to compare the discriminativeness of representations.
Consider a general clustering that groups N images, based on their representations in feature space F , into k clusters by certain methods, for example, k-means. Samples in the same cluster will be assigned the same pseudo-labels that will be applied to standard cross entropy loss to update the network. Due to samples in the same cluster having closer representations and being indiscriminately treated, we could legitimately regard the representations of samples from the same cluster as indistinguishable. If the N representations in space F are very discriminative, there will be few indistinguishable representations.
Therefore, the quantity that can be used to compare the discriminativeness of representations should be able to indicate the representations' degree of distinguishableness. When we say "distinguishable", when at least two representations are needed to be compared. Consequently, in this paper, we use the number of distinguishable pairwise comparisons of representations as the quantity to measure the discriminativeness of representations.
In different representation feature spaces, the clustering results will be different. Let n i (F ) be the number of samples assigned into cluster i in the feature space F , and, thus, N = n 1 (F )+n 2 (F )+ · · · + n k (F ), where N is the total number of samples. The total number of pairwise comparisons of N representations is N (N − 1)/2, in which the number of indistinguishable pairwise comparisons is and the number of distinguishable pairwise comparisons is Consider representations in two different feature spaces F 1 and F 2 , and perform clustering in these two spaces. If then the representations in F 1 ought to be more discriminative than those in F 2 because there are more distinguishable representations in F 1 . Therefore, the quantity N dis (F ) can be used to measure the discriminativeness of the N samples' representations in a feature space F : the larger is N dis (F ), the more discriminative are the representations. Several simple toy examples are given in Appendix A to elaborate the rationality of this quantification. From the view points of N dis (F ), the degeneracy problem in clustering methods is caused by small N dis (F ). For example, the extreme case that N dis (F ) = 0 means that the representations are totally indistinguishable, which corresponds to the case where all samples are assigned into the same cluster. Therefore, increasing N dis (F ) can prevent the representations from degenerating. However, the question is, when will N dis (F ) reach the maximum?
Note that N ind + N dis = N (N − 1)/2. Thus, the maximum N dis (F ) corresponds to the minimal N ind (F ) due to N and k being constants. Equation (1) Again, noting N is a constant, N ind achieves the minimal value when k i n 2 i (F ) is the minimal. According to the inequality of arithmetic and geometric means (AM-GM inequality) [45] and which is also the condition for N ind to reach the maximum. For the cases whenn = N/k is not an integer, a rigorous proof is given in Appendix B that N ind reaches the largest value when the standard deviation of the distribution of n i (F ) is the minimal, i.e., the N samples are distributed as uniformly (evenly) as possible among the k clusters.
Remark 1 If N nonidentical images are most evenly distributed into k clusters according to their representations in feature space F , then N ind (F ) must be the maximum.
The fact stated in Remark 1 gives an explicit explanation for why even distribution works well in previous clustering-related methods, e.g., [8]. From the view point of Remark 1 : an even distribution corresponds to the most discriminative representations, which can be regarded as the extreme opposite of the degenerated cases. However, for previous works in the literature, the even distribution is simply used as a tool to prevent the model from collapsing. In contrast, in this paper, we provide a fundamental insight that has enabled us to give a clear interpretation for even distribution and use it to develop new techniques to improve the discriminativeness of representations.
Intuitively, "the most discriminative representations give the most uniform distribution" is correct only for special cases. It is important to emphasize that, in this paper, we measure the discriminativeness of representations by N ind (F ), which gives how many pairs of representations are distinguishable. Again, the rationality of this quantification can be seen from simple examples in Appendix A.
For the given N and k, when n i (F ) is the most uniform distribution, the assignment of the input images to the k clusters has N e possible combinations This means that the solution to maximal N ind (F ) is not unique. Therefore, the question now is how we shall approach maximum N ind (F ) such that it is good for representation learning.

General-Purpose Prior: Smoothness
Good representation should be discriminative. However, at the end of Section 3.1, we demonstrated that there are many ways to reach evenly distributed pseudo-labels that can be used to improve the discriminativeness of representations. We are only interested in the ones that are good for representation learning. Recall that the basic task of representation learning is to make similar samples to have close representations while different samples have distinct representations. In other words, representation learning is the pursuit of a smooth learning model tht satisfies where f is the mapping function of the model to be learnt. Smoothness is a basic and general requirement of a learning model [46]. Therefore, only the pseudo-labels that are good for improving the smoothness of the model are desirable. Specifically, this requires that similar samples are more likely to be assigned the same pseudo-labels. Some model architectures have a certain inherent smoothness. For example, randomly initialized AlexNet possesses some degree of smoothness [26]. Therefore, clustering based on the outputs or internal features extracted by the model is likely to group close samples together, thus assigning them with the same pseudo-labels. Applying these pseudo-labels in the standard cross-entropy loss can enhance the smoothness of the model. Besides this, making the representations of transformations of the same image close to each other can also improve the smoothness of the model. Therefore, without any manual annotation, assigning pseudo-labels based on the neighborhood relations of the outputs or the internal representations of the network is a reasonable strategy. This provides us with an important constraint on approaching the even distribution of images in clusters, that is, the neighborhood relations of the outputs or the internal representations of the network must be protected. Recall that evenly assigning the pseudo-labels is necessary because it can not only prevent the model from collapsing, but also improve the discriminativeness of representations.

Algorithm
This section proposes an effective algorithm that can produce evenly distributed pseudo-labels according to the outputs of the learning model while keeping the smoothness of the current model unaltered. We then use the pseudo-labels and a label-consistent-training (LCT) technique to improve the model's smoothness.
Model architecture. We consider a general convolutional neural network (CNN) model, as shown in Figure 1, for representation learning. One or several fully connected layers are attached at the end of the CNNs, which will output a k-dimensional vector indicating the assignment of the input samples to the k clusters. Specifically, the sth input images I(s) is mapped to X(s) by the convolutional block φ, i.e., X(s) = φ(I(s)). By one or several fully connected layers, the feature X(s) in D-dimensional space is then mapped to a k-dimensional space: where O(s) is the output of I(s) without softmax operation (in this paper, all "output" refers to the final layer of network but the softmax activation on the final layer). The output dimension is set to be the same as the cluster number, such that images could be assigned into k clusters based on the rule: the input I(s) is assigned to cluster c i if and only if o i (s) is the maximal component: Clearly, (9) is a winner-takes-all competition extensively used in unsupervised competitive learning [47]. The largest output wins the competition and the rest lose out. Output translation. In most cases, if we perform clustering as (9) based on outputs (8), most input samples will be assigned into a few clusters. Consider a toy model: assigning eight points in a two-dimensional plane to four clusters based on the quadrant the points are located in. Usually, based on the original location of points, they would not be evenly distributed into the four clusters; for example, the first graph of Figure 2 (one red point represents a point, the boundaries in the diagram are the x and y axes). If the points need to be evenly distributed into the four clusters based on the same rule, the values of their coordinates have to be changed and the question is how to change the values of them. If there is a constraint that their relative distances should be kept (the smoothness requirement), translating them as whole is the most straightforward approach to carry that out, as shown in the second graph of Figure 2. Considering N points in k dimensions and changing the clustering rule to Equation (9), the model then becomes what this paper needs to study and, similarly, the even distribution of pseudo-labels could be reached by translating the outputs in k dimensions. Therefore, the core task of our algorithm is that of finding the translation T such that cluster assignment according to (9) using the new outputs O (s) gives evenly distributed pseudo-labels. Obviously, T is a k-dimensional vector: T = (t 1 , t 2 , · · · , t k ). To find t 1 , t 2 , · · · , t k , we borrow the basic idea from a competitive learning or self-organising learning algorithm called frequency sensitive competitive learning (FSCL), widely used in the design of optimal vector quantisation [48,49]. If one component (or a direction) "wins" too much in the competition (9), we move all output vectors away from this direction; if one direction "loses" the competition too often, we move all output vectors towards this direction. To perform this, we need to construct each direction's wining frequency indicator in features space F . We count the number of samples assigned into every cluster c i according to (9), and set the number as its wining frequency indicator. For convenience, we subtract the meann(F ) = N/k from each cluster to obtain the frequency indicator as wheren i (F ) = n i (F ) −n(F ). C can be used as all directions' winning frequency indicator, e.g., the winning frequency of the first direction isn 1 (F ). Note that somen i (F ) are positive while others are non-positive. The vector of translation should be proportional to this indicator T ∝ C. The frequency indicator determines the direction of translation. The magnitude of the translation should be close to that of the output vectors and change with the evenness of the frequency distribution: the more uniform is the distribution, the smaller should be the translation magnitude. Therefore, the translation vector could be set as where "std max " is the maximal standard deviation, which corresponds to the case where all samples fall into the same cluster, while "std" is the standard deviation of the current distribution. o max − o min is the difference between the maximal and the minimal components in all outputs vectors, which gives the order of magnitude of α. The ratio "std/std max " indicates the degree of unevenness: unevenness increases from 0 to 1 as "std" changes from 0 to the maximum. Noting that, from Equations (10) to (12), all the quantities are from the outputs, no hyper-parameters are introduced.
To make the distribution of pseudo-labels as uniform as possible, the translation should be repeatedly performed, and α should decrease as iterations of translation increase and, thus, we introduce a decay rate β (see Algorithm 1). The pseudo-code of creating evenly distributed pseudo-labels by output translation is given in Algorithm 1.
Algorithm 1: Creating even distributed pseudo-labels by output translation Input: N images I(1), I(2), · · · , I(N ); cluster number k; decay rate β; α bound α 0 Output: Even distributed pseudo-labels outputs = model(inputs); labels = argmax(outputs); C = count(labels) − N/k; std = standard deviation(C); while α > α 0 and std > 0 do outputs = outputs -αC; labels = argmax(outputs); The translation (10) is just a simple linear transformation, that is, a smooth mapping Therefore, translation (10) does not change the smoothness of the current model, that is, if then More algorithm analysis can be found in Appendix C, where we give a detailed comparison with the Sinkhorn-Knopp algorithm used in [8] in terms of of motivation and technical details, and where we also give a clear mathematical illustration demonstrating how our algorithm preserves the distances of outputs.
Model update. When the translation of outputs is completed, the sth input sample I(s) is assigned to cluster c i according to (9), based on the final translated outputs O (s) = (o 1 (s), o 2 (s) · · · , o k (s)): where y (t) is the pseudo-label of I(s). After reaching uniform labelling, we can update the weights of the network as standard supervised learning via standard cross-entropy loss which is established from the pseudo-labels y (t) and the outputs before translations O(t).
The labels generated by (16) from different versions of the same sample may be not consistent. For example, the grayscale image and the color image of the same sample will give different outputs, which may label the same image differently according to (16). To remove this discrepancy, we propose a label-consistent-training (LCT) method. In this method, we construct the loss function by the sum of the cross entropy losses as: In the loss above, a, b indicates different transformations of the same images I(s). If the labels from different versions of the same samples are not consistent, the loss function above would be hard to decrease. Thus, decreasing this loss enables different versions of the same samples to give the same prediction, i.e., the same pseudo-label, which will make the model smoother.

Experiments Resluts
In this section, we first give our experiment results on several datasets, which are often used in computer vision, to compare to the state-of-the-art research, and then we conduct experiments on various remote-sensing image datasets. Recall that our method is an unsupervised representation learning method and its target is to learn good visual representations. Therefore, the ground-truth labels are never used in the training phase but the evaluation phase, and measuring the performance our method is to measure the performance of representations learned by our method. To evaluate the learned representations, we consider a non-parametric and a parametric classifier: weighted kNN (k-nearest neighbors) and linear classifier [50]. The evaluation process is supervised, that is, the labels are used, but only the CNN layers or the layers before the final layer are retained and frozen. For the weighted kNN evaluation, we take k = 50, σ = 0.1 and an embedding size of 128. For all experiments, we set decay rate to β = 1.5 and α 0 = 10 −15 in Algorithm 1 .

Compare to the State-of-the-Art Approaches
To conveniently compare with other representation learning methods, in this part our experiments were conducted on a large-scale dataset, ImageNet LSVRC-12 [51] and three smaller scale datasets: CIFAR-10/100 and SVHN. ImageNet LSVRC-12 contains around 1.28 million training pictures of 1000 classes and 50-thousand validation pictures. There are 50,000 training images and 10,000 test images in the CIFAR-10/100 dataset, whose number of classes is 10/100 and pixel resolution is 32 × 32 . SVHN is similar to MNIST (images of digits), which is a real-world image dataset. There are 73,257 images (32 × 32) for training and 26,032 images (32 × 32) for testing. All the ground truths of these datasets were used only in the evaluations of representations. The implementation details are given in Appendix D.
For fairly comparing to the previous work, we set the same cluster number k for these datasets as [8]. For CIFAR-10, the cluster number was set as 128. Therefore, the mean number of samples in each cluster is 50, 000/128. According to Remark 1 , although this number is not an integer, setting it as the target still works: making the number of the samples assigned to each cluster as close as possible to the mean number. The cluster number is set as 128 for SVHN and 512 for CIFAR-100 when experiments were conducted in AlexNet. Table 1 gives kNN and linear-classifier evaluations of representations learned from CIFAR-10/100 and SVHN through our method OTL (output translation). In addition, kNN evaluations were performed based on the fully connected layer before the final layer, and all layers except the final layer were frozen. For linear classifier, all the fully connected layers were discarded and the retained CNN layers were frozen. The last CNN layer was then attached by a new randomly initialized fully connected layer whose weights were to be updated by a supervised learning. With simple augmentations, the performances of our method reached the state-of-the-art performances. When strong augmentations and LCT (label consistent training) were employed, the performances of our method outperformed the previous works in the literature. For SVHN, the performance of our method is very close to the performance of supervised learning, especially linear classifier, where the accuracy is only 0.1% lower than the accuracy in supervised learning. When one assumes the actual number of classes in the data is known, and sets k equal to the actual number of classes, this can be regarded as giving the model some prior knowledge. It is seen that such prior knowledge can help improve the performances of our OTL algorithm, especially in the case for linear classifier on SVHN, where the accuracy is the same as that of fully supervised learning. In Table 2, the experiments were ran on ResNet-50 and for fairly comparing to other methods, k = 128 was set for all experiments on ResNet-50. Both evaluations, kNN and linear classifier, were conducted on the last CNN layer. Note that both Instance [53], SimCRL [6] and ISL [54] are methods of instance discrimination. This table shows that clustering-based representation learning methods could learn representations that are as good as that of instance discrimination or even better. Compared to Table 1, we find that the performances on ResNet-50 overall are better than on AlexNet, which is reasonable and predictable. To demonstrate the effectiveness of our method on a large-scale dataset and the speed of our algorithm in dealing with huge number of images, i.e., over one million, we also conducted experiments on ImageNet with AlexNet. As can be seen from Table 3, our method, without strong augmentations and LCT, achieves state-of-the-art performances in evluations of both linear and kNN classifiers. In our model, comparing the representations from the five CNN layers, representations from the last two layers perform better than that of the first three layers.

Impact of Even Pseudo-Label Distribution on Performances
We have to emphasize, again, that evenly distributing pseudo-labels is in order to make the representations highly discriminative. Besides even distribution, our algorithm can conveniently set various unevenly distributed pseudo-labels, which can be performed by modifying then(F ) in frequency indicator C (11) as the desired distribution.
Unevenly distributed pseudo-label. To investigate the influences of unevenness of pseudolabels' distribution, various uneven target distributions of pseudo-labels were set, as shown in Figure 3 (for more details, see Appendix E) for training on CIFAR-10/100. The kNN evaluations for different target distributions are given in Table 4 (these models were trained on AlexNet with fewer epochs than in Table 1 and without strong augmentations and the LCT technique). As one can see from both tables, the accuracy of kNN decreases as evenness decreases. This is consistent with the argument made in the method section that evenly distributed pseudo-labels are good for Table 3: Evaluation on ImageNet. " * " denotes training on larger AlexNet."3k × 10/1" denotes 3000 clusters and 10/1 heads (10/1 fully connected layers attached at the end of the architecture). The best results are highlighted by the bold. Unevenly distributed datesets. The standard datasets used in this paper are evenly distributed, i.e., each class contains the same number of images. However, the effectiveness of our method does not depend on the data being evenly distributed. In order to show that, we considered five unevenly distributed datasets which are made from CIFAR-10 by deleting different numbers of images from different classes. As a comparison, five evenly distributed datasets were made, and each has the same number of samples as the corresponding dataset in the five unevenly distributed ones. In Table 5 (again with fewer epochs and without strong augmentations and the LCT technique), for unevenly distributed "D 100", we deleted 0, 100, 200, · · · 900 images from class 0, 1, 2, · · ·, 9, while evenly distributed "D 100" has the same total number of images but the ground truth is evenly distributed. Similarly, for unevenly distributed "D 200", we deleted 0, 200, 400, · · · 1800 images from class 0 to 9 . For unevenly distributed data "D 500", the number of images in the first class is ten times the number of images in the last class.
We can see from Table 5 that our method is almost unaffected by whether the actual data distribution is even or not. As can be expected, evenly distributed data leads to slightly better performances than that on the unevenly distributed data. This experiment demonstrates that the success of our algorithm does not rely on the even distribution of the datasets.

Application to remote-sensing Images
For all remote-sensing datasets, we divided images into two groups: the training sets and testing sets, that is, we randomly picked around 80% images for training and the others for testing. All training only involved training sets without using any labels. For all datasets, we employed the same backbone, i.e., ResNet-18. All the remote-sensing images were resized to 256 × 256 and then cropped to 224 × 224 and for the strong augmentations we used the strategy from [10] with Augment = 10, n holes = 1, and length = 16. In this part, we will apply our method to seven remotesensing image datasets: HMHR (How to Make High Resolution Remote Sensing Image dataset), EuroSAT, SIRI WHU google, UCMerced LandUse, AID, PatternNet, NWPU RESISC45.
For the very small datasets-HMHR, SIRI WHU google, UCMerced LandUse-we used pretrained models. The training epoch for these three datasets is 64, and we set initial learning rate as 0.03 and drops to 0.003, 0.0006 at epoch 32, 48. For the training without a pretrained model, the total training epoch was set to 300, and learning rate initially was 0.03 which dropped to 0.003 and 0.0006 at epoch 160 and 240. For all the training, the momentum was set to 0.9. For all datasets, the k was set to 128. This is a dataset that is made from google map via LocalSpece Viewer. In this dataset, there only contains 533 images with five classes: building, farmland, greenbelt, wasteland, water. The resolution and spatial resolution of the remote-sensing images are 256 × 256 and 2.39 m. To prevent the network from overfitting, we used the strong augmentations [10] and the pretrained models trained from other remote-sensing images datasets: PatternNet, NWPU RESISC45 and EuroSAT. All the pretained models were trained through our method discussed in the Methods section and no labels were used in the training. For supervised training, we only considered the pretrained model on PatternNet. From Table 6, we see that our method works well on very small datasets with few classes. There is only a small gap (2.8%) between supervised method and ours in linear classifier evaluation. Although using different pretrained models has different performances, all of them are effective and perform well. EuroSAT [58,59] is a dataset that for land use and land cover classification. It is consist of 10 classes with 27,000 labeled images of annual crop, forest, highway, reiver, sealake and so on. The resolution and spatial resolution of these images are 64 × 64 and 10 m. Table 7 shows that the gap between supervised method and ours is only 1% in Linear classifier evaluation. In kNN evaluation, the performance of our method is also very close to the performance of supervised learning. The dataset of SIRI WHU google [60] contains 12 classes images with 2400 images. There are 200 images for each class. The class of images includes: agriculture, commercial, harbor, idle land, industrial, meadow, overpass, park, pond, river, water, residential. The pixel and spatial resolution of images in this dataset are 200 × 200 and 2 m. As can be seen from Table 8, using the pretained model on NWPU RESISC45 have the best performance which even is better than the that of supervised learning based on pretrained model on PatternNet. Representations learned based on pretrained model on PatternNet perform closely to that of supervised learning. UCMerced landUse [61] gives the images that were manually extracted from large images from the USGS National Map Urban Area Imagery collection for various urban areas around the country. Each image measures 256 × 256 pixels, with a 0.3 m spatial resolution. There are 21 classes with 2100 images, including airplane, beach, buildings, freeway, golfcourse, tenniscourt and so on and there are 100 images for each class. Table 9 demonstrates that the performance of our method based on pretrained models on PatternNet and NWPU RESISC45 is very close to supervised learning, and in Linear Classification, our method even perform better. is a aerial image dataset with high resolution, 600 × 600 pixels, and a 0.5-0.8 m spatial resolution, which constitutes collected sample images from Google Earth imagery. There are 30 classes with 10,000 images. This dataset is not evenly distributed and in each class there are about 200 to 400 images. The scene classes include bare land, baseball field, beach, bridge, center, church, dense residential, desert, forest, meadow, medium residential, mountain, park, parking, playground, port, railway station, resort, school and so on. This experiment shows that, for highresolution remote-sensing images and an unevenly distributed remote-sensing dataset, our method also works well (see table 10). PatternNet [63] is a remote-sensing dataset with 30,400 high-resolution images whose resolution is 256 × 256 pixels and spatial resolution is 0.062-4.693 m. These images were collected for remotesensing image retrieval from Google Earth imagery or via the Google Map API for some US cities. For each of the 38 classes, there are 800 images. From the results in Table 11, we see that, in kNN evaluation, our method performs as well as that of supervised learning; while in linear classifier evaluation, our method has an even better performance. NWPU RESISC45 [64] is made by Northwestern Polytechnical University (NWPU), which is avaiable for remote-sensing image scene classification (RESISC). It contains 31,500 images in total with a pixel resolution of 256 × 256 and a 0.2-30 m spatial resolution. This dataset cover 45 scene classes with 700 images in each class. These 45 scene classes include baseball diamond, basketball court, bridge, chaparral, church, circular-farmland, cloud, commercial area, dense residential, desert, intersection, island, lake, meadow, medium residential, mobile home park, wetland and so on. The experiment in Table 12 shows that, for a dataset with more classes, our method still has an excellent performance that is very close to performance of supervised learning. From Tables 6 to 12, we see that, for the pixel resolutions of remote-sensing images ranging from 64 × 64 to 600 × 600, i.e., from low to high resolution, spatial resolutions ranging from 0.062 m to 30 m, class numbers ranging from 5 to 45, and total numbers of images ranging from 533 to 31,500, our method can learn good representations, which are close to, or even better than, the representations learned from supervised learning. These experiments demonstrate that our method could be widely applied to learn the representations of remote-sensing images. To investigate the effect of choosing different training and testing samples in the experiments of remote-sensing images, we conduct four more contrast experiments on the dataset SIRI WHU google. In this dataset, there are only 2400 images, and, thus, the random choice of training and testing samples may be quite different. We randomly selected 80% images as training samples and 20% as testing samples four times to construct four different divisions of the dataset SIRI WHU google. The performances on the five different divisions are given in Table 13. From Table 13, we see that different training and testing samples would affect the performances both on supervised learning and our method. However, in all five cases, using our method, without any labels can always learn good representations for remote-sensing images. Figure 4: For k = 128, the performance of our method with various decay rate β in the Algorithm 1 is given. The x axis is the number of iterations and y axis is the standard deviations for the distribution of pseudo-labels. In (a,c) the method is applied on the outputs of CIFAR-10 from randomly weighted AlexNet; in (b,d) the method is applied on N × k matrices randomly created.

Discussion
To study the robustness of our algorithm, we consider how the two parameters, cluster number k and decay rate β (see Algorithm 1), affect the performance of our algorithm. For given N , k and α 0 , the decay rate is the unique parameter in our algorithm, and has to be larger than one such that the magnitude of translations will decrease as iterations increase. We conducted two groups of experiments for β ranging from 1.5 to 6 and from 5 to 50 in Figure 4, and the results show that our algorithm can effectively create evenly distributed pseudo-labels for these wide ranges of β values, indicating our algorithm is insensitive to this free parameter. In Figure 5, we demonstrate that our method is not only effective but also efficient (it only needs around 20 iterations to converge) in various settings of cluster number k. In Tables 14 and  15 (without LCT and strong augmentations), we show the kNN evaluations for the representations learned using different k, which demonstrates that our method works well for k ranging from 32 to 2048. When k = N , our method would be the same as instance discrimination. Increasing k would make representations more discriminative, while decreasing k would make representations closer. Representation learning requires that representations of different samples to be as distinct as possible while representations of similar samples get closer. Therefore, very large or small k are not considered in this paper: small k would not give the highly discriminative representations while large k would only consider the nearest neighbors as the same class, or even only the transformations of the same sample as instance discrimination does.   The comparisons between our method and the Sinkhorn-Knopp algorithm are given in Figure  6. The Sinkhorn-Knopp algorithm is a classical method which has been widely used, especially in transport problems. For 20 different k values and different N × K matrices, our method always obtained more uniform pseudo-labels (see Figure 6).  Figure 6: For N = 50,000, comparisons between the Sinkhorn-Knopp algorithm and our algorithm: we consider 20 different k (50, 100, 150, · · · , 1000) for two kinds of N × K matrices: one is output of the randomly initialized AlexNet with CIFAR-10; one is randomly created N × k matrix. In both (a,b), for all k, our algorithm has better performance on approaching even distributions.
The main advantage of the Sinkhorn-Knopp algorithm is its speed of convergence. For creating pseudo-labels from outputs of ImageNet, the Sinkhorn-Knopp algorithm converges within 2 min [8]. In Figure 7, we give the standard deviations of pseudo-labels for ImageNet and the time for producing them at each epoch by our method. As can been seen from Figure 7a, for all epochs, the pseudo-labels created from our method are very uniform and the average standard deviation is only 1.07. From (b) in Figure 7, we see that our algorithm is very fast to converge even for 1.28 million pictures and 3000 clusters. The average time for approaching even distribution is only 12.68 s on GPU (NVIDIA A100).

Conclusions
In this paper, under a clustering model where images are clustered based on the representations of the images, we found a quantity that can be used to compare the discriminativeness of the representations and, based on this quantity, we demonstrated that evenly distributing images into clusters requires their representations to be most discriminative. We developed an algorithm to translate outputs such that pseudo-labels can be created evenly according to representations of images while keeping the smoothness of the model unchanged. Extensive experimental results demonstrate that our method is very effective and can learn representations that are as good as or better than the state-of-the-art methods. We then applied our method to various remote-sensing image datasets and the representations learned by our method are very close to, or even better than, the representations learned by supervised learning.

A Examples for Measuring the Discriminativeness of Representations A.1 Four Images Clustering
In this section, we give a simple toy model: four images clustered through the model, as in Figure  1. If we set the cluster number to four, all five possible ways to cluster are given in Table 16. Case 1 in Table 16 (there are four images in cluster C1 while no images are assigned to other clusters) corresponds to the degenerated solution, which is given by the most indiscriminative representations. This case gives the most indistinguishable pairwise comparisons, i.e., 6. The number of indistinguishable pairwise comparisons decreases from case 1 to case 5. For the last case, there are the most distinguishable pairwise comparisons, which requires the representations to be the most discriminative. This is reasonable when we remember that all the images are different: the most discriminative representations should reveal the most differences in the images. In fact, the last case is exactly what instance discrimination does. If the cluster number is set to three and two, all the possible ways to group the images are given in Tables 17 and 18. Similarly, basing on the most indiscriminative visual representations of images gives the clustering of Case 1, which has the most indistinguishable pairwise comparisons. If we use the number of distinguishable pairwise comparisons as the quantity to measure the discriminativeness of the visual representations, these three tables are consistent: Case 1 to 4 in Table 16 are same as Case 1 to 4 in Table 17, which have the same number of distinguishable pairwise comparisons and these numbers all increase from Case 1 to Case 4; Case 1 to 3 in Table  17 are the same as Case 1 to 3 in Table 18, which have the same number of distinguishable pairwise comparisons and these numbers all increase from Case 1 to Case 3. Therefore, clearly, this quantity, i.e., the number of distinguishable pairwise comparisons, does not change with the setting of cluster number, which only depends on how to group images. The cluster number only affects the maximum number of the distinguishable pairwise comparisons. Noting that, under a certain rule, for example, k-means, the way to group images is completely determined by the representations; this quantity, in fact, is determined by the representations. Specifically, its value changes with the discriminativeness of the images' representations: the more discriminative are the representations, the larger is this quantity. Therefore, this quantity, the number of distinguishable pairwise comparisons, can be used to measure the discriminativeness of the images' representations through the model shown in Figure 1.

A.2 Specific Toy Examples
Let us consider the clustering of the images in Figure 8 with the basic setting of the cluster number as three. Semantically, the best clustering is to group these images into two clusters: assigning the three cats c1, c2, c3 to one cluster while assigning the three dogs d1, d2, d3 to another cluster. However, according to Remark 1, to make representations of these images the most discriminative, we should group these images into three clusters, that is, assigning two images to each cluster. Assume that, based on the representations in two feature space F 1 and F 2 , the six images are clustered through k-means as:  Note that images are assigned into the same cluster because their representations are close, i.e., their representations are not distinguishable enough. On the other hand, in the representation learning model shown in Figure 1, images in the same cluster are assigned the same pseudo-labels, which will make the representations of the images in the same cluster closer. Therefore, we would assume that the representations of images in the same cluster as indistinguishable. Based on this assumption, we could list the pairs of distinguishable representations in F 1 and F 2 as: To compare the representations' degree of distinguishableness in F 1 and F 2 , we neglect the same pairs in the lists above and have: These two lists indicate that representations of c3 and d1 are distinguishable in F 1 while are indistinguishable in F 2 ; the representations between c1 and c3, c2 and c3; and d1 and d2, d1 and d3 are distinguishable in F 2 while they are indistinguishable in F 1 . From the viewpoints of the two lists above, assuming representations in F 2 are more discriminative than representations in F 1 is reasonable. We have to emphasize that we are not performing semantic classification and that every image is different, although some of them belong to the same semantic class. The purpose of evenly distributed pseudo-labels is to make the representations of images as discriminative as possible when k is fixed. Theoretically, as all the images are different, setting the cluster number is the same as the number of images (i.e., k = N ) will make the representations of images the most discriminative. However, although discriminative representations is one of the most important targets in representation learning, it is not the only one. Good representations should also reveal the similarity of the images: close images have close representations; different images have very different representations. Therefore, in methods of instance discrimination, both "positive" and "negative" samples are very important. In performing instance discrimination, only the transformations of the same images are regarded as "positive", but in clustering-based methods, i.e., k < N , similar images are also regarded as "positive", which we usually cannot obtain only from the transformations of a image. Therefore, setting k = N may not be the best for visual-representation learning.
The example above can only illustrate that grouping images into more clusters is better to find more discriminative representations. Let us consider another example: that of clustering the three cats c1, c2, c3 and one dog d1 into two clusters. Consider the representations in F 1 and F 2 as: Although the representations of c3 and d1 in F 2 are not distinguishable, the representations of c1 and c3, c2 and c3 are distinguishable. For this example, we can still reasonably assume that the representations of images in F 2 are more discriminative from the perspective of the two lists above. Again, every image is different; finding the representations that can reveal the differences amongst of images is one of the basic tasks in representation learning.

C Algorithm Analysis
Compared with similar work, e.g., self-labeling [8], there are some differences in our method. First of all, the motivation of introducing evenly distributed pseudo-labels is different. In selflabeling [8], even distribution is an assumption and introduced as a constraint to prevent models from degeneracy. In this paper, we create uniform pseudo-labels to make the images' representations as discriminative as possible. This is different because even distribution is not the only way to prevent degeneracy but to make representations as discriminative as possible, and evenly distributed, pseudo-labels are the best choice according to Remark 1 .
Secondly, the methods of approaching even distribution are very different. The technique used in [8] is from a classical algorithm in the optimal transport problem, the Sinkhorn-Knopp algorithm, while our method is based on the basic idea of frequency-sensitive competitive learning. In the Sinkhorn-Knopp algorithm, the manipulation of the output matrix M O , . . . . . . . . . . . .
is multiplication while our algorithm only involves addition or subtraction, which makes our algorithm computationally much more efficient. Last but not least, our method treats every output as a whole to make changes, that is, subtracting the same k-dimensional vector from every row (see (10), which is very important since it can preserve the neighborhood relation of any two outputs. To see that clearly, let us consider two arbitrary rows of M O (i.e., two arbitrary outputs), e.g., the s 1 -th and s 2 -th row. The difference in the two rows is The equalities (23) and (24) indicate that iteratively performing translation (10) would not change the difference and Euclidean distance between any two output vectors. This invariance is consistent with smoothness (15).
To clearly compare our algorithm with the Sinkhorn-Knopp algorithm used in [8], we give the specific form of outputs after performing both algorithms: where r 1 , · · · , r k and c 1 , · · · , c k are constants. Obviously, in general, making transformations as (26) cannot preserve equalities (23) and (24).

D Implementation Details
In the experiments conducted on AlexNet: All images were resized to 256 × 256 and then cropped to 224 × 224. Several augmentations were applied to inputs, such as color jitter, random grayscale, and so on. We trained all datasets with a batch size of 128 and a learning rate of 0.05 at the beginning. We trained CIFAR-10/100 for 1600 epochs in total and the pseudo-labels were updated around every two epochs. The learning rate dropped twice by multiplying 0.1 at epoch 960 and 1280. We trained SVHN for 400 epochs in total and the labels were updated nearly every epoch. The learning rate dropped twice by multiplying 0.1 at epoch 240 and 320. We trained ImageNet for 450 epochs in total and the labels were updated nearly every epoch. The learning rate dropped three times by multiplying 0.1 at epoch 160, 300 and 380. As the type of images in CIFAR-10 and CIFAR-100 are very similar, we used the same augmentation strategy, such as random cropping, color jitter, horizontal flipping and so on. Regarding SVHN, we did not use the horizontal-flipping scheme since flip transformations are improper for digit learning. For the strong augmentations, we used the strategy of [10] with Augment = 8, n holes = 1, length = 16. We used SGD to optimize the network.
In the experiments conducted on ResNet-50: All datasets kept their original size, and all the experiments used strong augmentations. For all experiments, k was set to 128. The total training epochs was 300. Learning rate initally was 0.01 and then dropped to 0.001 and 0.0005 at epochs 160 and 240. Other parameters were the same as experiments on AlexNet.

E Unevenly Distributed pseudo-labels
To observe the effects of the distribution of the pseudo-labels on the performances of representation learning, we consider several unevenly distributed pseudo-labels. To do this, we only need to replace meann(F ) in (11) with uneven distributions. In this paper we create the uneven distributions of target pseudo-labels asñ where N is the number of images and k is the cluster number. For x = 0, 2, 4, 6, 8, 10, we give the distributions of pseudo-labels for CIFAR-10/100, i.e., k = 128 for CIFAR-10 while k = 512 for CIFAR-100, in Figure 3.