Improving Image Clustering through Sample Ranking and Its Application to remote--sensing images

Image clustering is a very useful technique that is widely applied to various areas, including remote sensing. Recently, visual representations by self-supervised learning have greatly improved the performance of image clustering. To further improve the well-trained clustering models, this paper proposes a novel method by first ranking samples within each cluster based on the confidence in their belonging to the current cluster and then using the ranking to formulate a weighted cross-entropy loss to train the model. For ranking the samples, we developed a method for computing the likelihood of samples belonging to the current clusters based on whether they are situated in densely populated neighborhoods, while for training the model, we give a strategy for weighting the ranked samples. We present extensive experimental results that demonstrate that the new technique can be used to improve the State-of-the-Art image clustering models, achieving accuracy performance gains ranging from $2.1\%$ to $15.9\%$. Performing our method on a variety of datasets from remote sensing, we show that our method can be effectively applied to remote--sensing images.


Introduction
Clustering remote-sensing images containing objects belonging to the same class is an essential topic in the context of Earth observation. Without manual annotation, estimating the semantic similarities of images is difficult, and thus, clustering images is still a very challenging task. Images are usually described in very high-dimensional spaces. To effectively measure the similarities among images, it is necessary to find their low-dimensional representations.
However, as learning is performed without supervision, it is inevitable that samples from multiple classes will be grouped under the same cluster. Existing methods such as SCAN [31] address this problem by only retaining the highest confidence predictions of the model as pseudo labels for training and using strong augmentations to prevent the model from over-fitting. Such solutions [31,36,37] throw away not only incorrectly clustered samples, but also those correctly clustered. This would hurt the data diversity [38], which cannot be obtained from augmentations. Furthermore, samples in class boundaries are very valuable in clustering or classification, as is well known in the active learning [39,40] and support vector machines [41] literature. On the other hand, keeping inappropriate pseudo labels would mislead model training [42][43][44][45]. Instead of throwing away huge amounts of samples, we make use of most samples in training by developing a method to rank the samples in each cluster based on their confidence of belonging to the cluster: samples with a higher confidence are assigned a larger weight, whilst those with a lower confidence are assigned a smaller weight in the cross-entropy loss.
An image cluster formed by various clustering algorithms is likely to have a dominant class, i.e., one class will have much more samples than other classes within each cluster. Within a cluster, samples from the dominant class can be regarded as the signal and samples from other classes as noise. It is reasonable to assume that signal samples are more likely to belong to the current cluster and, therefore, should be kept, while noise samples should be re-assigned to other clusters. We propose a statistical method to estimate the likelihood that a sample is signal or noise based on whether it is situated in a densely populated region. By ranking the samples within a cluster based on this likelihood, we propose a method to improve image clustering by manipulating the contributions of the pseudo labels to the self-training cross-entropy loss according to their likelihoods or, equivalently, the confidence of their belonging to the current cluster 1 .
The organization of this paper is as follows. In the next section, the related works are given. Section 3 illustrates the methods, including how to rank images (or pseudo labels) based on their reliability, how to weight the cross-entropy loss, and how to evaluate the performance of clustering. Experimental results and a discussion are given in Sections 4 and 5, and a brief conclusion can be found in the last section.

Related Work
Representation learning. Supervised representation learning is limited by the availability of manually labeled data. Various unsupervised representation learning methods have been proposed and have achieved outstanding performances. The methods of pretext tasks learn the representations of the images from the pre-designed tasks, and typical examples include inpainting patches [46], colorizing images [9,47], predicting the patch context [4,48], solving jigsaw puzzles [49], and predicting rotations [7]. Contrastive methods [18][19][20][21][22][23][24][25][26] use contrastive loss to make the representations of positive samples closer while negative ones further apart. The latent distributions in generating models [22,26,50,51] are not only much simpler than the input distribution, but also can capture semantic variation in the distributions of the input data. Besides these methods, integrating the clustering and optimization of neural networks also can learn good representations [27][28][29][30]. In the past two years, self-supervised learning methods have been widely applied to the area of remote sensing and have reached remarkable performance [52][53][54][55][56][57][58][59].
Image clustering. Since direct clustering of images usually suffers from the curse of dimensionality problem, it is important to perform clustering using low-dimensional features or representations. The clustering-based representation learning methods [27,29,30] can also directly perform clustering according to the features extracted from the neural network. Other clustering methods by grouping features include DEC [60], DAC [61], and deep clustering [62]. Maximizing the mutual information between images and augmentations is another clustering method, e.g., IIC [63,64]. SCAN [31] takes two steps to implement image clustering: first, obtaining semantically meaningful features from a self-supervised task; second, using those features as a prior in a learnable clustering method. RUC [36] proposes a robust learning method to improve the pretrained clustering models by SCAN and TSUC [65] via cleansing and refining labels. SPICE [37] proposes two semantic-aware pseudo labeling algorithms, which provide accurate and reliable selfsupervision for clustering.
A new method to Improve Clustering through Sample Ranking (ICSR) is introduced in this section. This method builds on the pretrained clustering models such as [31,37], which are trained based on the representations learned from self-supervised learning. ICSR ranks the samples in each cluster through a confidence criterion that can estimate the likelihoods of samples belonging to the current clusters. The ranking of the samples is then used to weight their pseudo labels to form a modified cross-entropy loss.

Sample Ranking
Consider K clusters of images where the clusters are formed by deep learning models, such as the models trained by the clustering-based representation learning method [29,30]. In each cluster, assuming there is a dominant class, i.e., one class has the largest number of samples, we can regard samples from the dominant class as the signal and the others as noise. The first task of ICSR is to estimate the likelihood that an image is noise or not in a cluster.
Assume that the distance, such as the Euclidean distance (which can be replaced by any other functions of similarity such as cosine similarity or Kullback-Leibler divergence): in the feature space X can indicate the similarity of two images I (i) , I (j) , where x i , x j are representations of the two images in the feature space X. For two arbitrary images I (a) , I (b) in the same cluster, find their k nearest neighbors and compute the distances: {d(a, a 1 ), d(a, a 2 ), · · · , d(a, a k )} where I (a 1 ) , I (a 2 ) , . . . , I (a k ) are the k nearest neighbors of I (a) and I (b 1 ) , I (b 2 ) , . . . , I (b k ) are the k nearest neighbors of I (b) . Based on these distances, if then it is reasonable to assume that I (a) has a higher probability to be noise than I (b) . Note that we assume there is a dominate class in each cluster, and the distance (1) could give the similarities of the samples; thus, the explanation of the criterion in (3) is obvious: an image that is not noise has more close neighbors (in a more densely populated region in the feature space). As the mean value is susceptible to outliers, we also use the median value of the k distances, i.e., if then we assume that I (a) is more likely to be noise than I (b) . Again, the interpretation is that an image that is not noise has a higher chance to have closer neighbors. For a given k and an arbitrary image I (i) , we can find the k nearest neighbors of I (i) , I (i 1 ) , I (i 2 ) , . . . , I (i k ) and compute M then can be regarded as the score that indicates the likelihood of that image I (i) being noise: the higher the score, the higher the probability that it is noise. Using M as a criterion to determine whether an image is the noise of a cluster is reasonable; however, M is easily influenced by the choice of k.
To reduce the effects of different selections of k, we introduce a robust majority voting technique, as illustrated in Figure 1. For each cluster and for an arbitrary allowed k i , where i = 0, 1, · · · , n−1, we can grade every sample in the cluster according to M computed in (5) and sort them according to the values of M from low to high, i.e., from low to high likelihood that an image is noise in the cluster. In this way, we can obtain n sorted sample lists. We then use majority voting to obtain m groups of samples as Algorithm 1. It is easy to understand that samples in g i are less likely to be noise than samples in g i+1 . Based on this grouping, we modify the cross-entropy loss to improve the clustering performances.
In Algorithm 1, to compute M in (5) conveniently, an N c × (N c − 1) matrix is constructed to store the distances among representations of images, where N c ≈ N/k and N, k are the image number and cluster number. Clearly, the space needed to store N c × (N c − 1) is proportional to N 2 , and thus, the space complexity of the algorithm is O(N 2 ). Similarly, the temporal complexity of the algorithm is also O(N 2 ) with respect to the image number. There is another factor that would affect the temporal complexity, that is the voting number n. For this variable, the temporal complexity of the algorithm is O(n). Figure 1: Schematic of the robust majority voting technique (also see Algorithm 1). The first row is the N c image samples in a cluster c. The following n rows are samples sorted for n different choices of k. Majority voting is then applied to group the samples based on the following procedure.
Step 1: Pick the first p 1 samples from each of the n sorted lists and count how many times they appear. The top p 1 samples receiving the most votes (appeared most frequent) are kept to form Group 1, g 1 .
Step 2: Follow the same procedure as Step 1, but replace p 1 with p 2 (p 2 > p 1 ) to obtain the top p 2 samples receiving the most votes, then remove those samples already included in Group 1 to form Group 2, g 2 .
Step 3: Follow the same procedure as Step 1, but replace p 1 with p 3 (p 3 > p 2 > p 1 ) to obtain the top p 3 samples receiving the most votes, then remove those samples already included in previous groups (g 1 and g 2 ) to form Group 3, g 3 . Subsequent groups can be formed following this pattern, and in total, we can obtain m groups of such samples. It is not difficult to understand that samples in group i will be less likely to be noise than samples in group i + 1. Samples in the same cluster, but different groups are assigned the same pseudo labels, but differently weighted based on the likelihood of them belonging to noise.

Algorithm 1: Sample ranking and robust majority voting.
Input: N c images in cluster c: I (i) , i = 0, 1, · · · , N c Parameter: k 0 , n,k, m, p 1 < p 2 < · · · < p m Output: m groups of images Count how many times I (i) appeared in the top p j of all the sorting lists in Step 3.

Model Training
In this paper, we trained the model by standard cross-entropy loss: where f θ (x s ) is the clustering model, which is normally implemented as Figure 2 in a convolutional neural network architecture such as ResNet or other popular architecture, and y(x s ) is the pseudo label given by the model prediction as Figure 2: Using the predictions of the network as pseudo labels that could be used to train the network.
Based on the simple rationale that different augmentations of the same image should give the same prediction, i.e., the same pseudo label, we applied Label-Consistent Training (LCT) [30]: where i, j represent different augmentations and the augmentations for creating pseudo labels are simpler. Minimizing the loss L LCT enables an input's different augmentations to give the same prediction, which plays the same role of consistency regularization [66] applied in FixMatch [67], RUC [46], and so on. In Section 3.1, we propose a method to group the data according to their likelihood to be noise. For the m groups of data, we should give different weights to their loss since the reliability of the data is different. Therefore, we modify the loss function as: where the hyper-parameter w i is the weight of the loss function produced from the data of group g i . The values of these parameters must satisfy the following relations: due to the reliability of the data in them decreasing from Group 1 to group m. As the training proceeds, w 2 , w 3 , · · · w m should gradually increase for two reasons: firstly, to explore the good 8 weights for them; secondly, as accuracy increases, more samples are more reliable. In this paper, we adopted a strategy for computing the hyper-parameters w i as where i is the rank of the samples in group g i obtained from the ranking algorithm and t is the training epoch index. β 0 is the (only) free parameter set manually, and we found setting it to between 0.01 and 0.05 works well (see Section 5.2, ablation study). As β 0 in (11) increases, w 1 will not be affected, while the values of w 2 , w 3 , · · · , w m will increase, but never cross 1.

Evaluation
In this paper, we employ three approaches to evaluate our clustering: clustering Accuracy (ACC), Normalized Mutual Information (NMI) [68,69], and Adjusted Rand Index (ARI) [70][71][72]. ACC is the same as the accuracy of classification, but the labels given by the clustering are different from the ground truth. The clustering accuracy can be computed from the following equation: where y i is the ground truth, while c i is the clustering labels. m is a mapping that maps clustering labels c i to the ground truth y i . NMI evaluates clustering by measuring the mutual information between the ground truth labels and the clustering labels. It is an information theoretic metric normalized by the average of entropy of both the assignments of clustering and the ground truth labels as where I(A, B) is the mutual information of A and B and H(A) and H(B) are the entropy of A and B. The NMI can be implemented as sklearn.metrics.normalized mutual info score by Sklearn. Considering all pairs of samples and counting pairs with the same or different assignments in the predicted and true clusterings, the Rand Index (RI) can compute a similarity measure between two clusterings. The corrected-for-chance version of the Rand Index is the Adjusted Rand Index: which can be implemented by Sklearn as sklearn.metrics.adjusted rand score. To fairly compare our method with others, we need to conduct our experiments on the datasets used by other methods. In this paper, we chose four datasets that are usually used in image clustering: CIFAR-10/100, SLT-10, and ImageNet-10. The basic information of these four datasets is given in Table 1. The CIFAR-10/100 datasets [73] consist of 60,000 color images, 50,000 training images and 10000 testing images. For CIFAR-10, there are 10 classes, including airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. These classes are completely mutually exclusive, for example "automobile" includes SUVs, sedans, and things of that sort, while "truck" only includes big trucks.
For CIFAR-100, there are only 500 training images and 100 testing images per class. The 100 classes are grouped into 20 superclasses and, thus, each image has a "coarse" label and a "fine" label. The superclasses include aquatic mammals, fish, flowers, food containers, fruits and vegetables, household electrical devices, insects, and so on. Each superclass contains 5 classes, such as "aquatic mammals" includes beaver, dolphin, otter, seal, and whale and "flowers" includes orchids, poppies, roses, sunflowers, and tulips.
SLT-10 [74] is a dataset for developing unsupervised feature learning, deep learning, and selftaught learning algorithms. It is a modified version of CIFAR-10. There are fewer labeled training images and a large amount of unlabeled images that are from a similar, but different distribution of the labeled data. However, in this paper, we only used the labeled images, but without using their labels in training. The 10 classes of images in the this dataset include images of airplanes, birds, cars, cats, deer, dogs, horses, monkeys, ships, and trucks. In this paper, the training and evaluations are on all labeled data, 5000 training images and 8000 testing images.
ImageNet-10 dataset is a subset of ILSVRC2012 [75] that contains around 1.28 million images with 1000 classes. The pixel resolutions of the images in ILSVRC2012 are different, and in the experiments, we resized all images to 96 × 96 pixels. The 10 classes used in this paper contain penguin, dog, leopard, airplane, airship, freighter, football, sports car, truck, and orange.

Results
To evaluate the effectiveness of the proposed Image Clustering though Sample Ranking (ICSR) method, we applied it to models pretrained by the State-of-the-Art image clustering and representation learning methods with the aim of further improving their performances. It is important to note that even though the experimental data contain labels, the labels were never used in training.
We tested our method on the three State-of-the-Art clustering models, SCAN [31], RUC [36], and SPICE [37], on widely used standard datasets, CIFAR-10/100, STL-10, and ImageNet-10. SCAN is a two-step method where visual representation learning and clustering are decoupled: firstly, employing a self-supervised learning (MoCo [16] or SimCLR [17]) to learn semantically meaningful features; secondly, using the obtained features as a prior to perform clustering (classify each image and its nearest neighbors together). Additionally, a self-training technique was used to improve clustering performance. self-training is training a model using the pseudo labels that are given by Equation (7). RUC is a method of robust learning, which is very similar to ours, and the difference is mainly on how to treat pseudo labels. In RUC, three strategies are proposed to deal with pseudo labels: first, filtering out samples with low prediction confidence; second, detecting cleaned samples by checking if the given sample has the same label with the top k nearest neighbor; third, combining the above two strategies. In SPICE, there are 3 training stages: first, feature model training; second, cluster head training; third, feature model and cluster head joint training. SPICE designs a prototype pseudo labeling method where top confident samples are chosen to estimate the prototypes for each cluster, and their indices are then assigned to their nearest neighbors as the pseudo labels.
We used ResNet-18 for CIFAR-10/100 and STL-10, while WRN-37-2 for ImageNet-10, as previous works did. The data augmentation used is similar to [31,36]. To test the effectiveness of our method for other network architectures, we also conducted the experiments on AlexNet for CIFAR-10 and STL-10, where we established the clustering models through the clustering-based representation learning method [30]. We trained all the models with 240 epochs with β 0 = 0.02, a learning rate of 0.005, a batch size of 128, and an SGD momentum of 0.9. We evaluated all models by both clustering Accuracy (ACC), Normalized Mutual Information (NMI), and the Adjusted Rand Index (ARI).
For ranking samples or the pseudo labels, we used the Euclidean distance of the final layer of the network to compute M in Equation (5). Choosing the distance of the final layer but the CNN layers was due to most models here experiencing self-training, and thus, the final layer is representative enough and has lower dimensions. k 0 in Algorithm 1 takes 1/3 of the number of samples in the clusters, and the stepk takes 5. The largest k n−1 was set to close to 2/3 of the number of samples in the clusters. We ranked samples into 5 groups. In fact, the case of setting less than 5 groups was roughly included due to, in the early training epoch, w 2 , w 3 , w 4 , w 5 are extremely small (see the ablation study). Setting more group numbers can help a little, but needs much more computing resources. p 1 , p 2 · · · p 5 in Algorithm 1 take 15%, 35%, 55%, 75%, 95% of the samples in each cluster and increase 1% for every 50 epochs.
The performances of our method are shown in Table 2. We can see that applying our ICSR method to the pretrained models from SACN, RUC, and SPICE, the clustering performances markedly improved.
For CIFAR-10, ICSR can improve all these clustering models. For example, the ACC, NMI, and ARI performances of the SCAN model improved by 4.7%, 5.6%, and 8.3%, respectively. For the best model on CIFAR-10, SPICE, which already achieved a remarkably high accuracy of 92.6%, ICRS can still achieve a noticeable improvement of 2.1%. RUC is also a method proposed to improve image clustering. Results in Table 2 show that models trained by SCAN + RUC can be further improved by ICSR.
Although these State-of-the-Art clustering models have quite different performances on STL-10, ranging from 81.7% to 92.0%, our method works very well consistently on them, achieving significant performance gains ranging from 6.0% to 13.6%. It is worth noting that the best clustering accuracy on SLT-10 was improved by our method to 98%, close to a perfect 100%.
For ImageNet-10, SPICE already reached a very high level of performance of 95.9%. Still, applying the new ICSR can further improve the clustering accuracy by 2.2%. For CIFAR-100, there are 20 classes. Although in a few clusters, there is no dominant class, our method also can improve the performances of the three clustering models.  Table 3 gives the evaluation results on testing images, which were never involved in the training. The accuracy of this evaluation can demonstrate the models' capability of generalization.
For CIFAR-10, our method can boost the performance of all three clustering models. Although SPICE achieved a very high performance, very close to the accuracy of supervised learning, 93.8%, our ICSR method can further close the gap between supervised and unsupervised learning by 1.0%.
For CIFAR-100-20, for the model pretrained by SCAN + RUC, although our method improved its clustering performances on the training data (see Table 2), it did not on the testing data. However, for the other two pretrained models, the accuracies were improved by our method. For the model pretrained by SPICE, although SPICE + ICSR improved the clustering accuracy, it did not improve the NMI and ARI. In Tables 2 and 3 above, we tested our method on the models trained as a classifier via pseudo labels. To demonstrate that our method can also be applied to other models trained differently, we considered two other kinds of models. First, we considered models by SCAN, but without self-labeling. From Table 4, we see that on the model without the self-labeling steps, our method still had good performances, especially on the SCAN * models trained on STL-10. There was a 15.9% improvement, which is even better than building on models pretrained from SCAN and RUC.
Secondly, we considered the models trained by clustering-based representation learning [30]. The method [30] is proposed to learn visual representation through clustering. Its goal is visual representation, while clustering is the means to achieve the goal. To make the visual representations as discriminative as possible, [30] evenly distributes images among clusters by translating the final layer of the network. Although [30] is not a method for clustering, it could be used as the pretrained model for our method. In this experiment, we trained the models on AlexNet and then applied the ICSR method to improve the image clustering. Table 5 demonstrates that the ICSR method can be applied to improve models from the clustering-based representation learning method, and it also works well on AlexNet. In this paper, we consider 7 datasets of remote-sensing images, as shown in Table 6. The How-to-make-high-resolution-remote-sensing-image-dataset (HMHR) is made by Google map through LocalSpece Viewer. There are only 533 images in this dataset, which contains 5 classes: building, farmland, greenbelt, wasteland, water.
EuroSAT [82,83] is a dataset for land use and land cover classification. It consists of 10 classes with 27000 remote-sensing images of annual crop, forest, highway, river, sea/lake, and so on.
UCMerced LandUse [85] gives the images that were manually extracted from large images from the USGS National Map Urban Area Imagery collection for various urban areas around the country. There are 21 classes in this dataset, including airplane, beach, buildings, freeway, golf course, tennis court, and so on, and there are 100 images for each class.
AID [86] is an aerial image dataset with high resolution, 600 × 600 pixels, which is collected from Google Earth imagery. This dataset is not evenly distributed, and in each class, there are about 200 to 400 images. The scene classes contains bare land, baseball field, beach, bridge, center, church, dense residential, desert, forest, meadow, medium residential, mountain, park, parking, playground, port, railway station, resort, school, and so on.
PatternNet [87] is a remote sensing dataset collected for remote sensing image retrieval from Google Earth imagery or via the Google Map API for some U.S. cities. The 38 classes include airplane, cemetery, dense residential, forest, oil gas field, overpass, nursing home, parking lot, railway, closed road chaparral, bridge, and so on.
NWPU RESISC45 [88] is made by Northwestern Polytechnical University (NWPU), which is available for REmote Sensing Image Scene Classification (RESISC). The 45 scene classes include baseball diamond, basketball court, bridge, chaparral, church, circular-farmland, cloud, commercial area, dense residential, desert, intersection, island, lake, meadow, medium residential, mobile home park, wetland, and so on.

Results
In this part, we apply our method to the datasets introduced above. We first performed a clustering-based representation learning method [30] to learn the representations of the images and then, based on these representations, to perform the clustering of remote-sensing images.
For the very small datasets, HMHR, UCMerced LandUse, and SIRI WHU Google, to prevent the network from overfitting, we used the strong augmentations as done in [31] and the pretrained model that is trained on another remote sensing image dataset, i.e., PatternNet, by an unsupervised representation learning method [30]. In the ranking process, we used the Euclidean distance of the last CNN layer, which is different from the experiments in the previous part in which the distance was computed from the final layer of the network. This is because, in this part, we establish all pretrained models from a representation learning method whose goal is not clustering. For all datasets of remote sensing, we used ResNet-18 as the backbone of the network, and all images were resized to 256 × 256 and then cropped to 224 × 224. The total training epochs were 240, which was divided into two sets of 120 epochs, and for the first 120 epochs, p 1 , p 2 · · · p 5 in Algorithm 1 take 15%, 35%, 55%, 75%, 95% of the samples in each cluster, while for the second 120 epochs, p 1 , p 2 · · · p 5 take 17.5%, 37.5%, 57.5%, 77.5%, 97.5% of the samples in each cluster. For each 120 training epochs, β 0 = 0.03, and the learning rate initially was 0.03 and dropped to 0.003 at Epoch 90. Although these datasets are very different in terms of the source, size, spatial resolution, sample number, and class number (see Table 6), our method can work well on both them (see Table 7). For EuroSAT and PatternNet, the accuracies improved to 94.8% and 97.1% without using any labels, and for SIRI WHU Google, the accuracy improved 24.1%. For the very small datasets, such as HMHR and UCMerced LandUse, due to the samples being very limited, good ranking was difficult. However, our method still worked well on these datasets (the performance of ranking images in UCMerced LandUse can be found in the next part and Appendix A, B). For datasets AID, PatternNet, and NWPU RESISC45, the class numbers were 30, 38, and 45, which are large numbers for the clustering task. However, our method still worked on them. For AID, although the distribution of samples is not uniform and there are 200-400 images in each class, our method was still effective on it.

Performance of Sample Ranking Algorithm
The core of the ICSR technique is the sample ranking and robust majority voting algorithm: Figure 1 and Algorithm 1. ICSR assumes that each cluster has a dominant class, i.e., one class has more samples than other classes, and we call samples from the dominant class as the signal and other samples noise. The objective of the ranking algorithm is to place signal samples at a higher rank than the noise samples. The more signal samples are placed at a higher rank, the better the performances. We define the percentage of signal samples in the top-ranked p% of all samples in a cluster as the ranking success rate, R sr (p): where S(p) is the number of signal samples and N (p) is the number of noise samples at the top p% of the ranked samples within a cluster. A good ranking performance should have the following characteristics. For a given p, the larger R sr (p) is, the better. For a given cluster and p i < p j , R sr (p i ) > R sr (p j ). We illustrate the performance of ICSR's ranking algorithm in Figure 3. The graphs in the first column are consistent with the statistics, i.e., the percentage of signals in the sub-clusters is close to that of the whole cluster. The graphs in the second column demonstrate the effectiveness of our method, where in all three cases, the ranking success rate R sr (p) of the top 20% is generally higher than that of the top 40%; R sr (p) of the top 40% is higher than that of the top 60%, and so on. Figure 3: Sample ranking performances of Algorithm 1 (see also Figure 1). First column, R sr (p) without applying our ranking algorithm and randomly selecting p = 20%, 40%, · · · , 80% of samples in the clusters formed by 3 pretrained clustering models for CIFAR-10. Second column, after applying our ranking algorithm to rank the samples in the clusters formed by the same 3 pretrained clustering model, R sr (p) for p = 20%, 40%, · · · , 80%.
To visualize the effectiveness of our ranking algorithm in remote-sensing images, we took dataset UCMerced LandUse as an example to illustrate it. Let us focus on one class of remotesensing images, for example the tennis court. In the dataset, all images of tennis court are given in Figure 4.
When we apply the clustering-based representation learning method [30] to UCMerced Lan-dUse, the cluster containing most tennis court images is given in Figure 5. We see that besides tennis court, there are some images of denser residential, building, and so on. There are 101 images in Figure 5, and 71 of them contain a tennis court.  The remote-sensing images of tennis courts clustered by the clustering-based method [30]. There are 101 images in total, and 71 images contain a tennis court.
Performing our ranking algorithm, we rank the images in Figure 5 into five groups, as shown in Figure 6. As can be seen from Figure 6, the images in the first, second, and third groups are all correctly clustered. There are still eight images of tennis courts in the fourth group and only three images of tennis courts in the last group. The ranking in Figure 6  (e) The images are ranked as the fifth group. Figure 6: In the first group, that is the most reliable group of images containing a tennis court, all the images are correctly clustered. The images ranked in the second and third group are also all correctly clustered. In the fourth group, there are 1 image of dense residential, 11 images of mobile home parks, and 8 images of tennis courts. In the last group, there are only 3 images of tennis courts.

Ablation Study
The ICSR technique has an important free parameter β 0 in (11), which controls the weights w i (t).
In the training, we expect that w i (t) with i > 1 gradually increases. One reason for this is for seeking better weights, and another reason is that as training increases clustering accuracy, more samples are clustered correctly; thus, more pseudo labels become more reliable. In this paper, we rank samples in each cluster into five groups, and thus, there are five weights in (9). To investigate the effect of setting β 0 , Figure 7 shows the training behaviors for three different values of β 0 on SLT-10 clustering models from SPICE (these accuracies were evaluated only on testing samples and, thus, higher than the accuracy in Table 2). It is seen that although the convergence behaviors of the models varied with different values, they all converged to the same state after 200 epochs, demonstrating that our method is not overly dependent on the selection of the free parameter and different values of the free parameter will lead to similar performances. Figure 7: Training behaviors of the ICSR technique for different values of the free parameter β 0 on SLT-10 clustering models from SPICE. It is seen that for different selections of the value β 0 , training will converge to almost the same state after 200 epochs. Figure 8: The values of w 1 , w 2 , w 3 , w 4 , w 5 given by β increases from 0 to 6, where β = (1 + t)β 0 (see Equation (11)).
In Figure 8, the values of the five weights varied with β (β = (1 + t)β 0 ; see Equation (11)) are given. As expected, w 1 always keeps the same value, 1, unaffected by β, and w 2 > w 3 > w 4 > w 5 is always correct for all β. When β is very small, the values of w 2 , w 3 , w 4 , w 5 are very close to zero, which approximately only keep the data of the first group. From Figure 8, we see that w 2 grows much faster than w 3 , w 4 , w 5 as β increases. In the phase that w 2 is very larger than w 3 , w 4 , w 5 , there are approximately only two groups of data left. Similarly, when w 3 grows large enough, there are approximately only three groups of images left. Note that when β 0 = 0.01, β increase from 0.01 to 1, we need 100 epochs, which means that there are many training epochs, which is roughly equivalent to that there are fewer groups of images. In other words, when the number of ranking groups is set as five, assigning weights as Equation (11) roughly includes the cases with group numbers less than five.
From Figure 7, we see that from Epoch 100 to Epoch 200, the three accuracies still increase, which correspond to β from 1 to 2, 2 to 4, and 3 to 6, respectively. Note that the values of the five weights increased large enough after β > 1, and thus, the increasing accuracies after epoch 100 imply that the samples or the pseudo labels that are not the most reliable also have made contributions to improve the performance of clustering. Recall that adding the samples or pseudo labels that are not the most reliable to training the model is the essential difference between our method and previous ones, such as SCAN [31], RUC [36], and SPICE [37].

Conclusions
Unsupervised image classification is challenging, but very important, as it can make use of abundant unlabeled data. This paper developed an effective image clustering technique that builds on the pretrained clustering models and improves their performances. To develop this approach, we solved the following problems: first of all, how to estimate the likelihoods of the samples belonging to their current clusters; secondly, how to improve the reliability of this estimation; thirdly, how to dynamically determine the contributions of the pseudo labels according to their confidence, which is varied with the training epochs. Based on the statistics, we introduced the quantity M in Equation (5) to measure pseudo labels' confidence and employed majority voting to enhance the reliability of this measurement. A scheme is then proposed in Equation (11) to weight the cross-entropy loss according to the confidence of the samples. We applied the proposed method to various remote sensing datasets and compared the performance with SOTA methods. Through the results, we found that our method significantly outperforms SOTA methods in most cases. Besides clustering, the methods proposed in this paper could be applied to data denoising or data cleaning, automatic labeling, feature extracting, model pretraining, and so on.
A The Performance of Ranking Samples of Freeway in UCMerced LandUse Figure 9: Ground truth for remote-sensing images of freeway. There are 100 images of freeway in total. Figure 10: The remote-sensing images of freeway clustered by clustering-based method [30]. There are 93 images in total, and 68 images contain freeway.

27
(a) The images are ranked as the first group (the most reliable).
(b) The images are ranked as the second group.
(c) The images are ranked as the third group.
(d) The images are ranked as the fourth group.
(e) The images are ranked as the fifth group. Figure 11: In the first, second, and third groups, all the images are correctly clustered. In the fourth group, there is 1 image of river and runway, 2 images of agricultural and overpass, and 12 images of freeway. In the last group, there are only 2 images of freeway. Figure 12: Ground truth for remote-sensing images of beach. There are 100 images of beach in total. Figure 13: The remote-sensing images of beach clustered by the clustering-based method [30]. There are 101 images in total, and 60 images contain beach.

30
(a) The images are ranked as the first group (the most reliable).
(b) The images are ranked as the second group.
(c) The images are ranked as the third group.
(d) The images are ranked as the fourth group.
(e) The images are ranked as the fifth group.