In this section, we first give our experiment results on several datasets, which are often used in computer vision, to compare to the state-of-the-art research, and then we conduct experiments on various remote-sensing image datasets.
4.1. Compare to the State-of-the-Art Approaches
To conveniently compare with other representation-learning methods, in this part our experiments were conducted on a large-scale dataset, ImageNet LSVRC-12 [
51] and three smaller scale datasets: CIFAR-10/100 and SVHN. ImageNet LSVRC-12 contains around 1.28 million training pictures of 1000 classes and 50-thousand validation pictures. There are 50,000 training images and 10,000 test images in the CIFAR-10/100 dataset, whose number of classes is 10/100 and pixel resolution is
. SVHN is similar to MNIST (images of digits), which is a real-world image dataset. There are 73,257 images (
) for training and 26,032 images (
) for testing. All the ground truths of these datasets were used only in the evaluations of representations. The implementation details are given in
Appendix D.
For fairly comparing to the previous work, we set the same cluster number
k for these datasets as [
8]. For CIFAR-10, the cluster number was set as 128. Therefore, the mean number of samples in each cluster is
. According to Remark 1, although this number is not an integer, setting it as the target still works: making the number of the samples assigned to each cluster as close as possible to the mean number. The cluster number is set as 128 for SVHN and 512 for CIFAR-100 when experiments were conducted in AlexNet.
Table 1 gives kNN and linear-classifier evaluations of representations learned from CIFAR-10/100 and SVHN through our method
OTL (output translation). In addition, kNN evaluations were performed based on the fully connected layer before the final layer, and all layers except the final layer were frozen. For linear classifier, all the fully connected layers were discarded and the retained CNN layers were frozen. The last CNN layer was then attached by a new randomly initialized fully connected layer whose weights were to be updated by a supervised learning. With simple augmentations, the performances of our method reached the state-of-the-art performances. When strong augmentations and LCT (label consistent training) were employed, the performances of our method outperformed the previous works in the literature. For SVHN, the performance of our method is very close to the performance of supervised learning, especially linear classifier, where the accuracy is only
lower than the accuracy in supervised learning. When one assumes the actual number of classes in the data is known, and sets
k equal to the actual number of classes, this can be regarded as giving the model some prior knowledge. It is seen that such prior knowledge can help improve the performances of our OTL algorithm, especially in the case for linear classifier on SVHN, where the accuracy is the same as that of fully supervised learning.
In
Table 2, the experiments were ran on ResNet-50 and for fairly comparing to other methods,
was set for all experiments on ResNet-50. Both evaluations, kNN and linear classifier, were conducted on the last CNN layer. Note that both Instance [
53], SimCRL [
6] and ISL [
54] are methods of instance discrimination. This table shows that clustering-based representation-learning methods could learn representations that are as good as that of instance discrimination or even better. Compared to
Table 1, we find that the performances on ResNet-50 overall are better than on AlexNet, which is reasonable and predictable.
To demonstrate the effectiveness of our method on a large-scale dataset and the speed of our algorithm in dealing with huge number of images, i.e., over one million, we also conducted experiments on ImageNet with AlexNet. As can be seen from
Table 3, our method, without strong augmentations and LCT, achieves state-of-the-art performances in evluations of both linear and kNN classifiers. In our model, comparing the representations from the five CNN layers, representations from the last two layers perform better than that of the first three layers.
4.2. Impact of Even Pseudo-Label Distribution on Performances
We have to emphasize, again, that evenly distributing pseudo-labels is in order to make the representations highly discriminative. Besides even distribution, our algorithm can conveniently set various unevenly distributed pseudo-labels, which can be performed by modifying the
in frequency indicator
C (
11) as the desired distribution.
Unevenly distributed pseudo-label. To investigate the influences of unevenness of pseudo-labels’ distribution, various uneven target distributions of pseudo-labels were set, as shown in
Figure 3 (for more details, see
Appendix E) for training on CIFAR-10/100. The kNN evaluations for different target distributions are given in
Table 4 (these models were trained on AlexNet with fewer epochs than in
Table 1 and without strong augmentations and the LCT technique). As one can see from both tables, the accuracy of kNN decreases as evenness decreases. This is consistent with the argument made in the method section that evenly distributed pseudo-labels are good for learning representations.
Unevenly distributed datesets. The standard datasets used in this paper are evenly distributed, i.e., each class contains the same number of images. However, the effectiveness of our method does not depend on the data being evenly distributed. In order to show that, we considered five unevenly distributed datasets which are made from CIFAR-10 by deleting different numbers of images from different classes. As a comparison, five evenly distributed datasets were made, and each has the same number of samples as the corresponding dataset in the five unevenly distributed ones.
In
Table 5 (again with fewer epochs and without strong augmentations and the LCT technique), for unevenly distributed “D_100”, we deleted 0, 100, 200, ⋯ 900 images from class 0, 1, 2, ⋯, 9, while evenly distributed “D_100” has the same total number of images but the ground truth is evenly distributed. Similarly, for unevenly distributed “D_200”, we deleted 0, 200, 400, ⋯ 1800 images from class 0 to 9. For unevenly distributed data “D_500”, the number of images in the first class is ten times the number of images in the last class.
We can see from
Table 5 that our method is almost unaffected by whether the actual data distribution is even or not. As can be expected, evenly distributed data leads to slightly better performances than that on the unevenly distributed data. This experiment demonstrates that the success of our algorithm does not rely on the even distribution of the datasets.
4.3. Application to Remote-Sensing Images
For all remote-sensing datasets, we divided images into two groups: the training sets and testing sets, that is, we randomly picked around 80% images for training and the others for testing. All training only involved training sets without using any labels. For all datasets, we employed the same backbone, i.e., ResNet-18. All the remote-sensing images were resized to 256 × 256 and then cropped to 224 × 224 and for the strong augmentations we used the strategy from [
10] with Augment = 10, n
= 1, and length = 16. In this part, we will apply our method to seven remote-sensing image datasets: HMHR (How to Make High Resolution Remote Sensing Image dataset), EuroSAT, SIRI WHU google, UCMerced LandUse, AID, PatternNet, NWPU RESISC45.
For the very small datasets—HMHR, SIRI WHU google, UCMerced LandUse—we used pretrained models. The training epoch for these three datasets is 64, and we set initial learning rate as 0.03 and drops to 0.003, 0.0006 at epoch 32, 48. For the training without a pretrained model, the total training epoch was set to 300, and learning rate initially was 0.03 which dropped to 0.003 and 0.0006 at epoch 160 and 240. For all the training, the momentum was set to 0.9. For all datasets, the k was set to 128.
This is a dataset that is made from google map via LocalSpece Viewer. In this dataset, there only contains 533 images with five classes: building, farmland, greenbelt, wasteland, water. The resolution and spatial resolution of the remote-sensing images are 256 × 256 and 2.39 m. To prevent the network from overfitting, we used the strong augmentations [
10] and the pretrained models trained from other remote-sensing images datasets: PatternNet, NWPU RESISC45 and EuroSAT. All the pretained models were trained through our method discussed in the Methods section and no labels were used in the training. For supervised training, we only considered the pretrained model on PatternNet. From
Table 6, we see that our method works well on very small datasets with few classes. There is only a small gap (
) between supervised method and ours in linear classifier evaluation. Although using different pretrained models has different performances, all of them are effective and perform well.
EuroSAT [
58,
59] is a dataset that for land use and land cover classification. It is consist of 10 classes with 27,000 labeled images of annual crop, forest, highway, reiver, sealake and so on. The resolution and spatial resolution of these images are 64 × 64 and 10 m.
Table 7 shows that the gap between supervised method and ours is only
in Linear classifier evaluation. In kNN evaluation, the performance of our method is also very close to the performance of supervised learning.
The dataset of SIRI WHU google [
60] contains 12 classes images with 2400 images. There are 200 images for each class. The class of images includes: agriculture, commercial, harbor, idle land, industrial, meadow, overpass, park, pond, river, water, residential. The pixel and spatial resolution of images in this dataset are
and 2 m. As can be seen from
Table 8, using the pretained model on NWPU RESISC45 have the best performance which even is better than the that of supervised learning based on pretrained model on PatternNet. Representations learned based on pretrained model on PatternNet perform closely to that of supervised learning.
UCMerced landUse [
61] gives the images that were manually extracted from large images from the USGS National Map Urban Area Imagery collection for various urban areas around the country. Each image measures 256 × 256 pixels, with a 0.3 m spatial resolution. There are 21 classes with 2100 images, including airplane, beach, buildings, freeway, golfcourse, tenniscourt and so on and there are 100 images for each class.
Table 9 demonstrates that the performance of our method based on pretrained models on PatternNet and NWPU RESISC45 is very close to supervised learning, and in Linear Classification, our method even perform better.
AID [
62] is a aerial image dataset with high resolution, 600 × 600 pixels, and a 0.5–0.8 m spatial resolution, which constitutes collected sample images from Google Earth imagery. There are 30 classes with 10,000 images. This dataset is not evenly distributed and in each class there are about 200 to 400 images. The scene classes include bare land, baseball field, beach, bridge, center, church, dense residential, desert, forest, meadow, medium residential, mountain, park, parking, playground, port, railway station, resort, school and so on. This experiment shows that, for high-resolution remote-sensing images and an unevenly distributed remote-sensing dataset, our method also works well (see
Table 10).
PatternNet [
63] is a remote-sensing dataset with 30,400 high-resolution images whose resolution is 256 × 256 pixels and spatial resolution is 0.062–4.693 m. These images were collected for remote-sensing image retrieval from Google Earth imagery or via the Google Map API for some US cities. For each of the 38 classes, there are 800 images. From the results in
Table 11, we see that, in kNN evaluation, our method performs as well as that of supervised learning; while in linear classifier evaluation, our method has an even better performance.
NWPU RESISC45 [
64] is made by Northwestern Polytechnical University (NWPU), which is avaiable for remote-sensing image scene classification (RESISC). It contains 31,500 images in total with a pixel resolution of 256 × 256 and a 0.2–30 m spatial resolution. This dataset cover 45 scene classes with 700 images in each class. These 45 scene classes include baseball diamond, basketball court, bridge, chaparral, church, circular-farmland, cloud, commercial area, dense residential, desert, intersection, island, lake, meadow, medium residential, mobile home park, wetland and so on. The experiment in
Table 12 shows that, for a dataset with more classes, our method still has an excellent performance that is very close to performance of supervised learning.
From
Table 6,
Table 7,
Table 8,
Table 9,
Table 10,
Table 11 and
Table 12, we see that, for the pixel resolutions of remote-sensing images ranging from
to
, i.e., from low to high resolution, spatial resolutions ranging from 0.062 m to 30 m, class numbers ranging from 5 to 45, and total numbers of images ranging from 533 to 31,500, our method can learn good representations, which are close to, or even better than, the representations learned from supervised learning. These experiments demonstrate that our method could be widely applied to learn the representations of remote-sensing images.
To investigate the effect of choosing different training and testing samples in the experiments of remote-sensing images, we conduct four more contrast experiments on the dataset SIRI WHU google. In this dataset, there are only 2400 images, and, thus, the random choice of training and testing samples may be quite different. We randomly selected
images as training samples and
as testing samples four times to construct four different divisions of the dataset SIRI WHU google. The performances on the five different divisions are given in
Table 13. From
Table 13, we see that different training and testing samples would affect the performances both on supervised learning and our method. However, in all five cases, using our method, without any labels can always learn good representations for remote-sensing images.