A Benchmark Dataset for Performance Evaluation of Multi-Label Remote Sensing Image Retrieval

: Benchmark datasets are essential for developing and evaluating remote sensing image retrieval (RSIR) approaches. However, most of the existing datasets are single-labeled, with each image in these datasets being annotated by a single label representing the most signiﬁcant semantic content of the image. This is sufﬁcient for simple problems, such as distinguishing between a building and a beach, but multiple labels are required for more complex problems, such as RSIR. This motivated us to present a new benchmark dataset termed “MLRSIR” that was labeled from an existing single-labeled remote sensing archive. MLRSIR contained a total of 17 classes, and each image had at least one of 17 pre-deﬁned labels. We evaluated the performance of RSIR methods ranging from traditional handcrafted feature-based methods to deep-learning-based ones on MLRSIR. More speciﬁcally, we compared the performances of RSIR methods from both single-label and multi-label perspectives. These results presented the advantages of multiple labels over single labels for interpreting complex remote sensing images, and serve as a baseline for future research on multi-label RSIR.


Introduction
With the rapid development of remote sensing technology, a considerable volume of remote sensing data becomes available on a daily basis. The huge amount of data has provided the literature with new opportunities for various remote sensing applications; however, it also results in the significant challenge of searching the large remote sensing archives.
Content-based image retrieval (CBIR) aims to find the images of interest from a large-scale image archive, which is a useful solution to solve this problem. Content-based remote sensing image retrieval is a specific application of CBIR in remote sensing field. Typically, an RSIR system has two main parts, feature extraction and a similarity measure, but the remote sensing community has been focused only on developing powerful features, since the performance depends greatly on the effectiveness of the extracted features.
There are a number of conventional RSIR approaches that are available and have been evaluated on the existing benchmark datasets, providing baseline results for RSIR research. However, these approaches assume that the query image, and those images to be retrieved, are single-labeled since the images are annotated by single labels associated with the main semantic content of the images. It is reasonable to make such an assumption, which is often sufficient for some particular remote sensing applications, but tends to be impossible for more complex applications. For example, single labels (broad class) are sufficient to distinguish image categories like "building" and "grass land", but multiple labels (primitive class) are needed to distinguish between image categories like "dense residential" and "medium residential" since they are pretty similar and the main differences lie in the density of buildings. From the perspective of RSIR, multiple labels are able to narrow down the semantic gap between low-level features and high-level semantic concepts present in remote sensing images and further improve the retrieval performance. However, the lack of a multi-label benchmark datasets has restricted the development of RSIR research. In this paper, we first introduce a new multi-label dataset, named MLRSIR, which provides the remote sensing community with a benchmark dataset to develop novel approaches for multi-label RSIR. We then provide a review of traditional single-label RSIR, as well as the multi-label RSIR approaches, ranging from handcrafted feature-based methods to deep learning feature-based ones.
The main contributions of this paper are as follows: -We construct a multi-label remote sensing benchmark dataset, MLRSIR, for multi-label RSIR. MLRSIR is a publicly available dataset, which is a multi-labeled dataset in contrast to the existing single-labeled RSIR datasets. -We provide a brief review of the state-of-the-art methods for single-label and multi-label RSIR. -We compare the single-label and multi-label retrieval methods on MLRSIR, including traditional handcrafted features and deep learning features. This indicates the advantages of multi-label over single-label for complex remote sensing applications like RSIR and provides the literature with baseline results for future research on multi-label RSIR.
The rest of this paper is organized as follows. We provide a brief review of the state-of-the-art single-label and multi-label retrieval methods for RSIR in Section 2. Section 3 introduces our multi-label benchmark dataset and the multi-label RSIR methods evaluated on the dataset including handcrafted features and deep learning features. The results and comparisons are shown in Section 4. We draw some conclusions in Section 5.

Remote Sensing Image Retrieval Methods
RSIR is a useful technique for the fast retrieval of images of interest from a large-scale remote sensing archive. In this section, we introduce the state-of-the-art RSIR methods including handcrafted features and deep-learning-based ones from the perspective of single-label and multi-label RSIR, respectively.

Single-Label RSIR
For single-label RSIR methods, the query image and the images to be retrieved are labeled by a single, broad class label. Early single-label RSIR methods extracted handcrafted low-level features to describe the semantic content of remote sensing images, which can be either global or local features. Color (spectral) features [1], texture features [2][3][4], and shape features [5] are commonly used global features extracted from the whole image, while local features like Scale Invariant Feature Transform (SIFT) [6], are extracted from image patches of interest.
Color and texture features are used more widely for RSIR compared to shape features. Remote sensing images usually have multiple spectral bands (e.g., multi-spectral imagery) and even hundreds of bands (e.g., hyper-spectral imagery); therefore, spectral features are significant for remote sensing images. Bosilj et al. employed pattern spectral features for the first time in a dense strategy and explored both global and local pattern spectral features for image retrieval [1]. The results indicated that the morphology-based spectral features achieved the best performance. Color features, however, do not work sometimes due to the phenomena where the same object/class varies in spectra, or the same spectra are shared between different objects/classes. Texture features have therefore been used to capture spatial variation of pixel intensity of images, and has achieved great performance in many tasks, including RSIR. Aptoula developed multi-scale texture descriptors, the circular covariance histogram, and the rotation-invariant point triplets for image retrieval, and exploited the Fourier power spectrum as a couple of new descriptors [2]. Bouteldja et al. proposed a rotation and scale invariant representation of the texture feature vectors by calculating the statistical measures of decomposed image sub-bands [3]. However, most of these texture features are extracted from grayscale images, and thus the rich color information is ignored. Shao et al. proposed an improved texture descriptor by incorporating discriminative information among color bands [4], which outperforms texture features, such as Gabor texture [7] and local binary pattern (LBP) [8]. There are also other global features for RSIR like simple statistics [9], GIST features [10], and Gray-Level Co-occurrence Matrix (GLCM) features [11].
Unlike global features, local features are generally captured from image patches centered at points of interest, and often achieve better performance than global features. SIFT is the most popular local descriptor, and has been used widely for various applications, including RSIR. Yang et al. released the first remote sensing benchmark dataset to the public and investigated the performance of local invariant features for RSIR [9]. The local features outperformed global features, such as simple statistics, color histograms, and texture features. Özkan et al. investigated the performance of state-of-the-art representation methods for geographical image retrieval [12]. Their extensive experiments indicate the advantages of local features for RSIR. However, local features like SIFT are of high dimension, and thus feature aggregation approaches, such as bag of visual words (BOVW) [13], vector of locally aggregated descriptors (VLAD) [14], and improved fisher kernel (IFK) [15] are often used to encode local features to generate more compact global features. Compared with VLAD and IFK, BOVW is not only an image representation widely used for RSIR [9,12], but also a framework that can combine with other features to extract more powerful feature representations [16,17]. Some other popular local features include histogram of oriented gradient (HOG) [18], and its variant descriptor pyramid histogram of oriented gradient (PHOG) [19].
Deep learning has been demonstrated to be capable of extracting more powerful feature representations compared to handcrafted features. The remote sensing community, and more specifically RSIR, has benefited from these deep learning approaches, since retrieval performance is greatly dependent on the effectiveness of feature representations as mentioned above. Zhou et al. proposed an unsupervised feature learning approach where SIFT and a sparse auto-encoder are combined to learn sparse features for RSIR [20]. In a recent work, Wang et al. proposed a novel graph-based learning method for effectively retrieving remote sensing images based on a three-layer framework [21]. The improvement of these two unsupervised feature learning methods, however, are limited since they are made based on shallow networks that cannot learn higher-level information.
In addition to unsupervised feature-learning-based methods mentioned above, convolutional neural networks (CNNs) are supervised ones that have been proved to be the most successful deep-learning approach based on their remarkable performance achieved on those benchmark datasets, such as ImageNet [22]. However, a large number of labeled images are needed to train effective CNNs from scratch, which is impossible for some domains (e.g., remote sensing) due to the lack of large-scale labeled datasets. In practice, transfer learning is often used to remedy the lack of labeled datasets by either treating the pre-trained CNNs as feature extractors, or fine-tuning the pre-trained CNNs on the target dataset. Napoletano presented an extensive evaluation of visual descriptors, including global, local, and CNN-based features [23]. The results demonstrate that features extracted by treating pre-trained CNNs as feature-extractors are able to achieve the best performance. Zhou et al. proposed a low dimensional convolutional neural network (LDCNN) based on convolutional layers and a three-layer perception, which can learn low-dimensional features from limited labelled images [24]. The Visual Geometry Group (VGG) networks [25], including three CNN models, i.e., VGGF, VGGM, and VGGS, have been investigated as the basic convolutional blocks of LDCNN, among which, VGGM performs the best on several benchmark datasets.

Multi-Label RSIR
The single-label RSIR methods mentioned above are effective in searching remote sensing images of interest from a large-scale archive, but the primitive classes (multiple labels) present in images are ignored. This may result in a poor performance due to the semantic gap between low-level features and high-level concepts. Multi-label RSIR is different from single-label RSIR in terms of the number of labels included in images, as well as the process of feature extraction. In addition, for multi-label RSIR, a two-step coarse-to-fine retrieval can be performed based on the multiple labels in each image. More specifically, in the coarse retrieval step, the images in the archive that have at least one overlapped label with a query image will be returned to form the similar subset, and later in the fine retrieval step, the features extracted from the segmented image regions are used to perform exact retrieval of similar images from the subset. Figure 1 shows a basic comparison between single-label and multi-label RSIR.
Remote Sens. 2018, 10, x FOR PEER REVIEW 4 of 13 for multi-label RSIR, a two-step coarse-to-fine retrieval can be performed based on the multiple labels in each image. More specifically, in the coarse retrieval step, the images in the archive that have at least one overlapped label with a query image will be returned to form the similar subset, and later in the fine retrieval step, the features extracted from the segmented image regions are used to perform exact retrieval of similar images from the subset. Figure 1 shows a basic comparison between singlelabel and multi-label RSIR. To exploit the multiple labels and further improve RSIR performance, multi-label learning has shown promising and effective performance when it comes to addressing multi-label image retrieval problems in computer vision literature [26][27][28]. Nasierding et al. investigated multi-label classification methods for image annotation and retrieval to give a comparative study of these methods [26]. Li et al. proposed a novel multi-label image annotation method for image retrieval based on annotated keywords [27]. The results indicate that multi-labels can provide abundant descriptions for image content at the semantic level, thus improving precision and recall of image retrieval. Ranjan et al. introduced multi-label canonical correlation analysis to address cross-modal retrieval problem in the presence of multi-label annotations [28]. The proposed cross-model retrieval method achieves state-of-the-art retrieval performance.
Inspired by the success of multi-label learning methods in computer vision literature, the remote sensing community has raised interest in multi-label learning for RSIR problems [29][30][31][32][33]. Wang et al. proposed a remote sensing image retrieval scheme by using image scene semantic matching [29], and in the other work [30], image visual, object, and spatial relationship semantic features are combined to perform a two-stage coarse-to-fine retrieval of remote sensing images from multiple sensors. However, an object-based support vector machine (SVM) classifier is needed to produce classification maps of query images and images to be retrieved in the archive. In order to train an effective classifier, a reliable pixel-based training set is required, which is, however, not efficient for RSIR applications. Chaudhuri et al. presented a novel unsupervised graph-theoretic approach for region-based retrieval of remote sensing images [31]. In the proposed approach, the images are modeled by an attributed relational graph, and then the graphs of the images in the archive are matched to that of the query image based on inexact graph matching. Dai et al. explored the use of multiple labels for hyperspectral image retrieval and presented a novel multi-label RSIR system combining spectral and To exploit the multiple labels and further improve RSIR performance, multi-label learning has shown promising and effective performance when it comes to addressing multi-label image retrieval problems in computer vision literature [26][27][28]. Nasierding et al. investigated multi-label classification methods for image annotation and retrieval to give a comparative study of these methods [26]. Li et al. proposed a novel multi-label image annotation method for image retrieval based on annotated keywords [27]. The results indicate that multi-labels can provide abundant descriptions for image content at the semantic level, thus improving precision and recall of image retrieval. Ranjan et al. introduced multi-label canonical correlation analysis to address cross-modal retrieval problem in the presence of multi-label annotations [28]. The proposed cross-model retrieval method achieves state-of-the-art retrieval performance.
Inspired by the success of multi-label learning methods in computer vision literature, the remote sensing community has raised interest in multi-label learning for RSIR problems [29][30][31][32][33]. Wang et al. proposed a remote sensing image retrieval scheme by using image scene semantic matching [29], and in the other work [30], image visual, object, and spatial relationship semantic features are combined to perform a two-stage coarse-to-fine retrieval of remote sensing images from multiple sensors. However, an object-based support vector machine (SVM) classifier is needed to produce classification maps of query images and images to be retrieved in the archive. In order to train an effective classifier, a reliable pixel-based training set is required, which is, however, not efficient for RSIR applications. Chaudhuri et al. presented a novel unsupervised graph-theoretic approach for region-based retrieval of remote sensing images [31]. In the proposed approach, the images are modeled by an attributed relational graph, and then the graphs of the images in the archive are matched to that of the query image based on inexact graph matching. Dai et al. explored the use of multiple labels for hyperspectral image retrieval and presented a novel multi-label RSIR system combining spectral and spatial features [32]. Experimental results obtained using a benchmark archive of hyperspectral images show that the proposed method was successful for the adaptation of single-label classification for multi-label RSIR. In a recent work, Chaudhuri et al. proposed a multi-label RSIR method using a semi-supervised graph-theoretic method [33], which is an improvement of the region-based retrieval approach [31]. The proposed approach requires only a small number of pixel-wise labeled training images characterized by multiple labels to perform a coarse-to-fine retrieval process. This work provides not only a multi-label benchmark dataset but also baseline results for multi-label RSIR.

MLRSIR: A Pixel-Wise Dataset for Multi-Label RSIR
For single-label RSIR, a number of benchmark datasets are publicly available [34]. However, few works have been done to release datasets for multi-label RSIR in the remote sensing literature, which limits the development of novel approaches. Chaudhuri et al. released a multi-label RSIR archive [33], and each image in this archive is manually labeled with one or more labels based on visual inspection. This is the first open-source dataset for multi-label RSIR. However, it is an image-level dataset, which is sufficient for unsupervised or semi-supervised multi-label RSIR, but has limitations in supervised deep learning approaches, such as fully convolutional networks (FCN) [35]. More specifically, we only know the labels/primitive classes for the images but have no idea of the pixel-wise labels in each image.
As the initial step of the semi-supervised approach [33], an effective segmentation algorithm is required to obtain a number of semantically meaningful regions, since the retrieval performance is heavily dependent on the accuracy of segmentation results. During the subsequent steps, a small number of training images are randomly selected and pixel-wise labeled to predict the label of each region in an image. These steps can be combined and replaced by a FCN network, which has been proved to be effective for addressing semantic segmentation problem. Moreover, it is worth noting that pixel-wise labeling is also required in the semi-supervised multi-label RSIR approach. We therefore propose a new pixel-wise labeling dataset termed MLRSIR for multi-label RSIR that can be used for not only unsupervised and semi-supervised approaches but also supervised approaches like FCN.

Description of MLRSIR
To be consistent with the multi-label RSIR archive [33], the total number of distinct class labels associated for MLRSIR was also 17. The eCognition 9.0 (http://www.ecognition.com/) software was used to segment each image in the UC Merced archive [9] into a number of semantically meaningful regions, and then each region was assigned one of 17 pre-defined class labels.
MLRSIR had a total number of 21 broad categories with 100 images per class, which is the same as the UC Merced archive. The following 17 class labels, i.e., airplane, bare soil, buildings, cars, chaparral, court, dock, field, grass, mobile home, pavement, sand, sea, ship, tanks, trees, and water, were considered in this dataset. Figure 2 shows some images with corresponding pixel-wise labeling results, and the total number of images associated for each class label is shown in Table 1.
MLRSIR was a pixel-wise labeled dataset with each image containing multiple labels, therefore, it could also be used for other tasks, such as semantic segmentation (also called classification in remote sensing) and multi-label classification, i.e., predicting the classes contained in an image. MLRSIR is available at https://sites.google.com/view/zhouwx/dataset.   airplane  100  bare soil  754  buildings  713  cars  897  chaparral  116  court  105  dock  100  field  103  grass  977  mobile home  102  pavement  1331  sand  291  sea  101  ship  103  tanks  100  trees  1021  water  208 3.

Multi-Label RSIR Based on Handcrafted Features
To extract handcrafted features, we first determined the number of connected regions in each image according to its corresponding labeling results. We then extracted features from each of the segmented regions and combined these region-based features to form a feature matrix, as shown in Figure 1. In detail, each region was represented by a feature vector concatenating color, texture, and shape features. We refer the readers to Section 4.1 for more details on handcrafted feature extraction.
Two schemes were proposed to evaluate the retrieval performance of handcrafted features. In the first scheme, the multi-label RSIR was evaluated as single-label RSIR. More specifically, the similarity between the query image and other images in the archive were obtained by calculating the distance between corresponding feature matrices as follows:

Multi-Label RSIR Based on Handcrafted and CNN Features
Multi-label RSIR was different from single-label RSIR in that for multi-label RSIR, the multi-label information was considered and the features are extracted from the segmented regions instead of the whole image. This section introduces the handcrafted and CNN features that were evaluated using the presented MLRSIR dataset.

Multi-Label RSIR Based on Handcrafted Features
To extract handcrafted features, we first determined the number of connected regions in each image according to its corresponding labeling results. We then extracted features from each of the segmented regions and combined these region-based features to form a feature matrix, as shown in Figure 1. In detail, each region was represented by a feature vector concatenating color, texture, and shape features. We refer the readers to Section 4.1 for more details on handcrafted feature extraction.
Two schemes were proposed to evaluate the retrieval performance of handcrafted features. In the first scheme, the multi-label RSIR was evaluated as single-label RSIR. More specifically, the similarity between the query image and other images in the archive were obtained by calculating the distance between corresponding feature matrices as follows: where v q and v r were the features of the query image and other images in the archive, respectively. D(q, r) (D is a distance matrix) was the L 1 distance between the region q of the query image and region r of other images in the archive, and n was the number of regions in the query image. The first scheme is termed MLIR hereafter for conciseness.
In the second scheme, we performed a coarse-to-fine retrieval process. For the coarse retrieval step, a subset consisting of images which have at least one overlapped label with the query image was first obtained by comparing the label vectors (17-D vector) between the query image and other images in the archive. Then, in the later fine-retrieval step, we repeated the first scheme mentioned above on the subset to further improve retrieval results. The second scheme is termed MLIR-CF hereafter for conciseness.

Multi-Label RSIR Based on CNN Features
For multi-label RSIR based on CNN features, the pre-trained CNNs were fine-tuned on the MLRSIR dataset to learn domain-specific features. It is worth noting that the label of each image was a 17-D vector with the entries of 1 s and 0 s, where 1 indicated the image has this class, and 0 otherwise.
To evaluate CNN features for multi-label RSIR, we extracted the features from the fine-tuned fully-connected layers and proposed two schemes to investigate the performance. In the first scheme, the CNN-features-based multi-label RSIR was also evaluated as a single-label RSIR, which was the same as the first scheme in Section 3.2.1.
The second scheme relied on the label vectors to perform a coarse retrieval. Specifically, we first split the MLRSIR archive into two subsets, i.e., training and test sets, respectively, where the training set was used to fine-tune the pre-trained CNN, while the test set was used to perform coarse retrieval. Then we predicted the label vector of each image in the test archive by converting its corresponding label score (the output of the fine-tuned CNN) into binary values (0 and 1).
For binarization, a 17-D threshold vector was needed. Let L = [l i,1 , l i,2 , l i,3 , . . . , l i,k ](i = 1, 2, . . . , n) and S = [s i,1 , s i,2 , s i,3 , . . . , s i,k ](i = 1, 2, . . . , n) denote the label vectors and corresponding label scores of all the training images, respectively, where n and k were the number of training images and class labels, respectively. For class label k, the threshold t k was determined by taking the average of the minimum label score with l i,k = 1, and the maximum label score with l i,k = 0, of all the training images. This process was repeated 17 times to obtain the 17-D threshold vector. Then the class label l k of each test image was set to 1 if l k ≥ t k , and 0 otherwise. Once the label vectors of all the test images were obtained, the hamming distance between the query image and other images in the archive was calculated by comparing their corresponding label vectors, as shown in Equation (2), where l q and l r were the label vector of the query image and other images in the archive, respectively, and L was the number of class labels. The second scheme is termed CNN-HM hereafter for conciseness.

Experiments and Results
In this section, we evaluate the single-label and multi-label RSIR methods on the proposed MLRSIR dataset.

Experimental Setup
In our experiments, simple statistics, color histogram, Gabor texture, HOG, PHOG, GIST, and LBP were used for single-label RSIR based on handcrafted features, as well as evaluated on the presented MLRSIR dataset. For the multi-label RSIR based on handcrafted features, i.e., MLIR, each region was described by concatenating color (histogram of each channel), texture (GLCM), and shape features (area, convex area, perimeter, extent, solidity, and eccentricity) to obtain a 110-D feature vector.
For the single-label and multi-label RSIR based on CNN features, we chose VGGM as the pre-trained CNN since it was able to achieve slightly better performance than the other two VGG networks, i.e., VGGF and VGGS, on the UC Merced archive. The VGGM network was fine-tuned with a single label and multiple labels, respectively. The convolutional architecture for fast feature embedding (Caffe) framework [36] was used for fine-tuning, and the parameters are shown in Table 2. In addition, the weights of the pre-trained VGGM were transferred to the network to be fine-tuned. To accelerate training and avoid overfitting, the weights of convolutional layers were fixed during fine-tuning. The weights of the first two fully-connected layers were used as initial weights, and the weights of the last fully-connected layer were initialized from a Gaussian distribution (with a mean of 0 and a standard deviation of 0.01). We randomly selected 80% of the images from each broad category of MLRSIR as the training set, and the remaining 20% of the images were used for evaluating retrieval performance. To be consistent with the recent work that presents a benchmark dataset to evaluate RSIR methods [34], we selected L 1 as the distance measure for the color histogram, and L 2 for other features. The average normalized modified retrieval rank (ANMRR), mean average precision (mAP), precision at k (P@k, k is the number of retrieved images), and precision-recall curve, were used to evaluate the retrieval performance. For ANMRR, the lower values indicated better performance, while for mAP and P@k, the larger the better. It is worth noting that each image was taken as a query image, and the query image itself was also regarded as a similar image in the following experiments.
To further evaluate the performance of the multi-label RSIR methods, three metrics, i.e., accuracy, precision, and recall, were computed. The equations are as follows: (5) where N and L were the number of returned images and labels, respectively. Q and R i were the label vector of the query image and the ith returned image, respectively.

Results of Single-Label and Multi-Label RSIR
The single-label and multi-label RSIR based on handcrafted features were evaluated using the whole MLRSIR dataset, and the results are shown in Table 3. It can be observed that the multi-label RSIR method MLIR outperforms most of the handcrafted features except Gabor texture feature which achieves the best performance in terms of ANMRR value. However, we can see MLIR tends to achieve slightly better performance than Gabor texture in terms of P@k values when the number of returned images increases (k ≥ 1000), indicating MLIR is more scalable than Gabor texture in a large-scale remote sensing archive. The results in Table 3 demonstrate the advantages of multi-label RSIR over single-label RSIR.  Table 4 shows the performance of single-label and multi-label RSIR based on CNN features. These results were obtained using the test set, i.e., 20% of the MLRSIR dataset, as mentioned in Section 3.2.2. We extracted features from the first two fully-connected layers and obtained four features, i.e., CNN-Fc6, CNN-Fc6ReLU, CNN-Fc7, and CNN-Fc7ReLU. The results indicated that the fine-tuned CNN features outperformed the pre-trained CNN features, and that the multi-label RSIR performed slightly better than single-label RSIR for these four features except CNN-Fc7. The ANMRR values of CNN-Fc7 were 0.3350 and 0.3440 for single-label and multi-label RSIR, respectively. It can also be observed that the activation function ReLU affected the performance of features extracted from the fully-connected layers for both single-label and multi-label RSIR. In addition, CNN-HM achieved the worst performance for all the evaluated performance metrics. This was because of the fact that CNN-HM is essentially a coarse retrieval that only relied on the labels of the images. CNN-HM could be used as the first-stage retrieval to filter out those images that did not contain the specific classes as the query image. Figure 3 shows the precision-recall curves for single-label and multi-label RSIR based on CNN features. The performance is consistent with the results in Table 4.
We selected the best performing features for the pre-trained CNN features, the single-label RSIR features and the multi-label RSIR features, respectively, and plotted the ANMRR histogram for each broad class in MLRSIR, as shown in Figure 4. We can see multi-label RSIR, i.e., CNN-Fc7ReLU (ML) in Table 4, achieved the best performance for most of the broad classes except intersection and parking lot. For an image class like intersection, multi-label RSIR even achieved the worst performance. A possible explanation is that the image of intersection usually contains more primitive classes, including pavement, cars, trees, grass, buildings, and bare soil. This made it difficult to accurately represent the images since the features were extracted from the regions, and we did not consider the spatial relationship between different regions. Table 4. Performance of single-label and multi-label RSIR based on CNN features. "SL" and "ML" mean the CNNs are fine-tuned with single and multiple labels, respectively. "ReLU" means the feature is extracted with the use of activation function. The bold values mean the best result for each performance metric. We selected the best performing features for the pre-trained CNN features, the single-label RSIR features and the multi-label RSIR features, respectively, and plotted the ANMRR histogram for each broad class in MLRSIR, as shown in Figure 4. We can see multi-label RSIR, i.e., CNN-Fc7ReLU (ML) in Table 4, achieved the best performance for most of the broad classes except intersection and parking lot. For an image class like intersection, multi-label RSIR even achieved the worst performance. A possible explanation is that the image of intersection usually contains more primitive classes, including pavement, cars, trees, grass, buildings, and bare soil. This made it difficult to accurately represent the images since the features were extracted from the regions, and we did not consider the spatial relationship between different regions.

Comparisons of the Multi-Label RSIR Methods
We compared our multi-label RSIR method (MLIR-CF) with several state-of-the-art methods, including KNN, ARGMM, and MLIRM, on the presented MLRSIR dataset. The results are shown in

Comparisons of the Multi-Label RSIR Methods
We compared our multi-label RSIR method (MLIR-CF) with several state-of-the-art methods, including KNN, ARGMM, and MLIRM, on the presented MLRSIR dataset. The results are shown in Table 5. It can be seen that MLIR-CF outperformed KNN, but performed worse than the other two methods. This is because the graph matching strategy based on an attributed relational graph (ARG) was used for the similarity measure in both ARGMM and MLIRM.

Conclusions
In this paper, we proposed a benchmark dataset named MLRSIR for multi-label RSIR. We expect MLRSIR to help advance the development of RSIR approaches, particularly supervised-learning-based methods. We also compared the performance of single-label and multi-label RSIR on MLRSIR based on handcrafted and CNN features. MLRSIR is collected for RSIR and particularly multi-label RSIR, but it can also be used for other problems such as semantic segmentation.
Author Contributions: The research idea and design were conceived by K.Y. and W.Z. The experiments were performed by K.Y. and W.Z. The manuscript was written by K.Y. Z.S. helped revise the manuscript.