1. Introduction
Remote-sensing scene classification aims to predict the semantic category of image blocks by mining the visual primitives in a remote-sensing image scene (image block) and the spatial relationship between visual primitives [
1,
2]. It can greatly reduce the confusion of pixel-level or object-level ground object interpretation and improve the stability and accuracy of remote-sensing image interpretation. It has important application value in content-based remote-sensing image retrieval and remote-sensing target detection [
3,
4].
In the field of remote sensing, the data discrepancies between the source domain and the target domain are often caused by differences in imaging time, imaging atmospheric conditions, imaging locations and imaging sensors [
5,
6]. In this case, the classifier trained directly from the source domain data cannot achieve the desired results in the target domain. Therefore, a model trained on a specific dataset (source domain) is often difficult to generalize to another image set (target domain) that the model has not seen in the training process [
5]. Moreover, to solve the domain adaptation problem by directly fine tuning the model in the target domain, it is very time-consuming and laborious to collect the corresponding labels, and the in-orbit intelligent application is also limited. At present, with the increases in satellites and various sensors, the number and types of remote-sensing images available are becoming increasingly diverse. It is unrealistic to build datasets and fine tune models for massive new multisource data and tasks. An effective method is urgently needed for solving the generalization problem of remote-sensing images from source domain to target domain, which is to provide not only efficient solutions for subsequent tasks but also a feasible method of breaking through the barriers between existing datasets and achieving larger-scale applications.
The traditional method is to label a small amount of new data and fine tune the network trained in the source domain to adapt to the new data, which is not only time-consuming and laborious but also challenging in that the target domain data usually have the characteristics of difficult acquisition, small amount of data and difficult labeling. Additionally, for problems with large domain differences, such as using the model trained on the visible dataset for the shortwave infrared, thermal infrared, and even synthetic aperture radar (SAR) data for model reasoning, the model often fails to achieve good results. For tasks with no label in the target domain and great differences between the source domain and the target domain, the mainstream method is domain adaptation, which can improve the performance of the tasks in the target domain by reducing the differences in the characteristic distribution between the source domain and the target domain. In order to reduce the differences in feature distribution between the source domain and target domain, the research focus of the domain adaptation algorithm is how to correct the feature distribution of source domain and target domain without changing the important attributes of the specific data, so that a classifier trained only by source domain data can be directly applied to target domain data and achieve satisfactory classification results.
Domain adaptation algorithms have become one of the popular research topics in the field of computer vision. The research in the field of remote sensing is relatively lagging. Datasets are an important driver and promoter of their development. Currently, most of the public remote-sensing scene classification datasets are optical remote-sensing images, while few scene classification datasets are based on radar images, short-wave infrared images and thermal infrared images. Because of their special imaging mechanisms, radar data, short-wave infrared data and thermal infrared data have their own characteristics that are complementary to optical images. Therefore, it is of great significance and value to construct domain-adaptive remote-sensing scene classification datasets including visible light, radar, short-wave infrared and thermal infrared images.
In this paper, we extend the preliminary version of MRSSC, i.e., MRSSC 1.0 [
7], to MRSSC2.0. Specifically, MRSSC2.0 collects 26,710 scene images from a wide-band imaging spectrometer (WIS) and an interferometric imaging radar altimeter (InIRA) on Tiangong-2 and contains 7 categories from 4 domains. Ten domain adaptation methods are evaluated, and three applications are expanded, which is helpful for verifying the domain migration performance of data between different modes and exploring the innovative application of domain adaptation in remote sensing. This study will provide a data source for researchers to carry out the application research of domain transfer learning based on artificial intelligence and remote-sensing images, provide strong support for the transfer learning of remote sensing scene classification datasets and promote the deep fusion application of multiple remote-sensing scene classification datasets.
3. Materials and Methods
3.1. Images Collection
Data with different imaging mechanisms were the basis for constructing domain-adaptive datasets. The Tiangong-2 space laboratory is equipped with a wide-band imaging spectrometer (WIS) and a interferometric imaging radar altimeter (InIRA), which can obtain visible near infrared (VIS), short wavelength infrared (SWI), thermal infrared (INF) and SAR images [
26]. For VIS, SWI and INF, the spectral ranges were 0.4~1.0 μm, 1.0~1.7 μm and 8.0~10.0 μm; the channel numbers were 14, 2 and 2; the spatial resolutions were 100 m, 200 m and 400 m; and the accuracies of absolute radiation calibration were 10%, 10% and 2 K, respectively [
10]. For InIRA, the working frequency was 13.58 GHz, the work bandwidth was 40 MHz, the certainty of backscatter sounding was less than 2.0 dB and the spatial resolution of the two-dimensional images was 40 × 40 m. In addition, there were high-resolution two-dimensional images and DEM of InIRA that were been included in this dataset [
27]. WIS and InIRA provide multilevel data products, of which the level-2 product of WIS was processed by field-of-view spreading, interband registration, nonuniformity correction, radial correction, sensor correction and geometric correction, and the level-2 product of InIRA was processed by imaging processing, azimuth multiview processing, radiometric correction and geometric correction to form a two-dimensional image product with map projection. The above data had domain differences with high quality, providing rich data sources for MRSSC. In order to improve the diversity of data, we carefully selected data from different regions, imaging times and imaging conditions.
3.2. Categories Selection
Seven categories were chosen and annotated in our MRSSC2.0 dataset, including river, lake, city, farmland, mountain, coast and desert, as shown in
Figure 1.
The categories were selected according to the characteristics of data and the value for real-world applications. The spatial resolutions of VIS, SWI, INF and SAR were medium, so the scene category settings were as consistent as possible with the land cover classifications. The advantage of this is that in the application scenario of in-orbit intelligent analysis, the relatively large-scale scene classification helps to quickly analyze and understand the semantics of the in-orbit push scan data. Combined with the fine target detection method, it realizes in-orbit real-time analysis at the subsecond level and provides support for the application scenario of intelligent processing, storage and downlink.
3.3. Annotation Method
Data Clipping. The dataset was obtained by professional remote-sensing experts through manual cutting. In order to ensure that the image label could reflect the image content, the main object was located in the middle of the image and accounted for a large proportion or conformed to the attention mechanism.
Considering the information content of the scene, spatial resolution and adaptability of the algorithm, the image size in MRSSC was 256 × 256 pixels for the four domain images. The resolutions of the four domain data were different, so the scenes had scale differences. For coastal, desert and mountain, the images affected by scale were small. For other scenes, objects of different scales were selected to make up for this difference. For example, the lake scale selected in VIS was smaller, while the lake scale selected in SWI and INF was larger, both to ensure that the cropped image could represent the overall scene.
Band Selection. VIS was a RGB true color image (R: 0.655~0.675 μm, G: 0.555~0.575 μm, B: 0.480~0.500 μm), SWI was a gray image (1.23~1.25 μm), INF is gray image (8.125~8.825 μm), and SAR was a two-dimensional image. Different types of data had different imaging mechanisms, resulting in different data distributions. VIS truly reflected the surface state, SWI was more sensitive to soil moisture, INF reflected the surface temperature state and SAR reflected the degree of surface backscattering.
3.4. Dataset Splits
In order to test the mobility between different data, we took VIS as the source domain and SWI, INF and SAR as the target domains. The MRSSC2.0 dataset was divided into seven parts: one source domain, three target domains and three test sets. There were no data intersections between the target domain and the test set. The source domain included VIS images, and the target domains were SWI, INF and SAR images, respectively represented by the numbers 1, 2, and 3. Target and test were divided by 4:1. The images in the source domain contained labels for training, while the images in target domain did not have labels and were used for domain adaption. The image numbers in each of the seven category are shown in
Table 1.
3.6. Properties of MRSSC Dataset
3.6.1. Large Domain Differences Due to Different Imaging Mechanisms
The imaging mechanisms of a wide-band imaging spectrometer (WIS) and an interferometric imaging radar altimeter (InIRA) are very different. The WIS obtains image data using visible light and some infrared band sensors, while the InIRA uses microwave band (Ku band) sensors. WIS contains gray information of multiple bands for target recognition and classification extraction, while InIRA only records the echo information of one band and then extracts the corresponding amplitude and phase information through transformation. The amplitude information usually corresponds to the backscattering intensity of radar waves from ground targets and is closely related to the medium, water content and roughness. Additionally, the signal-to-noise ratio of SAR images is low, and there are unique geometric distortions such as overlay, perspective shrinkage and multipath false targets.
The three data products of wide-band imaging spectrometer, VIS, SWI and INF, also show certain differences due to the differences in imaging bands. SWI uses light reflection imaging, which is similar to the principle of visible light imaging. The difference is that the band of SWI can “bypass” the small particles in smoke, fog and haze and has better detail resolution and analysis ability. INF uses radiant thermal imaging to reflect the temperature difference of ground object surface and can be used for night imaging or smoke and fog scene imaging.
These multimodal data have strong complementarity and domain differences in data distribution, which is challenging for research on multimodal remote-sensing image scene understanding.
3.6.2. Large Domain Differences Caused by Scale Differences
The resolutions of VIS, SWI, INF and SAR were 100, 200, 400 and 40 m, respectively, and the sizes were almost all 256 × 256. Following the principle of dataset production, there were scale differences among the final cut categories. For example, the domain with low resolution selected larger cities, while the domain with high resolution selected smaller cities, to ensure that the target category occupied the main position of the image.
In addition, in order to increase the data richness, the intra-class differences, the classification difficulty and the model robustness, we selected different-scale images in the same category, such as rivers, including wide, medium-sized and narrow rivers.
3.6.3. Large Domain Differences Due to Different Imaging Time
The imaging times of VIS, SWI, INF and SAR were different. The imaging times of VIS and SWI were relatively short, only in the daytime, while INF and SWI could be imaged all day.
In addition, in order to increase the date richness, we selected data from different phases, including different seasons and different imaging times. Statistic results of acquisition time are shown in
Figure 2, and the intra-class differences are helpful for extracting richer class features and improving the model robustness.
3.6.4. Large Domain Differences Caused by Different Imaging Areas
MRSSC2.0 contains images of different regions within 42 degrees of north–south latitude, not limited to specific countries, cities or scene types, with rich spatial differences. Due to the differences in spatial locations, there are differences in the data distributions of the same scene among the four data domains of MRSSC2.0 in color, shape, texture, etc. In addition, there are large differences within the same scene category, such as farmland with large blocks and regular shapes and farmland with small pieces and different shapes, as shown in
Figure 3. This increased the challenge of the domain adaptation algorithm, prevented overfitting and improved the model performance.
4. Benchmark Results
4.1. Problem Setting
Due to the abundance of visible image data, it is mature to use visible image to realize remote-sensing scene classification. How to use visible image classifier for short-wave infrared, thermal infrared and SAR data scene classification is a research topic that needs to be solved.
Due to the domain differences caused by different sensors, the data distribution does not meet the consistency assumption that the training data and test data are distributed in the same feature space in machine learning, so it is impossible to achieve good classification results in the target domain. To solve this problem, we could consider using more optical images to train a scene classifier, adjust the knowledge learned in one domain (called the source domain) through domain adaptation (DA) and apply it to another related domain (called the target domain). By reducing the differences in the data and feature distribution between the source domain and the target domain, the model trained in the source domain could work normally in the target domain.
The source domain set was labeled VIS data, and the target domain set consisted of three unlabeled multimodal datasets: SWI, INF and SAR data. The corresponding tests of labeled SWI, INF and SAR were used for model evaluation. The rest of experiments in this section investigated the following questions: (1) Can DA increase the scene classification accuracy compared with source-only approaches, which directly test the classification model trained by optical images on unlabeled data? (2) Which DA methods perform better in the case of large appearance differences between source and target data? (3) What are the differences in the effects of DA between different source–target data?
4.2. Evaluation Task and Metrics
The labeled source domain data and the unlabeled target domain data were used to train the network, and then the test set was used to test the classification accuracy of the network. We selected confusion matrix, overall accuracy and kappa coefficient to characterize the classification accuracy and used t-SNE and Grad-CAM to analyze the performance of DA.
Confusion matrix is used to analyze the numbers of correctly classified and misclassified categories for each sample.
Overall accuracy (OA) used to characterize classification accuracy, but for multiclassification tasks with unbalanced numbers of category samples, its value is greatly affected by categories with large sample sizes.
Kappa coefficient represents the proportion of error reduction between classification and completely random classification. The value range is (−1, 1). In practical applications, it is generally (0, 1). The larger the value, the higher the classification accuracy of the model.
t-SNE is a dimension reduction technology that is suitable for visualizing high-dimensional data [
28]. It converts the similarity between data points into joint probability and tries to minimize the KL divergence between the joint probability of low-dimensional embedded data and high-dimensional data. T-SNE is used to realize the dimensionality reduction visualization of high-dimensional features and intuitively reflect the distribution of features before and after DA, and it can be used to analyze the performance of DA.
Grad-CAM is a model visualization method that obtains the weighting coefficient through back propagation and then obtains the thermal map, which is used to visualize the activation results of the specific layer of the network model and the pixels in the image that have a strong impact on the output [
29]. Through the analysis here, we could intuitively see the optimization of DA methods for network feature extraction.
4.3. Implementation Details
We conducted three transfer tasks: VIS to SWI (VIS
SWI), VIS to INF (VIS
INF) and VIS to SAR (VIS
SAR) on MRSSC2.0 dataset. We selected 10 domain adaptation methods based on their performance on generic domain adaptation datasets and source code availability and reproducibility, including adversarial discriminative domain Adaptation(ADDA) [
21], adaptive feature norm (AFN) [
19], batch spectral penalization (BSP) [
24], conditional domain adversarial network (CDAN) [
22], deep adaptation network (DAN) [
17], domain-adversarial neural network (DANN) [
21], joint adaptation network (JAN) [
18], minimum class confusion (MCC) [
20], maximum classifier discrepancy (MCD) [
23], and margin disparity discrepancy (MDD) [
25]. Additionally, we applied source only for comprehensive comparison, which denotes that only the source domain data were used for training without any domain adaptation strategy [
30].
DALIB [
31], a transfer learning library providing the source code of selected algorithms, is used to implement the methods. The specific details of experimental settings are shown in
Table 2.
In the training phase, the source and target images were randomly cropped to 224 × 224, and random horizontal flip was performed as the input of the network. In the test phase, the test set images were cropped from center to 224 × 224 for predictive input.
4.4. Benchmark Results
4.4.1. Overall Accuracy
We evaluated 10 domain adaptation methods for scene classification on 3 transfer tasks. Considering the instability of the performance of adversarial methods, we tested each method 10 times with different random seeds. We recorded the optimal OA results from 20 epochs of each experiment and calculated the mean and standard deviation. The overall results are reported in
Table 3.
Table 3 summarizes the overall accuracy of the 10 DA methods. According to the source-only results without a DA algorithm, VIS → SAR is the lowest, VIS → INF is the second and VIS → SWI is the highest. The data distributions among the three domains are different. VIS and SAR have the largest differences in data domains due to the different imaging mechanisms, while VIS and SWI are optical images with different band ranges, and their domain differences are small. By comparing different DA methods, it can be seen that the appropriate DA algorithm is helpful for improving OA, and the best algorithm is different for each domain. For VIS→SWI, the OA of MCD is 5.25% higher than source only, followed by AFN. For VIS → INF, the OA of AFN is 6.48% higher than source only, followed by MCD. For VIS → SAR, the OA of the best method, DANN, is 32.49% higher than source only. These show that DA is more effective for scenarios with large data distribution differences and slightly worse for scenarios with relatively small data distribution differences. It should be noted that in contrast with source only, not all DA methods were effective in scene classification accuracy, and some even had adverse impacts on the results. From the variance, it can be seen that the DA methods are unstable, and the final results are different due to different random seeds. Although the existing DA methods improved the classification results to varying degrees, their stability and adaptability are poor. This shows that future work can design a more robust DA algorithm suitable for remote-sensing scene classification.
4.4.2. Confusion Matrix
A confusion matrix obtained from the source-only prediction results and the best DA method for three transfer tasks could be used to study the distribution of each class in the experiments.
As shown in
Figure 4, in the VIS → SWI task, compared with source only, MCD greatly improved the prediction accuracy of various categories, with only a small amount of misclassification. In the VIS → INF task, the prediction accuracy of source only for most categories was not ideal, and the prediction results of each category improved to a certain extent by using AFN, but city and farmland categories were easily confused. In addition, desert and other categories were also confused. In the VIS → SAR task, the prediction accuracy of DANN was significantly improved compared with source only for most categories, but river was sometimes misclassified as lake.
These experimental results demonstrate the effectiveness of DA, and DA was able to distinguish the appearance difference in different cross-domain tasks. It is a point of follow-up research to study the essence of the performance differences in the different algorithms and design applicable DA algorithms according to the data characteristics.
4.4.3. t-SNE Analysis
We visualized the features from source only and the best DA method for each task using t-SNE to show the features distributions of the source domain and target domain images extracted by the network. In
Figure 5, in the first and second columns, points of each color stands for a category; in the third and fourth columns, points of different colors indicate different domains, where red indicates the source domain and blue indicates the target domain. Odd columns together reflect the corresponding relationship between categories and domains and/or Euclidean columns. The clustering of the same categories means better separability.
The third column of
Figure 5 shows that the data from the source domain and target domain are distributed in independent areas, which means that the Resnet-50 network trained only with source domain data cannot align the characteristics of source domain and target domain well. Combined with the category visualization in the first column, it can be seen that the data distribution in the target domain does not form a good category boundary on the (b) VIS → INF task and (c) VIS → SAR task, but the data distribution in the target domain has a boundary on the (a) VIS → SWI task, indicating that the difference between the VIS and SWI data domains is not significant, and the Source Only method can achieve better results in the target domain, which is consistent with the results in
Table 3.
As shown in the fourth column of
Figure 5, the network trained by the best DA method of the three tasks can align the characteristics of the source domain and the target domain well. Specifically, DANN made separated inter-class and tight intra-class clusters on the (c) VIS → SAR task. However, the effects on the (a) VIS → SWI and (b) VIS → INF tasks are relatively poor, which is related to the imbalance of data categories. The proposed DA methods could not solve this problem better. This visualization shows that the network trained by DA could improve the performance of cross-domain scene classification of three tasks to varying degrees.
4.4.4. Model Interpretability Analysis
The performance of the DA methods was further analyzed by Grad-CAM visualization. We selected the examples that were easily confused on the three tasks and used the Grad-CAM method to visualize the prediction results of source only and DA. On the generated heatmaps, red indicates that the position prediction category in the graph contributes the most, and blue indicates that the position in the graph contributes the least to the prediction result.
In
Figure 6, it can be clearly seen that the source-only method paid attention to the wrong location characteristics, which led to the wrong prediction results. In the case of large domain differences between the training set and the test set, the model-based DA methods could still focus on the correct region and obtain the correct prediction results. Taking the first group of results in
Figure 6c as an example, the reason for the incorrect source-only prediction result is that it focused on the land area. However, the semantic label of the input image is coast. Using DA, the model could focus on the location of the coastline and predict correctly, further proving the effectiveness of DA.
5. Discussion of Applications
5.1. Data Annotation
Cross-domain data annotation refers to making full use of the existing labeled datasets and pre-annotating the newly acquired data through domain adaptation.
Currently, many remote-sensing scene classification datasets have been proposed. The data sources, spatial distributions, temporal distributions and classification systems of these datasets have their own characteristics with region differences. Making full use of these marked data to realize the pre-annotation of new data can greatly reduce the annotation cost and improve the automation level of dataset production in remote sensing.
In order to verify the feasibility of DA algorithms in data annotation, we selected five categories, desert, mountain, farmland, river and beach, that are consistent with the MRSSC dataset and the AID dataset for experiments. As shown in
Figure 7, the AID dataset was used as the source domain, and the VIS, SAR of the corresponding category in the MRSSC dataset was used as the target domain. We conducted pre-annotation experiment and evaluated them based on the ground truth, finding accuracies of 84.49% and 83.21%. Based on the results of pre-annotation, the labels with low confidence could be modified through manual review, and finally the new data could be constructed.
For the case that the categories of the source domain and the target domain are inconsistent, how to realize the domain adaptation of the same category and reduce the interference of inconsistent categories is a problem worth studying.
5.2. Weakly Supervised Object Detection
A weakly supervised location task uses category information that is weaker than the target task to realize the target location task, that is, to predict the category and location of the target. Weakly supervised target localization based on domain adaptation upgrades the task difficulty. The training set is the sample with category information, and the prediction task is to realize the target localization of samples in different domains from the training set.
Currently, in practical application, the problem encountered by the weak supervised positioning task is that it has achieved good accuracy on the known dataset, but when the algorithm is applied to a new scene, there are domain differences in the data due to different imaging times, imaging locations and even imaging sensors, which will degrade the performance of the model in the new scene. Domain adaptation can solve the problem that the prediction accuracy of the model decreases due to different data distribution and improve the robustness of the model.
We used MRSSC2.0 to realize the weak supervised location of lakes. The specific experimental setup was as follows: source domain VIS is divided into two types (Lake and non-Lake), and unlabeled SAR is divided into training set and test set by 4:1. Based on DA method, the training sets of source domain VIS and target domain SAR are input into the DA model for training, and the trained model and its parameters are saved. Based on class activation mapping, the thermal map of the output class of the test set is obtained, then the reasonable threshold is set, and the maximum circumscribed rectangle is calculated as the target location result. The experimental results are shown in
Figure 8. In each image pair, the left image shows the predicted (green) and ground-truth (red) bounding box. The right image shows the CAM, i.e., where the network is focusing for the object.
The method achieved good weakly supervised location results. This provides a feasible and effective implementation method for in-orbit intelligent real-time remote-sensing image target location, in-orbit service and other application scenarios in different domains of training sets and test sets, and it promotes the development of in-orbit intelligent processing technology.
5.3. Domain Adaptation Data Retrieval
DA methods are also used in data retrieval. For a SAR image, how is it possible to identify the most similar one from a massive optical image dataset? First, based on the scene classification results, the optical images of the same scene were selected. Then we extracted the depth features of the SAR image and the optical image, designed a certain similarity measurement function and took the optical image with the highest similarity as the final result. When there is no rough matching between SAR image and optical image, image registration based on data retrieval can be carried out. Research on the retrieval of optical images and SAR images will be conducive to further fine-grained scene recognition and classification or extended to cross-modal data navigation and positioning.
6. Conclusions
In this paper, we proposed a remote-sensing cross-domain scene classification dataset including four domains and seven scene types. The effectiveness of the domain adaptive methods among VIS, SWI, INF and SAR was verified. Although different domain datasets have great differences, the experimental results show that domain adaptation algorithms can reduce the differences in data distribution between different domains and improve the accuracy of scene classification. However, there are some differences in the mobility between the data of different domains, which are related to the differences in the scale and imaging mechanisms. Improvement can be made for the stability and accuracy of existing DA methods. In addition, addressing unbalanced class distribution and solving the domain adaptation problem of multilabel scene classification tasks still face many challenges. We believe that MRSSC will not only promote the development of cross-domain remote-sensing scene classification but also inspire innovative research on weakly supervised object detection and domain adaptation data retrieval.