Open Access
This article is
 freely available
 reusable
ISPRS Int. J. GeoInf. 2018, 7(5), 182; doi:10.3390/ijgi7050182
Article
SemiSupervised GroundtoAerial Adaptation with Heterogeneous Features Learning for Scene Classification
College of Electronic Science, National University of Defense Technology, Changsha 410073, China
^{*}
Author to whom correspondence should be addressed.
Received: 2 April 2018 / Accepted: 9 May 2018 / Published: 10 May 2018
Abstract
:Currently, huge quantities of remote sensing images (RSIs) are becoming available. Nevertheless, the scarcity of labeled samples hinders the semantic understanding of RSIs. Fortunately, many groundlevel image datasets with detailed semantic annotations have been collected in the vision community. In this paper, we attempt to exploit the abundant labeled groundlevel images to build discriminative models for overheadview RSI classification. However, images from the groundlevel and overhead view are represented by heterogeneous features with different distributions; how to effectively combine multiple features and reduce the mismatch of distributions are two key problems in this scenemodel transfer task. Specifically, a semisupervised manifoldregularized multiplekernellearning (SMRMKL) algorithm is proposed for solving these problems. We employ multiple kernels over several features to learn an optimal combined model automatically. Multikernel Maximum Mean Discrepancy (MKMMD) is utilized to measure the data mismatch. To make use of unlabeled target samples, a manifold regularized semisupervised learning process is incorporated into our framework. Extensive experimental results on both crossview and aerialtosatellite scene datasets demonstrate that: (1) SMRMKL has an appealing extension ability to effectively fuse different types of visual features; and (2) manifold regularization can improve the adaptation performance by utilizing unlabeled target samples.
Keywords:
remote sensing; scene classification; heterogeneous domain adaptation; crossview; multiple kernel learning1. Introduction
With the rapid increase in remote sensing imaging techniques over the past decade, a large amount of very highresolution (VHR) remote sensing images are now accessible, thereby enabling us to study ground surfaces in greater detail [1,2,3,4,5]. Recent studies often adopt the bagofvisualwords (BOVW) [6,7,8] or deep convolutional neural networks (DCNN) representation [9,10,11,12,13,14,15,16,17,18] associated with AdaBoost classifiers or support vector machine (SVM) classifiers to learn scene class models. The collection of reference samples is a key component for a successful classification of the landcover classes. However, in realworld earth observation (EO) applications, the available labeled samples are not sufficient in number, which hinders the semantic understanding of remote sensing images. Directly addressing this problem is challenging because the collection of labeled samples for newly acquired scenes is expensive and the labeling process involves timeconsuming human photo interpretation that cannot follow the pace of image acquisition. Instead of collecting semantic annotations for remote sensing images, some research has considered strategies of adaptation, which is a rising field of investigation in the EO community since it meets the need for reusing available samples to classify new images. Tuia et al. [19] provided a critical review of recent domain adaptation methodologies for remote sensing and divided them into four categories: (1) invariant feature selection; (2) representation matching; (3) adaptation of classifiers; and (4) selective sampling. Nevertheless, all these methods [20,21,22,23,24] are designed for annotation transfer between remote sensing images. With an increasing amount of freely available ground level images with detailed tags, one interesting and possible intuition is that we can train semantic scene models using ground view images, as they have already been collected and annotated, and hope that the models still work well on overheadview aerial or satellite scene images. In detail, ground view represents the natural scene images taken from the ground view. Overhead view represents the remote sensing images taken from the overhead view, which contains overhead aerial scene images and overhead satellite scene images.
Transferring semantic category models from the ground view to the overhead view has two advantages: First, groundview and overheadview images are classified under the same scene class despite being captured from two different views, leading to consistency in the underlying intrinsic semantic features. Second, largescale groundview image datasets such as ImageNet [25] and SUN [26] have been built with detailed annotations that have fostered many efficient ways to describe the image semantically. However, the generalization of the classifiers pretrained from ground level annotations is not guaranteed, as training and testing samples are drawn from different probability distributions. To solve this problem, on the one hand, several works have addressed the crossview (groundtoaerial) domain adaption problem in the context of image geolocalization [27]. On the other hand, the work of [28,29,30,31,32] must be mentioned, as the authors aim to transfer scene models from ground to aerial based on the assumption that scene transfer is a special case of crossdomain adaptation, where the divergences across domains are caused by viewpoint changes, somewhat similar in spirit to our work. However, all these methods are feature learningbased adaptation approaches, where ground view and overhead view data are represented by one kind of feature, such as the histogram of oriented edges (HOG) feature. Nevertheless, multiple features should be considered because the elements in the same scene captured from two different views may appear at different scales and orientations. Because different types of features describe different visual aspects, it is difficult to determine which feature is better for adaptation. When considering heterogeneous types of features with different dimensions, scene model transfer deals with an even more challenging task. Figure 1 illustrates the appearance of considerable discrepancy in the same residential class captured from four views. Six types of features of each image are projected onto two dimensions using tDistributed Stochastic Neighbor Embedding (tSNE) [33] with different colors. The solid points, hexagram points, represent the residential class images captured from different views. The complexity of different features and the distinct distributions between different views pose great challenges to adaptive learning schemes.
Techniques for addressing the mismatched distributions of multiple types of features with different dimensions have been investigated under the names of heterogeneous domain adaptation (HDA). Most existing HDA approaches were feature representationbased methods whose aim is to make the data distributions more similar across the domains [21,34,35]. However, these methods are suitable for transfer tasks with limited deformations, whereas the difference between crossview images are huge. With the rapid development of deep neural networks, more recent works use deep adaptation methods [36,37] to reduce the domain shift, which brings new insights into our cross view scene model transfer task. However, deep adaptationbased approaches involve a large number of labeled samples to train the network in a reasonable time [38]. Generally, the groundview domain contains a large amount of labeled data such that a classifier can be reliably built, while the labeled overhead view data are often very few and they alone are not sufficient to construct a good classifier. Thus, based on the guidelines for choosing the adaptation strategy in [19], we focus on the classifier adaptation methods that can utilize the source domain models as prior knowledge to learn the target model. However, due to the huge domain mismatch between ground view images and overhead view images, three problems need to be solved for better adaptation: (1) how to fuse multiple features for crossview adaptation; (2) how to reduce the mismatch of multiple feature distribution between crossview domains; and (3) how to effectively leverage unlabeled target data to improve the adaptation performance.
To address these issues, in this paper, we propose a semisupervised manifoldregularized multiplekernellearning (SMRMKL) algorithm to transfer scene models from groundtoaerial. To fuse heterogeneous types of image features, we employ multiple kernels to map samples to the corresponding Reproducing Kernel Hilbert Space (RKHS), where multikernel maximum mean discrepancy (MKMMD) is utilized to reduce the mismatch of data distributions between crossview domains. To make use of available unlabeled target samples, we incorporate a manifoldregularized local regression on target domains to capture the local structure for scene model transfer. After iterative optimization of the unified components by the reduced gradient descent procedure, we obtain an adapted classifier for each scene class; then, a new coming target sample’s label can be determined accordingly. Extensive experimental results on both aerialtosatellite, and groundtoaerial or satellite scene image datasets demonstrate that our proposed framework improves the adaptation performance by fusing different types of visual features and utilizing unlabeled target samples.
2. SemiSupervised ManifoldRegularized Multiple Kernel Learning
We construct the crossview scene model transfer task as a classifier adaptationbased HDA problem. To be more precise, many labels are available for the source domain, and only a few labels are provided for the target domain. Taking the ground view image set as the source domain and the overhead view image set to be learned as the target domain, we want to adapt the scene model categories in the labelrich source domain to the labelscarce target domain. The main goal of SMRMKL is to bridge the crossview domain gap by jointly learning adaptive classifiers and transferable features to minimize domain divergence. As shown in Figure 2, three regularizers are jointly employed in our framework, including the MKMMD to match feature distributions for feature adaptation; the structural risk regularizer, which corresponds to an empirical risk minimization that makes SVM exhibit good generalization; and the manifold regularizer based on the basic intuition that the closer target unlabeled samples in the feature space may contain similar decision values. In the following, we will first introduce the notations used in this paper, followed by constructing the three regularizers of SMRMKL. Then, the optimization strategy of the overall objective is provided.
2.1. Notations
For simplicity, we focus on the scenario where there is one source domain ${D}^{S}$ and one target domain ${D}^{T}$. Taking the groundview scene image set with plenty of labels as the source domain ${D}^{S}=({{x}_{i}}^{S},{{y}_{i}}^{S}){}_{i=1}^{{n}_{S}}$ , where ${{y}_{i}}^{S}$ indicates the corresponding label of image ${{x}_{i}}^{S}$ and ${n}_{S}$ is the size of ${D}^{S}$. Similarly, let ${D}^{T}={{D}_{l}}^{T}\cup {{D}_{u}}^{T}$ denote the overheadview remote sensing image set of the target domain with a limited number of labeled data and a large number of unlabeled data, where ${{D}_{l}}^{T}=({{x}_{i}}^{T},{{y}_{i}}^{T}){}_{i=1}^{{n}_{l}}$ and ${{D}_{u}}^{T}={{x}_{i}}^{T}{}_{i={n}_{l}+1}^{{n}_{l}+{n}_{u}}$ represent the labeled and unlabeled training images, respectively. The size of ${D}^{T}$ is ${n}_{T}={n}_{l}+{n}_{u}({n}_{l}\ll {n}_{u})$. We define $N={n}_{S}+{n}_{T}$ and $n={n}_{S}+{n}_{l}$ as denoting the size of all training data and labeled training data from both domains, respectively. It is assumed that both the ground level images and remote sensing images pertain to J categories, i.e., they share the same label space. Our goal is to learn from $\left\{{D}^{S}\right.,\left.{D}^{T}\right\}$ a scene model decision function ${f}^{T}(x)$ that predicts the label of a novel test sample from the remote sensing domain.
2.2. MultiKernel Maximum Mean Discrepancy
In this section, we investigate how to bridge the sourcetarget discrepancy in the feature space. The broad variety of crossview images requires different types of features to describe different visual aspects, such as the color, texture and shape. Furthermore, with the development of deep neural networks, the output feature (i.e., deep feature) of convolutional layer or fully collected layer can represent image in a hierarchical way. As shown in Figure 3, each image is represented by different features with different dimensions. To overcome the problem of diversity, kernel methods have been extensively studied to minimize the mismatch of different distributions and combine different data modalities. In this paper, we use the nonparametric criterion called MMD to compare data distributions based on the distance between means of samples from two domains in a Reproducing Kernel Hilbert Space (RKHS), which has been shown to be effective in domain adaptation. The criterion of MMD is:
where ${{x}_{i}}^{S}$ and ${{x}_{i}}^{T}$ are images from the source and target domains, respectively, and $\u2225\u2022\u2225$ denotes the ${l}_{2}$ norm. A kernel function K is induced from the nonlinear feature mapping function $\phi (\u2022)$, i.e., $K({x}_{i},{x}_{j})=\phi {({x}_{i})}^{\prime}\phi ({x}_{j})$. To simplify the MMD criterion, we defined a column vector
to transform Equation (1) to:
where $Q=q{q}^{\prime}\in {\Re}^{N\times N}$, $K=[\begin{array}{cc}{K}_{SS}& {K}_{ST}\\ {K}_{TS}& {K}_{TT}\end{array}]\in {\Re}^{N\times N}$, ${K}_{SS}$, ${K}_{TT}$ and ${K}_{ST}$ are the kernel matrices defined for the source domain, target domain, and the crossdomain from the source images to the target images, respectively.
$$DIST({{D}_{i}}^{S},{{D}_{i}}^{T})=\u2225\frac{1}{{n}_{S}}\sum _{i=1}^{{n}_{S}}\phi ({{x}_{i}}^{S})\frac{1}{{n}_{T}}\sum _{i=1}^{{n}_{T}}\phi ({{x}_{i}}^{T})\u2225$$
$$q={[\underset{{n}_{S}}{\underbrace{1/{n}_{S},\dots ,1/{n}_{S}}},\underset{{\mathrm{n}}_{T}}{\underbrace{1/{\mathrm{n}}_{T},\dots 1/{\mathrm{n}}_{T}}}]}^{\prime}$$
$$DIST({{D}_{i}}^{S},{{D}_{i}}^{T})=tr(KQ)$$
To effectively fuse multiple types of features for crossview scene model transfer task, we employ multiple kernel learning method to construct the kernel matrices by a linear combination of different feature kernels matrices ${K}^{(m)}$.
where ${d}_{m}$ are the linear combination coefficients and ${\displaystyle \sum}_{m=1}^{M}}{d}_{m}=1$. ${K}^{(m)}(m=1,2,\dots ,M)$ is a base kernel matrix that combines both source and target images derived from different feature mapping functions $\phi (\u2022)$. Thus, the MKMMD criterion is simplified:
where $p={[tr({K}^{(1)}Q),\dots ,tr({K}^{(m)}Q)]}^{\prime}$ and $d={[{d}_{1},\dots ,{d}_{M}]}^{\prime}$ is the vector of kernel combination coefficients. When we minimize $DIS{T}_{K}({D}^{S},{D}^{T})$ to be close to zero, the data distributions of the two domains are close to each other.
$$K=\sum _{m=1}^{M}{d}_{m}{K}^{(m)}$$
$$DIS{T}_{K}({D}^{S},{D}^{T})=tr(\sum _{m=1}^{M}{d}_{m}{K}^{(m)}Q)={p}^{\prime}d$$
2.3. Structural Risk
In this section, we investigate how to bridge the discrepancy of source classifier ${f}^{S}(x)$ and target classifier ${f}^{T}(x)$. Previous works [39] assume that ${f}^{T}(x)={f}^{S}(x)+\Delta f(x)$, where $\Delta f(x)$ is the perturbation function adapted from the training data. In this paper, we learn a robust target decision function adapted from a combination of prelearned classifiers and a perturbation function as follows [39]:
where ${f}_{p}(x)$ is the prelearned classifiers with a linear combination coefficients ${\beta}_{p}$ trained based on the labeled data from both domains and P is the total number of the prelearned classifiers. $\Delta f(x)={\displaystyle {\displaystyle \sum}_{m=1}^{M}}{d}_{m}w{}_{m}^{\prime}{\phi}_{m}(x)+b$ is the perturbation function with b as the bias term. ${w}_{m}^{\prime}$ and ${\phi}_{m}(x)$ are the ${m}_{th}$ kind of normal vector and feature mapping function. Therefore, we form the structural risk functional as follows:
$\beta ={[{\beta}_{1},\dots ,{\beta}_{P}]}^{\prime}$ is the vector of ${\beta}_{p}$s, and $\lambda ,C>0$ are the regularization parameters. Denote ${\tilde{v}}_{m}={d}_{m}{[{w}_{m}^{\prime},\sqrt{\lambda}{\beta}^{\prime}]}^{\prime}$, and the optimization problem in Equation (6) can then be computed as follows:
$${f}^{T}(x)={f}^{S}(x)+\Delta f(x)={\displaystyle \sum _{p=1}^{P}}{\beta}_{p}{f}_{p}(x)+{\displaystyle \sum _{m=1}^{M}}{d}_{m}w{}_{m}^{\prime}{\phi}_{m}(x)+b$$
$$\begin{array}{c}\underset{{w}_{m},\beta ,b,{\xi}_{i}}{min}\frac{1}{2}\left({\displaystyle \sum _{m=1}^{M}}{d}_{m}\right.{\u2225{w}_{m}\u2225}^{2}+\left.\lambda {\u2225\beta \u2225}^{2}\right)+C{\displaystyle \sum _{i=1}^{n}}{\xi}_{i}\hfill \\ \mathrm{s}.\mathrm{t}.{y}_{i}{f}^{T}({x}_{i})\ge 1{\xi}_{i},{\xi}_{i}\ge 0\hfill \end{array}$$
$$\begin{array}{c}\underset{{\tilde{v}}_{m},b,{\xi}_{i}}{min}\frac{1}{2}{\displaystyle \sum _{m=1}^{M}}\frac{{\u2225{\tilde{v}}_{m}\u2225}^{2}}{{d}_{m}}+C{\displaystyle \sum _{i=1}^{n}}{\xi}_{i}\hfill \\ \mathrm{s}.\mathrm{t}.{y}_{i}({\displaystyle \sum _{m=1}^{M}}\tilde{v}{}_{m}^{\prime}{\tilde{\phi}}_{m}({\mathrm{x}}_{i})+\left.b\right)\ge 1{\xi}_{i},{\xi}_{i}\ge 0\hfill \end{array}$$
Denote ${\tilde{\phi}}_{m}({x}_{i})={[{\phi}_{m}{({x}_{i})}^{\prime},\frac{1}{\sqrt{\lambda}}f{({x}_{i})}^{\prime}]}^{\prime}$, where $f({x}_{i})={[{f}_{1}({x}_{i}),\dots ,{f}_{P}({x}_{i})]}^{\prime}$, a new kernel matrix $\tilde{K}=[{\tilde{\phi}}_{m}{({x}_{i})}^{\prime}{\tilde{\phi}}_{m}({x}_{j})]={\displaystyle {\displaystyle \sum}_{m=1}^{M}}{d}_{m}{\tilde{K}}^{(m)}=[\begin{array}{cc}{\tilde{K}}_{L}& {\tilde{K}}_{LU}\\ {\tilde{K}}_{UL}& {\tilde{K}}_{U}\end{array}]\in {\Re}^{N\times N}$, is defined by the both labeled and unlabeled training data from two domains. ${\tilde{K}}_{L}=[\begin{array}{cc}{{\tilde{K}}^{SS}}_{L}& {{\tilde{K}}^{ST}}_{L}\\ {{\tilde{K}}^{TS}}_{L}& {{\tilde{K}}^{TT}}_{L}\end{array}]\in {\Re}^{n\times n}$ is the kernel matrix defined for labeled samples for both two domains. ${\tilde{K}}_{U}\in {}^{{n}_{u}\times {n}_{u}}$ and ${\tilde{K}}_{LU}\in {}^{n\times {n}_{u}}$ are the kernel matrices defined for the unlabeled samples and crossdomain from the labeled images to the unlabeled images, respectively. Motivated by the optimization problem of SVM, Equation (7) can be solved by its dual problem:
where $y={[{y}_{1},\dots ,{y}_{n}]}^{\prime}$ is the training samples’ label vector. $A=\left\{\alpha \right{\alpha}^{\prime}y=0,{0}_{n}\le \alpha \le C{\mathbf{1}}_{\mathbf{n}}\}$ is the feasible set of the dual variables $\alpha $.
$$\underset{\alpha \in A}{max}{\mathbf{1}}_{n}^{\prime}\alpha \frac{1}{2}{(\alpha \circ y)}^{\prime}{\tilde{K}}_{L}(\alpha \circ y)$$
2.4. Manifold Regularization
In this section, we investigate how to leverage unlabeled target data based on manifold regularization, which has been shown effective for semisupervised learning [40]. This regularizer’s basic intuition is that the outputs of the predictive function are restricted to assign similar values for similar samples in the feature space. Inspired by Laplacian based semisupervised learning [41] and Manifold Regularized Least Square Regression (MRLS) [42], the estimation of the manifold regularization can be measured by similarity of the target pairwise samples. Specifically, it can be given by
where $S\in {n}_{T}\times {n}_{T}$ denotes the affinity matrix defined on the target samples, whose element ${S}_{ij}$ reflects the similarity between ${{\mathrm{x}}^{T}}_{i}$ and ${{\mathrm{x}}^{T}}_{j}$. By setting the derivative of the Lagrangian obtained from Equation (7) to zero, we can obtain ${\tilde{v}}_{m}={d}_{m}{\displaystyle {\displaystyle \sum}_{i=1}^{n}}{\alpha}_{i}{y}_{i}{\tilde{\phi}}_{m}({x}_{i})$. Thus, Equation (9) can be rewritten as follows:
$$\underset{{\tilde{v}}_{m}}{min}\sum _{i,j}^{{n}_{T}}{S}_{ij}{\u2225\sum _{m=1}^{M}{\tilde{v}}_{m}^{\prime}{\tilde{\phi}}_{m}({{\mathrm{x}}^{T}}_{i})\sum _{m=1}^{M}{\tilde{v}}_{m}^{\prime}{\tilde{\phi}}_{m}({{\mathrm{x}}^{T}}_{j})\u2225}^{2}$$
$$\sum _{i,j}^{{n}_{T}}{S}_{ij}{\u2225\sum _{m=1}^{M}{d}_{m}{(\alpha \circ y)}^{\prime}(\tilde{K}(1:n,i+{n}_{S})\tilde{K}(1:n,j+{n}_{S}))\u2225}^{2}$$
One way of computing the elements of affinity matrices S is based on Gaussian functions, i.e.,
where $\sigma $ is the bandwidth parameter. By defining the graph Laplacian $L=DS$, where D is a diagonal matrix defined as ${D}_{ii}={\displaystyle {\displaystyle \sum}_{j=1}^{{n}_{T}}}{S}_{ij}$, the manifold regularization can be rewritten as:
$${S}_{i,j}=\left\{\begin{array}{c}{e}^{\frac{{\u2225{{x}_{i}}^{T}{{x}_{j}}^{T}\u2225}^{2}}{{\sigma}^{2}}}\\ 0\end{array}\right.\begin{array}{c}\mathrm{if}\phantom{\rule{4pt}{0ex}}{{x}_{i}}^{T}\mathrm{and}\phantom{\rule{4pt}{0ex}}{{x}_{j}}^{T}\mathrm{are}\phantom{\rule{4pt}{0ex}}k\phantom{\rule{4pt}{0ex}}\mathrm{nearest}\phantom{\rule{4pt}{0ex}}\mathrm{neighbors}\\ else\end{array}$$
$$\begin{array}{c}\hfill ({(\alpha \circ y)}^{\prime}\sum _{m=1}^{M}{d}_{m}{\tilde{K}}^{(m)}(1:n,{n}_{S}:N))\xb7L({(\alpha \circ y)}^{\prime}\sum _{m=1}^{M}{d}_{m}{\tilde{K}}^{(\mathrm{m})}(1:n,{n}_{S}:N){)}^{T}\end{array}$$
2.5. Overall Objective Function
In this section, we integrate $DIS{T}_{K}({D}^{S},{D}^{T})$ in Equation (4) and structural risk functional in Equation (8) into the manifold regularization function in Equation (12) and then arrive at the overall objective function.
where $\theta $, $\zeta $ is the tradeoff parameter. Thus, we propose an alternating update algorithm to obtain the globally optimal solution. Once we have initialized the linear combination coefficient ${d}_{m}$, the optimization problem can be solved by existing SVM solvers such as LIBSVM [43] to obtain the dual variable $\alpha $. Then, the dual variable $\alpha $ is fixed, and the linear combination coefficient ${d}_{m}$ is updated by the secondorder gradient descent procedure [44] to make the value of the objective function in Equation (13) decrease. Thus, the alternating algorithm of SMRMKL is guaranteed to converge.
$$\begin{array}{c}\hfill G(d)={p}^{\prime}d+\theta \left({\mathbf{1}}_{n}^{\prime}\alpha \right.\frac{1}{2}{(\alpha \circ y)}^{\prime}{\tilde{K}}_{L}\left.(\alpha \circ y)\right)+\zeta ({(\alpha \circ y)}^{\prime}\sum _{m=1}^{M}{d}_{m}{\tilde{K}}^{(m)}(1:n,{n}_{S}:N))L({(\alpha \circ y)}^{\prime}\sum _{m=1}^{M}{d}_{m}{\tilde{K}}^{(m)}(1:n,{n}_{S}:N){)}^{T}\end{array}$$
3. Experimental Results
We conducted our experiments for both groundtoaerial scene model adaptation and aerialtosatellite scene model adaptation.
3.1. Data Set Description and Experimental Configuration
Two couples of sourcetarget image sets were used to evaluate the proposed framework of scene adaptation.
3.1.1. CrossView Scene Dataset
We collected a crossview scene dataset from two groundlevel scene datasets, SUN database (Source domain 1, S1) and Scene15 [38] (Source domain 2, S2), and three overhead remote sensing scene datasets, Banja Luka dataset [45] (Target domain 1, T1), UC Merced dataset [46] (Target domain 2, T2), and WHURS19 dataset [47] (Target domain 3, T3). The Banja Luka dataset consists of 606 RGB aerial images of size $128\times 128$ pixels. The UC Merced dataset is composed of 2100 aerial scene images measuring $256\times 256$ pixels, with a spatial resolution of 0.3 m per pixel in the red green blue color space. The WHURS19 dataset was extracted from a set of satellite images exported from Google Earth with spatial resolution up to 0.5 m and spectral bands of red, green, and blue. Our crossview scene dataset consists of 2768 images of four categories (field/agriculture, forest/trees, river/water and industrial). Figure 4 shows an example of the crossview scene dataset (one image per class per dataset). Table 1 gives the statistics of the image numbers in the dataset.
3.1.2. AerialtoSatellite Scene Dataset
We have collected 1377 images of nine common categories from the UC Merced aerial scene dataset and WHURS19 dataset. In this experiment, we use the aerial scene dataset as the source domain, while examples from the satellite scene dataset are used as the target domain training data. In total, there are 900 source training images. Satellite scene dataset has 495 images for all nine categories. Figure 5 shows the images from 9 out of 19 classes.
3.1.3. Base Features and Training/Testing Settings
For images in our two couples of sourcetarget image sets, we extracted four types of global features: HOG (histogram of oriented), DSIFT (dense SIFT), TEXTON and Geocolor. These heterogeneous base features can better describe different visual aspects of images. In addition, we also take the output of fc6 and fc7 layers by using DeCAF [48] as image representation for comparison.
All the instances in the source domain are used as the source training data. The instances in the target domain are evenly split into two subsets: One is used as the target training data and the other is as the target test data. Furthermore, to investigate the effect of the semisupervised learning in our proposed framework, we divide the target training data into two halves: half is used as the labeled set (we randomly select 1, 3, 5, 7, and 10 samples per class from the target domain set), in which we consider that the labels are known; and the remaining instances are used as the unlabeled set. For all these datasets, the splitting processes are repeated five times to generate five source and target training/testing partitions randomly, and then the average performance of the fiveround repetitions is reported.
3.1.4. Compared Approaches
We compare the following competing approaches for performance evaluation.
 SVMST: An SVM classifier trained by using the labeled samples from both source and target domains,
 SVMT: An SVM classifier trained by only using the labeled samples from the target domain.
 ASVM [49]: AdaptiveSVM is adapted from ${f}^{S}(x)$ (referred to prelearned classifier trained by only using the labeled samples from the source domain). In detail, the samples from the target domain are weighted by ${f}^{S}(x)$ then these samples are adopted to train a perturbation function $\Delta f(x)$. The final SVM classifier is a combination of prelearned classifiers ${f}^{S}(x)$ and a perturbation function $\Delta f(x)$, as shown in Equation (5).
 CDSVM [50]: Crossdomain SVM used knearest neighbors from the target domain to define a weight for each source sample, and then the SVM classifier was trained with the reweighted source samples.
 KMM [51]: Kernel Mean Matching is a twostep approach to reduce the mismatch between two different domains. The first step is to diminish the mismatch between means of samples in RHKS from the two domains by reweighting the samples in the source domain. Then, the second step is to learn a classifier from the reweighted samples.
 AMKL [39]: Adaptive MKL can be considered as an extension of ASVM. Firstly, the unlabeled target samples are used to measure the distribution mismatch between the two domains in the Maximum Mean Discrepancy criterion. Secondly, the final classifier is constrained as the linear combination of a set of prelearned classifiers and the perturbation function learned by multiple kernel learning.
 SMRMKL is our approach described in Algorithm 1.
Six parameters in our proposed framework need to be set. We set $k=5$ in the kNN (k Nearest Neighbors) algorithm to calculate neighbors in the manifold regularizer and empirically set the value of bandwidth parameter $\sigma $ to be 0.1. The tradeoff parameters $\theta $, $\lambda $, and $\zeta $ and regularization parameter C are selected from $\left\{{10}^{3}\right.,{10}^{1},1,\left.10,{10}^{2},{10}^{4}\right\}$ and the optimal values are determined. For the comparison algorithms, the kernel function parameter and tradeoff parameter were optimized by the gird search technique on our validation set. Classification accuracy is adopted as the performance evaluation metric for scene classification. Following [39], four types of kernels, including Gaussian kernel, Laplacian kernel, inverse square distance (ISD) kernel, and inverse distance (ID) kernel, are employed for our multiple kernel learning approach.
Algorithm 1 Semisupervised ManifoldRegularized Multiple Kernel Learning (SMRMKL). 
Input: Source data with labels ${D}^{S}=({{x}_{i}}^{S},{{y}_{i}}^{S}){}_{i=1}^{{n}_{S}}$, target data ${D}^{T}={{D}_{l}}^{T}\cup {{D}_{u}}^{T}$, regularization parameter C. tradeoff parameter $\lambda $, $\theta $, $\zeta $, bandwidth parameter $\sigma $; Output: Predicted target labels ${\mathbf{Y}}_{T}$.

3.2. GroundtoOverhead View Transfer
In this experiment, we focus on one source to one target domain adaptation. In each setting of our experiments, we train scene models using one ground view domain and the corresponding labels and test on one overhead view domains. Then, six sourcetarget domain pairs are generated by the aforementioned five domains, i.e., S1→T1, S1→T2, S1→T3, S2→T1, S2→T2 and S2→T3.
3.2.1. Performance Comparison
Traditional methods are single featurebased methods; thus, we investigate different approaches on individual features. Figure 6 shows the performance of different approaches with different features for the S1→T3 transfer task in terms of overall accuracy (OA) against the number of target positive training samples. In detail, the curves represent the means of OA and the error bars represent the statistical deviation. The smaller the statistical deviation, the better the consistency of the algorithm. For multiple kernelbased methods, such as AMKL and SMRMKL, each subfigure shows the results of single feature with multiple kernels. Figure 7 shows the distributions of S1→T3 cross view scenes with six types of features. Each image’s features in the dataset are projected into two dimensions using tSNE [33]. The solid points and hollow points represent the source images and target unlabeled images, respectively. In addition, the cross points represent the target labeled images. We observe the following from the results: (1) In most instances, the accuracy curves increase along with the increased number of target labeled training images, which shows that the more information the target domain provides, the better the performance of transfer learning. When the number of target positive training samples exceeds 10, SVMT has similar performance with other adaptation methods, such as SMRMKL, AMKL and ASVM. (2) AMKL and SMRMKL lead to better performance than other approaches, which demonstrates the superiority of multiple kernel learning. Compared with AMKL, SMRMKL achieves higher accuracy in most cases, which demonstrates the successful utilization of unlabeled training images. The exception is the HOG feature in Figure 6d. This observation is not surprising because the differentiation of the HOG feature is worse than the other features’ distributions (as shown in Figure 7d), deteriorating the effect of unlabeled target data in local manifold regularization. (3) The DeCAF and TEXTON features with better differentiation in distribution perform better than the HOG, DSIFT and GeoColor, which shows that the texture and DeCAF features are more suitable for crossview transfer tasks.
3.2.2. Analysis on the Kernel Combination Coefficients ${d}_{m}$ of the Multiple Features
To investigate the performance of multiple kernel learning and the ability to fuse multiple features, we propose two scenarios of crossview classification with respect to different features and kernels: singlefeature with multikernels and multifeature with multikernels. Figure 8 shows the performance of SMRMKL for six transfer tasks in terms of classification accuracy against the number of target positive training samples. MultiFuse represents the fusion of six features with four types of kernels, and the other features represent the single feature with four types of kernels. From the results, we observe the following: (1) The performance of different features has an obvious dissimilarity in different sourcetarget domain pairs. In most instances, when the number of the target positive training samples exceeds 3, DeCAF features have noticeable improvement over other handcraft features. The results reveal that the DeCAF features generalize well to our cross view datasets. (2) The TEXTON feature has better performance than the DeCAF features for S1→T3 and S2→T3 transfer tasks, whereas it has poor performance for S1→T2 and S2→T2 transfer tasks. This result is possibly caused by the resolution of the image dataset: T3 is a highresolution satellite scene dataset that has a more similar texture with groundlevel datasets. (3) MultiFuse generally leads to the highest accuracies in the S1→T1, S1→T2, S1→T3 and S2→T3 transfer tasks. For the S2→T1 and S2→T2 transfer tasks, MultiFuse has better performance than four sing handcraft featurebased methods but slightly worse than single DeCAF featurebased methods. This is possibly caused by the graylevel of S2 dataset and the lowresolution of the T1 and T2 datasets. The results demonstrate that our multiple kernel learningbased approach has the ability to fuse multifeatures for improving the performance of crossview scene classification.
Based on the noticeable improvement of the MultiFuse approach, we learned the linear combination coefficient ${d}_{m}$ of the multiple features with different types of kernels. The absolute value of each ${d}_{m}$ reflects the importance of the corresponding feature and kernel. Taking six types of image features with the Gaussian kernel, we plot the combination coefficient ${d}_{m}$ for each class with a fixed number of three targetpositive training samples for six pairings of the transfer tasks in Table 2. We observe that the absolute values of DSIFT and HOG are generally larger than other features in S1→T1, S1→T2 and S1→T3 transfer tasks, which shows that DSIFT and HOG play dominant roles among those tasks, whereas the DeCAF features are always larger than other features in the S2→T1,S2→T2 and S2→T3 transfer tasks. This is not surprising because the DSIFT, HOG and DeCAF features are much more distinctive than the GeoColor and TEXTON features in Figure 7. In Table 2, we also observe that the values of TEXTON are generally close to zero except for the industrial class, which demonstrates that texture is better able to describe the industrial crossview scene classification.
3.2.3. Effect of Each Regularizer
Our proposed SMRMKL has three components, i.e., multikernel minimizing mismatch distribution (MKMMD) (Section 2.2), structural risk (SR) (Section 2.3), and manifold regularization (MR) (Section 2.4). Here, we investigated the degree of each component’s contribution. Table 3 shows the performance improvements on different combinations of regularizers (i.e., SR+MKMMD, SR+MR, and SR+MKMMD+MR) with a fixed number of three targetpositive training samples. The results indicate that SR+MKMMD+MR exhibits a higher accuracy than SR+MKMMD and SR+MR, which demonstrates that the combination of three regularizers can effectively improve the adaptation performance. Furthermore, SR+MKMMD leads to a better performance than SR+MR, which means that the MKMMD regularizer has a higher contribution than the MR regularizer.
3.2.4. Analysis on Parameters
To investigate the impact of each parameter, the regularization parameter C and three tradeoff parameters $\theta $, $\lambda $, $\zeta $ are taken into consideration. In Figure 9a–c, we show the impact of regularization parameter C and tradeoff parameters $\lambda $ when they are set to take different values of the S1→T1 transfer task. From the results, we can see that C has a dominant impact on classification accuracy, whereas $\lambda $ is not considerably sensitive to the performance. Thus, we empirically set $C=100$ and $\lambda =10$ in our subsequent evaluations. In Figure 9d,e, we show the impact of tradeoff parameters $\theta $ and $\zeta $ with different values for the S1→T1 transfer task. From the results, we can see that the performance of our method is not sensitive to tradeoff parameters $\theta $ and $\zeta $.
Recall that we iteratively update the linear combination coefficient ${d}_{m}$ and dual variable $\alpha $ in SMRMKL (see Section 2.5). We discuss the convergence of the iterative algorithm of SMRMKL. Taking S2→T1 transfer task, we draw the change of the objective value for each class with respect to the number of iterations in Figure 10. We observe that SMRMKL converges after about six iterations for all categories. Other transfer tasks also have similar observations.
3.3. AerialtoSatellite Transfer
To demonstrate the robustness of our method, we evaluated the performance of our method in transferring scene models from aerial scenes to satellite scenes. Figure 11 further details the performance of different approaches with different features for the aerialtosatellite transfer task in terms of classification accuracy against the number of targetpositive training samples. In this figure, SMRMKL successfully brings up the performance of different features, which demonstrates that SMRMKL is significantly better than other approaches to the aerialtosatellite transfer task. The exception is the TEXTON feature in Figure 11c. This observation may be the result of the differentiation of the TEXTON feature deteriorating the effect of unlabeled target data in local manifold regularization, which deteriorates the adaptation performance.
Figure 12 shows the performance of SMRMKL with different features for the aerialtosatellite transfer task in terms of classification accuracy against the number of target positive training samples. From the results, we can see that DeCAF features have noticeable improvement over other handcraft features. GeoColor has better performance than other three handcraft features. In addition, MultiFuse generally leads to the highest accuracies in this transfer task. The result indicates that our multiple kernel learningbased approach has the ability to fuse multifeatures to improve the performance of aerialtosatellite scene classification. Furthermore, we can observe that the classification accuracy is very low without using samples form the target domain (i.e., the number of target train samples is 0). As the number of target training samples increases, the classification accuracy increases significantly. As can be seen in Figure 11 and Figure 12, the curve does not have a gentle trend. This proves that the participation of target domain training samples is very important for improving the classification accuracy. However, due to the small size of aerialtosatellite scene dataset, up to 10 samples per class from the target domain participated in the training. This will result in limited classification accuracy. In our future work, we will collect more samples for training. We believe that the classification accuracy will be improved.
To further observe the performance in individual categories, the mean Average Precision (mAP) of different features with respect to each class is drawn in Table 4. The corresponding confusion matrices are shown in Table 5. We can observe that different feature responds differently to each class. For instance, “parking” and “industry” are better classified with TEXTON, and “residential” and “harbor” show better results with DeCAF features. For the last five categories, MultiFusebased SMRMKL successfully improves the mAP performance. In Table 5, we can see that most of the scene categories could be correctly classified except “residential”, “harbor”, “industry”, “river”, and “beach”, whose visual aspects are significantly different between the aerial images and satellite images. In addition, “residential” and “harbor” from the aerial images are easily confused with “parking” and “industry” from the satellite images due to the similar configuration in Figure 5. It is also difficult to distinguish “viaduct” and “river” due to the similar winding attribute.
3.4. Running Time and Memory Usage
In the following, the computational complexity of SMRMKL in Algorithm 1 is investigated. Here, we suppose multiple types of features are precomputed before SMRMKL training. Then, the computational cost for the calculation of the kernel matrix K in Step 1 and $\tilde{K}$ in Step 3 takes the same time $O(M{N}^{2})$, where M is the number of base kernels and N is the number of training images in the source and target domains. Suppose the mean computational cost for the twoclass classification takes the time $O({D}^{2}{N}^{2})$, where D is the dimensionality of each feature. Then, the computational cost of Step 3 is $O(J(k{D}^{2}{N}^{2}+M{N}^{2}))$, where k is the number of required iterations for convergence and J is the number of categories. For Memory Usage, taking six types of image features with four kind of kernels, the kernel matrix of the small size transfer tasks (i.e., S1→T1, S1→T2 and S1→T3) occupies 40.6 megabytes on average, while the kernel matrix of the large size transfer tasks (i.e., S2→T1, S2→T2 and S2→T3) occupies 348.5 megabytes on average. When the kernel matrixes are precomputed, our algorithm is still effective in computation.
4. Conclusions
In this paper, we propose transferring scene models from groundview images to very highresolution remote sensing images. Specifically, a semisupervised manifoldregularized multiple kernel learning (SMRMKL) algorithm that jointly minimizes the mismatch of distributions between the two domains and leverages available unlabeled target samples to capture the local structure in the target domain is presented. In addition, we conduct an indepth investigation on various aspects of SMRMKL, such as analysis on the effect of each regularizer, the combination coefficients on the multiple kernels, and the convergence of the learning algorithm. Extensive experimental results on both crossview and aerialtosatellite scene datasets show that: (1) SMRMKL has an appealing extension ability to effectively fuse different types of visual features and improve the classification accuracy, whereas traditional methods focus on one kind of features. In addition, SMRMKL could indicate which type of feature plays dominant roles among scene transfer tasks, this is important for feature selection. (2) In the past, most crossview scene model adaptation models are unsupervised methods [28,29,30]. Without using target domain samples, the classification accuracy is limited. SMRMKL is semisupervised method which proves that the participation of target domain training samples is very important for improving the adaptation classification accuracy. (3) Manifold regularization can improve the adaptation performance by utilizing unlabeled target samples. In practical applications, there are many unlabeled samples. How to effectively leverage these unlabeled samples has important application significance. However, the results in our manuscript are still limited in practical applications. The dataset constructed is simple. The number of samples in the dataset is small. In our future work, we will extend this work to a larger crossview dataset collected from web images and UAV( unmanned aerial vehicle) images. Furthermore, our work is expected to be applied to the visual attributes adaptation. Visual attributes can be considered as a middlelevel semantic cue that bridges the gap between lowlevel image features and highlevel object classes. Thus, visual attributes have the advantage of transcending specific semantic categories or describing scene images across categories.
Author Contributions
Z.D. performed the experiments and wrote the paper. H.S. analyzed the data and contributed materials. S.Z. supervised the study and reviewed this paper.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant 61303186, in part by the Fund of Innovation of NUDT Graduate School(NO.B150406). The authors would also like to thank the anonymous reviewers for their very competent comments and helpful suggestions.
Conflicts of Interest
The authors declare no conflict of interest.
References
 Cheng, G.; Han, J.; Lu, X. Remote Sensing Image Scene Classification: Benchmark and State of the Art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
 Qiu, S.; Wen, G.; Liu, J.; Deng, Z.; Fan, Y. Unified Partial Configuration Model Framework for Fast Partially Occluded Object Detection in HighResolution Remote Sensing Images. Remote Sens. 2018, 10, 464. [Google Scholar] [CrossRef]
 Tang, T.; Zhou, S.; Deng, Z.; Lei, L.; Zou, H. ArbitraryOriented Vehicle Detection in Aerial Imagery with Single Convolutional Neural Networks. Remote Sens. 2017, 9, 1170. [Google Scholar] [CrossRef]
 Luo, Y.M.; Ouyang, Y.; Zhang, R.C.; Feng, H.M. MultiFeature Joint Sparse Model for the Classification of Mangrove Remote Sensing Images. Int. J. GeoInf. 2017, 6, 177. [Google Scholar] [CrossRef]
 He, C.; Liu, X.; Kang, C.; Chen, D.; Liao, M. Attribute Learning for SAR Image Classification. Int. J. GeoInf. 2017, 6, 111. [Google Scholar] [CrossRef]
 Hu, F.; Xia, G.S.; Hu, J.; Zhong, Y.; Xu, K. Fast Binary Coding for the Scene Classification of HighResolution Remote Sensing Imagery. Remote Sens. 2016, 8, 555. [Google Scholar] [CrossRef]
 Chen, C. Remote Sensing Image Scene Classification Using Multiscale Completed Local Binary Patterns and Fisher Vectors. Remote Sens. 2016, 8, 483. [Google Scholar] [CrossRef]
 Yu, H.; Yang, W.; Xia, G.S.; Liu, G. A ColorTextureStructure Descriptor for HighResolution Satellite Image Classification. Remote Sens. 2016, 8, 259. [Google Scholar] [CrossRef]
 Hu, F.; Xia, G.S.; Hu, J.; Zhang, L. Transferring Deep Convolutional Neural Networks for the Scene Classification of HighResolution Remote Sensing Imagery. Remote Sens. 2015, 7, 14680–14707. [Google Scholar] [CrossRef]
 Liu, N.; Lu, X.; Wan, L.; Huo, H.; Fang, T. Improving the Separability of Deep Features with Discriminative Convolution Filters for RSI Classification. Int. J. GeoInf. 2018, 7, 95. [Google Scholar] [CrossRef]
 Wang, J.; Luo, C.; Huang, H.; Zhao, H.; Wang, S. Transferring PreTrained Deep CNNs for Remote Scene Classification with General Features Learned from Linear PCA Network. Remote Sens. 2017, 9, 225. [Google Scholar] [CrossRef]
 Ding, C.; Li, Y.; Xia, Y.; Wei, W.; Zhang, L.; Zhang, Y. Convolutional Neural Networks Based Hyperspectral Image Classification Method with Adaptive Kernels. Remote Sens. 2017, 9, 618. [Google Scholar] [CrossRef]
 Han, X.; Zhong, Y.; Cao, L.; Zhang, L. PreTrained AlexNet Architecture with Pyramid Pooling and Supervision for High Spatial Resolution Remote Sensing Image Scene Classification. Remote Sens. 2017, 9, 848. [Google Scholar] [CrossRef]
 Qi, K.; Yang, C.; Guan, Q.; Wu, H.; Gong, J. A Multiscale Deeply Described CorrelatonsBased Model for LandUse Scene Classification. Remote Sens. 2017, 9, 917. [Google Scholar] [CrossRef]
 Gong, X.; Xie, Z.; Liu, Y.; Shi, X.; Zheng, Z. Deep Salient Feature Based AntiNoise Transfer Network for Scene Classification of Remote Sensing Imagery. Remote Sens. 2018, 10, 410. [Google Scholar] [CrossRef]
 Liu, Y.; Huang, C. Scene Classification via Triplet Networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 220–237. [Google Scholar] [CrossRef]
 Deng, Z.; Sun, H.; Zhou, S.; Zhao, J.; Lei, L.; Zou, H. Multiscale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogram. Remote Sens. 2018. [Google Scholar] [CrossRef]
 Deng, Z.; Sun, H.; Zhou, S.; Zhao, J.; Zou, H. Toward Fast and Accurate Vehicle Detection in Aerial Images Using Coupled RegionBased Convolutional Neural Networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3652–3664. [Google Scholar] [CrossRef]
 Tuia, D.; Persello, C.; Bruzzone, L. Domain Adaptation for the Classification of Remote Sensing Data: An Overview of Recent Advances. IEEE Geosci. Remote Sens. Mag. 2016, 4, 41–57. [Google Scholar] [CrossRef]
 Persello, C.; Bruzzone, L. KernelBased DomainInvariant Feature Selection in Hyperspectral Images for Transfer Learning. IEEE Trans. Geosci. Remote Sens. 2015, 54, 1–12. [Google Scholar] [CrossRef]
 Yang, H.L.; Crawford, M.M. Domain Adaptation With Preservation of Manifold Geometry for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 543–555. [Google Scholar] [CrossRef]
 Othman, E.; Bazi, Y.; Alajlan, N.; Alhichri, H. ThreeLayer Convex Network for Domain Adaptation in Multitemporal VHR Images. IEEE Geosci. Remote Sens. Lett. 2016, 13, 354–358. [Google Scholar] [CrossRef]
 Li, X.; Zhang, L.; Du, B.; Zhang, L.; Shi, Q. Iterative Reweighting Heterogeneous Transfer Learning Framework for Supervised Remote Sensing Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 2022–2035. [Google Scholar] [CrossRef]
 Wang, X.; Huang, W.; Cheng, Y.; Yu, Q.; Wei, Z. Multisource Domain Attribute Adaptation Based on Adaptive Multikernel Alignment Learning. IEEE Trans. Syst. Man Cybern. Syst. 2018, 1–12. [Google Scholar] [CrossRef]
 Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.F. ImageNet: A largescale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
 Patterson, G.; Xu, C.; Su, H.; Hays, J. The SUN Attribute Database: Beyond Categories for Deeper Scene Understanding. Int. J. Comput. Vis. 2014, 108, 59–81. [Google Scholar] [CrossRef]
 Workman, S.; Souvenir, R.; Jacobs, N. WideArea Image Geolocalization with Aerial Reference Imagery. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
 Sun, H.; Liu, S.; Zhou, S.; Zou, H. Transfer Sparse Subspace Analysis for Unsupervised CrossView Scene Model Adaptation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 2901–2909. [Google Scholar] [CrossRef]
 Sun, H.; Liu, S.; Zhou, S.; Zou, H. Unsupervised CrossView Semantic Transfer for Remote Sensing Image Classification. IEEE Geosci. Remote Sens. Lett. 2016, 13, 13–17. [Google Scholar] [CrossRef]
 Sun, H.; Liu, S.; Zhou, S. Discriminative Subspace Alignment for Unsupervised Visual Domain Adaptation. Neural Process. Lett. 2016, 44, 1–15. [Google Scholar] [CrossRef]
 Sun, H.; Deng, Z.; Liu, S.; Zhou, S. Transferring ground level image annotations to aerial and satellite scenes by discriminative subspace alignment. In Proceedings of the Geoscience and Remote Sensing Symposium, Beijing, China, 10–15 July 2016; pp. 2292–2295. [Google Scholar]
 Deng, Z.; Sun, H.; Zhou, S.; Ji, K. Semisupervised crossview scene model adaptation for remote sensing image classification. In Proceedings of the Geoscience and Remote Sensing Symposium, Beijing, China, 10–15 July 2016; pp. 2376–2379. [Google Scholar]
 Laurens, V.; Der, M.; Hinton, G. Viualizing data using tSNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
 Patel, V.M.; Gopalan, R.; Li, R.; Chellappa, R. Visual domain adaptation: A survey of recent advances. IEEE Signal Process. Mag. 2015, 32, 53–69. [Google Scholar] [CrossRef]
 Volpi, M.; CampsValls, G.; Tuia, D. Spectral alignment of multitemporal crosssensor images with automated kernel canonical correlation analysis. J. Photogramm. Remote Sens. 2015, 23, 167–169. [Google Scholar] [CrossRef]
 Long, M.; Wang, J.; Jordan, M.I. Unsupervised Domain Adaptation with Residual Transfer Networks. arXiv, 2016. [Google Scholar]
 Long, M.; Wang, J.; Cao, Y.; Sun, J. Deep Learning of Transferable Representation for Scalable Domain Adaptation. IEEE Trans. Knowl. Data Eng. 2016, 28, 2027–2040. [Google Scholar] [CrossRef]
 Gueguen, L. Classifying Compound Structures in Satellite Images: A Compressed Representation for Fast Queries. IEEE Trans. Geosci. Remote Sens. 2015, 53, 1803–1818. [Google Scholar] [CrossRef]
 Duan, L.; Xu, D.; Tsang, I.W.; Luo, J. Visual event recognition in videos by learning from Web data. Patten Anal. Mach. Intell. IEEE Trans. 2012, 34, 1667–1680. [Google Scholar] [CrossRef] [PubMed]
 Long, M.; Cao, Y.; Wang, J.; Jordan, M.I. Learning Transferable Features with Deep Adaptation Networks. arXiv, 2015. [Google Scholar]
 Melacci, S.; Belkin, M. Laplacian Support Vector Machines Trained in the Primal. J. Mach. Learn. Res. 2009, 12, 1149–1184. [Google Scholar]
 Belkin, M.; Niyogi, P.; Sindhwani, V. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. J. Mach. Learn. Res. 2006, 7, 2399–2434. [Google Scholar]
 Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. Acm Trans. Intell. Syst. Technol. 2011, 2, 389–396. [Google Scholar] [CrossRef]
 Rakotomamonjy, A.; Bach, F.R.; Canu, S.; Grandvalet, Y. Simplemkl. J. Mach. Learn. Res. 2008, 9, 2491–2521. [Google Scholar]
 Risojevic, V.; Babic, Z. Aerial image classification using structural texture similarity. In Proceedings of the 2011 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), Bilbao, Spain, 14–17 Decemner 2011; pp. 190–195. [Google Scholar]
 Yang, Y.; Newsam, S. Bagofvisualwords and spatial extensions for landuse classification. In Proceedings of the ACM Sigspatial International Symposium on Advances in Geographic Information Systems, AcmGis 2010, San Jose, CA, USA, 3–5 November 2010; pp. 270–279. [Google Scholar]
 Dai, D.; Yang, W. Satellite Image Classification via TwoLayer Sparse Coding With Biased Image Representation. IEEE Geosci. Remote Sens. Lett. 2011, 8, 173–176. [Google Scholar] [CrossRef]
 Kataoka, H.; Iwata, K.; Satoh, Y. Decaf: A deep convolutional activation feature for generic visual recognition. Comput. Sci. 2013, 50, 815–830. [Google Scholar]
 Yang, J.; Yan, R.; Hauptmann, A.G. Crossdomain video concept detection using adaptive svms. In Proceedings of the 2007 International Conference on Multimedia, Augsburg, Germany, 25–29 September 2007; pp. 188–197. [Google Scholar]
 Jiang, W.; Zavesky, E.; Chang, S.F.; Loui, A. CrossDomain Learning Methods for HighLevel Visual Concept Classification. In Proceedings of the 15th IEEE International Conference on Image, San Diego, CA, USA, 12–15 October 2008; pp. 161–164. [Google Scholar]
 Duan, L.; Tsang, I.W.; Xu, D. Domain transfer multiple kernel learning. IEEE Trans. Patten Anal. Mach. Intell. 2011, 34, 465–479. [Google Scholar] [CrossRef] [PubMed]
Figure 5.
Nine common categories of satellite scenes from WHURS19 dataset (top row) and aerial scenes from UC Merced dataset (bottom row): (1) residential; (2) parking lot; (3) port/harbor; (4) industry/building; (5) farmland/ agriculture; (6) viaduct/ overpass; (7) river; (8) forest; and (9) beach.
Figure 6.
The performance (means and standard deviation of overall accuracy) of different approaches with different features for S1→T3 transfer task.
Figure 7.
2D visualization of the S1→T3 crossview scene dataset with different features. The solid points, hollow points, and cross points represent the source images, target unlabeled images and target labeled images, respectively.
Figure 8.
The performance (means and standard deviation of overall accuracy) of our approach with different features for six transfer tasks.
Figure 9.
The performance (classification accuracy) of MultiFuse based SMRMKL for S1→T1 transfer task with different tradeoff parameters.
Figure 11.
The performance (mean and standard deviation of overall accuracy) of different approaches using different features with respect to different numbers of target samples per class for the aerialtosatellite transfer task.
Figure 12.
The performance (mean and standard deviation of overall accuracy) of our approach using different features with respect to different numbers of target samples per class for the aerialtosatellite transfer task.
Table 1.
Statistical number of images in Crossview scene dataset. SUN database (S1) and Scene15 (S2) are the source domains, while Banja Luka (T1), UC Merced (T2), and WHURS19 (T3) datasets are the target domains.
OverheadView Datasets  GroundView Datasets  

T1  T2  T3  S1  S2  
1 agriculture  178  100  50  84  410 
2 forest/trees  105  100  53  62  328 
3 river/water  77  100  56  125  360 
4 industrial  75  100  53  41  311 
Table 2.
The combination coefficients ${d}_{m}$ of the multi features with a fixed number of three target positive training samples.
Kernel Coefficients ${\mathit{d}}_{\mathit{m}}$  DSIFT  GeoColor  HOG  TEXTON  FC6  FC7  

S1>T1  agriculture  0.38  0  0.35  0  0.13  0.13 
forest  0.21  0.09  0.19  0  0.24  0.27  
river  0.23  0  0.28  0  0.26  0.23  
industrial  0.34  0  0.39  0  0.13  0.14  
S1>T2  agriculture  0.21  0.06  0.36  0  0.17  0.19 
forest  0.07  0.17  0.12  0.16  0.25  0.23  
river  0.23  0.05  0.18  0  0.27  0.27  
industrial  0.33  0.06  0.42  0.02  0.09  0.08  
S1>T3  agriculture  0.13  0  0.27  0  0.37  0.23 
forest  0.25  0.06  0.19  0  0.23  0.27  
river  0.16  0.06  0.31  0  0.22  0.25  
industrial  0.29  0.06  0.32  0.03  0.16  0.14  
S2>T1  agriculture  0.13  0  0.27  0  0.37  0.23 
forest  0.11  0  0.24  0  0.39  0.26  
river  0.13  0  0.33  0.01  0.37  0.17  
industrial  0.23  0  0.28  0.44  0.05  0  
S2>T2  agriculture  0.1  0.07  0.33  0  0.41  0.1 
forest  0.04  0.22  0.22  0  0.42  0.11  
river  0.11  0.02  0.33  0  0.43  0.1  
industrial  0.18  0  0.26  0.52  0.04  0  
S2>T3  agriculture  0.03  0.07  0.23  0  0.47  0.2 
forest  0.01  0.34  0.15  0  0.38  0.13  
river  0.04  0  0.32  0  0.45  0.18  
industrial  0.06  0  0.28  0.51  0.15  0 
Table 3.
The overall accuracy (percent) improvements with different combination of regularization across six pairing of the transfer tasks.
Overall Accuracy  S1>T1  S1>T2  S1>T3  S2>T1  S2>T2  S2>T3 

SR+MKMMD  70.65  81.04  81.00  75.18  90.04  65.08 
SR+MR  50.89  62.19  61.25  54.11  63.94  61.63 
SR+MKMMD+MR  81.19  95.21  84.50  77.14  93.08  65.27 
Table 4.
Perclass mAPs of different features with 10 target positive examples for all nine categories.
mAPs  DSIFT  GeoColor  HOG  TEXTON  FC6  FC7  MultiFuse 

residential  0.42  0.46  0.44  0.47  0.48  0.5  0.42 
parking  0.47  0.37  0.45  0.61  0.5  0.48  0.5 
harbor  0.5  0.51  0.49  0.5  0.57  0.56  0.51 
industry  0.44  0.39  0.41  0.51  0.47  0.46  0.46 
farmland  0.66  0.63  0.63  0.56  0.67  0.65  0.72 
viaduct  0.42  0.44  0.44  0.49  0.46  0.45  0.53 
river  0.55  0.51  0.45  0.44  0.57  0.52  0.57 
forest  0.76  0.71  0.77  0.78  0.81  0.79  0.84 
beach  0.86  0.86  0.68  0.81  0.94  0.93  0.93 
Table 5.
The confusion matrices of MultiFuse based SMRMKL with 10 target positive examples per class for aerialtosatellite classification. The overall accuracy is 56.79% and the Kappa is 0.5139.
Class  Residential  Parking  Harbor  Industry  Farmland  Viaduct  River  Forest  Beach  Total  User. Acc (%) 

residential  18  16  11  0  0  0  0  0  0  45  29.51 
parking  9  30  4  1  0  0  1  0  0  45  56.6 
harbor  16  6  19  1  0  0  1  2  0  45  51.35 
industry  10  1  3  14  16  1  0  0  0  45  70.00 
farmland  0  0  0  4  39  2  0  0  0  45  61.9 
viaduct  0  0  0  0  8  20  17  0  0  45  60.61 
river  0  0  0  0  0  10  24  9  2  45  55.81 
forest  0  0  0  0  0  0  0  32  13  45  69.57 
beach  8  0  0  0  0  0  0  3  34  45  69.39 
Total  61  53  37  20  63  33  43  46  49  405  
Prod. Acc (%)  40.00  66.67  42.22  31.11  86.67  44.44  53.33  71.11  75.56 
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).