Learning a Multi-Branch Neural Network from Multiple Sources for Knowledge Adaptation in Remote Sensing Imagery

: In this paper we propose a multi-branch neural network, called MB-Net, for solving the problem of knowledge adaptation from multiple remote sensing scene datasets acquired with different sensors over diverse locations and manually labeled with different experts. Our aim is to learn invariant feature representations from multiple source domains with labeled images and one target domain with unlabeled images. To this end, we deﬁne for MB-Net an objective function that mitigates the multiple domain shifts at both feature representation and decision levels, while retaining the ability to discriminate between different land-cover classes. The complete architecture is trainable end-to-end via the backpropagation algorithm. In the experiments, we demonstrate the effectiveness of the proposed method on a new multiple domain dataset created from four heterogonous scene datasets well known to the remote sensing community, namely, the University of California (UC-Merced) dataset, the Aerial Image dataset (AID), the PatternNet dataset, and the Northwestern Polytechnical University (NWPU) dataset. In particular, this method boosts the average accuracy over all transfer scenarios up to 89.05% compared to standard architecture based only on cross-entropy loss, which yields an average accuracy of 78.53%.


Introduction
Over the last two decades remote sensing has become a staple technology for monitoring urban, atmospheric, and ecological changes [1,2].One of the prominent and arguably most active areas in this context refers to scene classification, which enables pinpointing of the semantic tenor of a geographical area of interest.This may come with the cost of processing large masses of data (e.g., multispectral and hyperspectral images) that are often manifested in voluminous spectral layers, alongside a wide spatial context.On the other hand, the mainstream literature so far suggests that scene classification can be tackled from two perspectives.First, the earliest works in this regard tend to classify image pixels [3][4][5], typically by handling raw spectral values along with neighbouring attributes.The second approach is based on scene-level recognition [6][7][8], which has received interest recently, thanks to its property of offering broader semantic information.
In view of the two trends mentioned above, an efficient classification system lies in determining how to mitigate the semantic gap between low level image features on the one hand, and their respective semantic attributes, on the other.Thus, a typical classification pipeline would extract handcrafted features and feed them into a classifier, which has been shown to address the problem to some extent [9,10].For instance, a multiresolution representation bag of visual words (BOVW) model was presented [11].For an improved classification, a feature fusion by means of compressive sensing was introduced [12].Additionally, a correlation model was developed [13], which takes into consideration pixel homogeneity.A feature extraction method has been proposed, which relies on multi-scale completed local binary patterns [14].Additionally, a pyramid-of-spatial-relations model was introduced to combine both relative and absolute spatial information into the BOVW model for the scene classification problem [15].In another work, the authors introduced a method based on Gabor filters and the completed local binary patterns operator [16].
The amount of remote sensing images has been steadily increasing due to the technological improvement of satellite sensors [17].Thus, massive volumes of images with different spectral channels and spatial resolutions can be obtained [18].How to recognize and analyze such images has become a big challenge [19].Nowadays, plenty of work concentrates on deep learning strategies based on convolutional neural networks (CNNs), which aim to learn, in an end-to-end manner, representative as well as discriminative features automatically.Thanks to their sophisticated structure, these models have the ability to learn powerful generic image representations in a hierarchical fashion.The impressive results obtained on several remote sensing scene datasets confirm clearly their superiority compared to shallow methods based on handcrafted features [20][21][22][23][24][25][26][27][28][29].
In some domains there are sufficient labeled samples to train a classification model, whereas many new domains lack labeled samples [30].Moreover, generating and collecting labeled data is often expensive and time consuming [31].As a result, the idea of exploiting the availability of labeled data in one or more domain to predict unlabeled data in another domain has emerged, and is known as "domain adaptation".Unlike several machine learning algorithms, which assume that the training and testing samples are drawn from the same distribution [32], training and testing data in domain adaptation have different distributions, that is, the training images are always with labels and are extracted from what is called the source domain, while the test images are without (or with few) labels and are called the target domain [1].The main goal of domain adaptation is to mitigate the distribution discrepancy between the source and target domains [33].It is worth recognizing that domain adaptation has been applied in various applications such as computer vision [34], sentimental analysis [35], natural language processing [36,37], video concept detection [38], and Wi-Fi localization and detection [39].
In the literature of computer vision, many works have shown that deep networks can learn more transferable features for domain adaptation.As a result, domain adaptation methods learn deep neural transformations that map both domains into a common feature space.In [40], a unified deep adaptation framework is proposed for jointly learning transferable representation and classifier to enable scalable domain adaptation, by leveraging both deep learning and optimal two-sample matching.In [41], to reduce domain discrepancy an adaptation layer is added to the deep convolutional neural network (CNN) to achieve lower transfer errors.In another work [42], deep adaptation network (DAN) model is proposed, which gave state of the art results.In [43], to produce the commonality between the source and target distributions and accommodate the domain-specific parts that should not be aligned, local patches of varying sizes are extracted and processed via CNNs.In [44], transformation between the source and target is proposed to be learnt by the deep model regression network.Based on the nature of CNNs, this approach presumes that the source representation can be interpolated or regressed into the target, as it can approximate highly non-linear functions.In another work [45], a deep domain adaptation network is presented for the problem of describing people based on fine-grained clothing attributes.Specifically, an improved version of the Region CNN body detector is introduced, which effectively localizes the clothing area.In fact, it consists of three sub-modules.First, a selective search is utilized to generate candidate region proposals.Then, a Network in Network model is used to extract features for each candidate region.Finally, linear support vector regression is exploited to predict the Intersection over Union overlap of candidate patches with ground-truth bounding boxes.In [46], the authors trained a network to do both feature and classifier adaptation.Analogous to previous domain adaptation methods, feature adaptation is accomplished by matching the distributions of features across domains.Nevertheless, unlike prior works, the presented method allows classifier adaptation by adding a residual transfer module that bridges the source and target classifiers.The adaptation can be used in most feed-forward models by extending them with new residual layers and loss functions, which can be trained efficiently via back-propagation.In another work, a deep coral technique was proposed to mitigate the domain discrepancy for enhancing the classification procedure of CNN by reducing the Euclidean distance between covariance matrices in the source and target domains [47].Yet it is not certain if the Euclidean distance is a typical choice for minimizing the distance between both domains.Therefore, to deal with this issue, the work in [48] presented a new deep Log-coral method, which used geodesic distance rather than Euclidean distance.The obtained accuracies have demonstrated the effectiveness of minimizing geodesic distance instead of using simple Euclidean distance on covariance matrices.Lately, adversarial approaches to unsupervised domain adaptation have been introduced to improve generalization performance by reducing the discrepancy between the training and test domain distributions.The authors in [49] adopted an inverted label adversarial network loss to divide the optimization into two independent objectives, one for the generator and one for the discriminator, by comparing them relying on the loss type, and the weight sharing strategy between the two streams.
With respect to domain adaptation with multiple sources, some contributions based on handcrafted features have been published.The work in [50] aims to maximize unanimity of predictions from multiple sources, although not all source domains may be useful for knowledge adaptation.In [51], a kernel mean matching technique was adopted to match the means of different domains.The work in [52] proposed to combine multiple auxiliary classifiers trained on source data to classify target data.In [53], a smoothness regularizer is used to weight different source domains.In [54], it was assumed that the distributions of multiple sources are similar, but the labelled samples from different sources may be different from each other.In [55], a clustering based scheme was proposed to divide a dataset into latent domain, which is further extended in [56] for multi-domain adaptation.The work in [57] presented an event recognition method for consumer videos by leveraging web videos from YouTube, in order to handle the mismatch between data distributions of two domains (i.e., web video domain and consumer video domain).
In the context of remote sensing, in the literature there are few works related to single source domain adaptation approaches based on deep learning techniques and mainly related to cross-scene classification.By cross-scene classification, we mean datasets acquired with different sensors and over different locations (i.e., training and testing images are taken from two different scene datasets).Under this assumption, the data-shift problem should be considered alongside the representation aspect to obtain satisfactory results.For example, Othman et al. [58] added additional regularization terms to the objective function of the neural network besides the standard cross-entropy loss, in order to compensate for the distribution mismatch to alleviate the low accuracies resulting from the approaches relying on pre-trained CNNs.In [59], the authors developed an approach based on adversarial networks for cross-domain classification in aerial vehicle images to overcome the data shift problem.Finally, the authors in [60] addressed this issue by projecting the source domain samples to the target domain via a regression network, while keeping the discrimination ability of the source samples.
Although multisource domain adaptation has been shown to be very useful in general computer vision literature, it is yet to find its way into remote sensing applications.To the best of our knowledge, there is still no contribution on multisource domain adaptation for the specific task of scene classification.This could be traced back to the fact that, years ago, remote sensing did not benefit from multiple scene classification datasets that shared an adequate number of object classes.However, thanks to recent efforts made by researchers, several benchmark scene datasets are now available to the community of remote sensing, which opens the door to development of advanced methodologies such as those related to multisource domain adaptation.
In this work, we propose a multi-branch neural network called MB-Net for solving the problem of knowledge adaptation from multiple scene datasets sharing the same number of object classes.Specifically, our aim is to learn invariant feature representations from multiple source scene datasets with labeled images and one target scene dataset with unlabeled images.For this purpose, we define for the network an objective function that allows to mitigate the multiple domains shifts at both feature representation and decision levels, while keeping the discrimination ability between different object classes.The complete architecture is trainable end-to-end via the backpropagation algorithm.In the experiments, we demonstrate the effectiveness of the proposed method on a new multiple domain scene dataset created from four heterogonous scene datasets: the University of California (UC-Merced) dataset [61], Aerial Image Dataset (AID) [62], PatternNet dataset [63], and finally, the Northwestern Polytechnical University (NWPU) dataset [64].
The paper is organized as follows: Section 2 describes the proposed method.Section 3 shows the results obtained on the multiple source scene dataset.Section 4 analyzes the sensitivity of the method and presents some comparisons with some recent state-of-the-art methods based on deep neural networks.Finally, Section 5 draws conclusions and future developments.However, thanks to recent efforts made by researchers, several benchmark scene datasets are now available to the community of remote sensing, which opens the door to development of advanced methodologies such as those related to multisource domain adaptation.

Let us consider
In this work, we propose a multi-branch neural network called MB-Net for solving the problem of knowledge adaptation from multiple scene datasets sharing the same number of object classes.Specifically, our aim is to learn invariant feature representations from multiple source scene datasets with labeled images and one target scene dataset with unlabeled images.For this purpose, we define for the network an objective function that allows to mitigate the multiple domains shifts at both feature representation and decision levels, while keeping the discrimination ability between different object classes.The complete architecture is trainable end-to-end via the backpropagation algorithm.
In the experiments, we demonstrate the effectiveness of the proposed method on a new multiple domain scene dataset created from four heterogonous scene datasets: the University of California (UC-Merced) dataset [61], Aerial Image Dataset (AID) [62], PatternNet dataset [63], and finally, the Northwestern Polytechnical University (NWPU) dataset [64].
The paper is organized as follows: Section 2 describes the proposed method.Section 3 shows the results obtained on the multiple source scene dataset.Section 4 analyzes the sensitivity of the method and presents some comparisons with some recent state-of-the-art methods based on deep neural networks.Finally, Section 5 draws conclusions and future developments.

Model Architecture
MB-Net is based on a pre-trained CNN coupled with additional branches.As a pre-trained CNN, we use the residual network (ResNet) [65], which is based on the idea of identity shortcut connection.In particular, we use ResNet50, which is a 50-layer network with a 3-layer bottleneck block (Figure 2).ResNet has been introduced to solve the vanishing gradients problems in deeper networks.It introduces the idea of learning residual functions with reference to the layer inputs rather than learning unreferenced functions (see Figure 2).ResNet won first place in the ImageNet Large Scale Visual Recognition Competition (ILSVRC) 2015 classification completion.It has been used to replace the VGG-16 layers in the faster Region CNN (RCNN) learning for better improvements in terms detection results.In this work, we use this ResNet50 pre-trained on the well-known ImageNet dataset as the first module in our network.

Model Architecture
MB-Net is based on a pre-trained CNN coupled with additional branches.As a pre-trained CNN, we use the residual network (ResNet) [65], which is based on the idea of identity shortcut connection.In particular, we use ResNet50, which is a 50-layer network with a 3-layer bottleneck block (Figure 2).ResNet has been introduced to solve the vanishing gradients problems in deeper networks.It introduces the idea of learning residual functions with reference to the layer inputs rather than learning unreferenced functions (see Figure 2).ResNet won first place in the ImageNet Large Scale Visual Recognition Competition (ILSVRC) 2015 classification completion.It has been used to replace the VGG-16 layers in the faster Region CNN (RCNN) learning for better improvements in terms detection results.In this work, we use this ResNet50 pre-trained on the well-known ImageNet dataset as the first module in our network.
We remove the softmax layer and take the output of the average pooling layer (feature vector of dimension 2048) as input to different branches.Each branch related to a specific source dataset is composed of two dense layers of size 128.Each dense layer is followed by batch normalization (BN), linear rectified unit (ReLU) activation function, and dropout regularization.On top of these dense layers, we place a softmax classification layer (Figure 3).Additionally, to reduce the discrepancy between the source and target distributions, our network has average pooling layers  and  placed after the first dense layer and the softmax layer, respectively.We remove the softmax layer and take the output of the average pooling layer (feature vector of dimension 2048) as input to different branches.Each branch related to a specific source dataset is composed of two dense layers of size 128.Each dense layer is followed by batch normalization (BN), linear rectified unit (ReLU) activation function, and dropout regularization.On top of these dense layers, we place a softmax classification layer (Figure 3).Additionally, to reduce the discrepancy between the source and target distributions, our network has average pooling layers A v1 and A v2 placed after the first dense layer and the softmax layer, respectively.

Model Architecture
MB-Net is based on a pre-trained CNN coupled with additional branches.As a pre-trained CNN, we use the residual network (ResNet) [65], which is based on the idea of identity shortcut connection.In particular, we use ResNet50, which is a 50-layer network with a 3-layer bottleneck block (Figure 2).ResNet has been introduced to solve the vanishing gradients problems in deeper networks.It introduces the idea of learning residual functions with reference to the layer inputs rather than learning unreferenced functions (see Figure 2 We remove the softmax layer and take the output of the average pooling layer (feature vector of dimension 2048) as input to different branches.Each branch related to a specific source dataset is composed of two dense layers of size 128.Each dense layer is followed by batch normalization (BN), linear rectified unit (ReLU) activation function, and dropout regularization.On top of these dense layers, we place a softmax classification layer (Figure 3).Additionally, to reduce the discrepancy between the source and target distributions, our network has average pooling layers  and  placed after the first dense layer and the softmax layer, respectively.

Objective Function and Model Optimization
Let us define Θ s k as the weights and biases associated with the different branches of the network.To learn these parameters, we propose to minimize an objective function composed of three terms: where λ 1 and λ 2 are two regularization parameters (set to 1 in the experiments).The term L ce represents the total cross-entropy loss computed for the M-labeled source datasets; 1(•) is an indicator function that takes 1 if the statement is true, otherwise it takes 0; and P y i = j X (s k ) , Θ s k is the probability output vector provided by the softmax regression layer of the k-th source domain.The term L h is the distance between the source and target domains computed at the hidden representation layer of the network, as shown in Figure 1.The terms h (s) and h (t) refer to the feature representations of the source and target domains generated by the average layer A v1 .Similarly, the term L o is the distance between the source and target domains computed at the output of the network.Here, O (s) and O (t) refer to the outputs of the source and target domains generated by the average layer A v2 placed on top of the softmax regression layers.To optimize the above loss functions, one can use the backpropagation algorithm and the mini-batch classical stochastic gradient descent (SGD) method.The learning process starts by pre-training the network on the labeled source domains by optimizing the cross-entropy loss L ce .This is done by dividing the source domains into several mini-batches of the same size, afterwhich learning is performed by updating the weights for every mini-batch as follows: where η refers to the learning rate, b s k and n s k b refer to the size and number of mini-batches in the k-th source, respectively.Then, in the second phase we fine tune the network weights by minimizing the complete loss L using both the source and target domains.Mathematically, the weights of the network are then updated as follows: where b and n b refer to the size and the number of mini-batches in the target domain.During the learning process, random samples X (s k ) rand are extracted from the different sources to reduce the discrepancy between the domains while keeping the discrimination ability of the network.
It is worth recalling that for better performances, we use in the experiments more advanced gradient-based update rules based on the mini-batch adaptive moment estimation (Adam) method for updating the parameters The Adam method is an extension of the SGD method.While SGD maintains a single learning rate for all weights during the training process, the Adam method computes individual adaptive learning rates for different parameters from estimates of first-and second-order moments of the gradients, which makes it very efficient.
The following algorithm provides the main steps for training MB-Net with its nominal parameters: MB-Net method.

Description of the Multisource Dataset
To assess the performance of the proposed approach we use four heterogonous scene remote sensing datasets, collected and labeled by different experts to build the multisource dataset.These are the Merced, AID, PaternNet and NWPU datasets.This setting corresponds to three labeled source domains and one unlabeled target domain.
The Merced dataset is widely used for the task of aerial image classification.It is composed of 21 classes with 100 RGB images of size 256 × 256 pixels each, with 30 cm pixel resolution.This dataset was extracted from the United States Geological Survey (USGS) National Map Urban Area Imagery collection, from various urban areas pertaining to the following US regions: Birmingham, Boston, Buffalo, Columbus, Dallas, Harrisburg, Houston, Jacksonville, Las Vegas, Los Angeles, Miami, Napa, New York, Reno, San Diego, Santa Barbara, Seattle, Tampa, Tucson, and Ventura.The dataset amounts to 2100 images manually selected and labelled into 21 classes: agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium density residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis courts.
The AID dataset is made up of 10,000 large-scale aerial images of size 600 × 600 pixels with multi-resolution (8 m to 0.5 m) within the following 30 aerial scene kinds: airport, bare land, baseball field, beach, bridge, center, church, commercial, dense residential, desert, farmland, forest, industrial, meadow, medium residential, mountain, park, parking, playground, pond, port, railway station, resort, river, school, sparse residential, square, stadium, storage tanks and viaduct.Specialists in the field of remote sensing image interpretation annotated all the images, which are multi-source (from different remote sensing imaging sensors).Furthermore, the images of each class are carefully collected from various countries and areas around the world, including England, the United States, Germany, Italy, China, Japan, etc., and they are obtained at different times and seasons under different imaging circumstances.
The PatternNet dataset is collected from Google Earth imagery for remote sensing image retrieval, and consists of 38 classes: airplane, baseball field, basketball court, beach, bridge, cemetery, chaparral, Christmas tree farm, closed road, coastal mansion, crosswalk, dense residential, ferry terminal, football field, forest, freeway, golf course, harbor, intersection, mobile home park, nursing home, oil gas field, oil well, overpass, parking lot, parking space, railway, river, runway, runway marking, shipping yard, solar panel, sparse residential, storage tank, swimming pool, tennis court, transformer station and wastewater treatment plant.Each class comprises 800 images of size 256 × 256 pixels.
The NWPU dataset is collected by Northwestern Polytechnical University (NWPU).It accommodates 31,500 images corresponding to 45 scene classes with 700 images in each class of size 256 × 256 pixels.The spatial resolution of the images in each class varies from about 0.2 to 30 m.These 45 scene classes include airplane, airport, baseball diamond, basketball court, beach, bridge, chaparral, church, circular farmland, cloud, commercial area, dense residential, desert, forest, freeway, golf course, ground track field, harbor, industrial area, intersection, island, lake, meadow, medium residential, mobile home park, mountain, overpass, palace, parking lot, railway, railway station, rectangular farmland, river, roundabout, runway, sea ice, ship, snow berg, sparse residential, stadium, storage tank, tennis court, terrace, thermal power station, and wetland.
To make these datasets suitable for multisource domain adaptation, we consider in this work only the shared classes across them and discard the remainders.After this pre-processing step, we obtained twelve shared classes as shown in Table 1, which are airfield, anchorage, beach, dense residential, farm, flyover, forest, game space, parking space, river, sparse residential, and storage tanks.Figures 4-7 show sample images related to these shared classes for the four datasets, respectively.In the experiments, we refer to the four possible transfer scenarios as follows: (S Merced , S NWPU , S PatNet ) → T AID , (S AID , S NWPU , S PatNet ) → T Merced , (S AID , S Merced , S PatNet ) → T NWPU , and (S AID , S Merced , S NWPU ) → T PatNet .

Results
For training the network, we used the Adam optimization method with a mini-batch size of 100 images.We fixed the learning rate to 0.001, the exponential decay rates for the moment estimates to 0.9 and 0.999, and epsilon to 1 × 10 .Also, we set the regularization parameters as  =  = 1.
For performance evaluation, we present results on the unlabeled target datasets in terms of overall accuracy (OA) and per-class accuracy using confusion matrices.Experiments were performed on a laptop with a processor Intel Core i7 with a speed of 2.9 GHz, and 8 GB of memory.

Results
For training the network, we used the Adam optimization method with a mini-batch size of 100 images.We fixed the learning rate to 0.001, the exponential decay rates for the moment estimates to 0.9 and 0.999, and epsilon to 1 × 10 .Also, we set the regularization parameters as  =  = 1.
For performance evaluation, we present results on the unlabeled target datasets in terms of overall accuracy (OA) and per-class accuracy using confusion matrices.Experiments were performed on a laptop with a processor Intel Core i7 with a speed of 2.9 GHz, and 8 GB of memory.

Results
For training the network, we used the Adam optimization method with a mini-batch size of 100 images.We fixed the learning rate to 0.001, the exponential decay rates for the moment estimates to 0.9 and 0.999, and epsilon to 1 × 10 −8 .Also, we set the regularization parameters as λ 1 = λ 2 = 1.For performance evaluation, we present results on the unlabeled target datasets in terms of overall accuracy (OA) and per-class accuracy using confusion matrices.Experiments were performed on a laptop with a processor Intel Core i7 with a speed of 2.9 GHz, and 8 GB of memory.
In the first phase, we pre-trained MB-Net on the labeled source datasets by optimizing the cross entropy loss L ce .Each time, we consider one dataset as the target and the three remaining as sources.
Table 2 shows the classification accuracies obtained for the four scenarios.In particular, this table shows the accuracy achieved by the k-th branch of the network related to the k-th source dataset in addition to the final accuracy provided by the average fusion layer placed on top of the branches.For the scenario (S Merced , S NWPU , S PatNet ) → T AID , we observe that the second branch related to NWPU yields the highest OA, as it is equal to 91.46%, while the other two branches related to Merced and PatternNet yield lower accuracies of 58.13% and 61.50%, respectively.The average fusion layer permits us to obtain the final accuracy of 80.42%.For (S AID , S NWPU , S PatNet ) → T Merced , the PatternNet dataset shows a better correlation compared to the other datasets, yeilding an accuracy of 83.66%, while AID and NWPU deliver accuracies of 69.33% and 68.50%, respectively.The fusion layer allows the network to reach an OA of 82.16%.On the other hand, the case (S AID , S Merced , S PatNet ) → T NWPU proves to be the more challenging as the branch related to AID results in OA of 75.86%.The other branches related to Merced and PatternNet show weak correlations, as they deliver accuracies of 54.54% and 55.57%, respectively, while the average fusion layer results in an OA of 65.78%.Regarding the last scenario (S AID , S Merced , S NWPU ) → T PatNet , we observe that the branch trained on Merced shows a better correlation, as it yields an OA 93.58% compared to the other branches.The average fusion layer of the network permits us to obtain an OA of 85.77%.These preliminary results show that solving only the representation aspect is not sufficient to obtain satisfactory results due to the data-shift problem.In the second phase, we optimized MB-Net by adding the losses dealing with the distribution discrepancy between the source and target domains at both representation and decision levels, as explained in the methodology.
From Table 2, we observe that the network yields significant improvements in terms of OA for all scenarios.For instance, for (S Merced , S NWPU , S PatNet ) → T AID , it gives an accuracy of 91.46% corresponding to an increase of 11%.For the case (S AID , S NWPU , S PatNet ) → T Merced , it yields an accuracy of 90.33%, corresponding to an increase of around 8%.Similarly, for (S AID , S Merced , S PatNet ) → T NWPU , it reaches an accuracy of 76.30% with an increase of around 10%.For the last scenario (S AID , S Merced , S NWPU ) → T PatNet , it boosts its accuracy up to 98.05% with of an improvement of 14%.As an indication of the importance of the extra losses related to the domain discrepancy, we provide in Figure 8a general view of the distributions in the 2D space of the hidden representation layer with the t-SNE (t-Distributed Stochastic Neighbor Embedding) method [66] for the transfer (S AID , S Merced , S NWPU ) → T PatNet .As can be seen, this feature visualization shows an interesting behavior of MB-Net in reducing the distance between the features of the source and target domains before and after adaptation.As additional information related to different object classes, we report in 9-12 the confusion matrices of all scenarios before and after adaptation.In detail, for the transfer (S Merced , S NWPU , S PatNet ) → T AID (Figure 9), the MB-Net improves class accuracies by 1% for dense residential, river and sparse residential; 2% for arch-field and anchorage; and 5% for the farm class.For (S AID , S NWPU , S PatNet ) → T Merced (Figure 10), there is an increase of 1% for anchorage, flyover, and storage cisterns; and 2% for dense residential.Among these classes, the river class seems difficult to classify, although accuracy has been improved by 2%.For (S AID , S Merced , S PatNet ) → T NWPU (Figure 11), we observe a significant improvement (up to 6%) for the class farm.For the last scenario, (S AID , S Merced , S NWPU ) → T PatNet , the network shows a significant improvement for the class farm (up to 7%), while the reaming classes have been improved by 1% and 2%, respectively.

Discussion
To investigate further the importance of the additional losses  and  , we repeated the above experiments by considering them independently with the cross-entropy loss  .The results depicted in Figure 13 reveal that adding these two losses to  permits us to increase the classification accuracy on the unlabeled target data.By averaging the results over the four transfer scenarios, we obtain accuracies of 78.53%, 87.34%, 81.67% and 89.05% for the cases  ,  +  ,  +  , and  =  +  +  , respectively.We notice that the loss  related to the representation level is more

Discussion
To investigate further the importance of the additional losses L h and L o , we repeated the above experiments by considering them independently with the cross-entropy loss L ce .The results depicted in Figure 13 reveal that adding these two losses to L ce permits us to increase the classification accuracy on the unlabeled target data.By averaging the results over the four transfer scenarios, we obtain accuracies of 78.53%, 87.34%, 81.67% and 89.05% for the cases L ce , L ce + L h , L ce + L h , and L = L ce + L h + L o , respectively.We notice that the loss L h related to the representation level is more relevant than the loss L o computed at the decision level.Yet, the inclusion of the loss L o seems reasonable as it can boost further the model accuracy.

Conclusions
In this paper we have proposed a multi-branch neural network architecture for tackling the domain adaptation problem from multisource scene datasets.The method optimizes a loss function that reduces the discrepancy between the source and target distributions at the representation and decision levels besides the standard cross-entropy loss.In the experiments, we have validated the method on a multiple source scene dataset built from four scene datasets well known to the remote sensing community.In particular, we have assessed the method using four transfer scenarios (from three source domains to one target domain).The results allow us to draw the following conclusions: (1) deep models based on cross-entropy loss aim to solve the representation aspect; (2) they perform well when source and target domains are from the same domain; (3) they provide reduced accuracies when the distribution of the domains are different; (4) the inclusion of opportune terms in the objective function that reduce the discrepancy between both domains helps reduce this effect; (5) the transfer from multiple sources can further improve accuracy compared to single domain adaptation by handling the multi-domain shifts.For future developments, we plan to propose advanced architectures by taking into consideration the no-shared classes between the different domains.Finally, we present in Table 3 a comparison between MB-Net and some recent domain adaptation methods based on a single source.In particular, we compare our results to the adversarial discriminative domain adaptation (ADDA) [49], which combines adversarial and discriminative learning, and the Siamese-GAN method, which reduces the discrepancy between the source and target domains using a Siamese encoder-decoder architecture [59].These two architectures have been proposed recently for single domain adaptation and they have shown promising results compared to several state-of-the-art methods.Their extension to multisource domain has not been explored yet.To make these methods suitable, we concatenate all sources into one source domain and run the experiments.As can be seen from Table 3, our method provides promising results in terms of classification accuracies.In particular, it provides an average accuracy of 89.05% versus 86.80% and 84.24% for the S-GAN and ADDA methods.Compared to these methods, MB-Net brings the advantage of tackling the domain shift by reducing the discrepancy between the different source domains in addition to the target domain.By contrast, the other approaches mitigate only the difference between a single source and target domains, and ignore the shift between the different sources.The experimental results show the importance of taking into consideration the discrepancy between the different sources and in boosting further the classification accuracy.

Conclusions
In this paper we have proposed a multi-branch neural network architecture for tackling the domain adaptation problem from multisource scene datasets.The method optimizes a loss function that reduces the discrepancy between the source and target distributions at the representation and decision levels besides the standard cross-entropy loss.In the experiments, we have validated the method on a multiple source scene dataset built from four scene datasets well known to the remote sensing community.In particular, we have assessed the method using four transfer scenarios (from three source domains to one target domain).The results allow us to draw the following conclusions: (1) deep models based on cross-entropy loss aim to solve the representation aspect; (2) they perform well when source and target domains are from the same domain; (3) they provide reduced accuracies when the distribution of the domains are different; (4) the inclusion of opportune terms in the objective function that reduce the discrepancy between both domains helps reduce this effect; (5) the transfer from multiple sources can further improve accuracy compared to single domain adaptation by handling the multi-domain shifts.For future developments, we plan to propose advanced architectures by taking into consideration the no-shared classes between the different domains.
2, . . ., M as the dataset from the k-th source domain with n s k labeled images, where M represents the number of source domains.Here, X (s k ) i and y (s k ) i are the images in the k-th source domain and their corresponding class labels y (s k ) i ∈ {1, 2, . . . ,J}, where J is the number of classes.Also, let us consider a single target dataset T = X unlabeled images.As mentioned in the introduction section, the main contribution of this work is to develop an MB-Net architecture (Figure 1) that captures the shared knowledge across different labeled source datasets S k and generalizes well on the new unlabeled target dataset T. Remote Sens. 2018, 10, x FOR PEER REVIEW 4 of 19

,
= 1, 2, … ,  as the dataset from the -th source domain with  labeled images, where  represents the number of source domains.Here,  ( ) and  ( ) are the images in the -th source domain and their corresponding class labels  ( ) ∈ 1,2, … ,  , where  is the number of classes.Also, let us consider a single target dataset  =  ( ) composed of  unlabeled images.As mentioned in the introduction section, the main contribution of this work is to develop an MB-Net architecture (Figure 1) that captures the shared knowledge across different labeled source datasets  and generalizes well on the new unlabeled target dataset .

Figure 1 .
Figure 1.Proposed MB-Net architecture: Pre-trained convolutional neural networks (CNN) coupled with additional branches, where each branch is related to a specific source dataset  .

Figure 1 .
Figure 1.Proposed MB-Net architecture: Pre-trained convolutional neural networks (CNN) coupled with additional branches, where each branch is related to a specific source dataset S k .

Figure 3 .
Figure 3.One branch of MB-Net related to the k-th source domain.
). ResNet won first place in the ImageNet Large Scale Visual Recognition Competition (ILSVRC) 2015 classification completion.It has been used to replace the VGG-16 layers in the faster Region CNN (RCNN) learning for better improvements in terms detection results.In this work, we use this ResNet50 pre-trained on the well-known ImageNet dataset as the first module in our network.

Figure 3 .
Figure 3.One branch of MB-Net related to the k-th source domain.Figure 3.One branch of MB-Net related to the k-th source domain.

Figure 3 .
Figure 3.One branch of MB-Net related to the k-th source domain.Figure 3.One branch of MB-Net related to the k-th source domain.

8 2:• 5 :
Target class labels 1: Set MB-Net parameters:• Regularization parameters λ 1 = λ 2 = 1 • Mini-batch size: b = 100 •Adam parameters: learning rate η = 0.001, exponential decay rate for the first and second moments β 1 = 0.9, β 2 = 0.999 and epsilon = 1 × 10 −Pre-train the network on the M-labeled source domains using the Adam method (i.e., estimate the parameters Θ s k by optimizing only the cross-entropy loss L ce in Equation (2)) 3: Set the number of mini-batches: n b = n t /b 4: For epoch = 1 : num_epoch 4.1 Shuffle randomly the unlabeled target images and organize them into n b groups each of size b 4.2 For r = 1 : n b•Pick a mini-batch r from the target domain: T Feed this mini-batch to the different branches of the network and take the output h(t)r and O r (t) of the average pooling layer A v1 and A v2 , respectively• Pick randomly M mini-batch from the source domains X (s k ) rand , k = 1, . . ., M •Feed each source mini-batch to its corresponding branch and take the output h (s) rand and O rand (s) of the average layers A v1 and A v2 , respectively • Update the parameters Θ s k of the network by minimizing the total loss L = L ce + λ 1 L h + λ 2 L o (Equation (1)) on the current mini-batch Classify the target domain T.

Figure 4 .
Figure 4. Example of images extracted from Merced dataset.

Figure 5 .
Figure 5. Example of images extracted from the AID dataset.

Figure 4 .
Figure 4. Example of images extracted from Merced dataset.

Figure 4 .
Figure 4. Example of images extracted from Merced dataset.

Figure 5 .
Figure 5. Example of images extracted from the AID dataset.Figure 5. Example of images extracted from the AID dataset.

Figure 5 .
Figure 5. Example of images extracted from the AID dataset.Figure 5. Example of images extracted from the AID dataset.

Figure 6 .
Figure 6.Example of images extracted from PattrenNet dataset.

Figure 7 .
Figure 7. Example of images extracted from NWPU dataset.

Figure 7 .
Figure 7. Example of images extracted from NWPU dataset.

Figure 7 .
Figure 7. Example of images extracted from NWPU dataset.

Figure 8 .
Figure 8. t-SNE representation of the source and target features obtained at the output of the hidden average layer for the scenario ( ,  ,  ) →  when (a) using the  loss, and (b) the total loss  =  +  +  .(blue: source domain, red: target domain).

Figure 8 .Figure 9 .
Figure 8. t-SNE representation of the source and target features obtained at the output of the hidden average layer for the scenario (S AID , S Merced , S NWPU ) → T PatNet when (a) using the L ce loss, and (b) the total loss L = L ce + L h + L o .(blue: source domain, red: target domain).Remote Sens. 2018, 10, x FOR PEER REVIEW 13 of 19

Figure 12 .
Figure 12.Confusion matrices for PatternNet using: (a) L ce loss, and (b) the total loss L = L ce + L h + L o .

19 Figure 13 .
Figure 13.Sensitivity analysis with respect to the losses  ,  and  .

Figure 13 .
Figure 13.Sensitivity analysis with respect to the losses L ce , L h and L o .

Table 2 .
Results in terms of overall accuracy (OA) (%) obtained by MB-Net, trained by optimizing the cross-entropy loss L ce and the total loss L: (a) AID, (b) Merced, (c) NWPU and (d) PatNet.

Table 3 .
Comparisons with respect to state-of-the-art methods.

Table 3 .
Comparisons with respect to state-of-the-art methods.