Benchmarking Domain Adaptation Methods on Aerial Datasets

Deep learning grew in importance in recent years due to its versatility and excellent performance on supervised classification tasks. A core assumption for such supervised approaches is that the training and testing data are drawn from the same underlying data distribution. This may not always be the case, and in such cases, the performance of the model is degraded. Domain adaptation aims to overcome the domain shift between the source domain used for training and the target domain data used for testing. Unsupervised domain adaptation deals with situations where the network is trained on labeled data from the source domain and unlabeled data from the target domain with the goal of performing well on the target domain data at the time of deployment. In this study, we overview seven state-of-the-art unsupervised domain adaptation models based on deep learning and benchmark their performance on three new domain adaptation datasets created from publicly available aerial datasets. We believe this is the first study on benchmarking domain adaptation methods for aerial data. In addition to reporting classification performance for the different domain adaptation models, we present t-SNE visualizations that illustrate the benefits of the adaptation process.


Introduction
Deep neural networks were used for a variety of tasks such as image classification, segmentation, image generation, and speech recognition [1,2]. These models require a lot of labeled training data to make them generalizable and highly scalable [3][4][5]. There are practical scenarios where labeled data from the test domain are scarce or not available, and supervised methods don't maintain their performance, especially when the training and test data are drawn from different distributions. An example of this would be to deploy a model trained on images from one sensor corresponding to the source domain, and test on images obtained from another sensor corresponding to the target domain. We expect a drop in the accuracy due to the differences between the two domains [5], and thus domain adaptation (DA) algorithms are necessary to recover performance. A common application of domain adaptation is the case of unsupervised domain adaptation (UDA), where there are labeled source data for training but unlabeled target data [4,6]. DA can reduce the need for costly labeling of the target domain data by adapting the labels from the source domain [7].
This benchmarking study focuses on unsupervised DA methods for classification without access to any labeled target domain data. Several surveys were conducted on the topic of domain adaptation [5,[7][8][9][10][11][12][13][14] or transfer learning where the target domain data are labeled [15][16][17][18][19][20]. All of the DA surveys conducted make use of the common domain adaptation datasets, such as Office-31 (Amazon, Webcam, Digital) [21], Digits (MNIST, SVHN, MNIST-M) [6,22,23], and Syn2Real [24,25]. However, to our knowledge there is no DA study based on aerial datasets. Aerial imagery is important to the remote sensing community and presents unique challenges due to changes in rotation, resolution, illumination, and noise depending on the sensor characteristics. Moreover, models trained on ground based imagery do not generalize to aerial imagery due to some key differences. The viewpoints in aerial images are different than those in ground based images, as a the field of view of aerial cameras is larger and covers greater distance. This often means that the objects in aerial images consist of fewer pixels and are more difficult to describe. Other challenging conditions in aerial images include lower between-class variation, weather related disturbances such as cloud cover, and greater variations in the orientation of objects with respect to the background while still convey the same contextual information. These attributes make aerial datasets more challenging for generalization across domains and motivate the need for domain adaptation. In our benchmarking study, we focus on creating suitable aerial datasets for domain adaptation and report the results from seven different unsupervised domain adaptation methods on these datasets.
The main contributions of this paper are as follows: 1.
We present the first benchmarking study of domain adaptation on aerial datasets.

2.
We construct six different aerial datasets for domain adaptation by carefully selecting the common classes and balancing the datasets.

3.
We consider seven models for unsupervised domain adaptation and report their performance on the aerial datasets.
The remainder of this work is organized as follows. Section 2 discusses the notation and categories of domain adaptation. Section 3 provides an explanation for each of the algorithms used for the benchmarking study. Section 4 describes the datasets and the process used to make the aerial domain adaptation datasets used in this study. Section 5 provides the comprehensive results from the study and confusion matrices and t-SNE [26] visualizations of the data before and after domain adaptation for each of the methods. Finally, Section 6 presents the final remarks and conclusions based on our evaluation.

Background
In unsupervised domain adaptation, the source domain where training takes place is described as D s = {(x s i , y s i )} n s i=1 , where n s is the number of labeled samples, and the target domain, where testing takes place, is D t = {(x t j )} n t j=1 , where n t is the number of unlabeled samples. The number of classes is K in both source and target domains, and C s is the source classifier while C t is the target classifier. The feature extractor is described by F and the discriminator by D (when needed). Transfer learning (TL) was motivated by the insufficient training data problem, where the collection and labeling of data in some domains is expensive to do on a large scale. Here, the domains can be from the same modality [27] or different modalities [28,29]. TL refers to a broad category of algorithms, where knowledge transfer takes place, of which domain adaptation is a specific type.
Most DA methods can be categorized into divergence based and adversarial-based methods [10]. The divergence based DA works by minimizing the domain shift between the source and target distributions to obtain a domain invariant feature representation [10]. The classifier can then perform comparably on both domains. The commonly used methods for comparing the distribution shift and the works that use them are: maximum mean discrepancy (MMD) [4,[30][31][32][33], Kullback-Leibler (KL) divergence [34], correlation alignment (CORAL) [35,36] and H divergence. Class labels can also be used to transfer knowledge between different domains [37][38][39][40], and if target labels are not available, pseudo labels are used [30,33,41]. Models that adjust the architecture of the model, such as adaptive batch normalization (ABN) [42][43][44], domain-guided dropout [45], etc., are also divergence based DA methods.
In the adversarial-based DA, a domain discriminator is used to encourage domain confusion by using an adversarial objective. This objective minimizes the distance between the empirical source and target mapping distributions. This is further categorized based on the use of generative versus nongenerative models. The generative model makes use of generative adversarial networks (GANs) to create synthetic target data, which is in turn used to train the target model [46][47][48]. In the nongenerative approach, the feature extractor learns a discriminative representation using the source domain labels and a domain-confusion loss. The domain-confusion loss is added with the classification loss and tries to match the source and target distributions to confuse the higher layers [6,37,[49][50][51]. DANN [6] is a popular example of adversarial-based DA.
In most cases of TL and DA, the source and target domains have the same classes, and this is called closed set DA. When the classes in the source and target are not identical, the problem is called open set DA [52] and is more challenging. Figure 1 illustrates the difference between open set and closed set DA. The work presented here considers closed set unsupervised DA methods for classification. All the DA algorithms that we have used for this benchmarking study were the top performing models with source code publicly available from each of the DA categories.

Domain Adaptation Algorithms
This section offers a description of each DA method used in our study. We begin with Domain Symmetric Networks (SymNets) [53], Robust Spherical Domain Adaptation (RSDA) [54], Conditional Adversarial Domain Adaptation with Gradually Vanishing Bridge as well as Gradually Vanishing Bridge [55], the unsupervised form of Universal Domain Adaptation [56], the source free method Source Hypothesis Transfer (SHOT) [57], and finally the Structurally Regularized Deep Clustering [58].

Domain-Symmetric Networks for Adversarial Domain Adaptation
The Domain Symmetric Network (SymNets) [53] utilizes a symmetric design of the source and target classifiers for adversarial unsupervised domain adaptation. SymNets make use of a novel adversarial learning method that includes a category-level and domainlevel confusion loss that can enhance the learning of features to be domain-invariant for the various classes. The proposed cross-domain confusion scheme makes the target classifier symmetric to the source classifier in terms of predicting the classes. The domain-level confusion scheme used for the domain adversarial training makes use of the convolutional layers in the network as a feature extractor G, and the fully connected (FC) layers as the task classifier C. The domain discriminator D, which is symmetric to C is added on top of G to distinguish between the features of the samples from the two domains.
The architecture of SymNets is described in Figure 2. In unsupervised domain adaptation, the source domain is described as D s = {(x s i , y s i )} n s i=1 containing n s labeled samples x s i with labels y s i , and the target domain is D t = {(x t j )} n t j=1 containing n t unlabeled samples x t j . Figure 2. Architecture of SymNets [53]. Blue and red arrows indicate source and target domains and losses corresponding to them respectively. Yellow refers to feature extractor and corresponding losses and green represents classifiers and their losses.
The SymNets design consists of two parallel task classifiers for K number of classes, C s and C t , which are based on a single Fully Connected (FC) layer followed by softmax operations. The source task classifier C s is trained using the cross-entropy loss over the labeled source samples, where p(x) ∈ [0, 1] is the probability of the sample belonging to a class after softmax. Since the target samples x t are unlabeled, the idea is to leverage the labeled source samples to train the target domain classifier C t using the cross-entropy loss (E): To make C s and C t distinguishable, domain discrimination training C st is done by making use of the two-way cross-entropy loss: where ∑ K k=1 p st k (x) and ∑ K k=1 p st k+K (x) are the probabilities of classifying an input sample x as belonging to the source and target domains respectively.
The category-level confusion loss makes use of the labeled source samples and the feature extractor G is learned by the following objective: For the domain-level confusion loss, the unlabeled target samples are used as the individual class label is not required for domain-level confusion. For the target sample, the feature extractor G is learned by the following objective, The entropy minimization principle [59] is used in SymNets to enhance the discrimination among task categories by summing over the probabilities at each pair of category-corresponding neurons in C st , where The overall training objective is obtained by combining Equations (1)-(6) as, where λ ∈ [0, 1] is a trade-off parameter to suppress noisy signals from F st domain (G, C st ) and M st (G, C st ) during the early stages of training.

Spherical Space Domain Adaptation with Robust Pseudo-Label Loss
Robust Spherical Domain Adaptation (RSDA) [54] proposes a novel adversarial domain adaptation approach by leveraging the spherical space and defining a spherical neural network. A robust pseudo-label loss is defined to make effective use of the pseudo-labels. This loss weighs the importance of the estimated labels on target data by the posterior probability of the correct label that is modeled by a Gaussian-uniform mixture model in spherical space. The spherical features (L 2 normalized) were shown to improve performance in recognition and domain adaptation [60][61][62][63][64]. RSDA extends this idea by defining all the operations in spherical feature space to leverage the advantages of the spherical space structure. The architecture of RSDA is described in Figure 3.
The source domain has a labeled dataset, {x s i , y s i } N s i=1 and the target domain is unlabeled, {x t j } N t j=1 . The goal of RSDA is to transfer the knowledge obtained from the labeled source data classifier to get target labels. The feature extractor F goes through adversarial training with a domain discriminator D such that F is able to distinguish between the source and target domains. A CNN such as ResNet [65] is used as the feature extractor F which is mapped onto a sphere. The spherical feature space also has a classifier (C) and discriminator (D). The spherical neural network consists of spherical perceptron and spherical logistic regression layers. The robust pseudo-label loss defined in spherical space makes use of the pseudo-labels of the target domain and the Gaussian mixture model.  [54]. Blue and red arrows represent computational flow of source and target domain samples respectively. F is a feature extractor which is a CNN that extracts features and embeds them onto a hypersphere. Spherical classifier predicts class labels and domain discriminator predicts domain labels. Posterior probability of correct labels is obtained by feeding the target pseudo-labels and target features into a Gaussian mixture model. Posterior probabilities then weight pseudo-label loss for robustness.
The spherical adversarial training loss is defined as, and takes into account the basic loss, robust pseudo-label loss, and conditional cross entropy loss, which are all defined in the spherical feature space. L bas (F, C, D) is the basic loss which is used to learn the classifier in the source domain and align features across domains. The basic adversarial domain adaptation loss takes either DANN [6] or MSTN [66] as the baseline and L bas is the spherical version of the loss. The cross-entropy loss, L ent (F) is used to reduce prediction uncertainty and is defined as, where H is the entropy of the distribution. To define the form of the robust pseudo-label loss, L rob (F, C, φ), the pseudo label isỹ t j = argmax k [C(F(x s i ))] k for the kth element. A random variable z j ∈ {0, 1} is used to figure out if the data are correctly labeled (1) or wrongly labeled (0). If the probability of correct labeling is P φ (z j = 1|x t j ,ỹ t j ) with parameter φ, the robust loss is, , and J (·, ·) is the mean absolute error (MAE) [67]. w φ (x t j ) is defined based on the posterior probability of correct labeling, is learned using the Gaussian mixture model in spherical feature space.
The spherical neural network (SNN) is an extension of MLP from Euclidean to spherical space. The features on the spherical space are obtained by normalizing the feature vectors, f = r F(x) ||F(x)|| . The classifier is constructed by stacking a few spherical perceptron (SP) layers and a final spherical logistic regression (SLR) layer. The SP layer consists of a linear transform and an activation function. The spherical linear transform (g) consists of first projecting the features from the former spherical surface onto the tangent plane, then a linear transformation to transform the projected features onto the tangent plane of the later spherical surface, and then back from the tangent space onto the later spherical surface. The nonlinear activation function in spherical space is defined as, The SP layer is then, The SLR layer is described similar to the Euclidean logistic regression as, where, w T k z + b k = 0 is the classification boundary on the sphere.

Gradually Vanishing Bridge for Adversarial Domain Adaptation
Gradually Vanishing Bridge for Adversarial Domain Adaptation [55] proposes a method to perform unsupervised adversarial domain adaptation that utilizes a vanishing bridge mechanism on the generator and the discriminator used in the network. The bridge is a measurement which models the difference between the existing and ideal representation of the domains. On the generator, the bridge reduces the overall transfer difficulty and reduces the influence of residual domain specific characteristics. It connects either the source or the target to an intermediate domain to enable domain alignment. On the discriminator, the bridge enhances the discriminating ability and balances the adversarial training process. During the training process, the range of the bridge is reduced gradually to reduce the influence of the domain characteristics of the bridge output which in turn reduces the discrepancy between the source and target. After the adaptation is done, more points from both the domains are covered in the intermediate domain. The points that are still outside of the intermediate domain are taken as hard examples. Applying the bridge to both the generator and discriminator is denoted as GVB-GD and it ensures that the two-player minmax game is balanced. The adversarial training process also ensures that the distribution of the intermediate representation is similar across both the domains.
The gradually vanishing bridge framework is an end-to-end network. The bridge layer on the generator outputs the domain specific properties, called γ. The intermediate representation r i is found by subtracting γ from the classifier response c i . This representation is minimized by the classification loss.
The bridge layer associated with the discriminator represents the distance between the current discriminator function and the ideal decision boundary to be achieved. The discriminator also receives additional discriminative power from the bridge layer. The overall objective of the network is given by the equation below which shows that the network is trained by minimizing classification loss, adversarial transfer loss, and the reconstruction loss for the generator and maximizing the adversarial transfer loss for the discriminator.
where L cls is the classification loss, L adv trans is the adversarial transfer loss and L ext is the reconstruction loss.

Conditional Adversarial Domain Adaptation with Gradually Vanishing Bridge
Gradually Vanishing Bridge [55] can be applied to other unsupervised domain adaptation methods like Conditional Adversarial Domain Adaptation (CDAN) [68]. CDAN uses conditioning strategies such as multilinear conditioning that captures the cross variance between feature representations and classifier predictions to improve the discrimination process. In the model CDAN-GD, the gradually vanishing bridge is applied on the generator and discriminator. The major contribution of CDAN is the domain discriminator, which is trained on the cross-covariance of the domain specific feature representations and predictions made by the classifier. This discriminator is also trained on the ambiguity of the classifier which helps the discriminator prioritize the easier samples. Back propagation is then used to solve the system in linear time.
This method improves adversarial domain adaptation methods by addressing two shortcomings. First, when the joint distributions of features and classes across both domains are not identical, adapting only the feature representation will not be enough as deep representations usually transition from general to specific as training progresses in deep networks. The second shortcoming occurs when the feature distribution is multi-modal, which often occurs in real-life scenarios. In this case, the adversarial network performs poorly. These challenges are overcome by training the generator and discriminator on information related to the domains such as its respective labels or similar modality objects. By adapting the feature representation f , the classifier prediction g is able to capture multi-modal structures in the adversarial domain adaptation. Domain variations in both feature representation f and classifier prediction g can be represented at the same time via conditioning. Thus, CDAN is presented as a minimax optimization problem consisting of two error terms: E(G) from the source classifier G, which needs to be minimized for reduced source risk, and E(D, G), which is related to the source classifier G and the domain discriminator D across both domains. Here, E(D, G) will be minimized over the discriminator but maximized over the feature representation f = F(x) and classifier prediction g = G(x). The minimax game of conditional domain adversarial network (CDAN) is shown as, where λ is a hyper-parameter between the two objectives to trade off source risk and domain adversary.

Universal Domain Adaptation with Universal Adaptation Network
Most domain adaptation techniques require a definite class-wise relationship between the source and target domain classes. Universal Adaptation Network (UAN) [56] is an interesting modification of DANN [6] where the network does not need to know any specific class-wise relationship between the source and target domains. UAN can identify the target samples that do not belong to any of the classes of the source domain and categorize them as unknown samples, while classifying other target samples into one of the source domain classes.
In UAN, during training, a source domain D s = (x s i , y s i ) with n s annotated samples of C s classes collected from a distribution p and a target domain D t = (x t i ) with n t nonannotated samples of C t collected from a distribution q are given. The common label set shared by source and target domains is denoted by C = C S C t . The label sets or classes specific to the source and target domains are denoted by C s = C s \C and C t = C t \C, respectively. The target data are totally nonannotated, and the target label set is defined only for the purpose of explanation of the setting. Commonness between the source and target domains is defined as the Jaccard distance of the label sets ξ = |C s C t | |C s C t | . In our case of aerial datasets under consideration, keeping in line with all other methods evaluated, we selected a closed setting with ξ = 1.
The Universal Adaptation Network consists of a feature extractor F, classifier G, adversarial domain discriminator D and a nonadversarial domain discriminator D . Inputs x from both domains are processed through the feature extractor F to get features z = F(x), which are then fed into the classifier G to get the probability y = G(z) over the source classes. In addition to domain gap, in a universal setting there is also category gap between source and target domains. To mitigate this discrepancy, UAN introduces the non-adversarial discriminator D that learns to calculate the domain similarity parameter d . It is, in turn, used to compute the source transferability criterion for training the network with the adversarial discriminator D. Source similarity weights w s and w t of the source and target domains, respectively, are calculated following the equation below, under the assumption that entropy expresses the uncertainty in predictions and more confident predictions produce lower entropy.
Here, H( y) is the entropy of the sample prediction and d (x) is the output of nonadversarial discriminator D . Weights w s and w t confine domain alignment via adversarial training within the shared label set between the two domains. The error functions can therefore be written as follows.
where L is the cross-entropy loss. Therefore, the objective functions are the following.
where λ is a hyperparamter for trade-off between domain transferability and discriminability. In the test phase, the adversarial discriminator is removed while the nonadversarial discriminator, along with the classifier produce the source similarity weights and depending on a source similarity threshold, the samples correspond to one of the source classes or as an unknown.

Source Hypothesis Transfer
Source Hypothesis Transfer (SHOT) proposes a discrepancy based domain adaptation technique using hypothesis transfer from source domain D s to the target domain D t . For vanilla unsupervised domain adaptation task, n s labeled samples {x i s , y i s } n s i=1 are given where x i s ∈ X s , y i s ∈ Y s with n t target samples {x i t } n t i=1 where x i t ∈ X t . For vanilla unsupervised DA, the goal is to find {y i t } n t i=1 where y i t ∈ Y t for the target samples by learning a mapping f t : X t → Y t . SHOT proposed a domain adaptation technique without the source data during the adaptation procedure. To achieve that, first a network is trained with a source data to learn the mapping f s : X s → Y s . During the adaptation procedure with the target data, the source trained model is initialized and the parameters for the classifier part of the model is kept frozen and only the backbone part of the network is trained with the information maximization (IM loss). The overall architecture of the SHOT method is shown in Figure 4. During training the network on the source domain D s , the source feature extractor g s and the source hypothesis h s are both trained. Here, g s encodes the input image to a d dimension feature, i.e., g s : X s → R d and the hypothesis module takes the embedding feature and output k dimensional logits, i.e., h s : R d → R k , where the total number of classes is k. During the domain adaptation procedure on the target domain D t , the target feature extractor g t is initialized with the source trained backbone and learned during the domain adaptation procedure. The source hypothesis h s is transferred to the target hypothesis h t and keeps unchanged throughout the network training procedure.
The loss function for the source training is categorical cross-entropy loss with label smoothing. The label smoothing component helps to achieve evenly separated clustering in the embedding space. The overall source training loss can be written as follows.
where, δ k = exp(a k ) ∑ i exp(a i ) , during training with the target data, as the source data are not made available, the distribution alignment based on features does not work. The authors argue that the ideal output probability for the target samples should be similar to one hot vectors, i.e., the model should be confident and diverse. For this purpose, the information maximization (IM) loss is introduced as the sum of the entropy loss L ent and KL-divergence loss L div and can be written as follows.
where, f t (x) = h t (g t (x)) is the output logits of the network with K-dimensions and p = E x t ∈X t [δ( f t (x t ))] is computed as the mean of the entire target domain. However, this does not ensure that similar classes match with each other between the source and the target domain. To alleviate this, the authors propose to use self supervised pseudo labeling motivated by deep clustering technique. They compute the class centroid for each of the target classes and compute the pseudo labels based on the cosine distances from each of the centroids. The overall loss function for the domain adaptation can be written as follows.
whereŷ t ∈Ŷ t is the estimated pseudo labels. Recently Taufique et al. [69] proposed a variant of SHOT utilizing HRNet [70] backbone that further improved the performance of SHOT.

Unsupervised Domain Adaptation via Structurally Regularized Deep Clustering
Structurally Regularized Deep Clustering (SRDC) [58] is a discrepancy based unsupervised domain adaptation method which minimizes the risk of damaging the intrinsic discrimination of target data. This potential damaging of the target domain usually occurs when the domain aligned features are explicitly learned. To avoid this, a source regularized, deep discriminative clustering method is used to directly show the existing discrimination within the target data. The method is motivated by the assumption that structural similarity exists between the two domains. This method uses a framework of deep network based discriminative clustering that minimizes the KL divergence between the predictive label distribution of the network and an introduced auxiliary one, by replacing the auxiliary distribution with the ground truth labels of the source data. Thus, it implements the structural source regularization with joint network training. The target discrimination process is also enhanced by clustering of the intermediate network features. The structural regularization is also enhanced by selecting less divergent source examples.
The joint network method consists of an embedding function and a classifier function. The assumptions for the method are that the classes used are discriminative within a domain but the same classes in different domains are geometrically close. After getting the softmax predictions of the target data, the network minimizes the KL divergence between the softmax outputs of the target samples and the auxiliary samples. The method then alternates between updating the auxiliary samples and using these updated samples to train the network to update the network parameters to optimize deep clustering. The equation below shows the objective of deep clustering that is to be achieved. The first term calculates the KL divergence where P t is the softmax output of the target samples and Q t is the softmax output of the auxiliary samples. The second term in the equation is used to balance the cluster assignment. This term also helps in maintaining the cluster size balance by performing entropy maximization of the labels on the target domain.
The above objective is optimized by alternating between the auxiliary distribution step and the network update step. Since both the domains share the same label space, joint training pushes instances from the same classes into the same regions, thus achieving feature alignment between the two domains. The clusters are also enhanced by selecting soft source samples, and this is done by reassigning the weights to the source after every epoch, based on their similarity to the target samples.

Aerial Datasets
This section describes the aerial datasets used in this benchmarking study. Since there are no aerial datasets dedicated to domain adaptation, we considered publicly available aerial datasets for classification and utilized their shared classes for unsupervised domain adaptation.

AID
The Aerial Image Dataset (AID) [71] is a dataset developed for the task of aerial scene classification by procuring images from Google Earth and contains 30 classes of aerial or satellite imagery: beach, bridge, center, airport, bare land, baseball field, church, commercial, dense residential, forest, industrial, meadow, desert, farmland, medium residential, mountain, park, port, railway station, resort, parking, playground, pond, river, school, sparse residential, storage tanks, viaduct square and stadium. The annotations were made by experts in remote sensing image interpretation. The images in the dataset are termed as multisource images as the Google Earth images are collected from varying remote imaging sensors. The images were selected from many countries some of which are, the United States, England, Italy, China, Japan, France, and Germany. The data are also considered diverse as the images were captured under different imaging conditions such as at different times during the day and at various seasons throughout the year. There are a total of 10,000 aerial images of size 600 × 600 pixels. The images are obtained at multiple ground sampling distances (GSDs) ranging from 8 m to 0.5 m. The classes selected from AID for our experiments are airport, parking, storage tank, beach, forest, river, baseball field, medium residential, and sparse residential.

UCM
The UC Merced Land Use Dataset (UCM) [72] is a publicly available image dataset of overhead land images meant for research purposes. It consists of 21 classes and has 100 images per class measuring 256 × 256 pixels. The pixel resolution of the dataset is 1 foot/0.3 m per pixel in the RGB color space. The dataset consists of these classes: beach, buildings, chaparral, agricultural, airplane, baseball diamond, dense residential, forest, freeway, golf course, mobile home park, overpass, parking lot, harbor, intersection, medium residential, river, runway, sparse residential, tennis courts and storage tanks. The images were downloaded from the United States Geological Survey (USGS) National Map from different urban US regions. The images selected contain a wide variety of spatial patterns, textures and colors making it ideal for scene classification. The classes selected from UCM for our experiments are airplane, parking lot, storage tank, beach, forest, river, baseball diamond, medium residential, and sparse residential.

NWPU
The NWPU-RESISC45 [73] dataset was created by Northwestern Polytechnical University (NWPU) for REmote Sensing Image Scene Classification (RESISC). It is known for having high diversity within each class while also maintaining similarity amongst the classes. The dataset was collected from Google Earth. NWPU has a total of 31,500 highresolution remote sensing images which are divided into 45 scene classes. The classes are as follows: circular farmland, cloud, commercial area, dense residential, desert, forest, freeway, golf course, ground track field, airplane, airport, baseball diamond, basketball court, beach, bridge, chaparral, church, harbor, industrial area, intersection, island, lake, meadow, medium residential, mobile home park, roundabout, runway, sea, ice, ship, snowberg, sparse residential, stadium, storage tank, mountain, overpass, palace, parking lot, railway, railway station, rectangular farmland, river, tennis court, terrace, thermal power station, and wetland. Each class has 700 RGB images each of size 256 × 256 pixels. The spatial resolution is from around 30 m to 0.2 m per pixel. The classes selected from NWPU for our experiments are railway station, parking lot, bridge, runway, storage tank, and airplane.

CLRS
The Continual Learning Benchmark for Remote Sensing (CLRS) [74] dataset was designed for remote sensing image scene classification and for continual/lifelong learning. The authors created a criterion for three continual learning scenarios and have divided the dataset into those three categories. The CLRS dataset has 25 classes and a total of 15,000 images. The remote sensing images were procured from Google Earth, Bing Map, Google Map and Tianditu which all possess different remote imaging sensors, so the images are multi-source. Each class contains 600 images of size 256 × 256 pixels. The resolution of the data ranges from 0.26 m to 8.85 m. The 25 scene classes are highway, industrial, meadow, airport, bare-land, beach, mountain, overpass, park, parking, playground, commercial, desert, farmland, port, railway, railway-station, residential, river, runway, forest, golfcourse, stadium, and storage-tank. The classes of interest from CLRS for our work are railway station, parking, bridge, runway, storage tank, and airport.

xView
This dataset was created as part of the xView [75] 2018 Detection Challenge. It contains around 1 million object samples divided across 60 classes with the option of using either 3band or 8-band imagery. The images have a resolution of 0.3 m/pixel. This is an imbalanced dataset as there are some classes with many instances and some classes with only a few instances. The images are captured using the WorldView-3 satellite at 0.3 m ground sample distance. The objects within each image in this dataset vary in size from 3 m to greater than 3000 m. Each image in this dataset is of a very high resolution and often there were multiple objects from different classes within an image. This makes it difficult to perform classification accurately. To overcome this, each image was cropped around the bounding boxes so that only 1 object is in each image from a single class. The original images range from 2500 × 2500 to 4000 × 4000 pixels. The cropped image sizes range from 10 × 10 to 987 × 987. More preprocessing was done on this dataset to achieve optimum results for our experiments. The images which were smaller than 30 × 30 pixels were discarded. The number of images per class was restricted to 5000 and for classes not meeting this requirement, data augmentation was performed. The data augmentation was in the form of flipping the image horizontally and vertically. The final dataset contained the classes small vehicle, large vehicle, storage tank, plane, and ship. The classes which were augmented are plane, ship and storage tank.

DOTA
The Dataset for Object deTection in Aerial images (DOTA) [76] dataset is a benchmark dataset created for performing object detection in aerial images. The images in this dataset are mainly gathered from Google Earth and satellite JL-1 and satellite GF-2, which belong to the Chine Centre for Resources Satellite Data and Application. A total of 2086 images were captured using these satellites from different areas of the world. The images are in a range of around 800 × 800 pixels to 6000 × 6000 pixels. The object categories in this dataset are ground track field, harbor, bridge, large vehicle, small vehicle, helicopter, roundabout, soccer ball field, swimming pool, plane, ship, storage tank, baseball diamond, tennis court and basketball court. Similar to xView, each image in this dataset is of a very high resolution and often there were multiple objects from different classes within an image. This makes it difficult to perform classification accurately. To overcome this, each image was cropped around the bounding boxes so that only 1 object is in each image from a single class. The cropped image sizes range from 10 × 10 to 904 × 904. More preprocessing was done on this dataset to achieve optimum results for our experiments. The images which were smaller than 30 × 30 pixels were discarded. The number of images per class was restricted to 5000 and for classes not meeting this requirement, data augmentation was performed. The data augmentation was in the form of flipping the image horizontally and vertically. The final dataset contained the classes small vehicle, large vehicle, storage tank, plane, and ship. The class which was augmented is storage tank.

Aerial DA Datasets
To the best of our knowledge, there are no existing datasets designed for aerial domain adaptation. For closed-set DA, the classes between the source and target domain datasets are the same. Using the datasets described in this section, we create three aerial DA datasets for our benchmarking study. Each of these datasets are made by taking the common classes between two of the aerial datasets. There are differences in the image characteristics between the source and target domain based on their GSDs and the sensors that were used to collect the data. The first DA dataset is between AID and UCM, with 9 classes in common between them. The shared classes and number of samples for the AID-UCM DA dataset is in Table 1. Sample images from the shared classes are shown in Figure 5.  Classes are (from left to right) baseball field/baseball diamond, beach, medium residential, sparse residential, parking/parking lot, airport/airplane, storage tank, forest, and river.
The next dataset is created by considering the common classes between NWPU and CLRS. The 6 shared classes and samples in each class for the NWPU-CLRS DA dataset is in Table 2. Sample images from the shared classes are shown in Figure 6.  The third dataset is created by taking the common classes between DOTA and xView. The 5 shared classes and samples in each class for the DOTA-xView DA dataset is in Table 3. Sample images from the shared classes are shown in Figure 7.

Experiments
The code for the DA methods in our study was obtained from the URLs in Table 4. We considered the entire source domain for training and the entire target domain for adaptation. All the methods are unsupervised DA, so no label information from the target domain is used during adaptation. After adaptation, we consider the entire target domain for evaluation. The implementation for SymNets, RSDA, SRDC, CDAN-GD, and GVB-GD was done on a Linux workstation with an NVIDIA Titan V GPU with 12 GB memory. SHOT and UAN were implemented on a Linux workstation with an RTX 2080 Ti GPU with 12 GB memory. The parameter selection of the specific methods is described as follows. Table 4. URLs for codes of implemented methods. All code is implemented in PyTorch. SymNets: the backbone and implementation of SymNets follows the original described in [53]. The feature extractor G is a pretrained ResNet-50 [65] excluding the last FC layer. The feature extractor G is fine-tuned depending on the dataset and the classifier C st is trained from scratch using back-propagation. The learning rate of C st is 10 times that of G. The optimization used is SGD with a momentum of 0.9 and batch size of 128. The strategy described in [53] to update the learning rate (λ) is used. The learning rate is adjusted using RSDA: the backbone network architecture was kept the same as the original RSDA implementation in [54]. The training is done by alternatively optimizing the network parameters F, C, D and the parameters φ of the Gaussian mixture models while freezing one of the sets of parameters. The network is first trained with the basic loss from DANN [6] L bas to initialize F, C, and D. The following procedures are then run alternately:

1.
Fix F, C, and D and estimate φ: The pseudo-labelsỹ t j are estimated by fixing F, C, and D and the distance of a sample from the spherical class center for a classỹ t j is calculated using the cosine distance. Then, φ is estimated by the EM algorithm [59].

2.
Optimizing F, C, and D and fixing φ: With the current pseudo-labels and parameter φ, F, C, and D are trained as a standard domain adaptation training by the progressive adversarial training strategy described in [6] with the objective function in Equation (8).
The feature extractor, F, is set to a ResNet-50 [65] pretrained on Image-Net excluding the last FC layer. Optimizing F, C, and D is done by SGD with a momentum of 0.9 and learning rates of C and D and 10 times that of F. Following the method in [6], the learning rate η and hyperparameter γ are estimated by η = 0.01 (1+αp) β and γ = 2 1+exp(−τ p) − 1, where α = 10, β = 0.75, τ = 10, and p is the optimizing progress that linearly changes from 0 to 1. The alternating iteration is performed 10 times and each step runs SGD for 5000 steps.
CDAN-GD: for these experiments, the backbone network is a ResNet-50 [65] which is pretrained on ImageNet, as done in the original paper. The new layers and classifier layers are trained through back propagation, where the classifier is trained from scratch with the learning rate set as 10 times that of the lower layers. As with GVB-GD the bridge layer is built on the generator and discriminator using fully connected layers. The (λ) is set to 1 and the (µ) is set to 1 to indicate the bridge on both networks. The CDAN is also implemented with entropy conditioning which prioritizes the discriminator on easy to transfer samples. For the training process, we employ mini-batch SGD with the momentum of 0.9 and a weight decay of 0.0005. The learning rate is taken as 0.001 for all the experiments with a batch size of 16.
GVB-GD: in these experiments, the backbone network is a ResNet-50 [65] which is pretrained on ImageNet, as done in the original paper. The bridge layer is built on the generator and discriminator using fully connected layers. In the adversarial training gradient reversal layer is applied to the network. In the early stages of the training procedure, a progressive training strategy is applied to suppress noisy signals from the discriminator. For the training process we employ mini-batch stochastic gradient descent (SGD) with the momentum set to 0.9 and a weight decay of 0.0005. The learning rate is taken as 0.001 for all the experiments with a batch size of 16. The (λ) is set to 1 and the (µ) is set to 1 to balance the loss and it indicates that the bridge is implemented on both the generator and the discriminator. UAN: the backbone or feature extractor is an ImageNet pretrained ResNet-50 [65] following what was done in the original paper. The classifier is a 2-layer fully connected network where the final output layer size depends on the dataset. Both the adversarial and non-adversarial discriminators are 3 layer FC networks with sigmoid activation function applied to the final output. While training, the learning rates for both the discriminators and the classifier was set as 0.001 and that of the feature extractor was set as 0.0001. Training was done with mini-batch SGD optimizer with momentum of 0.9.
SHOT: in these experiments, the network architecture is the same as described in their Github implementation. The backbone is a pretrained ResNet-50 [65] where the classifier layer is replaced with an FC layer of 256 neurons followed by a BN layer. The classifier layer consists of an FC layer followed by a weight normalization (WN) layer as shown in Figure 4. We consider a base learning rate of η 0 = 10 −2 for the backbone and 10 times the learning rate of the upper level layers. Similar to the original implementation, we also use the SGD optimizer with 0.9 momentum. We have used a learning rate scheduler η = η 0 · (1 + 10 · p) −0.75 where p is the training process from 0 to 1. We have also set the hyperparameter β = 0.3.

SRDC:
The backbone network is a ResNet-50 [65] which is pretrained on ImageNet, as done in the original paper. In the backbone network the last FC layer is replaced with the task specific FC layer to parameterize the classifier. The pretrained layers are fine tuned, and the new layers are trained where the learning rate of the new layers is 10 times that of the pre-trained layers. For the training process mini-batch SGD is used with the momentum of 0.9 and a weight decay of 0.0001. The batch size is taken as 16. For each run, the best performing clustering model is used as the test model. Additional regularization is done by performing discriminative clustering in the bottleneck feature space.

Results and Discussion
In this section, we compare the performance using the overall accuracy and F1 scores of the different DA models on the aerial datasets. We also report the confusion matrix and t-SNE plots for the xView to DOTA domain adaptation scenario.

Performance Comparison
Tables 5-7 present the performance comparison for all the benchmarking methods. Each of the methods was executed five times and the mean and standard deviation of the accuracy was reported. The F1 scores for the methods are also reported as the classes are not perfectly balanced. F1 score is the harmonic mean of the precision and recall. Precision is the measure of how many positive predictions are correct. It is defined as the ratio of true positives over the sum of true positives and false positives. Recall is the measure of the correct positive cases from all the actual positive cases. It is defined as the ratio of the true positives over the sum of the true positives and false negatives. Every method except UAN achieved impressive performance for most of the datasets. For AID to UCM, SymNets acquires the highest performance gain with the classification accuracy jumped from 81.75% (before adaptation) to 99.15% (after adaptation). The F1 scores follow a similar trend as the accuracy scores except for the DOTA to xView adaptation. This may be because training on a higher resolution dataset, such as DOTA, and then adapting to a lower resolution one, like xView, is more challenging and leads to a degradation in performance. While being source free during adaptation, SHOT achieved competitive performance with the other source dependent methods.

xView to DOTA Results
In this section, we will discuss the results of the xView to DOTA adaptation in more detail with the help of confusion matrices (Figures 8-14) and t-SNE plots (Figures 15-21) of the various DA methods before and after adaptation. The custom xView to DOTA dataset was chosen for further analysis as there is a significant jump in accuracy from before DA to after DA for all the methods in this study, with SymNets showing the highest performance gain. A confusion matrix is used to quantify the performance of a classifier, where each row corresponds to the actual class and each column corresponds to the predicted class. It is used to show which classes are confused with each other for a given model, where ideal performance results in a diagonal matrix.
While there is an improvement in the performance for all classes across all DA methods, the most confusion appears to be between small vehicle and large vehicle or between ship and storage tank. This could be due to the fact that at small resolutions, small and large vehicles appear similar as do ships and storage tanks.

Feature Visualization
To visualize the features, we make use of t-Distributed Stochastic Neighbor Embedding (t-SNE) [26], which is a dimensionality reduction method used to visualize highdimensional data on a 2D or 3D plot. t-SNE works in three steps: first, the similarity between points in the higher dimensional-space is measured. Next, a distribution that measures the pairwise distances between points in the lower-dimensional embedding is calculated. Finally, KL divergence is used to minimize the difference between the probability distributions in the higher and lower dimensional spaces to provide the final 2D graph for visualization.
To visualize features in this benchmarking study, we show t-SNE plots with the source and target domain features of the network before and after adaptation. Prior to the adaptation process, the t-SNE plot is obtained when the model is trained only on the source domain. The t-SNE plot after adaptation is used to visualize the improvement in the alignment of source and target domain features after the model is adapted to the target domain. In this section, we show the t-SNE plots for all the DA methods in Figures 15-21, before and after adaptation on the xView-DOTA dataset which contains five classes, as outlined in Table 3.       When the domain adaptation process is successful, the source and target domains have near perfect alignment. From the t-SNE plots of the DA methods, SymNets, RSDA and SHOT seem to have the best source and target domain alignment after adaptation, as seen in Figures 15, 16 and 20.

Conclusions
We presented a benchmarking study of seven unsupervised domain adaptation methods (SHOT, RSDA, Symnet, UAN, GVB-GD, CDAN-GD, and SRDC) on three custom aerial DA datasets. These datasets were created by taking the common classes between AID-UCM, DOTA-xView, and NWPU-CLRS. We reported the accuracy for each of the methods considered on the aerial datasets to determine their efficacy. The confusion matrices and t-SNE plots of the methods on the xView-DOTA dataset were also reported.
For the AID to UCM adaptation, SymNets does the best, while SHOT reports the most improvement for the UCM to AID adaptation task. For CLRS to NWPU and NWPU to CLRS adaptation, SHOT shows the best performance. Finally, for the DOTA to xView dataset, SHOT has highest accuracy, while SymNets does best for the xView to DOTA adaptation. Overall, SHOT and SymNets are the best performing models for the task of aerial domain adaptation. This is further evidenced by observing the confusion matrix of SymNets for the xView to DOTA adaptation, where after adaptation, the diagonal gets stronger. The t-SNE plots of SymNets and SHOT both show good alignment of the source and target domains after adaptation, which is further evidence of the efficacy of these methods.