Domain-Adversarial Training of Self-Attention Based Networks for Land Cover Classification using Multi-temporal Sentinel-2 Satellite Imagery

The increasing availability of large-scale remote sensing labeled data has prompted researchers to develop increasingly precise and accurate data-driven models for land cover and crop classification (LC&CC). Moreover, with the introduction of self-attention and introspection mechanisms, deep learning approaches have shown promising results in processing long temporal sequences in the multi-spectral domain with a contained computational request. Nevertheless, most practical applications cannot rely on labeled data, and in the field, surveys are a time consuming solution that poses strict limitations to the number of collected samples. Moreover, atmospheric conditions and specific geographical region characteristics constitute a relevant domain gap that does not allow direct applicability of a trained model on the available dataset to the area of interest. In this paper, we investigate adversarial training of deep neural networks to bridge the domain discrepancy between distinct geographical zones. In particular, we perform a thorough analysis of domain adaptation applied to challenging multi-spectral, multi-temporal data, accurately highlighting the advantages of adapting state-of-the-art self-attention based models for LC&CC to different target zones where labeled data are not available. Extensive experimentation demonstrated significant performance and generalization gain in applying domain-adversarial training to source and target regions with marked dissimilarities between the distribution of extracted features.


Introduction
In the past few decades, the launch of many satellite missions with short revisit time and comparatively high-resolution sensors has offered an extensive repository of remote sensing images.Availability of the open-source data by many Earth-observation satellites has made remote sensing very easy and obtainable [1].Open-source data sets are available free of cost from several satellite missions such as the Sentinel-2 and Landsat [2].These satellites are equipped with multi-spectral sensors with short revisit time, and good spatial and spectral resolution, allowing researchers to test modern image analysis techniques to extract more detailed information of the target object.It is quite possible to monitor the dynamic processes on Earth [3,4].Additionally, it has become easier to estimate and classify biophysical parameters using several data sources [5,6,7].Overall, the new scenario has led to the opportunity for the land cover monitoring, change detection, image mosaicking, and large-scale processing using multi-temporal and multi-source images [1,8,9,10].
The most essential and critical remote sensing application is land cover and crops classification (LC&CC).It facilitates labeling the cover such as forest, ocean, and agricultural land.Moreover, mapping can also be done manually using satellite images, but the process is quite tedious, costly, and time-consuming.Finally, an exquisite global cover map is not available as yet, but there is a land cover map with the name Corine Land Cover (CLC) [11] which provides land cover information with 100m per pixel resolution.However, the problem with this map is that it only covers the European area and is updated once in six years.There are several ways to perform land classification automatically.In general, the classification involves the creation of a training dataset that consists of annotated samples of the corresponding class labels, training a model using the training dataset, and evaluating the resulting predictions.The number and quality of training samples play a pivotal role in defining the performance of the trained model.From a remote sensing prospective, training sample collection requires a ground survey or visual photo-interpretation by an expert [12].Ground surveying involves GIS expert knowledge, human resource that is not typically economical, while visual interpretation is not appropriate to be used for some applications, such as finding chlorophyll concentration [13] and classification of tree species [14].Most of the machine learning (ML) algorithms such as random forest, support vector machines, logistic regression performs well in the context of classification of remote sensing images.However, performance of these ML algorithms are not satisfied when learning features from different sources such as active and passive sensors [15].It was shown in [16,17] that Convolutional Neural Networks (CNN) are better than traditional land cover classification techniques.In the land segmentation section of the deep globe challenge [18], the Deep Neural networks (DNN) completely dominate the leaderboards.The best examples of land cover classification using Deep Neural Networks are ResNet and DenseNet [19,20].
Since there is a difference in the land covers of different locations, the model trained in one area cannot be deployed for the other areas.Additionally, the satellite imagery of different satellites is not the same.That phenomenon is due to the difference in their resolution, capture time, and other radiometric parameters.Due to these multiple changing variables, the dataset taken from a satellite covering one region and another satellite dataset covering the same or other regions leads to a domain shift between the datasets.One way to achieve a reliable outcome is possibly to train a model with a huge amount of training samples to generalize its behavior for all classes of all the regions.However, that needs an enormous labeled dataset that is time and labor-intensive.
Another method to deal with the shift between the datasets is termed Domain Adaptation (DA), in which a model is trained on one dataset (source data) and predictions are made on the other dataset (target domain).The distribution shift between the target and source dataset is mainly due to temporal differences in the acquisition, differences in the acquisition sensors, and geographical differences such as variations of objects at the Earth's surface.The domain shift affects the performance of a model trained on a source dataset and applied on the target dataset.Domain adaptation methods often rely on learning domain-invariant models that keep comparable performances on the two datasets.Existing domain adaptation techniques may be classified as supervised, unsupervised, and semisupervised.In supervised DA methods, it is presumed that labeled data are available for both source and target domains [21].In a semisupervised domain, the labeled data for the target domain is assumed to be small while an unsupervised method contains labeled data for the source domain only.For example in [22], a semisupervised visual domain adaptation was proposed to address classification of very high-resolution remote sensing images.To deal with the variation in features distribution between the source and target domains, multiple kernel learning domain adaptation method was employed.Another example [23], in which domain adaptation based on semisupervised transfer component analysis was employed to extract features for knowledge transfer from source image to target image for land cover classification of remotely sensed images.
Tuia et al. divides the domain adaptation methodologies into four different categories: domain-invariant feature selection, adapting data distribution, adapting classifiers, and adaptive classifiers using active learning methods [12].Many studies discuss the unsupervised domain adaptation in the context of classification and segmentation of the remotely sensed satellite and aerial imagery.For example, In [24], an unsupervised adversarial domain adaptation method was proposed based on boosted domain confusion network (ADA-BDC) which focuses on feature extraction to enhance the transferability of classifier which is trained by source domain images and tested on target domain images.In [25], an unsupervised domain adaptation was used using generative adversarial networks (GANs )for semantic segmentation of aerial images.A multi-source domain adaptation (MDA) for scene classification was proposed to transfer knowledge from the multiple-source domains to the target domain [26].Most of the studies presented in the literature related to DA-based classification have used single date images of source and target domain.However, in [27], first approach was proposed in the context of DA for classification of multi-temporal satellite images in which Bayesian classifier-based DA was employed with only two images of Landsat-5 satellite.This work investigates adversarial training of deep neural networks to bridge the domain discrepancy between distinct geographical zones.In particular, we perform a thorough analysis of domain adaptation applied to challenging multispectral, multi-temporal data, highlighting the advantages of adapting state-of-the-art self-attention-based models for LC(&)CC to different target zones where labeled data are not available.We choose to experiment our methodology on the BreizhCrops dataset, a large-scale time series benchmark dataset introduced in 2020 by Rußwurm et al., [28], for supervised classification of field crops from satellite data.Figure 1 shows the visual representation of the crop prediction performed on a sub-region of Brittany, highlighting the benefit provided by the proposed methodology.This article is organized as follows.Section 2 covers the related work on domain adaptation and its developments in techniques for LC&CC.Section 3 describes the dataset.A detailed description of the proposed method is presented in Section 4. The experimental setup, the results and related discussion are reported in Section 5. Finally, Section 6 draws some conclusions and future directions.

Land Cover and Crop Classification
LC&CC has been the subject of many studies in the past.A widely used classification method makes use of time series of vegetation indices (VI) derived from remotely sensed imagery to extract temporal features and phenological metrics.There are also some thresholds and simple statistical techniques that help calculate the time of peak VI, Maximum VI, and other vegetation related metrics [29,30].Moranduzzo et al. and Hao et al. [31,32] Illustrate the older image classification methods using handcrafted features for image representation and training classifiers such as support vector machine and random forest.Machine learning methods self-learn how to extract the features from the data with massive datasets available and improved computing devices.Random Forest (RF)-based classifiers is another common approach for remote sensing applications [32], though it should be noted that multiple features need to be derived and fed to the RF classifier for more effective output.
One of the newest and most powerful concepts integrated into mapping is a branch of machine learning known as Deep Learning (DL).DL is a type of machine learning based on artificial neural networks in which multiple layers of processing are used to extract progressively higher-level features from data.DL can be used to solve a wide range of problems such as signal processing, computer vision, image processing, and natural language processing [33].DL has shown significant contribution in remote sensing image classification due to its ability to represent features and its competence of mechanization for end-to-end learning.Autoencoders are type of artificial neural networks and are often used to represent features of data [34,35].In the remote sensing field, object detection and image segmentation have been performed extensively using two-dimensional CNNs [36,37] to perform spatial feature extraction from highresolution images.2D CNN proved better than 1D CNN in crop classification [38].In remote sensing, two-dimensional CNN can be used effectively for image classification where the correlation between the morphological details and the target classes exists.For example in [39], a 2D-CNN is used to obtain the spatial features of the hyperspectral imagery (HSI), analyzing the continuity of land covers in the spatial domain.Often relation among spectral bands of HSI is not linear, in that case, 2D-CNNs are normally used together with 1D-CNNs to incorporate the spectral and spatial domain of features [40].
Indeed, the classification task becomes quite challenging when dealing with high-dimensional hyperspectral data with few labeled samples.Recently, generative adversarial networks (GANs) have been exploited for sample generation, though it is not easy to acquire high-quality samples with authenticity.In this context, the generative adversarial networks (GANs) aim to generate more labeled samples by mimicking labeled data and provide high-quality realistic data to increase the number of training samples [41].Generally, GANs are comprised of two adversarial modules: a generator that obtains the original data distribution and a discriminator that differentiates between the generated labeled data and the original ones [42].For this purpose, an unsupervised 1D GAN was aimed to capture the spectral distribution while increasing the training samples for HSI classification [43].It was trained on unlabeled samples, which were then transformed as a classifier in a semisupervised setting.Hence, it is difficult to learn class features during the training process.Modified versions of GANs have considered the label information, such as conditional GAN (CGAN) [44], InfoGAN [45], deep convolutional GAN (DCGAN) [46], and categorical GAN (CatGAN) [47].
The aforementioned versions of GANs are susceptible to noises and disregard the relationships between spectral bands.Additionally, the generated samples are usually very different in the spectral domain from the original ones which fail in increasing classification results.This problem has been addressed in [48], authors developed a self-attention generative adversarial adaptation network (SaGAAN) to produce high-quality labeled samples in the spectral domain for hyperspectral image classification.

Domain Adaptation
The method of domain adaptation aims to reduce the domain shift between source and target datasets.Domain adaptation has three possible approaches according to [49,50,51].The primary approach consists of reducing the difference in the feature space among the target and source data.For this purpose, maximum mean discrepancy (MMD) is often used as a cost function to minimize the distance or to check a consistent feature extraction in both source and target domains [51].Other investigations focus on feature extraction; however, Nielsen et al. [52] performed change detection by aligning both domains using canonical correlation analysis (CCA).The work is extended with a semisupervised approach, where change detection is performed on multi-scale data obtained from different sensors [53].In [54], the domain alignment is achieved through an eigenproblem aiming at preserving the mismatch of labels and the geometric structure.The second approach uses Generative Adversarial Networks (GANs) [42] for an adversarial domain adaptation.The purpose of the GANs is to make both the source and target datasets spectral characteristics similar.Tzeng et al. [55] shows an example where the target dataset is translated to the source dataset using GANs.The translation contains a discriminator that recognizes the two datasets.Most of the studies employ a feature extraction network to generate feature sets for source and target domain [56,57,58].The feature extraction network acts as a generator to reduce the classification loss for the source domain and concurrently maximize the loss of the discriminator.Based on these approaches, Adversarial Discriminative Domain Adaptation (ADDA) was employed to learn feature extraction networks for the source as well as for target domains [55].In [57], an adversarial feature augmentation method was proposed to achieve DA in which the encoder is trained for the source and target domains.Inspired by the concept used in ADDA, Mesay et al. [59] implemented GAN-based DA for object classification in the remote sensing data.The last approach of domain adaptation creates a shared representation of both domains.In this method, one domain can be translated to another, and both domains can be translated into a common space.The method also provides a transfer function that facilitates the translation of one domain to another and translating back to the original state.CycleGAN provides the third approach and involves two discriminators that are used to translate one domain to another and converse [60].
The general methods of domain adaptations are not well interpreted for semantic segmentation [61].Thus, adversarial and reconstruction procedures are chosen.Adversarial and constraint-based adaptations are performed at pixel level using architectures that exploit adversarial domain adaptation using GANs to transform source-like images [62].Then, the images are segmented using a network that has been trained on the source dataset.In [63], Domain-Invariant Structure Extraction (DISE) structure was adopted to transform images into the domain-invariant structure and domainspecific texture representations.The bidirectional method prevents the translational model to reach a point where the discriminator fails to identify the image from the same distribution setup and fails to align correctly [64].

Study Area and Data
To promote reproducibility of our experimentation, we rely on BreizhCrops, a large-scale time series benchmark dataset introduced in 2020 by Rußwurm et al., [28], for supervised classification of field crops from satellite data.The dataset comprises multivariate time series examples in the Region of Brittany, France, of the season 2017, from January 1 to December 31.In particular, the authors of the dataset exploited all available Sentinel 2 images from Google Earth Engine, [65], and farmer surveys collected by France National Institute of Forest and Geography Information (IGN) to collect more than 600 k samples divided into 9 classes with 45 temporal steps and 13 spectral bands.Most importantly, as shown in Figure 2, acquired data are equally split into distinct regional areas.Indeed, as regulated by the Nomenclature des unites territoriales statistiques (NUTS), the overall dataset is divided into the four NUTS-3 regions Côes-d'Armor, Finistère, Ille-et-Vilaine, and Morbihan.That, in conjunction with the challenging nature of the dataset, makes BreizhCrops an ideal benchmark to test domain adaptation for multi-spectral and multi-temporal data for LC&CC.As summarized in Figure 3, even if the authors of the dataset avoided broad categories, due to the nature of agricultural production, which focuses on a few dominant crop types, a class imbalance can be observed in the collected parcels.That constitutes a challenge for every classifier type, but it reflects the strong imbalance in real-world crop-type-mapping datasets.On the other hand, sample classes in the different regions are balanced, making BreizhCrops a perfect bench for testing domain adaptation strategies.Finally, to disentangle the performed domain adaptation analysis from the influence of the random variation of the atmospheric conditions, we exclusively make use of L2A bottom-of-atmosphere imagery where data acquired over time and space share the same reflectance scale.Adjacent and slope effects are corrected by the MAJA processing chain [66] that employs 60-meter spectral bands to apply atmospheric rectification and detect clouds.Therefore, only ten spectral features are available for each parcel.Table 1 is presented as a summary of the number of samples collected for the domain adaptation experimentation divided into classes and regions.In conclusion, multi-spectral, multi-temporal pixels are individually extracted for each parcel and are constituted by 10 spectral bands and 45 temporal steps each.The class imbalanced highlighted by the number of parcels of Figure 3 is reflected in the number of samples of Table 1 used for all experimentation.

Methodology
In this work, unsupervised domain adaptation is considered in the field of land cover classification from satellite images.The study aims to tackle the problem of low generalization capability of classifiers only trained on a peculiar geographical region dataset.Moreover, the lack of rich available datasets of labeled satellite images increases the interest towards this challenge.In particular, the proposed methodology is intended to investigate the application of  In this section, a thorough description of the methodology is provided.First, we frame domain adaptation with the DANN method.Then, we briefly explain the Transformer Encoder structure with self-attention adopted for the multi-temporal crops classification.Finally, we describe the resulting architecture of the attention-based DANN, which is used to train a classifier with improved domain generalization.

Domain-Adversarial Neural Networks
Classifiers obtained with Deep Neural Networks often suffer from a lack of generalization related to possible variations in the appearance of the same objects.This problem is usually identified as a domain gap.In the land cover classification task, this situation is very recurrent and can be associated with the spectral shift affecting the data collected in different regions at different times.The shift is often related to photogrammetric distortion or visual differences in the appearance of lands.Furthermore, when dealing with satellite images, a dataset usually needs to be created by labeling images for a specific region to train a classification model.Despite this time-expensive procedure, standard training does not guarantee satisfying performance on images of different regions.Given these three main elements, the expression of the total loss used to train DANN is obtained by the following expression, according to the authors [67]: The first term L y is the label predictor loss, while the second one involves the domain discriminator loss L d .The hyperparameter λ can be tuned to weigh the contribution of the two learning terms.A more detailed analysis of the choice of λ is proposed in the experiments section.n and n are respectively the numbers of samples from the source and the target domains.Totally, we have N = n + n samples used in the training.The expression of the total loss function also describes the principal goals of DANN: first, we want to obtain a label predictor with low classification risk.Second, we are adding a regularization term for the domain adaptation.To this extent, we aim to find a set of parameters of the feature extractor θ f that can map a generic input sample from either source or target domain to a new latent space of features, where the domain gap is reduced.On the other hand, the classification performance has not to be affected.For this reason, the extracted features should be discriminative as well as domain-invariant.According to this goal, the optimal choice of parameters θ f and θ y is represented by the one which minimizes the total loss function, keeping θd unchanged.By contrast, the domain discriminator parameters θ d are updated to maximize the loss while not changing the other ones.θ f , θy = argmin In the original paper of DANN, the parameters of each piece of the neural network model are updated with a classical Stochastic Gradient Descent (SGD) optimizer.Here instead we use Adam (Adaptive momentum estimation), another popular optimization algorithm introduced by [68].Parameters θ f ,θ y and θ d are updated according to its rules.
As can be studied more in detail in the Adam original paper, the first (mean) and the second (uncentered variance) moments of Adam m and v are estimated as exponentially moving averages computed with the gradients obtained from each mini-batch.For the specific case of DANN, gradients used to estimate the Adam moments change for each element G f , G y , G d of DANN structure.For example, the feature extractor gradients (∂ L i y /∂ θ f ) and (∂ L i d /∂ θ f ) are used to compute m f ,y and m f ,d .Diversely, gradients obtained from label predictor (∂ L i y /∂ θ y ) and domain discriminator are only used to update their respective momentum my and md .The feature extractor and the domain discriminator play adversarial roles during the training process.A satisfying feature extractor can fool the domain discriminator by forwarding a vector of domain-invariant features.The role of the domain discriminator is to improve and evaluate this ability.A key intuition in the DANN method is to carry out the adversarial training with a standard backpropagation of the gradients, thanks to a custom Gradient Reversal Layer between the feature extractor and the domain discriminator.This particular layer does not add other parameters to the model but changes the sign of the upstream gradients.The GRL operation can be formulated with R(x) in the following mathematical expressions for the forward and backpropagation step: where I is the identity matrix.Hence, by performing optimization steps on the resulting DANN architecture, we can update parameters to reach saddle points of the total loss function reported in (1).

Classification of Multi-Spectral Time Series Data with Self-Attention
Self-Attention, popularized by the Transformer model in 2017, [69], has provided a considerable boost in machine translation performance while being more parallelizable and requiring significantly less time to train.Nevertheless, the introspection capability behind the success of Transformers is not limited only to natural language processing, but can be adapted to any time series analysis to filter data and focus on more relevant repressions aspects.
A single sample pixel i-th of multi-spectral, multi-temporal acquisition can be represented as a matrix X (i) ∈ R t×b where t is the temporal dimension and b is given by the number of spectral bands.Therefore, it is a 1D sequence of tokens, (x 0 , ..., x t ), with x t ∈ R b , that can be easily linearly projected to feed a standard Transformer encoder.The encoder can map a temporal input sequence X t×b in a continuous representation X L t×d model , where L is the output layer of the Transformer model and d model is the constant latent dimension of the projection space.
Self-attention, through local multi-head dot-product self-attention blocks, can easily manipulate the temporal sequence finding correlations between different time-steps and completely avoiding the use of recurrent layers.The dot-product self-attention operation is composed on a trainable associative memory with key and value vector pairs of dimensions d.For a sequence of t query vectors, arranged in a matrix Q ∈ R t×d , the self-attention operation is described by the following operation: where the Softmax function is applied over each row of the input matrix and K ∈ R t×d and V ∈ R t×d are the key and value vector matrices, respectively.Query, key and values matrices are themselves computed from a sequence of t input vectors with dimension d model using linear transformations: Finally, multi-head dot-product self-attention is defined by considering applying h self-attention functions to the input X.Each head provides a sequence of size t × d.These h sequences are rearranged into a t × dh sequence that is linearly projected into t × d model .
Subsequently, after the transformer encoder, the output representation, X L t×d model ,can be exploited to perform a classification of the input sequence.Indeed, that can be achieved by further processing the output encoder matrix and feeding a classification head trained to map the hidden representation to one of the k classes.
Several approaches have been proposed in the literature to obtain this result; in [70,71] they pre-append to the input sequence a learnable embedding, whose state at the output of the Transformer encoder serves as a hidden representation of the membership class.Indeed, only that output token is fed to the classification head to obtain the final prediction.On the other hand, the output sequence can be averaged or processed with a max operation on the temporal dimension [72].Nevertheless, despite the type of processing applied to X L , the encoder will adapt to elaborate the sequence properly and embed the needed information for the classification task.In conclusion, a Transformer encoder can be repurposed to process a multi-spectral input sequence and find valuable correlations between the different time-steps to perform LC&CC with a high level of accuracy.

DANN for Land Cover and Crop Classification
We employ DANN in conjunction with self-attention-based models to bridge the domain gap between different geographical regions.The overall architecture of the adopted methodology is shown in Figure 4. First, an input sequence X t×b is linearly projected to the constant latent dimension of the Transformer model d model .Moreover, a Transformer encoder does not contain recurrence or convolution to make use of the order of the sequence.Therefore, some positional encoding is injected about the relative or absolute position of the tokens in the sequence.The positional encodings have dimension d model as the projected sequence, so that the two can be summed.Guided by experimentation, as in [71], we adopt a learnable positional encoding instead of the sine and cosine functions with different frequencies of [69].The resulting pre-processed input sequence X l 0 t×d model feeds the Transformer encoder, parameterized by Θ f , that provides as output a continuous representation X L t×d model .Subsequently, we make use of the max function, over the temporal axis, to extract a token, x L d model , from the output sequence.The extracted representation constitutes the input for either the LC&CC and domain multi-layer perceptron classifiers.The first network provides a probability distribution over the k different classes, ŷk .On the other hand, the domain classifier outputs the probability, d2 , that the extracted representation x L d model belongs to the target or source domain.Using the cross-entropy loss function for both classifiers, it is possible to compute the respective gradients and update the weights, Θ f of the feature extractor.Indeed, inverting the sign of the gradients, ∇L d (Θ d ), derived from the domain classifier, and multiplying them for a scale factor λ , we can increasingly reduce the distance between the latent space of the two domains while training the encoder on the classification task.Overall, the proposed training framework

Experiments and Discussion
We experiment with the proposed methodology on the four regions of the multi-temporal satellite BreizhCrops dataset presented in Section 3. As explained in the same section, we indicate this dataset as an optimal choice to train and test new domain adaptation methods exploiting labeled multi-temporal data.The first main objective of the conducted experimentation is to investigate how the classification performance of a state-of-the-art model for LC&CC model is affected by a lack of generalization towards different geographical regions.Then, we clearly highlight how adversarial training can mitigate the domain gap and significantly boost performance for source and target regions with marked distribution distance.It is important to remark that the method relies on the availability of samples of both source and target domains, whereas only source labels are required, not allowing direct applicability of transfer learning techniques.Finally, in the last part of the section, obtained results are discussed and inspected through dimensionality reduction techniques, validating the proposed method for practical use.

Experimental Settings
We carried out a complete set of experiments to compare the Transformer encoder classifier performance with and without DANN.The standard classifier is trained separately on each of the single regions of the dataset, then tested on the other ones.By contrast, DANN models are trained on each source-target pair to gain the desired adaptation In the final architecture, the classifier model comprises a transformer encoder feature extractor and a final classification stage.In all experimentation, the transformer encoder receives as input a batch of 256 tensors with t = 45 temporal steps and b = 10 spectral bands in the image samples.Moreover, to linearly project the temporal sequence to the constant latent dimension of the encoder, the input is first passed to a dense layer with 64 units.Therefore, d model is equal to 128.On the other hand, the multi-head attention Transformer encoder is defined with several layers and attention heads equal to n layers = 3 and n heads = 2. Finally, the dimension of internal fully connected layers d inner = 128.Rectified linear units is the non-linear activation function used for all neurons of the encoder.
The LC&CC classification stage is a simple multi-layer perceptron head composed of a normalization layer, a fully connected layer with 128 units, ReLU as activation function, and a final layer with k = 9 neurons.On the other hand, for the DANN experimentation, the domain predictor is identical to the multi-layer perceptron head of the LC&CC classifier, with 128 units and a ReLU activation.However, the number of neurons in the final layer is set to d = 2, since we always perform a single target domain adaptation.
A cross-entropy loss function is chosen to train both the classifiers.The parameters of both models are updated using Adam optimizer with β 1 = 0.9, β 2 = 0.999 and ε = 1 × 10 −7 .A fixed number of epochs is always set to 250.The learning rate value is changed during training according to an exponential decay policy from a starting value of 0.001, with a decay scheduled for each epoch equal to 0.99 epoch .A key point in the experimental settings is related to the domain adaptation parameter λ .It acts as a regularization parameter, since it regulates the impact of the domain discriminator gradients on the feature extractor during training.Therefore, it can be considered to be the principal hyper-parameter to tune when using DANN.We always use a scheduling policy for λ , as suggested in the original publication of DANN: where λ max is the plateau value reached.This is the actual value of λ used for the second half of the training, which affects the final performance of the model in terms of generalization.The parameter γ = 10 defines the slope of the curve and it is fixed to such value to let λ max be reached in a suitable number of epochs.A scheduled value of λ allows the feature extractor to learn the basic features for the classification during the first epochs.It then adjusts the mapping function to let the source and target domain feature distributions to overlap at the end of the training process.As shown in Figure 5, different values of λ max are tested to study the response of the model.To our knowledge, λ max = 0.2 is the best value for a robust adaptation improvement of the classifier, at least among the set of tested λ max values.As already explained at the beginning of the section, the classifiers are trained and tested on all the possible combinations of regions to quantify the existing domain gap.
The classification performance is evaluated using three different classification metrics, which are chosen among the ones proposed in the BreizhCrops dataset benchmarks: Accuracy, F1-score and K-score.This last metric is the Cohen's kappa [73], computed according κ = (p o − p e )/(1 − p e ) where p o and p e are the empirical and expected probability of agreement on a label.In addition, we make use of Maximum Mean Discrepancy (MMD) metric, presented in Section 5.2, to quantitatively evaluate the distance between source and target distributions.

Maximum Mean Discrepancy
MMD is a statistical test originally proposed in [74] to determine a measure of the distance between two distributions.MMD is largely used in domain adaptation since it perfectly fits the need to understand whether the source and the target domain extracted features overlap.MMD can be directly exploited as a loss function for adversarial training of generative models or for domain adaptation purposes, as shown in [75,76].However, in this works we limit its usage to show the results of the Transformer Encoder DANN in terms of reduction of feature distances.Formally, MMD is a kernel-based difference between feature means.Given a set of m samples X with a probability measure P, the feature mean can be expressed as: where φ (X) is the feature map that maps X to a new feature space F .If it satisfies the necessary theoretical conditions, a kernel-based approach can be used to compute the inner product of two distributions of samples X ∼ P and Y ∼ Q: At this point the MMD can be defined as the distance between the feature means of X ∼ P and Y ∼ Q: which can be expressed more in detail using Equation ( 12): However, an empirical estimate of MMD needs to be computed since in a real case only samples are available instead of the explicit formulation of the distributions.It is possible to obtain the MMD expression by considering the empirical estimates of the feature means based on their samples: where x i and y i in this case are the image samples from source and target domains, m is the number of samples of the considered subsets.Finally, we specifically use a gaussian kernel with the following expression:

Results Discussion and Applicability Study
In this section, we present the comparison results between the Transformer classifier with and without DANN, clearly highlighting the scenarios that present a definite advantage in applying adversarial training for training a classifier for LC&CC.From results in Tables 2 and 3, Figure 6, it is possible to notice that DANN adversarial training allows the classifier to improve knowledge transferability to other domains for most of the cases.Nonetheless, we investigate a potential criterion to decide if the transfer of learning from source to target can be effectively improved by DANN.More in detail, since DANN aims to overlap feature distributions, we look at the extracted features from a subset of 10000 samples of each zone dataset that is considered representative of the total one.
We use the set of extracted features to compute a numerical evaluation of the distance decrease, and to give a graphical visualization of the effect of DANN.From a quantitative perspective, we propose Maximum Mean Discrepancy as the feature distance metrics to detect suitable conditions where DANN is an appropriate methodology.To compute MMD without considering the clustering of classes, we only need unlabeled image samples.We use PCA algorithm to compute the principal components of the extracted features and we exploit them to provide 2D and 3D visualization of relevant cases.
First, we can look at the MMD values obtained from both the Transformer encoder and DANN in Table 2.It is clear that DANN is always able to reduce the distance between feature distributions.However, this is not always associated with an increase in classification performance.We realize that key information is contained in the MMD value obtained from source and target features, extracted by the standard classifier.This simple test is crucial and can also be done without labels.The best improvement with DANN is reached considering zone 2 as the source domain and selecting zone 3 as the target domain.The percentage improvement shown in Table 3, with an increase of more than 30% of accuracy, correlates with an initial MMD value for this specific case is equal to 0.6700, reduced by DANN to 0.0104.What can be deduced by this observation is that high values of the MMD indicate a lack of generalization of the classifier and a domain gap.It is also to consider that the geographical zones of interest are close to each other.Hence, it can be reasonable to find small domain gaps.A clear example is the case of zone 1, when chosen as source domain.This factor can be considered an additional difficulty of the study case.Therefore, it is possible that the same methodology applied to other regions on the planet, sharing the same categories of crops, can probably show greater results.Another peculiar case to be considered is: zones 4 (source) and 3 (target).The MMD value is low from the initial analysis of the case, without the intervention of DANN.However, a classification boost is always achieved.
We report a visual representation of the extracted features to add meaning to the previous considerations.In particular, Figures 7-9 show the 2D principal components obtained from the peculiar cases defined below: • case 1: zone 2 (source), zone 3 (target).In this case DANN shows the greatest improvements with an initial high value of MMD.Features are visually reported in Figure 7: in (a,b) when extracted by standard Transformer encoder trained on the source domain, in (c,d) when extracted by DANN.The difference is visually clear.Features distributions are matched by DANN, with a resulting overlapping shape between source and target domain.
• case 2: zone 1 (source), zone 2 (target).In this case DANN shows the worst improvements with an initial low value of MMD.Features are visually reported in (a,b) of Figure 8 when extracted by standard Transformer encoder, in (c,d) of the same Figure 8 when extracted by DANN.They appear already similar also without DANN.
• case 3: zone 4 (source), zone 3 (target).In this case DANN shows noticeable improvements, regardless an initial low value of MMD.Features are visually reported in (a,b) of Figure 9 when extracted by standard Transformer encoder, in (c,d) of Figure 9 when extracted by DANN.As with case 1, the difference is visually clear, and the effect of DANN can be easily appreciated.
Finally, case 1 and case 2 defined above are also considered for a 3D representation.Figure 10 shows the obtained results.For each subplot in the figure, both source and target domain features are scattered.Thanks to this visual perspective, the effect of the DANN method is highlighted, considering both the worst and the best application scenario.
In case 1, the difference between source and target features is shallow also without DANN, as shown in (a).By contrast, the situation from (c) to (d) is changed thanks to the adversarial training significantly.
The proposed discussion underlines some interesting insights on the correlation between reducing the domain gap and improving a classifier performance.The isolated cases considered provide a good reference example to decide if it is a reasonable and convenient choice to adopt the proposed DANN methodology for multi-spectral temporal sequences for Land Cover classification.

Conclusions
In this paper, we investigated adversarial training for domain adaptation with state-of-the-art self-attention-based models for LC&CC.Indeed, domain gaps between distinct geographical regions prevent the direct repurpose of the trained model on diverse areas of the training domain, and the practical difficulty of acquiring labeled data prevents the direct application of transfer learning techniques.Our extensive experimentation clearly highlights the advantages of applying the proposed methodology to transformer models trained on multi-spectral, multi-temporal data and the considerable gain in performance with considerable distribution distance between target and source regions.In particular, the best improvement obtained with DANN shows a percentage increase of more than 30% of classification accuracy, associated with an evident reduction of the features distance metrics MMD from 0.6700 to 0.0104.Moreover, our investigations conduct to a clear identification of the scenarios where it is advantageous to apply the DANN domain adaptation mechanism.More in detail we identified three different cases that highlight the strategy for a correct adoption of the methodology.A graphical visualization of the effect of DANN on the crop classification task has also been proposed and discussed exploiting the 2D class-wise and the 3D principal components of crops features distribution.
Future work may investigate the advantages and disadvantages of different domain adaptation techniques applied to LC&CC and extend our study to further geographical regions.

Figure 1 :
Figure 1: Visual representation of land crops classification on zone 3 (Ille-et-Vilain) of the BreizhCrops dataset.For each sub-image we show the complete region and a sub-area to facilitate the visualization of the advantage obtained by the proposed methodology.In particular, on the left the crops predictions without our domain adaptation mechanism are shown, while in the center the same predictions performed adopting DANN are proposed.On the right, ground truth labeled crops can be visualized.The improvement in the classification with DANN is evident, especially in the reduction of misclassification of wheat and meadows.

Figure 2 :
Figure 2: Magnified view of the four NUTS-3 regions of Brittany, located in the northwest of France and covering 27,200 km².The strict division of the supervised BreizhCrops dataset in the four regions allows the performance of a formal and controlled analysis on domain adaptation for LC&CC with multi-spectral and multi-temporal data.

Figure 3 :
Figure 3: Class frequencies divided in the four NUTS-3 regions of Brittany.The respective number of parcels highlights the strong class imbalance, reflecting the substantial imbalance in real-world crop-type-mapping datasets.However, samples per class in the four regions are equally divided.
Domain-Adversarial Neural Networks (DANN) is a representation learning technique that allows a classifier to generalize better from a source domain to a target domain.This specific domain adaptation method consists of adding a branch to the original feed-forward architecture of the classifier and carry out an adversarial training.From a generic perspective, it is possible to identify three main components of the DANN: a feature extractor with parameters θ f , a label predictor with parameters θ y , and a domain classifier with parameters θ d .The feature extractor is the first block of the DANN model.It is responsible for learning the function G f : X → R d , which maps the input samples X to a d-dimensional vector containing the extracted features.The label predictor function, G y (G f (X)), compute the label associated with the predicted class of the sample.The domain discriminator function G d (G f (X)) distinguishes between source and target domains given the extracted features.The combination of feature extractor and label predictor gives us the complete classifier model.The domain classifier is composed of a secondary branch, similar to the label predictor, which receives the extracted feature vector by the first block of the network.

Figure 4 :
Figure 4: Overview of the overall framework to train a Transformer encoder with domain-adversarial training.The multispectral temporal sequence X t×b is first linearly projected and fused with a position encoding.Subsequently, the selfattention-based model manipulates the input series and, through a max operation applied to the last layer of the encoder, is possible to extract a token x L d model from the output sequence.Finally, gradients derived by LC&CC and Domain classifiers train the network while keeping close the distribution of source and target domains.
provides an effective solution to transfer the acquired knowledge of a model to a diverse region, exploiting only the original nature of the data.

Figure 5 :
Figure 5: λ scheduling: the value of the domain adaptation parameter λ is changed during training according to an exponentially growing trend.This allows the feature extractor to learn basic features during the initial epochs.Different final λ max values are tested to study the right level of adaptation required in the different cases: 1, 0.5 and 0.2.λ max = 0.2 is the best choice for an overall adaptation improvement of the classifier in the different regions.The parameter γ influences the slope of the curve and it is kept constant to 10 to let λ reach the desired value in a suitable number of epochs.

Figure 6 :
Figure 6: Class-wise comparison of classification results on zone 3 (target), selecting zone 2 as source domain.Confusion matrix obtained with Transformer encoder trained on zone 2 and tested on zone 3 is shown in (a) on the left.Figure (b) on the right shows classification results with DANN model tested on zone 3. The effect of DANN clearly mitigate the prediction error, with a particular focus on relevant classes such as Corn, Permanent and Temporary Meadows.

Figure 7 :
Figure 7: 2D feature visualization obtained with PCA, extracted with the Transformer Encoder trained on the source domain and with the Transformer DANN model trained on the specific source-target domains.A comparison between the 2D feature distributions is proposed for the case of zone 2 (source) and 3 (target).In (a,b) we have features extracted with the Transformer Encoder from source and target domains: (a) reports features of the source domain (zone 2) and (b) the ones extracted from the target domain (zone 3).In this case, features are mapped poorly in the target domain, with a consequent low accuracy in classification.In (c,d) the same features extracted with the Transformer DANN model are shown.The positive effect of DANN in terms of features overlapping is evident compared to (a,b).

Figure 8 :
Figure 8: 2D feature visualization obtained with PCA, extracted with the Transformer Encoder trained on the source domain and with the Transformer DANN model trained on the specific source-target domains.A comparison between the 2D feature distributions is proposed for the case of zone 1 (source) and 2 (target).In (a,b) we have features extracted with the Transformer Encoder from source, (a), and target, (b), domain: a low MMD distance indicates no need for domain adaptation.In (c,d) the same features extracted with the Transformer DANN model are shown, with no substantial differences.

Figure 9 :
Figure 9: 2D feature visualization obtained with PCA, extracted with the Transformer Encoder trained on the source domain and with the Transformer DANN model trained on the specific source-target domains.A comparison between the 2D feature distributions is proposed for the case of zone 4 (source) and 3 (target).In (a,b) we have features extracted with the Transformer Encoder from source, (a), and target, (b), domains: regardless of an initial low MMD, the classifier accuracy can still be improved reducing the domain gap.In (c,d) the same features extracted with the Transformer DANN model are shown, with a clear improvement of the feature mapping, which result in very similar distributions from source to target domain.

Figure 10 :
Figure 10: 3D feature visualization and comparison.(a,b) show the features extracted from zone 1 (source) and 2 (target).They are respectively obtained with transformer encoder and DANN.It is clear that the transformer encoder alone can correctly map features on both domains.By contrast, the improvement provided by DANN model is very evident in figures (c,d), representing the features extracted from zone 2 (source) and 3 (target), where the transformer encoder alone present both high values of MMD and low classification accuracy on target domain.

Table 1 :
Summary of the number of samples per class divide in the four NUTS-3 regions of Brittany.Instances are derived by L2A bottom-of-atmosphere parcels to disentangle our analysis with variation of the atmospheric conditions.

Table 2 :
Results of crops classification for the Transformer Encoder classifier trained with and without DANN using λ max = 0.2.The two models are trained and tested on all the possible combinations of source/target domains available in BreizhCrops dataset.Accuracy, F1-Accuracy and K-score are the metrics used to compare the classification quality.Training accuracy is also reported for the Transformer encoder classifier.Maximum Mean Discrepancy computed on a subset of extracted features of source and target domain shows the successful reduction of features distance obtained with DANN.

Table 3 :
Comparison between Transformer Encoder Classifier with and without DANN, in terms of classification metrics reported in Table2.This run of experiments is conducted with a scheduling of the adaptation parameter λ , with λ max = 0.2.