Object-based multi-temporal and multi-source land cover mapping leveraging hierarchical class relationships

European satellite missions Sentinel-1 (S1) and Sentinel-2 (S2) provide at highspatial resolution and high revisit time, respectively, radar and optical imagesthat support a wide range of Earth surface monitoring tasks such as LandUse/Land Cover mapping. A long-standing challenge in the remote sensingcommunity is about how to efficiently exploit multiple sources of information and leverage their complementary. In this particular case, get the most out ofradar and optical satellite image time series (SITS). Here, we propose to dealwith land cover mapping through a deep learning framework especially tailoredto leverage the multi-source complementarity provided by radar and opticalSITS. The proposed architecture is based on an extension of Recurrent NeuralNetwork (RNN) enriched via a customized attention mechanism capable to fitthe specificity of SITS data. In addition, we propose a new pretraining strategythat exploits domain expert knowledge to guide the model parameter initial-ization. Thorough experimental evaluations involving several machine learningcompetitors, on two contrasted study sites, have demonstrated the suitabilityof our new attention mechanism combined with the extend RNN model as wellas the benefit/limit to inject domain expert knowledge in the neural networktraining process.


Introduction
Remotely sensed data collected by modern Earth Observation systems, such as the European Sentinel programme [1], are getting increasing attention in recent years to cope with Earth surface monitoring. In particular, the Sentinel-1 and Sentinel-2 missions are of interest, since they provide publicly available multi-temporal radar and optical images respectively, with high spatial resolution (up to 10 m) and high revisit time (up to five days). Thanks to these unprecedented spatial and temporal resolutions, data coming from such sensors can be arranged in Satellite Image Time Series (SITS). SITS have been employed to deal with several tasks in multiple domains ranging from ecology [2], agriculture [3], land management planning [4], and forest and natural habitat monitoring [5,6]. using agglomerate statistics starting from the available image set, to be subsequently used as samples for training and classification. HOb2sRNN is therefore conceived to perform object-based LULC mapping given the radar and optical object SITS available on the study area, the relative ground truth data and the associated land cover class hierarchy. Building upon the preliminary work presented in [15], as a major further contribution, here, we propose an architecture that is based on an extended RNN model enriched via a modified attention mechanism capable of fitting the specificity of SITS data. In addition, we also introduce a pretraining strategy to get the most out of the information available under the shape of hierarchical relationships between land cover classes. Last but not least important, differently from previous works on multi-source and multi-temporal land cover mapping that exploits DL methods [14,15], we also provide, in this study, a contribution related to the interpretability of the proposed model. More specifically, we investigate and discuss how the side information provided by HOb2sRNN can be leveraged to draw some connections between the way in which the model takes its decision and the agronomic knowledge we have.
With the aim to provide an in-depth assessment of the HOb2sRNN behaviour, an extensive experimental evaluation is conducted on two study sites with diverse land cover characteristics, namely the Reunion island and a part of the Senegalese groundnut basin, the latter being dominated by small scale agriculture and limited in the amount of available data. The results have underlined the effectiveness of our proposal when compared to competitive and recent approaches commonly leveraged to deal with land cover mapping task, including the work presented in [15].
The remainder of this work is structured, as follows: first, the study sites and associated data are introduced in Section 2; and then Section 3 describes the proposed method while the experimental settings and the evaluations are carried out and discussed in Section 4 and Section 5, respectively, and finally, Section 6 draws the conclusion.

Materials
The analysis was carried out on 2 study sites characterized by different landscapes and land cover classes: the Reunion island, a French overseas department located in the Indian Ocean and a part of the Senegalese groundnut basin located in central Senegal. The Reunion island covers an area of a little over 3000 km 2 , while the Senegalese site area is about 500 km 2 . The former benchmark involves 26 Sentinel-1 (S1) and 21 Sentinel-2 (S2) satellite images acquired during the year 2017 while the latter consists of 16-S1 and 19-S2 images collected between May and October 2018 (see Figure 1 for acquisition date details). q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q SENEGAL REUNION Figure 1. Overview of the acquisition dates of Sentinel-1 (S1) and Sentinel-2 (S2) images over the two study sites. S2 acquisitions were sparsed due to the ubiquitous cloudiness.

Sentinel-1 Data
The radar images were acquired in C-band Interferometric Wide Swath (IW) mode with dual polarization (VH and VV) and in ascending orbit. All images as retrieved at level-1C Ground Range Detected (GRD) from the PEPS platform (https://peps.cnes.fr/) were first radiometrically calibrated in backscatter values (decibels, dB) using parameters that were included in the metadata file, then coregistered with the Sentinel-2 grid and orthorectified at the same 10-m spatial resolution. Finally, multi-temporal filtering was applied to the time series in order to reduce the speckle effect.

Sentinel-2 Data
The optical images were downloaded from the THEIA pole platform (http://theia.cnes.fr) at level-2A top of canopy reflectance. Only 10-m spatial resolution bands (i.e., Blue, Green, Red and Near Infrared spectrum) containing less than 50% of cloudy pixels were considered in this analysis. A preprocessing was performed over each band to replace cloudy observations as detected by the supplied cloud masks through a multi-temporal gapfilling [4]. Cloudy pixel values were linearly interpolated using the previous and following cloud-free dates. Finally, the Normalized Difference Vegetation Index (NDVI) [32] was calculated for each date. NDVI was considered as supplementary optical descriptor since it captures well the vegetation activity which is subject to change over time.

Ground Truth
Considering the Reunion island (Reunion island land cover dataset is available online on the CIRAD dataverse under doi:10.18167/DVN1/TOARDN), the ground truth (GT) was built from various sources: the Registre Parcellaire Graphique (RPG) (RPG is part of the European Land Parcel Identification System (LPIS), provided by the French Agency for services and payment) reference data for 2014, GPS land cover records from June 2017 and the visual interpretation of very high spatial resolution (VHSR) SPOT6/7 images (1.5-m) completed by a field expert with knowledge of territory. Additional information about this dataset can be found in [33]. As regards the Senegalese site, the ground truth was built from GPS land cover records that were collected during the 2018 field campaign with the same approach as for the Reunion site followed by a visual interpretation of a VHSR PlanetScope image (3-m). Both operations were conducted by a specialist of the study area. For each site, the GT was assembled in Geographic Information System vector file, containing a collection of polygons each attributed with the corresponding land cover class. The Reunion island GT includes 6265 polygons that were distributed over 11 classes while the Senegalese site includes 734 polygons distributed over nine classes (See Tables 1 and 2). In order to inject specific knowledge in the learning process, we obtained from field experts, for each study site, a taxonomy of land cover classes (See Figures 2 and 3), getting two levels of representation before the target classification level described in Tables 1 and 2.   To analyse data at object-level, a segmentation was performed for each study site in close collaboration with the field experts to provide a convenient object layer. To this end, we used the VHSR images at hand (i.e., SPOT6/7 and PlanetScope) which have been coregistered with the corresponding Sentinel-2 grid to ensure a precise spatial matching. The VHSR images were segmented using the Large Scale Generic Region Merging (LSGRM) module in the Orfeo Toolbox [34] obtaining 14,465 segments on the Reunion island and 116,937 segments on the Senegalese site, respectively. The segmentation algorithm parameters were adjusted by visual interpretation via several trial processes, so that the final obtained segments fit as closely as possible land cover units of the study sites. Figures 4 and 5 show some details about the segmentation outcomes. Subsequently, for each study site, the GT polygons were spatially intersected with the obtained segments to provide radiometrically homogeneous class samples. This process resulted in new labeled segments of comparable size: 7908 for the Reunion island and 3084 segments for the Senegalese site (see Tables 1 and 2). Finally, the average pixel values corresponding to each of these segments were extracted over the time series, giving 157 features per segment (26 time stamps × two bands for S1 + 21 time stamps × five bands for S2) for classification on the Reunion island and 127 features per segment (16 time stamps × two bands for S1 + 19 time stamps × five bands for S2) for classification on the Senegalese site.   Figure 6 depicts the proposed deep learning architecture, named HOb2sRNN, for the multi-source SITS classification process. The architecture involves two branches: one for the radar SITS (left) and one for the optical SITS (right). At the end of the per-source analysis, the results of the two branches are merged and a final decision is made. The model automatically combines the multi-source and multi-temporal information in an end-to-end process. The output of the model is a land cover classification for each pair of time series (radar and optical). Each branch of the HOb2sRNN architecture can be decomposed in two parts: (i) the time series analysis through the extended Gated Recurrent Unit cell, we named FCGRU and (ii) the multi-temporal combination to generate per-source features employing a modified attention mechanism. Moreover, the per branch FCGRU outputs are concatenated and the attention mechanism is employed again to extract fused features. Finally, the extracted per branch and fused features are leveraged to produce the final land cover classification. Such learned features, named f eat rad , f eat opt , and f eat f used , indicate, respectively, the output of the radar branch, the optical branch, and the source fusion. In addition, the architecture is trained leveraging domain knowledge represented under the shape of hierarchy that organizes land cover classes in a taxonomy with class/subclass relationships. The hierarchical information is exploited to pretrain the HOb2sRNN architecture considering tasks of increasing complexity. Section 3.4 details the process.   Figure 6. Overview of the HOb2sRNN method. The architecture is composed of two branches, one for each source (radar and optical) SITS. Each branch processes the SITS by means of an enriched RNN cell we named FCGRU and an attention mechanism is employed on its outputs to extract the per source features. Furthermore, the same attention mechanism is employed on the concatenation of the per source outputs allowing to extract fused features. Finally, the per-source and fused feature sets are leveraged in order to provide the final classification.

Fully Connected Gated Recurrent Unit (FCGRU)
The first part of each branch is constituted by an enriched Gated Recurrent Unit that extends standard GRU [35]. We name such enriched GRU as Fully Connected GRU (FCGRU). In Figure 7 we illustrate the standard GRU unit and the introduced FCGRU. The FCGRU cell extends the GRU unit by involving two fully connected layers namely FC 1 and FC 2 at the beginning of the cell pipeline. Such layers preprocess the input time series information before starting the standard GRU unit transformation. Therefore, they allow for the architecture to extract an useful input combination for the classification task, enriching the original representation of the object time series. More specifically, FC 1 takes as input the object time series (radar or optical) and its output is used to feed FC 2 . A Hyperbolic Tangent (tanh) non-linearity is associated to each of the fully connected layers for the sake of consistency with the GRU unit that is mainly based on Sigmoid and tanh functions. The tanh activation function has an S-shape and is delimited in the range [−1, 1]. Successively, the standard GRU unit transformation is employed over the enriched representation (output of FC 2 ). It is composed of a hidden state h t−1 , the reset gate r t , and the update gate z t . The gates regulate the information to be forgotten or remembered during the learning process and deal with the vanishing and exploding gradient problem. The output of the unit is the new hidden state h t . Dropout was employed in the FCGRU cell on the ongoing states and between the two fully connected layers to prevent overfitting. The following equations formally describe the extended GRU cell: The symbol indicates an element-wise multiplication while σ and tanh represent Sigmoid and Hyperbolic Tangent function, respectively. x t is the time stamp input vector and x t is the enriched input vector representation. The different weight matrices W * , W * * , and bias vectors b * are the parameters learned during the training of the model.

Modified Attention Mechanism
The second part of the branches consists of a modified neural attention mechanism on top of the output hidden states produced by the FCGRU cell. Attention strategies [36][37][38] are widely used in one-dimensional (1D) signal or natural language processing to combine RNN outputs at different time stamps through a set of attention weights. In the traditional attention mechanism, the set of weights is computed using a So f tMax function so that their values ranges in [0, 1] and their sum is equal to 1 providing at the same time a probabilistic interpretation. Due to the sum constraint, the So f tMax attention has the property to prioritize one instance over the others making it well suited for tasks such as machine translation where each target word is aligned to one of the source word [39]. However, in the remote sensing time series classification context, forcing the sum of weights to 1 may not be fully beneficial for the attention model. In fact, considering a specific time series classification task where almost all of the time stamps are relevant for the problem, the use of a SoftMax function to compute attention weights will squash towards zero the attention weights since their sum should be one and finally the attention combination may not be efficient as expected. Therefore, relaxing this constraint could help the model to better weight the relevant time stamps independently. In our attention formulation, we attempted to address this point by substituting the So f tMax function with a Hyperbolic Tangent to compute attention weights. The motivation behind the tanh attention, in addition to the sum constraint relaxation, is that the learned attention weights will be in a wider range i.e., [−1, 1], also allowing negative values. The equations below describe the tanh attention formulation that we introduced: where H ∈ R N,d is a matrix obtained by vertically stacking all hidden state vectors h t i ∈ R d learned at the N different time stamps by the FCGRU; λ ∈ R d is the attention weight vector traditionally computed by a So f tMax function that we replaced by a tanh function; matrix W ∈ R d,d and vectors b, u ∈ R d are parameters learned during the process. The described attention mechanism is employed over the FCGRU outputs (hidden states) in the radar and optical branches to generate per-source features ( f eat rad and f eat opt ). Such features encode the temporal information related to the input sources. Furthermore, the per-source hidden states are concatenated and an additional attention mechanism is employed over them to generate fused features ( f eat f used ). Such features encode both temporal information and complementarity of radar and optical SITS. Thus, the architecture involves learning three sets of attention weights: λ rad , λ opt and λ f used , which refers, respectively, to the attention mechanisms employed over the radar, optical and concatenated hidden states.

Feature Combination
Once each set of features has been yielded, they are directly leveraged to perform the final land cover classification. The combination process involves three classifiers: one main classifier on top of the fused features and two auxiliary classifiers, one for each source features. The main classifier is composed of two fully connected layers and a So f tMax layer. The fully connected layers are associated to a ReLU non linearity and followed by a dropout layer each. The auxiliary classifiers are composed of one So f tMax layer each. Auxiliary classifiers [10,14,40] are used to strengthen the complementarity as well as the discriminative power of the per-source learned features. Their goal is to stress the fact that the learned features need to be discriminative alone i.e., independently from each other [10,14,41]. Subsequently, the cost function associated to the optimization of the three classifiers is: where L( f eat) is the loss computed by the categorical Cross-Entropy function and associated to the classifier fed with the features f eat. The contribution of the auxiliary classifiers is weighted by the parameter α. The final land cover class is obtained by combining the three classifier outcomes with the same weighting schema employed in the loss computation: where score f used and score source are respectively the prediction scores of the fused classifier and the auxiliary classifier associated to one of the radar or optical branch. We empirically set the value of α to 0.5 with the aim to enforce the discriminative power of the per-source learned features while privileging the fused features in the combination.

Hierarchical Pretraining Strategy
With the aim to leverage specific domain knowledge about the LULC classes, the HOb2sRNN parameters were learned exploiting the taxonomic organization associated to the classes. The training of the model is repeated for each level of the taxonomy, from the more general to the most specific (the target classification level). Specifically, we start training the model from the highest level of the hierarchy and then, continue training at the next level by reusing the previously learned weights for the whole architecture, excepting those that are associated to the classifiers, since level specific (see Figure 8). New weights were learned for the classifiers at each level of the taxonomy. The process is performed until we reach the target classification level. In summary, the hierarchical pretraining strategy allows for the model to focus first on high level classification problems (i.e., crops vs non crops) and, step by step, to smoothly adapt its behaviour to deal with classification problems of increasing complexity. In addition, this process allows the model to tackle the classification at the target level integrating a kind of prior knowledge on the task (based on high level classes) instead of addressing it completely from scratch.

Experiments
In this section, we present and examine the experimental results obtained on the study sites introduced in Section 2. We carried out several experimental analysis to provide a deep assessment of the HOb2sRNN behaviour: • an in-depth evaluation of the quantitative performances of HOb2sRNN model with respect to several other competitors; • a sensitivity analysis of the α hyperparameter to weight the auxiliary classifier contributions and an ablation study of input sources and main components of the architecture in order to characterize the interplay among them; • a qualitative analysis of land cover maps produced by HOb2sRNN and its competitors; and, • an inspection of the attention parameters learnt by the HOb2sRNN model with the aim to investigate to what extent such side information contributes to the model interpretability.

Experimental Settings
To assess the quality of HOb2sRNN, we chose several approaches as competitors: a version of the Random Forest classifier that works on the early fusion of the radar/optical sources: the radar and optical information are concatenated together and the model is fed with this information [17] (RF); a version of the Random Forest classifier that works on the late fusion of the source information, as described in [42]: a RF model is trained for each of the input sources and, then, the combination is achieved via the product of per-source RF model outputs (RF POE ); a Support Vector Machine (SVM), a Multi Layer Perceptron (MLP) as the one employed as main classifier of the HOb2sRNN architecture, the Temporal Convolutional Neural Network (TempCNN) proposed by [43] that performs convolutions on the temporal dimension of the time series information and the OD2RNN model proposed in [15]. Like the RF, the SVM and MLP models were run on the concatenation of the multi-temporal and multi-source data. For the TempCNN model, we set up an architecture with 2 branches, one per source, sharing the same convolutional structure as described in [43]. The outputs of the two branches were successively concatenated and fed into the classifier module. The OD2RNN model was employed with the same structure and parametrization considered in [15]. Trainable parameters of neural network approaches i.e., MLP, TempCNN, OD2RNN, and HOb2sRNN models are reported in Table 3 . The values of the datasets were normalized per band, when considering the time series, in the interval [0, 1]. The datasets were split into training, validation and test set with a proportion of 50%, 20% and 30% of samples (labeled segments), respectively. We imposed that segments belonging to the same ground truth polygon before the spatial intersection (see Section 2.3) were exclusively assigned to one of the data partition (training, validation or test) with the aim to avoid possible spatial bias in the evaluation procedure. The models were optimized via training/validation procedure [28] (settings are reported in Table 4). Their assessment was done using test set and considering following metrics: Accuracy (global precision), F1 score (harmonic mean of precision and recall), and Cohen's Kappa (level of agreement between two raters relative to chance). Because the model performances may vary depending on the split of the data due to simpler or more complex samples involved in the different partitions, all metrics were averaged over ten random splits of the datasets following the strategy mentioned above. Experiments were carried out on a workstation with an Intel Xeon CPU, 256 GB of RAM and four TITAN X GPU. In such environment, the HOb2sRNN model takes approximately 16 h (resp. 4.5 h) to complete training on Reunion island (resp. Senegalese site) while testing needs around one minute on both study sites. The neural net models were implemented using the Python Tensorflow library, while other implementations were obtained from the Python Scikit-learn library [44]. The implementation of the HOb2sRNN model is available at https://github. com/eudesyawog/HOb2sRNN.

Comparative Analysis
In this evaluation, we compare the results that were obtained by the different competing methods, considering their overall and per-class behaviours. Table 5 reports the average performances obtained on the two study sites. We can note that the proposed method outperformed its competitors on both study sites, although the performance gap is more pronounced on the Reunion island dataset than on the Senegalese site. This behaviour may be due to the fact that the Reunion island benchmark has more ground truth samples (about eight times) than the Senegalese dataset. In fact, deep learning models are known to be effective when trained on huge volumes of data. Concerning the other competing methods, RF and SVM achieve similar scores on the Reunion island, while SVM surpasses RF on the Senegalese site. On this latter benchmark, the SVM algorithm demonstrates to be well suited for dataset characterized by a limited set of labeled samples. We also note that, on both study sites, the RF POE competitor was less effective than the RF variant which is fed with the concatenated sources. As regards the MLP and TempCNN competitors, both achieved lower scores than HOb2sRNN on the Reunion island, while the performance of the MLP is comparable to that of HOb2sRNN on the Senegalese site. Moreover, the OD2RNN model performances on both study sites indicate the added value of extensions provided by the HOb2sRNN model. It should be noted that the relatively better performance obtained on the Senegalese site compared to the Reunion island (90.78 vs. 79.66) may come from the topography of the two sites. In fact, Reunion island is characterized by a rugged topography while the Senegalese site is essentially flat. Relief effects, like shadow or orientation, can induce biases in the discrimination of land cover classes impacting much more the Reunion island [14].  Figures 9 and 10 show the per-class F1 scores obtained by the different methods on the Reunion island and the Senegalese site, respectively. Concerning the Reunion site, we can observe that HOb2sRNN achieves the best performances on the majority of land cover classes excepted some classes where other competing methods i.e., RF or MLP obtained slightly better scores that are still comparable to the ones achieved by our framework. It is worth noting how the proposed method outperforms its competitors particularly on agricultural/vegetation classes such as Sugarcane, Pasture and fodder, Market gardening or Orchards. This particular efficiency on such classes suggests that the HOb2sRNN architecture is well suited to deal with the temporal dependencies characterizing these land cover classes. As regards the Senegalese site, HOb2sRNN per-class scores are moderate. It achieved the best scores on 4 land cover classes over 9 namely Fallows, Ponds, Cereals and Legumes while other competing methods outperformed its results especially on the Valley class. Nonetheless, it should be remarked that also in this case, HOb2sRNN obtained the best results on land cover classes that exhibit a time-varying behaviour. It is common to observe natural vegetation activity on fallows areas; ponds appear during the rainy season while cereals and legumes follow crop growth cycle. These findings are inline with the previous observations made on the Reunion island and confirm the fact that the proposed method is capable to leverage temporal dependencies to made its decisions. To go further with the per-class analysis, we also investigated the confusions matrices of each method on the two study sites. Concerning the Reunion island (Figure 11), all of the methods exhibit similar behaviours. This is particularly evident between Greenhouse crops and Urbanized areas classes even if confusions between land cover classes are reduced from RF POE (Figure 11a) to HOb2sRNN (Figure 11b) as can be observed. Overall, the per-class analysis is coherent with the findings we got from the previous analysis. Apropos of the Senegalese site (Figure 12), confusions vary sensibly regarding the different methods. RF POE (Figure 12a) and RF (Figure 12b) exhibits more confusions on Bushes and Fallows classes that are highly misclassified with Cereals and a little bit less with Legumes, while Ponds are often confused with Wet areas. The other competitors tend to reduce these confusions, as underlined by their confusion matrix.    Table 1 for corresponding labels.   Table 2 for corresponding labels.

Sensitivity and Ablation Analysis
In this part of the evaluation, we conduct a sensitivity analysis regarding the influence of the weights of the auxiliary classifiers and, then, we perform several ablation studies to better characterize the behaviour of our framework. For the latter point, we considered the role of multi-source information (radar vs optical SITS), the role of the different architecture components disentangling their contributions and the interplay between spectral bands and radiometric indices (NDVI) regarding the optical signal.

Sensitivity Analysis on the Weights of Per-Source Auxiliary Classifiers
Here, we analyse how the α weights that are associated to the contribution of the auxiliary classifiers influence the classification performances of our framework. To this end, we vary α in the range [0.1, 0.7] with a step of 0.1 and we consider the F1 score measure. Figure 13 reports the results of such analysis. Regarding the Senegal study site, the F1 score varies from 89.35 when α is equal to 0.1 to 90.78 when α is equal to 0.5. Concerning the Reunion study site, the F1 score varies from 79.02 when α is equal to 0.4 to 79.87 when the weight is equal to 0.6.
Generally, we can note that all of the obtained F1 score are competitive w.r.t. the behaviour exhibited by the other methods (see Table 5) and, the performances of HOb2sRNN are quite stable when considering the different values on which α ranges.

Ablation on the Multi-Source Data
In this stage of experiments, we considered only one source time series (radar or optical) to perform the land cover classification. An unique branch was employed for the TempCNN, OD2RNN and HOb2sRNN models. Regarding the results that are reported in Tables 6 and 7, the radar time series has a specific behaviour for each of the considered study site. If radar signal is quite discriminating on the Senegalese site, this is not really the same for the Reunion island, considering how poorly the competing methods trained on the radar SITS performed, especially the SVM algorithm. As mentioned earlier (Section 4.2.1), the Reunion island is characterized by a rugged relief compared to the Senegalese site which is almost flat. Because radar signal is much more sensitive to high ground relief than optical, the performances of the competing methods are negatively impacted when trained with radar data only on the Reunion. Thus, the majority of the models i.e., (RF, SVM, MLP, TempCNN, and OD2RNN) performed slightly worst or equally when combining both sources due to the noise that seems to come from the radar signal. However, HOb2sRNN was able to better leverage the complementarity between radar and optical data to improve with both sources. This behaviour is also noticeable on the Senegalese site where HOb2sRNN achieved better performances than its competitors even though all the competitors improved with both sources. For the rest, regardless of the study site, there is no trend on which competing the method better deals with radar or optical time series. Nonetheless, we have observed on both sites that the SVM algorithm seems not well suited to exploit radar information. Table 6. F1 score, Kappa, and Accuracy of the different methods considering the per-source ablation analysis on the Reunion island (results averaged over ten random splits). We have highlighted the best scores in bold.

Ablation on the Main Components of the Architecture
In this part, we investigate the interplay among the different components of HOb2sRNN and we disentangle their benefits in the architecture. We considered both time series (radar and optical), but excluded one of the following components at a time: the three attention mechanisms involved in the architecture (naming NoAtt), the hierarchical pretraining process (naming NoHierPre) and the enrichment step involved the FCGRU cell which is equivalent to using a GRU cell (naming NoEnrich). We also investigated the use of traditional So f tMax attention mechanism instead of the modified one in the HOb2sRNN architecture. More in detail, this variant also involves the feature enrichment component in the FCGRU cell and the hierarchical pretraining process. We named it SoftMaxAtt. The results are reported in Table 8. Concerning the use of attention mechanisms or not (NoAtt, So f tMaxAtt and HOb2sRNN), we can observe how these components contribute to the final classification performances on both study sites, more on the Reunion island (about 2 points of improvement) than the Senegalese site (approximately one point). We can also note that the So f tMaxAtt variant performs similarly to the NoAtt variant and lower than the HOb2sRNN architecture confirming our hypothesis that relaxing the constraint that the attention weights may sum to 1 in the attention process could be more suitable for remote sensing context. As regards the use of the hierarchical pretraining process (noHierPre vs HOb2sRNN), we can note the added value of such step on both study sites obtaining more than 1 point of improvement. These results seem to underline that involving domain specific knowledge in the pretraining process of neural networks can improve the final classification performances. Finally, the enrichment step carried out in the FCGRU cell (noEnrich vs. HOb2sRNN) also demonstrates its usefulness in both study sites, however it seems to be more effective on the Senegalese site. Table 8. F1 score, Kappa, and Accuracy considering different ablations of HOb2sRNN on the study sites (results averaged over ten random splits). We have highlighted the best scores in bold.

Ablation on Optical Information
We evaluate here whether the NDVI index as additional optical descriptor has an impact on the final land cover classification obtained using the HOb2sRNN architecture. Indeed, considering NDVI index as additional feature in land cover classification task was obvious when training conventional machine learning algorithms, since such techniques cannot extract specialized features for a specific task at hand [46]. Nowadays, the new paradigm related to deep (representational) learning [46] is emerging and demonstrating to be more and more effective in the field of remote sensing [47]. Neural networks have the ability to extract features optimised for a specific task (when enough data are available) avoiding the necessity to extract hand-crafted features. Thus, employing spectral indices, like NDVI, as additional features to deal with land cover classification could not be necessary when using neural networks. Therefore, we evaluate on the two study sites our model performances while excluding the NDVI index from the input (optical) time series. We named such variant noNDVI. The results are reported in Table 9. Table 9. F1 score, Kappa, and Accuracy when considering the exclusion of Normalized Difference Vegetation Index (NDVI) index on the study sites (results averaged over ten random splits). We have highlighted the best scores in bold. We can note on both study sites that there is no significant difference in the model performance when NDVI is excluded. noNDV I performs slightly better than HOb2sRNN on the Reunion island and inversely on the Senegalese site. These small variations come from model properties such as kernel weight initialization or parameter optimization that can induce such kind of performance fluctuations.

Reunion
To conclude, this experiment underlines that our model, considering the two study sites involved in the experimental evaluation, is able to overcome the use of such common hand-crafted features achieving the same performances in the land cover classification task. Such a result makes a step further on the comprehension of which hand-crafted features are convenient (or not) to be extracted during the preprocessing step as well as save time, computation, and storage resources during the analysis pipeline.

Qualitative Analysis of Land Cover Maps
With the purpose to investigate some differences in the land cover maps produced by the competing methods, we highlight in Figures 14 and 15 some representative map details of the Reunion island and the Senegalese site, respectively. For each study site, we remind that land cover maps were produced by labeling each of the segments (14,465 on the Reunion island or 116,937 on the Senegalese site) obtained after the VHSR image segmentation (SPOT6/7 or PlanetScope). For each example, we supplied the corresponding VHSR image displayed in RGB colour as reference. Concerning Reunion island, we focused in the first example (Figure 14b-h), on the Saint-Pierre mixed coastal urban and agricultural area. In this example, we can note the confusions highlighted in the per-class analysis between urbanized areas and greenhouse crops. Visually, RF models (RF POE particularly) better classifies urbanized areas. The second example (Figure 14j-p) depicts a mixed agricultural area with natural vegetation neighboring. We can note here that HOb2sRNN detects a realistic amount of orchards with respect to other methods, according to field experts. In addition, we can also observe on the right of this extract that RF wrongly classified as sugar cane plantations some wooded areas, moor, and savannah objects. Moreover, at the bottom left of the OD2RNN map, we can highlight the misclassification that arises between Rocks and Water classes. Regarding the Senegalese site, the first example (Figure 15b-h) depicts a wet area near the Diohine village located in the east. While other competitors tend to provide the correct representation of the wet area, RF methods wrongly detect villages. As pointed out in previous map details concerning Sugarcane and Orchards on the Reunion island study site, RF predictions is sometimes biased towards most represented classes in the training data i.e., Sugarcane, Orchards in the case of Reunion island and Villages here. In fact, RF is known to be sensible to class imbalance [48]. The second example (Figure 15j-p) focused on a rural landscape, including built-up (villages) and agricultural activities. Here, RF models map more legumes than the other methods while the rest of the approaches detect fallows and cereals instead. To sum up, these visual inspections of land cover maps are consistent with the previously obtained quantitative results.

Attention Parameters Analysis
In this last part of our experimental results, we explore the side information provided by the attention mechanism introduced in Section 3.2 in order to get meaningful insights about how HOb2sRNN handles the multi-source time series for the land cover classification task. Attention weights have been successfully employed in the field of NLP [36,38,49] to explain which parts of the input signal contribute to the final decision of RNN models. With the aim to set up an analogous analysis in the remote sensing time series classification context, we considered attention weights on the Senegalese site with a particular interest on crops (cereals and legumes) motivated by the agronomic knowledge we obtained from discussions with field experts. Figure 16 depicts the distribution of the attention weights on cereals and legumes land covers. For the sake of simplicity, we only focused on the attention mechanism employed to support the fused classifier. At first glance, we can observe that the model weights quite similarly the radar and optical time stamps on both classes. We can also notice that some time stamps towards the end of each time series are highly weighted. It is interesting to note that correlation exists between these high attention values and the crops growth. In the Senegalese groundnut basin, vegetation reaches its peak activity, in fact, during the August month (middle of the time series) also characterized by heavy rain amount, all inducing more differentiation among land cover classes. However, on the two last optical time stamps (25/10/2018 and 30/10/2018), attention weights were differently attributed when considering the two crop types. Cereals get more important weights for these two timestamps than legumes. This behaviour seems to be associated to the agricultural practices adopted in the area at the end of the season. While cereals (mainly millet) are harvested cutting only the ears, legumes (mainly groundnut) are torn off. Thus, on these latter time stamps, cereal plots are covered by senescent plants while legume plots turn into bare soils. These practices are visible in the SITS and illustrated in Figure 17. To wrap up this analysis, we have observed that correlations exist between the attention weights learnt by HOb2sRNN and agricultural practices that characterized the considered study areas. As underlined by our findings, the exploration of the attention parameters can support the understanding of the decision made by our model and provides useful insights about the information that is contained in the SITS under the lens of agricultural field practices.

Discussion
To summarize, the proposed deep learning framework exhibits convincing performances in land cover mapping considering situations characterized by a realistic amount of available training samples. The comparison with other machine learning approaches underlines three points: (i) our approach clearly outperforms the RF classifier that is a common strategy employed to deal with SITS classification; (ii) other standard machine learning methods, i.e., SVM and MLP, exhibit competitive behaviours with respect to our method on a study site that involves a small amount of labeled samples; and, (iii) our proposal surpasses, on both study sites, the adaptation of the recent TempCNN competitor for the multi-source scenario. Such result points out again that, beyond modelling the temporal correlation exhibited by SITS data via RNN or CNN approaches, in a multi-source context other features be worthy to be considered: the hierarchical class relationships as well as the way in which radar/optical information may be combined.
The ablation study indicates that HOb2sRNN is capable to exploit the complementarity between the radar and optical information, always improving its performances with respect to using only one of the two sources. Our framework integrates background knowledge via hierarchical pretraining leveraging taxonomic relationships between land cover classes. The experiments highlight that such knowledge seems valuable for the model and it systematically ameliorates its behaviour. On the other hand, some other type of considered knowledge, i.e., the NDVI index, seems less effective due to the fact that, probably, the model is capable to overcome it. Such observation was also highlighted by [43]. These points clearly pave the way to further investigation about which and how knowledge can be injected to guide/regularize the learning process of such techniques. In addition, the analysis conducted on the α hyperparameter demonstrates that the integration of the auxiliary classifiers, in the training procedure, brings added value still considering small weighting values. When considering the model interpretability, we have also provided some qualitative studies about the side information that can be extracted from our framework. The qualitative results we obtain are in line with the agronomic knowledge we obtained from the considered study area. Make the black box grey is an hot topic today in the machine learning community [50] and we can state, with a certain margin of confidence, that solutions or answers associated to this question will be available in a near future.
We recall that our framework works under an OBIA setting. This means that its performances are tightly related to the quality of the underlying segmentation process that is sensitive to several elements [51], for example: the employed segmentation algorithm, since different algorithms show different pros and cons; the determination of the scale parameter for the particular segmentation algorithm, since useful information can be available at different scales and the co-registration among different sources when working with multi-source data. If on one side, all of these elements carrying in opportunities to properly model the underlying information available in the considered remote sensing data for the analysis at hand, on the other side they can constitute possible sources of errors that are propagated until the land cover mapping result. We underline that it is out of the scope of this work to investigate how such elements impact the subsequent analysis since they deserve a complete and extensive study also in light of the recent adoption of deep learning techniques in the remote sensing domain. Here, we have focused our effort on the multi-source land cover mapping task assuming that the object layer, derived by the segmentation, is an input of our framework. For this reason, we believe that progress in the OBIA field, improving the extraction of the object layer from remote sensing data, can only ameliorate downstream analysis, like the one proposed in this work.
Finally, we remind that operational/realistic constraints might be considered when dealing with remote sensing analysis. Constraints can be related to available resources, i.e., timely production of land cover maps or limited access to training samples. We are aware that, in operational/realistic scenario characterized by the almost real-time production of land cover maps (i.e., disaster management [52]), more computationally efficient solutions needs to be preferred (i.e., MLP or SVM) to deep learning approaches. On the other hand, in our work, we deal with (agricultural-oriented) land cover mapping where, land cover maps need to be provided with a relative low time frequency (once or twice per year). Due to this fact, here, the operational constraints are mainly intended regarding the limited amount of available labeled samples. In such data paucity setting, our approach clearly outperforms the Random Forest classifier, which is the de facto strategy involved in the operational classification of SITS data [53]. In addition, the experimental evaluation pointed out that, less explored machine learning techniques in the context of SITS analysis, i.e., SVM and MLP, deserve much attention, since they constitute valuable strategies to which compare future proposals.

Conclusions
In this work, we propose dealing with land cover mapping at object level, from multi-temporal and multi-source (radar and optical) data. Our approach is based on an enriched RNN cell and involved a modified attention mechanism devised to better suit the SITS data context. We also introduce a hierarchical pretraining approach for the architecture which integrates specific knowledge from land cover classes to support the land cover mapping task. Extensive quantitative and qualitative evaluations on two study sites demonstrate the effectiveness of our solution as compared to competitive approaches in the field of land cover mapping with SITS data. As future work, we plan to extend the current framework in order to also leverage spatial dependencies in multi-source SITS, which are often neglected, especially in the OBIA setting. In addition, the great popularity of Transformer models [54] in the field of Natural Language Processing for handling sequential data makes them a potential way of fusing multi-temporal and multi-source remote sensing data in subsequent research.