Towards a Multi-Temporal Deep Learning Approach for Mapping Urban Fabric Using Sentinel 2 Images

: The major part of the population lives in urban areas, and this is expected to increase in the future. The main challenges faced by cities currently and towards the future are the rapid urbanization, the increase in urban temperature and the urban heat island. Mapping and monitoring urban fabric (UF) to analyze the environmental impact of these phenomena is more necessary than ever. This coupled with the increased availability of Earth observation data and their growing temporal capabilities leads us to consider using temporal features for improving land use classiﬁcation, especially in urban environments where the spectral overlap between classes makes it challenging. Urban land use classiﬁcation thus remains a central question in remote sensing. Although some research studies have successfully used multi-temporal images such as Landsat-8 or Sentinel-2 to improve land cover classiﬁcation, urban land use mapping is rarely carried using the temporal dimension. This paper explores the use of Sentinel-2 data in a deep learning framework, by ﬁrstly assessing the temporal robustness of four popular fully convolutional neural networks (FCNs) trained over single-date images for the classiﬁcation of the urban footprint, and secondly, by proposing a multi-temporal FCN. A performance comparison between the proposed framework and a regular FCN is also conducted. In this study, we consider four UF classes typical of many European Western cities. Results show that training the proposed multi-date model on Sentinel 2 multi-temporal data achieved the best results with a Kappa coe ﬃ cient increase of 2.72% and 6.40%, respectively for continuous UF and industrial facilities. Although a more deﬁnitive conclusion requires further testing, ﬁrst results are promising because they conﬁrm that integrating the temporal dimension with a high spatial resolution into urban land use classiﬁcation may be a valuable strategy to discriminate among several urban categories.


Introduction
In 2018, 55% of the world's population lived in urban areas against only 30% in the 1950s. This value is expected to reach 68% in 2050 [1]. This growing urbanization makes urban areas very dynamic and changes the way people live, consume and exploit resources. For many years, it was monitored through mapping the urban footprint, which includes the road network, buildings, vegetation, and impervious surfaces. These elements structure the spatial layout of cities resulting in various urban fabrics (UF) [2]. UF can significantly affect the energy balance of the Earth's surface, resulting in the urban heat island effect [3], and can also affect natural ecosystems through land use and land cover fragmentation [4]. Mapping UF is thus crucial for understanding and simulating urban dynamics [5]. Integrating the temporal information into a classification framework remains challenging. In the literature, several approaches exist. The first technique consists in applying a machine learning classifier to a stack of multi-temporal images. The second approach involves handcrafting adequate temporal features and then feeding them to a machine learning classifier [6]. These features can either be pixel statistics (mean, minimum, etc.) or other metrics obtained from the time series [16]. This technique was used [12] to distinguish between different land cover classes by computing various Integrating the temporal information into a classification framework remains challenging. In the literature, several approaches exist. The first technique consists in applying a machine learning classifier to a stack of multi-temporal images. The second approach involves handcrafting adequate temporal features and then feeding them to a machine learning classifier [6]. These features can either be pixel Remote Sens. 2020, 12, 423 3 of 21 statistics (mean, minimum, etc.) or other metrics obtained from the time series [16]. This technique was used [12] to distinguish between different land cover classes by computing various statistical and phenological values. Even though these methods deliver satisfactory results, they are data-dependent as they require an adequate set of features for each classification. Other techniques commonly used are based on time series analytics [17]. They operate by using a classifier, for instance, the nearest neighbor classifier (NN), along with a similarity measure like the dynamic time warping distance (DTW) [18]. These techniques are, however, computationally expensive.
The shortcomings of these techniques make them less competitive in comparison with approaches capable of learning high-level features end-to-end like deep learning. In computer vision, convolutional neural networks (CNNs) are very good at analyzing images by learning high-level spatial features [19]. CNNs have been successfully applied to remote sensing data for tasks like scene classification [20], semantic segmentation [21], and object detection [22]. Ma et al. [23] offers a thorough review of deep learning applications in remote sensing. Encoder decoder architectures are the current state-of-the-art semantic segmentation models due to their numerous advantages like easier implementation, higher accuracy, and less computation complexity [24]. Numerous studies explored the use of encoder-decoder architectures for the mapping of urban environments. Audebert et al. [25] trained a variant of the SegNet [26] architecture over an urban area using aerial images and other heterogeneous sensors. Zhang et al. [27] distinguished between buildings, roads, water, and vegetation by implementing a multi-scale deep learning model based on the famous UNet architecture. Fu et al. [22] performed detailed urban land use mapping by improving FCN-8s, introducing Atrous convolutions and refined the model's output using conditional random fields. This approach outperformed standard object-based image analysis techniques.
Although CNN's are well-suited to processing 2D data, such as images, they are not able to manage sequential data like multi-temporal images or times series. Conversely, recurrent neural networks, and particularly LSTMs, can process such data. However, they are not adequate for our purposes, giving that land use mapping aims at producing one classification map per series [28]. Various studies investigated the use of CNNs for processing multi-temporal data. Benedetti et al. [29] proposed a deep learning framework for the fusion of multi-temporal Sentinel 2 imagery and very high-resolution satellite imagery to distinguish between various vegetation classes. Mauro et al. [30] used a multi-layer perceptron (MLP) and a CNN to classify land cover from multi-temporal Landsat images. Nogueira et al. [31] implemented a multi-branch CNN for vegetation mapping and confirmed its superiority to a traditional CNN operating on a temporal stack of images. Pelletier et al. [28] proposed a temporal CNN for crop classification where convolutions are applied in the temporal domain. The literature review reveals that few studies investigated the temporal dimension of high-resolution satellite imagery such as Sentinel-2 (10 m) for mapping UF. The use of very high-resolution images is often preferred given the complexity of urban environments and the great spectral overlap.
In this context, the goal of this study is two-fold: (i) to investigate whether UF mapping benefits from integrating temporal information; and (ii) to explore the use of CNNs on multi-temporal satellite imagery. More precisely, this study evaluates the potential contribution of temporal dimension of freely available high-resolution satellite imagery such as Sentinel-2 to map four urban classes: Continuous UF, discontinuous UF, industrial or tertiary facilities and the road and rail network. These are all classes with different vegetation and buildings densities and are thus well suited for this study. In this paper, we work on a multi-temporal sampling of a time series as a first step. The proposed framework can be later on extended to a full time series. This approach, exploiting the temporal dimension to compensate for the spatial resolution explores the use of CNNs which will provide us with an end-to-end learning framework without the need of engineering features (the process of using our knowledge of the data to create features that make machine learning algorithms work), often time consuming for experts. The paper is structured in five sections. Section 2 describes material and methods. First, a comparison baseline from mono-temporal images is presented by analyzing the performance of four encoder-decoder architectures for mapping the urban footprint. Secondly, based on Remote Sens. 2020, 12, 423 4 of 21 the best model, we perform UF mapping and we design a multi-temporal classifier by altering the previous model. Results (Section 3) are then detailed and allow assessing the benefits of using temporal data to discriminate UF. A discussion and conclusion are then provided respectively in Sections 4 and 5.

Study Site and Reference Data
The study area is the Alsace region located in the East of France (Figure 2a). It covers an area of 8329 km 2 and includes four main land cover classes: urban footprint, agricultural areas, forest areas, and water surfaces. The urban footprint displays some level of class imbalance (Figure 2b). Discontinuous UF is the most common class, followed by industrial or tertiary facilities and then the road and rail network. The key reason for choosing the Alsace region as a study site is its morphological characteristics typical of many European Western cities.
Remote Sens. 2020, 12, 423 4 of 22 benefits of using temporal data to discriminate UF. A discussion and conclusion are then provided respectively in Section 4 and Section 5.

Study site and reference data
The study area is the Alsace region located in the East of France (Figure 2a). It covers an area of 8329 km² and includes four main land cover classes: urban footprint, agricultural areas, forest areas, and water surfaces. The urban footprint displays some level of class imbalance (Figure 2b). Discontinuous UF is the most common class, followed by industrial or tertiary facilities and then the road and rail network. The key reason for choosing the Alsace region as a study site is its morphological characteristics typical of many European Western cities. The regional land use/cover vector database (BDOccsol©GeoGrandEst) produced by visual interpretation of aerial photography is used as reference data [33]. The most up-to-date version (2012) available through the regional data infrastructure GeoGrandEst, maps the study area into 55 thematic classes. The legend is organized into four levels of nomenclature. The first level categorizes land cover into four classes: (1) artificial surfaces, (2) agricultural areas, (3) forest areas, and (4) water surfaces. Artificial surfaces (Level 1) are further sub-divided into 12 classes at Level 3. Table 1 details the four generic urban thematic classes obtained after generalization.
Both continuous and discontinuous UF contain buildings and impervious surfaces. The main difference consists in their relative density and the importance of vegetated areas and bare surfaces. In continuous UF, more than 80% of the surface is covered by impervious elements (buildings, roads, and artificially surfaced areas). For discontinuous UF, roads and artificially areas range from 30% to 80%. This also presents for industrial facilities due to industrial wastelands and queries and the road and rail network due to the presence of open areas. These variations make these classes well suited for this study as we can already expect the densely built classes to have a more stable temporal response compared to the others ( Figure 1). Reference data are rasterized at a 10 m spatial resolution. This process is applied at two levels: the first level with a binary map of the urban footprint, and the second level results in a multi-class map with the four UF classes (Figure 3 The regional land use/cover vector database (BDOccsol©GeoGrandEst) produced by visual interpretation of aerial photography is used as reference data [32]. The most up-to-date version (2012) available through the regional data infrastructure GeoGrandEst, maps the study area into 55 thematic classes. The legend is organized into four levels of nomenclature. The first level categorizes land cover into four classes: (1) artificial surfaces, (2) agricultural areas, (3) forest areas, and (4) water surfaces. Artificial surfaces (Level 1) are further sub-divided into 12 classes at Level 3. Table 1 details the four generic urban thematic classes obtained after generalization.
Both continuous and discontinuous UF contain buildings and impervious surfaces. The main difference consists in their relative density and the importance of vegetated areas and bare surfaces. In continuous UF, more than 80% of the surface is covered by impervious elements (buildings, roads, and artificially surfaced areas). For discontinuous UF, roads and artificially areas range from 30% to 80%. This also presents for industrial facilities due to industrial wastelands and queries and the road and rail network due to the presence of open areas. These variations make these classes well suited for this study as we can already expect the densely built classes to have a more stable temporal response compared to the others ( Figure 1). Reference data are rasterized at a 10 m spatial resolution. This process is applied at two levels: the first level with a binary map of the urban footprint, and the second level results in a multi-class map with the four UF classes (Figure 3-Step 1). Areas mainly occupied by industrial activities (trade, manufacturing, service and tertiary or extractive activities).

(4) Road and rail networks
Rail and road networks, airport and port sites, and other open associated areas.
Primary road network and associated installations (stations, platforms and parking areas), railways and train stations, areas associated with airport and port activities and buildings intended for agricultural use.

Optical satellite data and pre-processing steps.
The Alsace region is covered by three Sentinel 2 tiles (T32ULU, T32ULV, and T32TLT) downloaded from the Theia Land Data Center (Figure 1a). Sentinel 2 is a satellite mission of the European earth surveillance program, Copernicus. It provides multispectral images with four bands at 10 m (blue, green, red, and near infrared), six bands at 20 m (four vegetation red edge bands and two short wave infrared bands) and three bands at 60m (coastal aerosol, water vapor, and short wave infrared cirrus) [34]. In this paper, we use only the 10 m, spectral bands, as they are more adapted for UF mapping. Depending on the cloud cover, all images are selected to be as representative as possible of the year 2017, i.e., by varying the sensing seasons in order to add seasonal variations to our training data. Table 2 shows the distribution of the sensing periods.  Areas mainly occupied by industrial activities (trade, manufacturing, service and tertiary or extractive activities).

(4) Road and rail networks
Rail and road networks, airport and port sites, and other open associated areas.
Primary road network and associated installations (stations, platforms and parking areas), railways and train stations, areas associated with airport and port activities and buildings intended for agricultural use.

Optical satellite data and pre-processing steps.
The Alsace region is covered by three Sentinel 2 tiles (T32ULU, T32ULV, and T32TLT) downloaded from the Theia Land Data Center ( Figure 1a). Sentinel 2 is a satellite mission of the European earth surveillance program, Copernicus. It provides multispectral images with four bands at 10 m (blue, green, red, and near infrared), six bands at 20 m (four vegetation red edge bands and two short wave infrared bands) and three bands at 60m (coastal aerosol, water vapor, and short wave infrared cirrus) [34]. In this paper, we use only the 10 m, spectral bands, as they are more adapted for UF mapping. Depending on the cloud cover, all images are selected to be as representative as possible of the year 2017, i.e., by varying the sensing seasons in order to add seasonal variations to our training data. Table 2 shows the distribution of the sensing periods.

Continuous housing blocks
Areas mainly occupied by buildings and other impervious surfaces. Vegetation and bare soil are exceptional (2) Discontinuous UF

Discontinuous housing blocks
Areas where buildings, artificial surfaces, vegetated areas and bare surfaces are present and occupy significant surfaces in a discontinuous spatial pattern (3) Industrial facilities Commercial and industrial facilities, industrial wastelands, gravel quarries and sandpits, mining sites and construction sites.
Areas mainly occupied by industrial activities (trade, manufacturing, service and tertiary or extractive activities).

(4) Road and rail networks
Rail and road networks, airport and port sites, and other open associated areas.
Primary road network and associated installations (stations, platforms and parking areas), railways and train stations, areas associated with airport and port activities and buildings intended for agricultural use.

Optical satellite data and pre-processing steps.
The Alsace region is covered by three Sentinel 2 tiles (T32ULU, T32ULV, and T32TLT) downloaded from the Theia Land Data Center ( Figure 1a). Sentinel 2 is a satellite mission of the European earth surveillance program, Copernicus. It provides multispectral images with four bands at 10 m (blue, green, red, and near infrared), six bands at 20 m (four vegetation red edge bands and two short wave infrared bands) and three bands at 60m (coastal aerosol, water vapor, and short wave infrared cirrus) [34]. In this paper, we use only the 10 m, spectral bands, as they are more adapted for UF mapping. Depending on the cloud cover, all images are selected to be as representative as possible of the year 2017, i.e., by varying the sensing seasons in order to add seasonal variations to our training data. Table 2 shows the distribution of the sensing periods.

Final classes Landcover classes (Level 3) BDOccsol©GeoGrandEst Description
(1 Areas mainly occupied by industrial activities (trade, manufacturing, service and tertiary or extractive activities).

(4) Road and rail networks
Rail and road networks, airport and port sites, and other open associated areas.
Primary road network and associated installations (stations, platforms and parking areas), railways and train stations, areas associated with airport and port activities and buildings intended for agricultural use.

Optical satellite data and pre-processing steps.
The Alsace region is covered by three Sentinel 2 tiles (T32ULU, T32ULV, and T32TLT) downloaded from the Theia Land Data Center (Figure 1a). Sentinel 2 is a satellite mission of the European earth surveillance program, Copernicus. It provides multispectral images with four bands at 10 m (blue, green, red, and near infrared), six bands at 20 m (four vegetation red edge bands and two short wave infrared bands) and three bands at 60m (coastal aerosol, water vapor, and short wave infrared cirrus) [34]. In this paper, we use only the 10 m, spectral bands, as they are more adapted for UF mapping. Depending on the cloud cover, all images are selected to be as representative as possible of the year 2017, i.e., by varying the sensing seasons in order to add seasonal variations to our training data. Table 2 shows the distribution of the sensing periods. Primary road network and associated installations (stations, platforms and parking areas), railways and train stations, areas associated with airport and port activities and buildings intended for agricultural use.

Optical Satellite Data and Pre-Processing Steps
The Alsace region is covered by three Sentinel 2 tiles (T32ULU, T32ULV, and T32TLT) downloaded from the Theia Land Data Center (Figure 1a). Sentinel 2 is a satellite mission of the European earth surveillance program, Copernicus. It provides multispectral images with four bands at 10 m (blue, green, red, and near infrared), six bands at 20 m (four vegetation red edge bands and two short wave infrared bands) and three bands at 60 m (coastal aerosol, water vapor, and short wave infrared cirrus) [33]. In this paper, we use only the 10 m, spectral bands, as they are more adapted for UF mapping. Depending on the cloud cover, all images are selected to be as representative as possible of the year 2017, i.e., by varying the sensing seasons in order to add seasonal variations to our training data. Table 2 shows the distribution of the sensing periods. Image Normalization All the images are normalized to make the learning process more stable by reducing the variability of training data and thus improving the generalization power of the model (Figure 3-Step1). This preprocessing step consists of computing the mean and the standard deviation of each channel separately and then normalizing across it: where i represents the i-th channel, z i the pixel radiometric value after normalization, x i its value before normalization, µ i the mean value of the channel i and σ i its standard deviation.

Problem Formulation
The purpose of this work is to study the benefit of using multi-temporal imagery for the classification of the urban fabric. For the classes of interest, the temporal dimension can supply additional information as certain classes experience seasonal variations, and others do not. In this context, temporal unstableness is as relevant as stableness to differentiate between the classes.
The first phase of our experiments aims at proposing a comparison baseline from mono-temporal images. The urban footprint is map (Figure 3-Step2) using four CNN models and the best architecture is selected to perform UF classification. Models are evaluated on test sets covering different geographical regions and different dates to assess their spatial robustness and temporal variations. The second step of our experiments (Figure 3-Step3) concerns mapping UF using multi-temporal imagery. Table 3 summarizes the different datasets used for training the three models: a mono-temporal model for (1) the classification of urban footprint, (2) the classification UF, and (3) a multi-temporal model for the classification of UF. The flowchart in Figure 3 details the proposed methodology to assess the best (mono or multi-temporal) model to map UF. Table 3. The data used for training each model.

Model
Training Data  (Figure 3-Step1). The size should be sufficiently large as not to lose spatial context but should also take into account the available computational resources. Images with no spectral information are then removed. We perform a random shuffling of the dataset before splitting it into training and validation sets using a 10% ratio for validation and 90% for training. Data augmentation on the training set is also applied by random rotations, translations, and zooms to make our models more robust to variations and prevent overfitting. Table 4 shows the number of image patches obtained for each model. The number of patches is proportional to the distribution of defines classes on the study site.

Data Sampling
It consists of subdividing the image into 256 × 256 pixel patches (Figure 3-Step1). The size should be sufficiently large as not to lose spatial context but should also take into account the available computational resources. Images with no spectral information are then removed. We perform a random shuffling of the dataset before splitting it into training and validation sets using a 10% ratio for validation and 90% for training. Data augmentation on the training set is also applied by random rotations, translations, and zooms to make our models more robust to variations and prevent overfitting. Table 4 shows the number of image patches obtained for each model. The number of patches is proportional to the distribution of defines classes on the study site.

Choice of Model Architectures
It is essential to discuss how to approach multi-label classification. Two common strategies are often used: the first one consists in implementing a global model that outputs membership probabilities of all the classes of interest, whereas the second strategy, commonly referred to as the one-vs-all Remote Sens. 2020, 12, 423 8 of 21 classification, consists in implementing separate binary classifiers for each class. Several factors influence this choice, like class distribution, number of target classes, and the available computing resources. As Figure 2b illustrates, the dataset is very class imbalanced. Consequently, we use the one-vs-all approach [34]. As a training strategy, we use transfer learning, which allows us to take advantage of existing pre-trained models. Transfer learning is also ideal when the number of training samples is limited [35].

•
Supervised classification of urban footprint using mono-temporal sentinel 2 data.
Four distinct encoder-decoder architectures are implemented: FCN-32s [36], FCN-16s [36], SegNet [26], and UNet [37]. Unlike the original paper, we use VGG-19 [38] as the basis of the first two architectures. We initialize all encoders from the ImageNet weights and all decoders using Xavier Initialization [39]. According to [40], when changing sensors, transfer learning can fail. We, therefore, employ Sentinel 2 RGB bands analogously to the ImageNet dataset. We nonetheless train from scratch a variant of the SegNet model on multispectral data to explore the impact on classification performance. As a loss function, we use the dice loss (DSC) defined below. We then conduct a performance comparison between the implemented models to select the best model for UF mapping.
where DSC is the dice loss, Y is the ground truth, andŶ is the model prediction. We use Mini-batch gradient descent to optimize the objective function. To mitigate overfitting, we use numerous regularization techniques, including dropout and real-time data augmentation, by applying random rotations, translations, and zooms. We tune the dropout value and the batch size on the validation set, and we set an initial value for the learning rate using the method of [41]. We then reduce by 10% whenever the validation loss stops improving over more than five epochs of training. We train all models for 150 epochs. The outputs are class membership probabilities in the range 0 through 1. At this stage, we use 50% as a threshold to transform the results into a classified map.
• Supervised classification of UF using mono and multi-temporal Sentinel 2 data.
The previous experiment showed that the UNet model offers the best performance. This will be further detailed in the results section. The model is then trained on mono-temporal data to map the four classes of UF. Afterward, we design a new architecture taking as a template the previous one to analyze multi-temporal data. The proposed architecture ( Figure 4) takes as input a multi-temporal patch and outputs a classified map, and is capable of learning spatio-temporal features. The encoder consists of three VGG-16 branches carrying out the separate learning of spatial features for each timestamp. Then, a temporal-wise concatenation of the last convolutional layer is performed along the three branches for each VGG-16 block. The spatio-temporal features are then passed to the decoder via skip connections. The result of the last concatenation is up-sampled and further processed by the decoder. Finally, the output is fed to a sigmoid classifier to produce class membership probabilities. Thus, starting from a multi-temporal patch, our model outputs a single classification map. We initialize and freeze the encoders using the corresponding mono-temporal weights learned previously. This aims at reducing the number of trainable parameters and reducing the risk of over fitting. We used a combination of the dice loss and the binary cross-entropy to train the model [43].
where DSC is the dice loss, Y is the ground truth, Ŷ is the model prediction, we set to a value of 0.7, and H is the binary cross-entropy function. We train the models for 300 epochs each. To define a threshold, we put into practice a systematic approach that consists in retrieving the break-even point defined as the point where the recall of the model equals its precision. This value represents an adequate choice given the Trade-off between recall and precision; a good model should, therefore, be able to compromise between these two quantities. Figure 5 below shows the precision recall curve for each classification model. Due to computational constraints, we train the multi-temporal UNet on two classes. To best assess the potential of temporal information for UF mapping, we chose a class with a large percentage of land covered by artificially surfaced areas, and another one with a small percentage. This is the We initialize and freeze the encoders using the corresponding mono-temporal weights learned previously. This aims at reducing the number of trainable parameters and reducing the risk of over fitting. We used a combination of the dice loss and the binary cross-entropy to train the model [42].
where DSC is the dice loss, Y is the ground truth,Ŷ is the model prediction, we set α to a value of 0.7, and H is the binary cross-entropy function. We train the models for 300 epochs each. To define a threshold, we put into practice a systematic approach that consists in retrieving the break-even point defined as the point where the recall of the model equals its precision. This value represents an adequate choice given the Trade-off between recall and precision; a good model should, therefore, be able to compromise between these two quantities. Figure 5 below shows the precision recall curve for each classification model. We initialize and freeze the encoders using the corresponding mono-temporal weights learned previously. This aims at reducing the number of trainable parameters and reducing the risk of over fitting. We used a combination of the dice loss and the binary cross-entropy to train the model [43].
where DSC is the dice loss, Y is the ground truth, Ŷ is the model prediction, we set to a value of 0.7, and H is the binary cross-entropy function. We train the models for 300 epochs each. To define a threshold, we put into practice a systematic approach that consists in retrieving the break-even point defined as the point where the recall of the model equals its precision. This value represents an adequate choice given the Trade-off between recall and precision; a good model should, therefore, be able to compromise between these two quantities. Figure 5 below shows the precision recall curve for each classification model. Due to computational constraints, we train the multi-temporal UNet on two classes. To best assess the potential of temporal information for UF mapping, we chose a class with a large percentage of land covered by artificially surfaced areas, and another one with a small percentage. This is the Due to computational constraints, we train the multi-temporal UNet on two classes. To best assess the potential of temporal information for UF mapping, we chose a class with a large percentage of land covered by artificially surfaced areas, and another one with a small percentage. This is the case for continuous UF where the ground is almost entirely covered by buildings and the class of industrial facilities where vegetation is present.
To measure models accuracy, we use the intersection over union/Jaccard index (IoU) [43], overall accuracy (OA), and the confusion-matrix metrics that include precision, recall, and the F1 score. We also use the receiver operating characteristic (ROC) curve and the area under the ROC curve (AUROC) [44] as other performance indicators.

Experimental Results
Section 3.1 presents and discusses the results of the comparison between the four models for urban footprint classification, and Section 3.2 presents the results of UF classification using mono and multi-temporal images.

Results of Urban Footprint Classification
The loss function is plotted against time for each model and for both training and validation sets to monitor any possible overfitting ( Figure 6). It appears that for both training and validation sets, UNet and RGB SegNet reached approximately a dice loss of -85% compared to a value approaching −75% for other models. Furthermore, the FCN-32s and the multi-spectral SegNet are prone to overfitting, which is shown by the increasing gap between the training loss and the validation loss. To make a more meaningful comparison, we plot the learning curves for each model on the same graph ( Figure 7). Remote Sens. 2020, 12, 423 10 of 22 case for continuous UF where the ground is almost entirely covered by buildings and the class of industrial facilities where vegetation is present.
To measure models accuracy, we use the intersection over union/Jaccard index (IoU) [44], overall accuracy (OA), and the confusion-matrix metrics that include precision, recall, and the F1 score. We also use the receiver operating characteristic (ROC) curve and the area under the ROC curve (AUROC) [45] as other performance indicators.

Experimental results
Section 3.1 presents and discusses the results of the comparison between the four models for urban footprint classification, and Section 3.2 presents the results of UF classification using mono and multi-temporal images.

Results of urban footprint classification
The loss function is plotted against time for each model and for both training and validation sets to monitor any possible overfitting ( Figure 6). It appears that for both training and validation sets, UNet and RGB SegNet reached approximately a dice loss of -85% compared to a value approaching -75% for other models. Furthermore, the FCN-32s and the multi-spectral SegNet are prone to overfitting, which is shown by the increasing gap between the training loss and the validation loss. To make a more meaningful comparison, we plot the learning curves for each model on the same graph (Figure 7).   UNet and RGB SegNet outperformed their counterparts in that their loss decreases more steeply and reaches lower values. Furthermore, it appears that the UNet model is holding a slight advantage. Another interesting remark is that the performance of the SegNet model trained on multi-spectral data is reduced compared with its RGB counterpart. Given the wide gap between the two variants of SegNet, we will no longer consider the second one for the rest of the experiments. These findings support the importance of transfer learning for small datasets. To further analyze the model's performance, we examined in more depth the variations of their sensitivity and specificity using the ROC curve (Figure 8).  UNet and RGB SegNet outperformed their counterparts in that their loss decreases more steeply and reaches lower values. Furthermore, it appears that the UNet model is holding a slight advantage. Another interesting remark is that the performance of the SegNet model trained on multi-spectral data is reduced compared with its RGB counterpart. Given the wide gap between the two variants of SegNet, we will no longer consider the second one for the rest of the experiments. These findings support the importance of transfer learning for small datasets. To further analyze the model's performance, we examined in more depth the variations of their sensitivity and specificity using the ROC curve ( Figure 8). UNet and RGB SegNet outperformed their counterparts in that their loss decreases more steeply and reaches lower values. Furthermore, it appears that the UNet model is holding a slight advantage. Another interesting remark is that the performance of the SegNet model trained on multi-spectral data is reduced compared with its RGB counterpart. Given the wide gap between the two variants of SegNet, we will no longer consider the second one for the rest of the experiments. These findings support the importance of transfer learning for small datasets. To further analyze the model's performance, we examined in more depth the variations of their sensitivity and specificity using the ROC curve ( Figure 8). UNet and SegNet outperformed the other models with a slight edge for the UNet. As an example, to reach 86% of true positives, UNet and SegNet only produce 1% of false positives compared with 7% and 10% for the FCN-16s and FCN-32s, respectively. The area under the ROC curve (AUROC) is also a meaningful indicator of a model's performance. Given that each ROC curve must depart as much as possible from the diagonal, AUROC must be as close as possible to 1 and as UNet and SegNet outperformed the other models with a slight edge for the UNet. As an example, to reach 86% of true positives, UNet and SegNet only produce 1% of false positives compared with 7% and 10% for the FCN-16s and FCN-32s, respectively. The area under the ROC curve (AUROC) is also a meaningful indicator of a model's performance. Given that each ROC curve must depart as much as possible from the diagonal, AUROC must be as close as possible to 1 and as far as possible from 0.5. Table 5 below shows the AUROC values for each model. These results are very much in-line with those in Figures 8 and 9 with an on-going advantage for UNet and SegNet. Next, we examined the performance of each model on the validation set in terms of various evaluation metrics (Table 6). UNet achieved the best results with an F1 score of over 91%. As a last experiment, we conducted more rigorous tests enabling us to assess each model's generalization ability. For this purpose, we constructed two test sets. In test set 1 (DiffT), we kept the same geographical zone while changing the sensing time. In contrast, we did the opposite for test set two (DiffZ), i.e., a different geographical zone while keeping sensing time unchanged. Results are reported in Table 7. The performance is significantly better for the DiffZ test set for all the classification models with up to 9% difference for metrics like the F1 score and the IoU. This suggests that training a model over single date images might make it less robust to temporal variations and less efficient over images taken during another time of the year. The second observation is that, regardless of the test set, the UNet model achieves the best results. For a more in-depth analysis, we plotted for each model, the differences in IoU, F1 score, and Kappa coefficient (Figure 9). The performance is significantly better for the DiffZ test set for all the classification models with up to 9% difference for metrics like the F1 score and the IoU. This suggests that training a model over single date images might make it less robust to temporal variations and less efficient over images taken during another time of the year. The second observation is that, regardless of the test set, the UNet model achieves the best results. For a more in-depth analysis, we plotted for each model, the differences in IoU, F1 score, and Kappa coefficient (Figure 9). The UNet model shows the smallest differences in the three-evaluation metrics. So, it is the least sensitive to seasonal variations. We finally conduct a visual examination of results ( Figure 10). Overall, all the predictions closely resemble the ground truth. However, the UNet model does better when it comes to linear features like roads. If we take a closer look at each prediction ( Figure  11), UNet's and SegNet's predictions are very refined compared to those of FCN-32s and FCN-16s that are more coarse. Overall, all the predictions closely resemble the ground truth. However, the UNet model does better when it comes to linear features like roads. If we take a closer look at each prediction (Figure 11), UNet's and SegNet's predictions are very refined compared to those of FCN-32s and FCN-16s that are more coarse.
Overall, all the predictions closely resemble the ground truth. However, the UNet model does better when it comes to linear features like roads. If we take a closer look at each prediction ( Figure  11), UNet's and SegNet's predictions are very refined compared to those of FCN-32s and FCN-16s that are more coarse.
Urban Footprint Each model's architecture can explain these results. FCN-32s only carry out one upsampling operation. Because spatial information is lost as we go deeper into the network, the final prediction of FCN-32s is coarse. The improved performance of FCN-16s is due to the presence of a skip connection that makes spatial information flows from earlier layers to deeper ones. For SegNet, the use of a more substantial decoder allows a more progressive reconstruction of spatial information Each model's architecture can explain these results. FCN-32s only carry out one upsampling operation. Because spatial information is lost as we go deeper into the network, the final prediction of FCN-32s is coarse. The improved performance of FCN-16s is due to the presence of a skip connection that makes spatial information flows from earlier layers to deeper ones. For SegNet, the use of a more substantial decoder allows a more progressive reconstruction of spatial information and, thus, a higher performance. As for the UNet model, the use of several layers in the decoder, along with skip connections, makes it the best model. Therefore, we chose the UNet to form the mono-temporal comparison baseline. Table 8 below shows the training duration for each UF class.  Figure 12) reveal a comparable performance for all classes with AUCs between 0.97 and 0.99.  Table 8 presents the evaluation metrics computed over the validation set, while Figure 13 provides visual results.   Table 9 presents the evaluation metrics computed over the validation set, while Figure 13 provides visual results.   Furthermore, we tested each model over four multi-temporal Sentinel 2 images. To span the entire year, we chose the months of November, March, and August. F1 scores were then plotted along with those of the training dataset ( Figure 14). Furthermore, we tested each model over four multi-temporal Sentinel 2 images. To span the entire year, we chose the months of November, March, and August. F1 scores were then plotted along with those of the training dataset ( Figure 14). As we move away from the training dataset's sensing period, the model's performance gets worse. Moreover, the performance of discontinuous UF does not degrade as quickly as other classes. This can be explained by the number of available samples per class: a model trained over a large dataset has a better generalization capability. Nevertheless, it is still unclear why, even for As we move away from the training dataset's sensing period, the model's performance gets worse. Moreover, the performance of discontinuous UF does not degrade as quickly as other classes. This can be explained by the number of available samples per class: a model trained over a large dataset has a better generalization capability. Nevertheless, it is still unclear why, even for discontinuous UF, performance degrades as we move further from training time. Afterward, we trained the multi-temporal UNet on two UF classes: Continuous UF and the class of industrial facilities. Results over the validation set are reported in Table 10. To answer the research question, we develop a testing methodology using multi-temporal Sentinel 2 imagery of the year 2018. We first train them separately, using the UNet, and then together using our multi-temporal framework. The results are listed in Table 11. It appears that for both categories of UF, all evaluation metrics show some degree of improvement. Continuous habitat has seen its Kappa coefficient go up by 2.72% by comparing the multi-temporal model with the most successful testing period (February 2018). The same goes for the class of industrial facilities for which an improvement of 6.4% was registered. The improvement further increases if we conduct the comparison with the least successful testing period (August 2018). To examine in more depth how these quantitative improvements translate visually, we plot alongside the ground truth the predictions of the multi-temporal model and those of the mono-temporal model for the most successful testing period (Figures 15 and 16).   Predictions made by the multi-temporal model are better than those produced by its mono-temporal counterpart for both the classes. Improvements were more critical for the class of industrial facilities, where the multi-temporal model was able to classify previously undetected portions of the image accurately. Similarly, for the class of continuous habitat, predictions made by the multi-temporal model are more refined and resemble more closely the ground truth.

Discussion
These experimentations evaluate if multi-temporal imagery might help discriminate between some UF classes with different occurrences of urban structures, vegetation, and bare surfaces. At first, we highlight that a mono-temporal model performed systematically better on data covering a different geographical zone than on data taken during a different time. This suggests that seasonal variations prevented models from recognizing certain UF classes. Moreover, when testing each class's model on a multi-temporal test set, performance decreases as we move away from the training set's sensing time. Both of these findings are complementary to each other: a mono-temporal model will not only do worse on multi-temporal data, but its performance will also continue to degrade as we move away from the training period. The proposed multi-temporal model did better than its mono-temporal counterpart. Namely, using three test images of the year 2018 simultaneously proved more effective than exploiting them separately. The quantitative, as well as the visual improvements, were more pronounced for the class of industrial facilities, compared to continuous UF. This is consistent with our general hypothesis: a multi-temporal approach can be fruitful to differentiate between classes with temporal characteristics. All of these results are consistent with each other as they all point to the temporal domain as possible useful input for urban land use classification. All things considered, a reasonable conclusion is that learning new features using multi-temporal data and deep learning can improve the classification accuracy for both continuous UF and industrial facilities. Another sound conclusion is that Sentinel 2 images, despite a spatial resolution of 10 m, can be used for urban land use mapping.
Nevertheless, the contribution of the temporal dimension to UF mapping requires further testing with larger datasets and longer time series. With this mind, we think the positive results achieved in this research are worth building upon in future studies to further investigate the temporal domain as an input to urban land use classification models. Conducting experiments on a wide range of UF classes, using longer time series, and bigger datasets all represent good avenues for future research.

Conclusions and Perspectives
In conclusion, this work explored the use of multi-temporal Sentinel-2 images for UF mapping. To this end, a novel multi-temporal fully convolutional network based on the famous UNet architecture was implemented. The proposed model learns a spatio-temporal representation of the data without the need for extensive feature engineering or heavy preprocessing and is trained in an end-to-end fashion. To investigate whether the temporal dimension can improve the classification performance of UF, we devised various testing methodologies using mono-temporal and multi-temporal images. We first assessed the robustness of four popular deep learning benchmarks to spatial and temporal variations and then went on to conduct a comparative analysis between the performance of a mono-temporal model and the proposed multi-temporal model. Experimental results showed that a mono-temporal model's performance tends to decrease as the test set's sensing period departs from that of the training set. Furthermore, the proposed multi-temporal framework yielded the best results with a Kappa coefficient increase of 2.72% and 6.4% for the classes of continuous UF and industrial facilities, respectively. These findings highlight the benefit of exploiting spatio-temporal properties of satellite images for urban land use mapping and pave the way to new approaches to leverage the power of multi-temporal data to discriminate among multiple urban categories. Funding: This research was funded by the French funded project ANR TIMES 'High-performance processing techniques for mapping and monitoring environmental changes from massive, heterogeneous and high frequency data times series' (ANR-17-CE23-0015-01) and the CNES TOSCA program 2018.

Conflicts of Interest:
The authors declare no conflict of interest.