1. Introduction
In 2018, 55% of the world’s population lived in urban areas against only 30% in the 1950s. This value is expected to reach 68% in 2050 [
1]. This growing urbanization makes urban areas very dynamic and changes the way people live, consume and exploit resources. For many years, it was monitored through mapping the urban footprint, which includes the road network, buildings, vegetation, and impervious surfaces. These elements structure the spatial layout of cities resulting in various urban fabrics (UF) [
2]. UF can significantly affect the energy balance of the Earth’s surface, resulting in the urban heat island effect [
3], and can also affect natural ecosystems through land use and land cover fragmentation [
4]. Mapping UF is thus crucial for understanding and simulating urban dynamics [
The increasing availability of remote sensing data has greatly facilitated land use mapping. However, it is still challenging to ensure its accuracy [
6], especially in urban environments where impervious materials have similar spectral properties. Spectral information alone is insufficient for this purpose. Consequently, object-based image analyses are preferred because of their ability to include contextual information. Numerous studies have used supervised object based methods to map land cover and land use [
7]. However, these approaches usually require very high-resolution images since they use segments as processing units to compute spectral, geometrical, and spatial features to classify [
8]. Even if these methods deliver satisfactory results, they are still limited as they are data-dependent and require an adequate set of features for each application.
Since recently, several papers have shown the efficiency of multi-temporal imagery to improve classification provided land use have time-variant properties [
14]. For urban context, the urban elements spectral properties in each urban fabric classes vary along seasons due to different occurrences of each land cover element (vegetation, bare soil, building, road networks, and impervious surfaces). These patterns are already visible on high spatial resolution (10 m) images and the spectral variation can thus be characterized by temporal features [
15]. However, unlike for crop mapping, the temporal properties we want to exploit are not seasonal patterns specific to each class but rather are related to a class’s temporal stability, which depends on its land cover elements. The presence/ absence of vegetation along with the relative density of buildings make some classes more susceptible to change and others more stable. For instance, the class of discontinuous UF (
Figure 1a) composed of buildings and other impervious surfaces with green areas and bare soil, is more susceptible to seasonal changes. However, other types of UF are less prone to change and will have more stable temporal properties such as dense and continuous UF (
Figure 1b). This is the case when urban structures and transport networks are dominating the surface area. Based on this expert and qualitative analysis, we make the hypothesis that a classification algorithm capable of retrieving spatio-temporal features could be efficient in mapping UF compared to a classical supervised approach based on very high spatial resolution.
Integrating the temporal information into a classification framework remains challenging. In the literature, several approaches exist. The first technique consists in applying a machine learning classifier to a stack of multi-temporal images. The second approach involves handcrafting adequate temporal features and then feeding them to a machine learning classifier [
6]. These features can either be pixel statistics (mean, minimum, etc.) or other metrics obtained from the time series [
16]. This technique was used [
12] to distinguish between different land cover classes by computing various statistical and phenological values. Even though these methods deliver satisfactory results, they are data-dependent as they require an adequate set of features for each classification. Other techniques commonly used are based on time series analytics [
17]. They operate by using a classifier, for instance, the nearest neighbor classifier (NN), along with a similarity measure like the dynamic time warping distance (DTW) [
18]. These techniques are, however, computationally expensive.
The shortcomings of these techniques make them less competitive in comparison with approaches capable of learning high-level features end-to-end like deep learning. In computer vision, convolutional neural networks (CNNs) are very good at analyzing images by learning high-level spatial features [
19]. CNNs have been successfully applied to remote sensing data for tasks like scene classification [
20], semantic segmentation [
21], and object detection [
22]. Ma et al. [
23] offers a thorough review of deep learning applications in remote sensing. Encoder decoder architectures are the current state-of-the-art semantic segmentation models due to their numerous advantages like easier implementation, higher accuracy, and less computation complexity [
24]. Numerous studies explored the use of encoder-decoder architectures for the mapping of urban environments. Audebert et al. [
25] trained a variant of the SegNet [
26] architecture over an urban area using aerial images and other heterogeneous sensors. Zhang et al. [
27] distinguished between buildings, roads, water, and vegetation by implementing a multi-scale deep learning model based on the famous UNet architecture. Fu et al. [
22] performed detailed urban land use mapping by improving FCN-8s, introducing Atrous convolutions and refined the model’s output using conditional random fields. This approach outperformed standard object-based image analysis techniques.
Although CNN’s are well-suited to processing 2D data, such as images, they are not able to manage sequential data like multi-temporal images or times series. Conversely, recurrent neural networks, and particularly LSTMs, can process such data. However, they are not adequate for our purposes, giving that land use mapping aims at producing one classification map per series [
28]. Various studies investigated the use of CNNs for processing multi-temporal data. Benedetti et al. [
29] proposed a deep learning framework for the fusion of multi-temporal Sentinel 2 imagery and very high-resolution satellite imagery to distinguish between various vegetation classes. Mauro et al. [
30] used a multi-layer perceptron (MLP) and a CNN to classify land cover from multi-temporal Landsat images. Nogueira et al. [
31] implemented a multi-branch CNN for vegetation mapping and confirmed its superiority to a traditional CNN operating on a temporal stack of images. Pelletier et al. [
28] proposed a temporal CNN for crop classification where convolutions are applied in the temporal domain. The literature review reveals that few studies investigated the temporal dimension of high-resolution satellite imagery such as Sentinel-2 (10 m) for mapping UF. The use of very high-resolution images is often preferred given the complexity of urban environments and the great spectral overlap.
In this context, the goal of this study is two-fold: (i) to investigate whether UF mapping benefits from integrating temporal information; and (ii) to explore the use of CNNs on multi-temporal satellite imagery. More precisely, this study evaluates the potential contribution of temporal dimension of freely available high-resolution satellite imagery such as Sentinel-2 to map four urban classes: Continuous UF, discontinuous UF, industrial or tertiary facilities and the road and rail network. These are all classes with different vegetation and buildings densities and are thus well suited for this study. In this paper, we work on a multi-temporal sampling of a time series as a first step. The proposed framework can be later on extended to a full time series. This approach, exploiting the temporal dimension to compensate for the spatial resolution explores the use of CNNs which will provide us with an end-to-end learning framework without the need of engineering features (the process of using our knowledge of the data to create features that make machine learning algorithms work), often time consuming for experts. The paper is structured in five sections.
Section 2 describes material and methods. First, a comparison baseline from mono-temporal images is presented by analyzing the performance of four encoder-decoder architectures for mapping the urban footprint. Secondly, based on the best model, we perform UF mapping and we design a multi-temporal classifier by altering the previous model. Results (
Section 3) are then detailed and allow assessing the benefits of using temporal data to discriminate UF. A discussion and conclusion are then provided respectively in
Section 4 and
Section 5.
4. Discussion
These experimentations evaluate if multi-temporal imagery might help discriminate between some UF classes with different occurrences of urban structures, vegetation, and bare surfaces. At first, we highlight that a mono-temporal model performed systematically better on data covering a different geographical zone than on data taken during a different time. This suggests that seasonal variations prevented models from recognizing certain UF classes. Moreover, when testing each class’s model on a multi-temporal test set, performance decreases as we move away from the training set’s sensing time. Both of these findings are complementary to each other: a mono-temporal model will not only do worse on multi-temporal data, but its performance will also continue to degrade as we move away from the training period. The proposed multi-temporal model did better than its mono-temporal counterpart. Namely, using three test images of the year 2018 simultaneously proved more effective than exploiting them separately. The quantitative, as well as the visual improvements, were more pronounced for the class of industrial facilities, compared to continuous UF. This is consistent with our general hypothesis: a multi-temporal approach can be fruitful to differentiate between classes with temporal characteristics. All of these results are consistent with each other as they all point to the temporal domain as possible useful input for urban land use classification. All things considered, a reasonable conclusion is that learning new features using multi-temporal data and deep learning can improve the classification accuracy for both continuous UF and industrial facilities. Another sound conclusion is that Sentinel 2 images, despite a spatial resolution of 10 m, can be used for urban land use mapping.
Nevertheless, the contribution of the temporal dimension to UF mapping requires further testing with larger datasets and longer time series. With this mind, we think the positive results achieved in this research are worth building upon in future studies to further investigate the temporal domain as an input to urban land use classification models. Conducting experiments on a wide range of UF classes, using longer time series, and bigger datasets all represent good avenues for future research.