Neural Network-Based Urban Change Monitoring with Deep-Temporal Multispectral and SAR Remote Sensing Data

: Remote-sensing-driven urban change detection has been studied in many ways for decades for a wide ﬁeld of applications, such as understanding socio-economic impacts, identifying new settlements, or analyzing trends of urban sprawl. Such kinds of analyses are usually carried out manually by selecting high-quality samples that binds them to small-scale scenarios, either tem-porarily limited or with low spatial or temporal resolution. We propose a fully automated method that uses a large amount of available remote sensing observations for a selected period without the need to manually select samples. This enables continuous urban monitoring in a fully automated process. Furthermore, we combine multispectral optical and synthetic aperture radar (SAR) data from two eras as two mission pairs with synthetic labeling to train a neural network for detecting urban changes and activities. As pairs, we consider European Remote Sensing (ERS-1/2) and Landsat 5 Thematic Mapper (TM) for 1991–2011 and Sentinel 1 and 2 for 2017–2021. For every era, we use three different urban sites—Limassol, Rotterdam, and Liège—with at least 500 km 2 each, and deep observation time series with hundreds and up to over a thousand of samples. These sites were selected to represent different challenges in training a common neural network due to atmospheric effects, different geographies, and observation coverage. We train one model for each of the two eras using synthetic but noisy labels, which are created automatically by combining state-of-the-art methods, without the availability of existing ground truth data. To combine the beneﬁt of both remote sensing types, the network models are ensembles of optical- and SAR-specialized sub-networks. We study the sensitivity of urban and impervious changes and the contribution of optical and SAR data to the overall solution. Our implementation and trained models are available publicly to enable others to utilize fully automated continuous urban monitoring.


Introduction
Digital remote-sensing-driven land cover and land use (LCLU) change detection with multi-temporal satellite data has been studied for decades, dating back to the early 1970s [1]. Due to the reliance of repeat cycles and sensor technologies of satellite-based remote sensing, it is sought after by urban change detection to understand socio-economic impacts and effects, identifying new settlements, monitoring urban growth, or analyzing trends of urban sprawl, just to name a few. For these fields, knowledge of how urban and human-made structures change over time is crucial and, depending on the exact application, can require high spatio-temporal resolution. In addition, automatic processing is desired if large-scale analysis is required to monitor larger areas over a longer time and to make available the methods and results to a wider range of interdisciplinary users to thrive on remote-sensing-based urban change detection and monitoring.
In our work, we propose an ensemble neural network architecture called the Ensemble of Recurrent Convolutional Neural Networks for Deep Remote Sensing (ERCNN-DRS).
Detecting changes is tuned for impervious and built-up land covers and is trained in two variants to detect urban-related changes and activities for two eras. The first era spans 1991-2011 with the use of ERS-1/2 and Landsat 5 TM missions as a pair. The second era is 2017-2021 with the Sentinel 1 and 2 pair. For the former, we consider time windows for analysis of one year (1y). Six months (6m) are used for the latter due to the higher temporal resolution between observations. We avoid the need for manual ground truth generation by automatically creating synthetic but noisy labels using a combination of state-of-the-art methods. These labels are used for training the ERCNN-DRS in a supervised setting. For every era, we combine SAR and optical missions as pairs to combine their remote sensing advantages, such as distinguishing impervious land covers with optical and higher temporal resolution with reduced atmospheric disturbance for SAR observations. Our method does not require the manual selection of observation samples. We use all available observations as long as they can be properly co-registered and do not contain corrupt information. This can usually be done with little effort, which paves the way for automatic processing. For optical data, we remove clouds using information already available directly at the optical data sources. Our solution aims at a large degree of automation to either train or use the neural network models for different Areas of Interest (AoI) without the need of manual ground truth generation or selecting high-quality observation samples. We use two AoIs (Rotterdam and Limassol) for training and validation and one (Liège) for testing. Each AoI is at least 500 km 2 with hundreds to thousands of combined SAR and optical observations (so-called deep-temporal). We consider 1y/6m windows for detecting urban changes that can be continuously carried out over longer periods, which we refer to as continuous urban change and activity monitoring with deep-temporal remote sensing data.
ERCNN-DRS is trained to detect urban changes with a varying number of optical or SAR observations of different qualities due to atmospheric effects, different solar irradiance and seasons, and partial updatesdue to out-of-swath conditions. The analysis windows required were intentionally kept short to allow flexibility in selecting time ranges of interest. Our goal is to offer a trained network that can be applied with minimal effort to an AoI for a short period of 1y for ERS-1/2 and Landsat 5 TM, or 6m for Sentinel 1 and 2 mission pairs.
Our work first summarizes related work in Section 2 and differentiates it to our study. In Section 3, we describe the data types used and their pre-processing to retrieve trainable data. Our proposed neural network architecture is discussed in Section 4, including the hyper-parameters for the two different eras and the training and validation methodology. Section 5 contains two ablation studies to document the performance of the trained models with respect to sensitivity to changes and the impact of SAR and optical data types. Futher, we show some examples from the Liège test site for both eras. A final discussion with current limitations and further improvements is carried out in Section 6. Section 7 concludes our work.

Related Work
The detection of urban changes has become one of the most important applications in remote sensing science. Urban objects include man-made structures such as airports, buildings, or roads. An urban structure type (UST) was proposed by Lehner et al. [2], and we alias urban as the defined built-up types. These artificial structures dynamically vary in time and space, and therefore, it is challenging to detect them accurately. Due to the public availability of various high-resolution remote sensing data sets, it has provided a large number of possibilities for extracting urban information.
As mentioned earlier, (urban) change detection and monitoring with remote sensing data has been of interest for over half a century. Table 1 gives a non-exhaustive overview of other work with different properties, which we differentiate our work from. A lot more work was carried out [3], and we only name some as reference. Our work is listed at the bottom of the table for direct comparison. We consider four important properties: change detection method, missions used, number of observations, and scaling of the method.
Throughout the last decade, change detection for remote sensing data has increasingly embraced neural networks, including deep neural networks with Multilayer Perceptron (MLP) [4,5] as one of their most simple representatives. Deep learning is able to fulfill the needs of remote sensing image processing, such as classification or segmentation [6]. Before the shift to deep neural networks, change detection was driven by Principal Component Analysis [7] (PCA), Decision Trees [8], Change Vector Analysis [9] (CVA), Difference or Ratio methods [10], Markov chains [11], and other methods [12]. Some neural network-based approaches existed before 2010. However, with the recent advent of deep neural networks, many architectures have been invented that have resulted in better performance and methods predesignated to processing large amounts data. This is fueled by the availability of dedicated and optimized hardware (e.g., GPU accelerators), the amount of large storage (big data) and easy-to-use software solutions, such as the training frameworks Tensorflow ( https://www.tensorflow.org/; accessed on 29 July 2021) or Pytorch ( https://pytorch.org/; accessed on 29 July 2021). Many studies in recent years have shown interest in deep learning technologies [13][14][15]. Deep neural networks also move feature extraction and mapping into the neural network architecture and its training process, which is easier to handle and can lead to unprecedented performance [16].
Another property is related to the mission types. Some works consider either optical multispectral or SAR solemnly to identify changes. The most commonly used satellite data for analyzing urban changes are Landsat space images [14,17]. The Landsat program has monitored the Earth's surface since 1972 and provided data from three different devices: Thematic Mapper (TM), Enhanced Thematic Mapper Plus (ETM+), and the Multispectral Scanner System (MSS). However, Landsat data, as well as other multispectral optical sensors, are extremely sensitive to weather changes and the time of day, which requires further pre-processing. Research on urban changes from Landsat data has attracted the interest of many scholars. For example, a novel approach for extracting high-frequency urban land cover changes by using clear-sky Landsat observations was developed by Jing et al. [18].
For optical multispectral observations, a large range of index methods exists, which are able to extract urban and impervious areas. To recognize urban objects, scientists have traditionally relied on built-up indices [19,20]. Built-up indices are often fast and accurate to extract urban information, and therefore, their usage is conventional. Several studies have used different indices to extract urban areas [21]. These are, for example, the Normalized Difference Built-up Index (NDBI) [22], the Normalized Difference Impervious Surface Index (NDISI) [23], the Enhanced Built-up and Bareness Index (EBBI) [24], Combinational Biophysical Composition Index (CBCI) [25], or the Enhanced Normalized Difference Impervious Surface Index (ENDISI) [26].
Even for SAR data, some works propose methods to identify urban changes [27,28], which demonstrate that information is immanent to designate urban changes. The advantages of microwave remote sensing with respect to optical features are in weather and atmospheric independence, the possibility to monitor during night time, and the high sensitivity to fine changes [29]. The potential of radar data was investigated much less due to the often difficult data pre-processing phase [30]. Other works [8,31,32] have identified positive synergies by combining optical and SAR observations. Nevertheless, different studies have shown that radar technology is an excellent approach for extracting urban information. For instance, Ansari et al. [33] used multi-resolution texture features from polarimetric SAR images to analyze urban changes. On the other side, a simple and fast method was developed to identify structural changes in urban environments using SAR imagery [34]. Urban mapping and delineation with the help of SAR data were presented in other studies as well [30,35,36]. The number of observations used for identifying urban change is the third property. Most of the existing work uses only a few high-quality observations (i.e., image pairs) to indicate changes. These samples are carefully identified and usually are from the same season to mitigate the effect of seasonal changes and with the best-case atmospheric conditions, if optical. Some studies attempt to use more observations, which are either averaged [14] or used for statistical analysis [8]. Both methods are less sensitive to changes as either the sampling interval is too coarse or averaging filters suppress changes. In addition, benchmark data sets, such as OSCD [37] or DeepGlobe 2018 [44], exist. Despite their high quality, these share the same limitations of including only a few multi-temporal samples from a single sensor.
Access to observations and high quality products is also made available through platforms such as the Thematic Exploitation Platform (TEP, https://eo4society.esa.int/pla tform-services/; accessed on 29 July 2021) or Data and Information Access Services (DIAS, https://earsc.org/dias-comparison/; accessed on 29 July 2021). The latter provides access to Copernicus ( https://www.copernicus.eu/; accessed on 29 July 2021) and Sentinel data, information products from six operational services of Copernicus, and cloud-based tools for analysis. The former also provides the Urban TEP [45] (U-TEP), which focuses on the effective and efficient urban management and safeguarding of livable cities. It offers its users high-performance data access and processing, analysis and visualization services, and a customized workplace for communication and the means to share algorithms, data, and services. The platform offers a number of globally processed data products, such as the World Settlement Footprint 2015 [46] (WSF) derived from Landsat 8 and Sentinel 1 data, Global Urban Footprint [47] (GUF) derived from TerraSAR-X and TanDEM-X radar data, and TimeScan product derived from Landsat and Sentinel 1 and 2 imagery, providing a cloud-free representation of spectral and temporal characteristics of the land surface. The products on these platforms are of high quality but show similar temporal limitations as the aforementioned data and studies.
The last property is the applicability and demonstration of large-scale urban change detection. It is common that studies demonstrate their applicability over smaller areas, limited time frames, or lower temporal resolution. In addition, their requirement of high-quality data induces time-consuming and labor-intensive pre-processing that is difficult to automatize. Exceptions are works such as [8,14,18], which attempt larger scale demonstrations. Furthermore, the transfer to different remote sensing missions is underrepresented, which results in the use of either different missions of the same type (optical or SAR) or only one optical/SAR pair. The applicability to a broader range of sensors is not directly answered.
Compared to other studies, our work does not require high-quality samples, level 2 products, averaging over larger periods, or observations of matching seasons. Instead, we intend to use most of the data available as long as it can be properly co-registered, which gives access to higher frequency information. Furthermore, we extend the idea of combining SAR and multispectral optical observations in an ensemble neural network for overall better performance. We also demonstrate the flexibility of our solution to different sensor generations over two eras spanning almost 25 years combined. This is an important pre-requisite for long-term urban monitoring as it is not limited to a single remote sensing mission. Finally, due to no available ground truth data with the required temporal resolution, mission pairs, and temporal extent, we create synthetic noisy labels to train the neural network.

Data Pre-Processing
For the two eras, we use two different remote sensing mission pairs to include both optical multispectral and SAR observations: ERS-1/2 and Landsat 5 TM for 1991-2011, and Sentinel 1 and 2 for 2017-2021. Three different AoIs are used, as shown in Table 2, with different resolutions (m/pixel) across every era due to sensor capabilities. The selection of the AoIs, the data acquisition and pre-processing methods, and time series preparation we applied are described below.

Selection of AoIs
The three AoIs-Rotterdam, Liège, and Limassol (see Table 2)-were selected due to different weather and seasonal conditions. Rotterdam is subject to high cloud coverage [48], snow/ice, whereas Limassol has a lower probability of these but also offers less observations over time. In addition, both sites are in different geographies and show different characteristics of impervious areas. As we discuss in Section 3.3.2, they require different parameters for creating the synthetic labels. Furthermore, the inclusion of two different eras gives access to a large time horizon and demonstrates the utility of our approach for different remote sensing missions. We intentionally use L1 products due to their wide temporal and spatial availability; our solution, however, is not bound to L1 products alone and can encompass other, higher quality levels. We used different reference systems due to our existing processing pipelines, which are not crucial for our study.  Figure 1 shows the available observations over time, within each possible observation window of one year or six months, as we discuss later. Usually, remote sensing missions are configured with fixed repeat cycles so that sampling of observational data follows certain patterns. However, we consider larger time ranges where transition phases of missions and processing changes are common. In addition, further observations are discarded due to failed co-registration because of atmospheric effects, such as clouds, dust, and water vapor, and other effects caused by anomalies, such as banding, detector failure, and data loss. As a result, usable observations are irregularly spaced, and hence, every fixed observation window can include a different amount of observations. These variations add an additional level of complexity when it comes to detecting changes. In some cases, the number of available observations can even drop below a meaningful threshold. We combine both optical and SAR observations to provide enough information for detecting changes, even for shorter time frames of up to one year.
Liège was selected as the testing AoI with different observation frequencies and a higher likeliness of cloud coverage degrading optical information quality. The same window properties as for the other two AoIs are used (as described later).

Data Acquisition and Pre-Processing
The four different mission products used were obtained from three different data sources. Table 3 summarizes the sources, used products, and the number of observations available. The number of observations that were removed due to failed co-registration or corrupted data are given in parenthesis. In the following, we briefly describe the pre-processing applied to the products.  ESA's Earth Online SAR_IMP_1P ( https://earth.esa.int/web/guest/data-access/br owse-data-products/-/article/sar-precision-image-product-1477; accessed on 29 July 2021) product is used for ERS-1/2 observations. They are single-polarization (VV) backscatter intensity in Ground Range Detected (GRD) with a pixel size of 12.5 × 12.5 m and a number of looks of three. Orbits are separated into ascending and descending directions throughout the processing pipeline. Further processing is applied to geometrically terrain correct and co-register with SRTM 1Sec HGT Digital Elevation Model (DEM) using the ESA SNAP ( https://step.esa.int/main/; accessed on 29 July 2021) toolset with a final clipping to the AoI and re-projection to EPSG:3035. All transformations use nearest neighbor interpolation to retain the values of the original data. Values are retained as a Digital Number (DN) and stored as a single-precision floating point data type.

Landsat 5 TM
Landsat 5 TM L4-5 TM C1 L1 data are obtained from USGS ( https://earthexplorer.usgs .gov/; accessed on 29 July 2021) with all seven bands and the 16-bit quality band (BQA). We apply top-of-atmosphere (TOA) reflection correction and co-register a final re-projection to EPSG:3035 with nearest neighbor interpolation. We use the 16-bit quality bands to filter out clouds, cloud shadows, snow/ice, and over-saturation (value: 0b10101011100 for masking clouds, cloud shadows, and snow/ice with medium confidence; masking saturation for five or more bands). All bands are used and super-sampled with nearest neighbor interpolation to 12.5 × 12.5 m to fit the resolution of ERS-1/2. Values are stored as single precision floating point numbers and normalized to a range of (0.0, 1.0).

Time Series Preparation
The observations are further processed by temporal stacking, assembling, tiling, and windowing into periods used for the change detection with the assignment of a label. Hence, each window is a data sample containing spatial and temporal constraint observations. These steps are carried out for every era separately. Figure 2 shows the basic steps, which we explain in detail in the following.
You can find all data on our GitHub repository: https://github.com/It4innovations/ERCNN-DRS_urban_change_monitoring Figure 2. Two-step procedure for data preparation: From the series of available observations to the windows used for training/inference.

Temporal Stacking, Assembling, and Tiling
Optical observations o t ∈ R w×h×(b OPT +1) at time t over an AoI with a width w and height h contain the available optical multispectral bands b OPT plus one mask band. The mask band indicates invalid pixels, e.g., due to cloud coverage, over-saturation, snow/ice, or being outside of swath. SAR observations are separated by orbit directions, such that s asc t ∈ R w×h×(b SAR +1) are the intensity backscatter polarizations (b SAR ) plus a mask band of ascending orbit direction. SAR observations from descending orbit directions s dsc t are defined accordingly. For each era, the aforementioned data are temporally stacked independently to form a time series of observations. This results in o ∈ R t OPT ×w×h×b OPT , s asc ∈ R t asc SAR ×w×h×b SAR , and s dsc ∈ R t dsc SAR ×w×h×b SAR , with a total of t OPT , t asc SAR , and t dsc SAR observations, respectively. The masks are used to denote the pixels from each observation that are to be ignored. Their values are substituted by the previous observation (or zero if not available). This guarantees that, at every observation, there is full coverage of the AoI with acceptable approximations. However, the more pixels are masked out over a longer period, the uncertainty increases (e.g., due to cloud conditions or over-saturation).
The remote sensing kinds are combined into one time series with every observation aggregating optical and ascending and descending SAR data. This introduces two risks: at an observation time point, inconsistencies between the remote sensing kinds can occur; the amount of redundant data is increased (i.e., non-sparse representation). However, urban changes are expected to span across a longer time frame, which mitigates these effects. We also expect the neural network to become tolerant to smaller inconsistencies. Even though redundant data are introduced, it simplifies the ingestion during the training phase and further online processing is not required. All three time series observations o, s asc , and s dsc are assembled into one time series a ∈ R t all ×w×h×b , with t all = t OPT + t asc SAR + t dsc SAR as the accumulated observations of all three kinds and b = b OPT + 2 * b SAR as the accumulated bands. We assume that no two observations happen simultaneously. Otherwise, the total number of accumulated observations would be less as some individual observations share the same time stamp.
The accumulated time series of each AoI is tiled into x × y pixel blocks with consecutive non-overlapping tiles. Only full tiles are considered with no padding applied. In the following, we continue with each tile so that we can use a i,j ∈ R t all ×x×y×b with x and y as spatial tile extent and i ∈ N : i ∈ [0, w/x ], j ∈ N : j ∈ [0, h/y ]. One aggregated observation at time t for a tile (i, j) would then be a t i,j ∈ R x×y×b , which contains all bands, optical and SAR. Our method does not depend on the tile size, which can be smaller or larger, as long as the applied filters of the neural network do not exceed these tiles. Based on memory considerations of our training platform, a size of x = y = 32 was selected and is used in the following to facilitate reading.

Windowing and Labeling
The tiled time series is further windowed into partially overlapping periods to provide all data for the change detection. Each window tile w t i,j , starting at time t spans a fixed observation period ∆. We use observation periods for ERS-1/2 and Landsat 5 TM of one year and six months for Sentinel 1 and 2. The higher repeat cycles of Sentinel missions (ca. 6/5 days for Sentinel 1 and 2 versus ca. 35/16 days for ERS-1/2 and Landsat 5 TM) allows shorter periods while still covering the usual time frame to detect urban changes. We receive w t i,j ∈ R l×32×32×b with l as the number of observations within the observation period. As window sizes are based on calendar months, the number of observations l ∈ N : l ∈ [ω, Ω] varies, with ω as the lower limit and Ω as the upper limit. Windows with less than ω observations are less likely contain enough information for change detection and are discarded. In Figure 1, these discarded windows are in the highlighted (red/orange) time frames. The upper bound Ω is added to limit the maximum memory footprint of each window and is defined to hold all observations of every window possible. This is controlled with a parameter δ to specify the minimum difference between two observations. For the ERS-1/2 and Landsat 5 TM era, it is set to δ = 1 second observation , which considers all observations. One second is the highest temporal resolution for observation data; we assume no two observations occur at the same second. For Sentinel 1 and 2, it is set to δ = 2 days observation . For the latter, this significantly reduces the potential maximum window size of over 300 (see Figure 1) down to a maximum of 6 months 2 days observation = 92 observation(s). This is acceptable due to urban changes occurring at a much larger time scale and the original high number of observations per window is a result of nearby paths/rows where only a subset of the AoI is updated. These partial observations are mostly taken into account due to the temporal stacking after all, only discarding overlapping swaths.
In the following, we consider windows of the form w t i,j ∈ R Ω×32×32×b . As the number l of observations varies for each window, we pad the remaining Ω − l observations. As we show in Section 4.1, we apply masking in the neural network to discard these padded observations. Moreover, for each window, a synthetic label is created for the training sites Rotterdam and Limassol to provide a guideline for the supervised training. A synthetic labelŷ t i,j ∈ R 32×32 is created for each tile and window at time t aŝ and The operator [·] returns the subset of bands from the window. Hence, w t i,j [b OPT ] ∈ R Ω×32×32×b OPT R Ω×32×32×b and accordingly for the SAR bands in ascending and descending orbit directions. The variable ∆ is the observation period for the windows and is fixed for each era. Hence, windows ] denote the windows of the immediate preceding and subsequent observation periods relative to the window starting at time t.
The function s cm creates a change map using the omnibus test statistic, as proposed by Conradsen et al. [12], and herein denoted as OMN I(·, σ, η). Its parameters σ and η are the significance and ENL for the multi-looking SAR data, respectively. ENL has been rounded to the closest integer. The significance is empirically identified (see Table 4), and the SAR related changes from the omnibus test statistic are averaged over the two orbit directions. Table 4. Parameters of our proposed method used for creating the windows w t i,j and synthetic labelŝ y t i,j as needed for the training and validation processes. It is important to note that we did not clip the DN values of ERS-1/2 or Sentinel 1 and used single-precision floating point numbers, as recommended for the omnibus test statistic in [49]:

ERS-1/2 and
Saturating the extreme pixel values [...] is unfortunate in our situation where the dominating changes detected are due precisely to those strongly reflecting human-made objects [...]. Pixels that are saturated at several timepoints may not be detected as change pixels, which is potentially wrong. The best way to handle this is to store the data as floats [...].
An optical change map is created with o cm , which takes the preceding and subsequent observation period. The operation (·) mean t ⊂ R 32×32×b OPT is the arithmetic mean over time applied individually to each band. A clipped index ENDISI c is computed thereof, which is an extended version of the Enhanced Normalized Difference Impervious Surfaces Index (ENDISI) that was originally proposed in [50]. According to a recent study [21]: "The ENDISI algorithm [...] effectively distinguished impervious surface from the background. It not only eliminated the impacts of water bodies, but also suppressed the impacts of most bare soil and arid land". ENDISI c ⊂ R 32×32 is constructed as with and For better readability, we omit the function arguments and consider the ENDISI c and subsequent functions to operate element-wise on each pixel. Function ENDISI c clips the values of ENDISI e between (0, 1), denoting impervious areas that are urban due to their built-up/impervious reflectance characteristic with values closer to one. Values closer to zero indicate no urban pixels. The parameter α shifts the values of ENDISI depending on the site to separate urban and non-urban pixels. The shifting values were empirically identified once for each site (see Table 4). Our values are in accordance with the study [21], where dark impervious surfaces (IS) have the lowest values of ENDISI between ca. (−0.5, 0). To increase the values of differences, ENDISI e is scaled by the factor γ. This ensures larger change detection values.
ENDISI e further incorporates MNDW I [51] and MNDBI [52]. The use of MNDW I reduces the false positives of water bodies, while MNDBI mitigates effects of farming where bare soil occurs frequently and can appear as an impervious area. The use of MNDW I also removes false positives due to incomplete cloud removal with clouds detected erroneously as urban by ENDISI. Only their respective positive values are used by MNDW I + and MNDBI + as they denote water or farming areas and values below zero appear for other types of land cover, including impervious areas. The index MNDW I + is subtracted with a factor of two because ENDISI tends to assign higher values to cloud fragments. The index methods ENDISI, MNDBI, and MNDW I follow their original definitions [50][51][52], which are and The individual bands ρ are assigned, as shown in Table 5. The correction factor β, which is defined by ENDISI, is determined for the entire AoI for every observation instead of individually for every tile. This is to avoid biases due to limited spatial context, such as impervious surfaces covering most of the tile. The operation (·) mean ⊂ R hence computes the spatial mean of all pixels.

Expected Noise of Synthetic Labels
The generation of synthetic labels on unfiltered data results in outliers. We clarify the expectations and risks of our proposed approach to later guide the discussion with the obtained results.
The multiplication of the SAR and optical change maps serves multiple purposes. First, the omnibus test statistic-derived changes do not distinguish between urban and non-urban pixels. Hence, the multiplication with optically detected urban changes reduces the false positives, e.g., by detecting changes due to farming or in water bodies. Second, due to atmospheric conditions, outliers are quite common in optical data (e.g., clouds, cloud shadows, changes of irradiation). This can lead to false positives where urbanized pixels are falsely detected. An advantage of the omnibus test statistic-based change detection is that urban areas receive a lower count of detected changes compared to non-urban pixels due to rigid structures. The multiplication further reduces false positives thereof.
On the other hand, our method relies on a certain amount of optical observations to distinguish between urban and non-urban pixels. False negatives occur when there are not enough observations, either by too little SAR observations, which leads to a low multiplication factor through s cm , or by no optical data due to cloud coverage or incorrect cloud removal. We observed the latter for some parts of the Rotterdam AoI where the cloud mask removed some buildings with high reflective white roofs. False-positive scenarios are also possible where erroneously urban pixels are detected optically in non-urban areas. This can happen for bare and arid land that appears impervious for a longer time due to droughts. Depending on the SAR detected changes, they can result as falsely labeled urban changes. To mitigate these errors, we use optical means over the period before and after and ensure there are at least ω observations per each window available. Windows with a number of observations lower than ω are discarded for training. The use of means, however, causes a problem of changes from previous and subsequent observation periods attributed to some degree to the window of interest. As the network under training shall only see the window of interest, it will not be able to exactly match the label but find a general approximate solution.
Our framework is not tied to the described labeling method. Other indices, change detection, and even prediction methods could be used. We, however, focus on long-range changes as they happen in urban development. Shorter term urban changes will be more dependent on the quality of optical observations.

Proposed Neural Network
In the following, we describe our proposed neural network architecture to detect urban changes using the data sets discussed earlier for the two eras of interest. Furthermore, we elaborate on the training and validation methodology using the synthetic labels to train the neural network.

Architecture
Semantic labeling of remote sensing images with high spatial resolution is of great interest to a wide range of researchers. This task consists of assigning a category to every pixel in an image. The pixel classification in urban environments is still a challenging task because extracting spatially consistent information requires quality data processing and high model accuracy. Authors Diakogiannis et al. [53] named two main reasons why urban features are hard to extract/classify: (i) high variability within a class as objects with similar spectral reflectance can belong to completely different classes; (ii) interactions between urban features (e.g., 3D character of buildings, shadows).
For the last decade, Deep Neural Network (DNN) research has been rapidly developing, and therefore, DNN models became very popular as classifiers [54]. The main advantage of these models is their high performance and modular design, which allows their architectures to be easily adapted to specific needs. In our work, we use classification principles using a Convolutional Neural Network (CNN) [55] with the combination of Recurrent Neural Network (RNN) layers [56] to add spatio-temporal awareness. This awareness is needed to identify changes of a certain type, which are urban/imperviousrelated changes in our case.
We propose a novel neural network architecture that is a concatenation ensemble of two sub-models: the Ensemble of Recurrent Convolutional Neural Networks for Deep Remote Sensing (ERCNN-DRS). The sub-models are designed to ingest multispectral optical and SAR data. Figure 3 shows the ensemble architecture. We denote the network as a function f NN returning a prediction y t i,j := f NN (w t i,j ) of urban changes that happened in window w t i,j or with separate data types for clarity: ). As we discuss in Section 5, both sub-models can be used individually when only one kind of data type is available but with a reduced performance.
The sub-models are comprised of different convolutional layers [55] and one RNN layer (i.e., convolutional LSTM), which is the core of our architecture. The convolutional LSTM (ConvLSTM) layer learns the deep-temporal dependencies between the observations. This layer type has been introduced by Shi et al. [56] and solves the problem of applying Long Short-Term Memory (LSTM) layers [57] to spatial data. Instead of flattening the input, it retains the spatial information by accepting multi-dimensional inputs that are fed to the gates using further 3 × 3 convolutions. The ConvLSTM layer is fed from one (SAR) or two (optical) 3 × 3 convolutional layers. They apply the same set of filters to all observations of a window, which is referred to as the time-distributed convolution. As a result, these layers tend to learn filters to retrieve useful features for the following RNN layer from all or most of the observations in a window. In addition, they can imperfectly compensate co-registered observations caused by nearest neighbor interpolation, which can shift the pixels by one in each direction. After the RNN layers, the activations are concatenated to represent time-invariant features. The concatenation of sub-networks can lead to better performance, as demonstrated in [58,59], and natively facilitates the use of different data types. Two more 3 × 3 convolutions are applied to spatial filters after the time series is mapped to a set of representative samples. In the last layer, the activations are fed to a 1 × 1 convolutional layer to correct each pixel of the prediction independently without using any spatial context. All 3 × 3 convolutions apply batch normalization and subsequently a ReLU activation function [60] on their outputs. The ConvLSTM layers instead apply tanh activations and hard sigmoid for the recurrent step. In addition, they apply a dropout to the linear transformations of the recurrent states [61]. This is a regularization method used to minimize the impact of frequent window sizes biasing the predictions. The 1 × 1 layer does not apply batch normalization and applies a sigmoid activation instead of ReLU to generate a heat map with values in the range of (0, 1) for every pixel in the input tile. Values close to one denote urban changes that are more likely, whereas values closer to zero indicate no urban-related change. All layers also train biases in addition to their kernels.   Table 6 for each era.
Throughout the network, each convolutional layer produces a different amount of filters, which are dependent on the complexity of the input data. The model for the ERS-1/2 and Landsat 5 TM era requires less filters due to less bands, whereas the Sentinel 1 and 2 era uses more bands containing more information. In addition, the higher spatial resolutions show finer spatial structures for which more filters can be useful. We intentionally left the convolutional kernels at a size of 3 × 3. Given the amount of data available per observation period, larger kernel sizes would result in an approximately quadratic increase of parameters of the neural network. We traded learning complex spatial structures to learning deeper temporal correlations. This is especially important as changes over time should have more weight compared to identifying urban micro-structures. As such, our trained network is aimed at an applicability for a wider range of AoIs as it learns the structure of urban (impervious) surfaces less, which can be different for different geographies. The limitation of spatial filters helps to avoid memorization of certain urban structures, such as reflective characteristics of rooftops, parking lots, or roads. Instead, general characteristics of impervious surfaces should be learned through their changes over time. Table 6 summarizes the important parameters used for the respective layers.
Due to the deep-temporal series with varying amounts of observations, we apply masking to the ConvLSTM layers to indicate which first l observations should be considered. This is different for each observation window, resulting in an individual mask. There is an upper bound Ω to the number of observations in the model architecture. To avoid losing observations, this bound needs to be set to the largest number of observations expected (see Table 4). As such, only the first l observations are used and the remaining Ω − l observations, which were padded and do not contain valid information, are masked out. Masking also helps to exploit the capacity of the network more efficiently as the network does not need to learn how to identity the valid observations and ignore the padded ones. As l is already known a priori, it does not need to be learned after all.

Training and Validation Methodology
The aforementioned data are separated into training and validation sets. For each era, AoIs Rotterdam and Limassol were used during the training process. For the training data set, tiles are drawn from {(i, j) ∈ N × N | j ≡ i 2 (mod 2)}, and for the validation data set, they are drawn from {(i, j) ∈ N × N | j ≡ 5−i 2 (mod 4)}. This is approximately one-fourth of all tiles used for training and one-eighth for validation with both sets being disjunctive. The tile selections ensure non-adjacent tiles in the respective sets to avoid recurring and redundant patterns along the borders.
For each tile assigned to the training or validation data set, a subset of its windows w t i,j is selected. Leaving aside corner cases where preceding (w t−∆ i,j ) or subsequent (w t+∆ i,j ) windows with more than ω observations are required for the synthetic labels, a window can be created for every observation that occurred at t. We select a random subset using a uniform distribution with p = 1 10 . Therefore, we expect approximately one-tenth of the overall windows to be selected. The random uniform distribution function uses different seeds for every tile for reproducibility and to avoid all tiles of the same AoI from having the same window starting times, which the neural network could learn. This is especially a concern as only two different AoIs are used for training, of which all the tiles would have the same sequence of observations within each AoI. If windows are identical for tiles of the same AoI, the sequence can be learned, but applying it to other AoIs with different sequences would result in bad prediction performance. This random selection also reduces redundancies as subsequent windows would almost fully overlap except for one or a few observations.
The selection of the loss function for the supervised learning has a significant effect on the results as it guides the network towards a solution during training. Several (region-based) loss functions for semantic segmentation and prediction problems have been developed previously. A few examples are the Dice, Focal Tversky, and Intersection over Union (IoU) loss [62]. In a recent study, a new loss as the extension of the Tanimoto coefficient-so-called Tanimoto with complement coefficient-was demonstrated by Diakogiannis et al. [53]. It shows improved gradients of first and second-order derivations. As our labels usually contain small values, a loss function is preferred that provides stronger and higher quality gradients. In addition, the Tanimoto coefficient is also an ideal candidate for segmentation tasks as it shares similarities with the Dice coefficient or IoU and ignores true negatives. Such true negatives would dominate the number of pixels as changes are less likely than no-changes. This unbalance would lead the network to bias the no-change background class.
The Tanimoto coefficient T is based on the Sorenson-Dice coefficient. T and it's complement T are defined as For T and T,ŷ and y are the ground truth label and prediction, respectively. We use L (ŷ, y) := 1 − T(ŷ, y) as the loss function.
Other studies [63][64][65][66] have found and discussed the adverse effects of tiling with respect to increased errors, especially towards the tile borders. As a result, we applied the Tanimoto with complement loss only to the center 30 × 30 pixels of every tile. The use of 3 × 3 convolutional filters in the neural network requires the information of its directly adjacent set of pixel neighbors for every pixel. This is only partially possible for pixels at the tile boundaries, such as the edges and more pronounced at the corners. As a result, the prediction quality at the boundaries would show a degraded quality and some capacity of the network would be used for such corner cases. Fully convolutional neural networks are not bound to any tile size, as already trained fully convolutional networks can use varying tile sizes during inference, irrespective of the tile size used during training. This enables an entire scene (AoI) used as one tile, as long as it fits into memory to speed up prediction and avoid the boundary effects. We only constrain the tile size to 32 × 32 pixel tiles to avoid excessive memory requirements during training.

Results
The neural network proposed in Section 4 was trained with different hyper-parameters and training data from Rotterdam and Limassol AoIs for each era, as previously described in Section 3. In the following, we describe our training environment and configurations. Due to the underlying methods and training objective, a detailed evaluation of prediction correctness is hard to accomplish. There are no second sources that provide enough data with similar sampling intervals and sufficient spatial resolutions to examine the performance of the neural network predictions. Instead, we carry out two ablation studies to evaluate the impact of urban changes over time and the contribution of optical and SAR observations. Finally, we show some example predictions for both eras, where we found historic Google Earth ™ very high-resolution imagery, which explains (most of) the changes.

Training
We trained two models, one for every era, with the hyper-parameters shown in Table 6. The training was carried out on one compute node with four NVIDIA Tesla V100-SXM2 GPUs on the Barbora cluster at IT4Innovations ( https://docs.it4i.cz/barbora/hardware -overview/; accessed on 29 July 2021). Tensorflow 2.4 with Keras and Horovod [67] was used for data parallel training, utilizing all four GPUs. We used a batch size of 32 per GPU (effective batch size of 128), a learning rate of 0.004 (fixed over all epochs), and the distributed synchronous minibatch stochastic gradient descent (Sync-SGD) solver [68,69] with a momentum of 0.8.
The development of the losses over the training epochs for the two models are shown in Figure 4. The ERS-1/2 and Landsat 5 TM model was trained for 126 epochs with a walltime of 40:23 h and the Sentinel 1 and 2 model for 115 epochs with a walltime of 23:30 h. For the former, we used the trained weights at epoch 86, and 98 for the latter (both zero-based), which showed the best validation loss until overfitting started to become visible. Figure 4 shows the start of overfitting marked with a green dashed vertical line. It also shows only one training session but with validation data sets split and assigned individually to each of the four GPUs (shuffling was only applied to training data). Hence, the validation losses give insight into how well balanced the found solution is. In the ERS-1/2 and Landsat 5 TM model, it is visible that one loss (val_1) shows higher values, which likely stems from the Limassol AoI where there is a slight imbalance regarding the (validation) tiles containing water areas. Such tiles contain larger errors. We have rerun the training sessions multiple times and repeatedly observed the same behavior. This was also true during some limited hyper-parameter searches-no exhaustive search has yet been applied. The data sizes for the ERS-1/2 and Landsat 5 TM model are 58 GB (training) and 7 GB (validation) and 187 GB (training) and 24 GB (validation) for Sentinel 1 and 2. All are used as GZIP-compressed TFrecord files from the training sites Rotterdam and Limassol. To reduce the resource needs, only one-fourth of the validation data was used, and the majority of tiles covering sea for Limassol AoI have been removed.
In addition, we trained only a subset of the ensemble by using either of the two sub-networks OPT . These correspond to the ensemble network with one input data type removed, e.g., SAR f NN does not consider any optical input. All hyper-parameters were left unchanged. Figure 4 shows the results for OPT f NN in the middle and SAR f NN on the right of each era. It is visible that both do not reach the same low loss as with the ensemble configuration. In particular, SAR f NN shows degraded performance. This likely stems from less information to distinguish urban-related changes and suggests that the overall solution is mostly driven by optical data. However, in the ensemble case, SAR can add additional information, leading to lower generalized losses. This is more visible for the Sentinel 1 and 2 era than for ERS-1/2 and Landsat 5 TM but is consistent across both eras. Furthermore, the training requires less epochs to reach the best general solution. We, however, have to stress that the hyper-parameters were not changed. A change of batch size or SGD momentum could result in less epochs, or more filters could result in an overall improved performance for the non-ensemble cases as well. The ERS-1/2 and Landsat 5 TM model reaches lower loss values. This is a result of the lower spatial resolution of the remote sensing sensors. The observations hence contain less detailed structures that are easier to learn. The model for Sentinel 1 and 2, for example, is able to identify finer grained urban structures, such as roads and mid-sized buildings, whereas Landsat 5 is only able to detect very large urban structures, such as highways or logistics buildings, at best.
We would like to briefly elaborate on the aforementioned parameters Ω, the tile size, and their value selection. In theory, neither of them would need to be artificially bound, leaving the maximum possible number of observations for a given window period to be only defined by the remote sensing mission parameters. Furthermore, any AoI could be considered as one single tile. In practice, however, we limit both parameters to accommodate the neural network model and data in the available (GPU) memory -in our case, 4 * 16 GB. Similar to other works considered, the use of tiles (also called patches) "under a given GPU memory constraint" [66] as well as "the input to the network [is] fixed to a given size [...] mainly influenced by considerations of available memory size on the GPU" [70], and we also regard tiling as a compromise to enable deeper neural networks and data rather than a profound method. Similarly, we treat the limitation of observations per window with Ω as a compromise as well. Tiling can increase the prediction errors due to reduced spatial information, whereas the use of Ω and the related step size δ contribute to a reduction of temporal information.

Ablation Studies
We attempt two ablation studies to understand the quality of responses to urban changes and the contribution of each data type, optical and SAR, to the overall solution of the proposed ensemble network. Both document the performance of the neural network using deep time series and help to set the expectations of the predictions.

Urban Change Sensitivity
To analyze the sensitivity of the model predictions to urban changes over time, we study them using an artificial but well-controlled test. In this test, we select one 32 × 32 pixel tile for one window that does not cover an urban area and substitute the 16 × 16 pixel center with the a tile that contains an urban area. More specifically, we use one that is fully covering a forest as a reference tile and one from the airport of Liège for the center. The resulting perturbed tile is called hybrid * w t i,j . The window covering the forest was selected so that it contained the maximum amount of observations among all other windows. Otherwise, windows with less observations would not allow to study the substitution effect towards Ω but only the first l observations. Figure 5 shows the substitutions for both eras. The substitution in a confined space in the center is deliberate as CNNs tend to show performance decay towards the edges of tiles as mentioned earlier.
By that, we minimize such errors. Both locations are from the same time period, but the urban replacement is constant. As such, only changes occur at the beginning and end of the substitution length. The substitution is controlled via two parameters ζ and ψ for the start and length of the substitution, respectively. Both parameters are evaluated with single increments, and a prediction of the perturbed tile is applied. It is expected that the insertion of the urban sub-tile results in a positive prediction without dependence when the substitution happens (ζ). The longer the substitution persists, the stronger the predicted change should be, falling off after l 2 . Of the predictions, we compute the mean over the 30 × 30 pixel center of the tile as a continuous representation of sensitivity. Figure 6 shows the bivariate means and their intensity values, depending on ζ and ψ; for better readability, we only used every second value for ψ. The curves with ψ = 0 (i.e., no substitution) show constant values close to zero. Forests are neither urban nor impervious, and there is no such change in the reference tile for the window considered. As a result, every substitution with an impervious area should show values significantly larger than zero. What is more, we only consider sizes of ψ that fit into the window. Therefore, growing substitution lengths ψ delimit the start of substitution ζ towards the end. With the substitutions carried out, different effects become visible. First, changes close to the beginning of the windows show a higher change intensity and tail off the further out as changes happen. These later scopes represent a dead spot of the predictions as any change there is attenuated. The reason for that behavior is the training of different periods with varying amounts of observations. With ω, a minimal amount of required observations was set. Periods with more observations are less likely and, hence, result in a lower weighting for the final predictions. At the end, the intensities rise again, leading to more weight put on the changes at the end of the windows. The ERS-1/2 and Landsat 5 TM model follows that pattern very clearly. For the Sentinel 1 and 2 model, the response is a bit more complex but with the same underlying pattern. Differences are mostly the strong attenuation towards the beginning of the window for smaller ψ with more pronounced plateauing up to the first ω observations. Second, changes that are short are attenuated and increase the mean intensities with the growing ψ. This likely stems from errors in the observations where short outliers need more (temporal) filtering. For example, co-registration errors can lead to some pixels not detected as a change, unless there are enough observations (with the same error) increasing confidence in the change.

ERS-1/2 and Landsat 5 TM
We hence can conclude that only changes that occur for a longer time and close to the beginning of the observation period (or right at the end) are well detected. The former is acceptable as urban changes typically occur over a longer time. The latter can be compensated by using repeated and overlapping incremental predictions to evenly cover a selected period.

Correlations of SAR and Optical Predictions
Since an ensemble of networks is trained, we analyze the contribution of these subnetworks to the overall solution. More specifically, we try to understand how well their ingested data types (SAR or optical) are influencing the result. We create two synthetic configurations. For each, we keep one data type constant, while the other remains unchanged. The constant data type is the mean over the entire period to avoid outliers biasing the predictions under analysis. As a result, one configuration contains a fixed mean of all the SAR observations (ascending and descending), while optical is unchanged, denoted as The second variation is vice versa, using the averaged optical observations, while SAR observations are unchanged, denoted as SAR y t i,j : . As the network ensemble applies most of the spatial and all temporal operations individually for each data type, we expect to receive results close to each other if both data types contribute equally to the overall prediction. However, as we apply three more convolutions, two with ReLU and one with sigmoid activation functions after the concatenation layer, the two perturbed subnetwork outputs are not directly comparable quantitatively. To correlate such non-linear continuous variables, we compute Kendall's τ coefficient [71] as a measure of closeness over all windows and tiles. Tiles are aggregated to one large prediction for each window, OPT y t and SAR y t , respectively. We apply two correlations of OPT y t and SAR y t against y t . The latter are the tile-aggregated predictions with both data types unchanged. As before, we ignore the border pixels and only consider the center 30 × 30 pixels of every tile.
As shown in Figure 7, both perturbations show a high positive correlation. The correlations of OPT y t show up the highest. This indicates that it is harder for the network to identify changes as urban or impervious related from SAR data alone ( SAR y t ). Other work [27,28] demonstrated that urban areas can be detected to some extend from only SAR observations. Similarly, the SAR sub-network was also able to identify urban changes to a good degree, as indicated by the positive correlations. Seasonal changes are visible, resulting in recurring patterns of correlation values for SAR y t . The higher correlation to optical data suggests that the overall prediction y t relies more on optical data, which are affected by seasonal changes. As SAR is less affected by seasonal changes, these patterns become visible but are still of good correlation. The outlier for the ERS-1/2 and Landsat 5 TM correlations between 2002 and 2003 stems from the lack of optical data in that period (see Figure 1). As expected, the SAR observations almost fully correlate (i.e., contribute) to the final solution due to the loss of optical data. On a related note, the positive correlation between the two data types and, in turn, the correlation of sub-network ensembles justifies the concatenation we applied. This is suitable for the task at hand instead of applying an averaging ensemble, which would be preferable for uncorrelated cases [59].

Qualitative Analysis
In the following, we present some examples of predictions for both trained models. We use as predictions y max i,j := f NN (w t i,j ) max t with (·) max t ⊂ R 32×32 as the maximum of every window's prediction throughout each era. We would like to note that this will promote outliers of the predictions. Other filtering methods in post-processing could be applied to mitigate these effects, such as computing the mean over time (·) mean t to only highlight changes over a longer period (i.e., constant background activities) or the difference (·) max t − (·) mean t to suppress these background activities. Figure 8 shows example predictions y max i,j (i.-iv.), each for one tile, using the model trained for ERS-1/2 and Landsat 5 TM. As mentioned above, the prediction is the maximum over all the windows, with a total count of 843 with starting times ranging from 7 February 1994 to 19 September 2009. Due to our processing pipeline, we do not consider the first and last window period (1y/6m) of each era. The predictions are shown in the top middle. We used Google Earth's historic very-high resolution imagery (bottom row) to document the changes and show observations before and after the respective change. Due to the limitations of this resource, only a few samples are available, and not all changes can be investigated thoroughly. However, we selected samples that explain the detected changes concisely. In addition, we show the true color images from Landsat 5 TM as a comparison and as a scale for the resolution the network works with at an approximately similar time as the Google Earth imagery. The very high-resolution imagery at the bottom is overlaid with a red mask derived from y max i,j to help comparing the images with the predictions. Examples i. and iii. from Figure 8 show changes over a larger area with (de-)constructions of buildings. In example i., buildings were deconstructed. A series of predictions of this example is shown in v. Example ii. shows one hotspot (bottom right), which was used for longer ongoing construction purposes in the surrounding area. It also shows the limitation of sensor resolution as the new road on the top right is not detected. Example iii. shows exceptionally high prediction values due to a longer construction period. In iv., a construction of a parking lot next to an existing building is shown. In the same example, the found changes on the top left in the prediction were due to SAR-detected changes in the time frame from approximate 2003-08 to 2003-10. We were not able to obtain other data to further investigate the root cause.  Same as for the ERS-1/2 and Landsat 5 TM, we provide four example predictions y max i,j (i.-iv.) in Figure 9 but with the model trained for Sentinel 1 and 2 data. For Liège, a total of 381 windows were used, ranging from 3 July 2017 to 15 April 2020. All intensities in y max i,j are higher due to the temporal change characteristics, as discussed in Section 5.2.1. It is immediately visible that the higher sensor resolutions provide finer detailed changes, e.g., the construction of a single house in i. Larger constructions and deconstructions in ii. and iii. are also detected with hotspots where changes happened more often. Example ii. shows a hotspot on the bottom left that is caused by a (temporary) construction started in 2020-09. We were not able to obtain other data for further investigation. In v., this is shown with a series of predictions. For example, in iii., the building on the top right is not fully covered by the predictions due to the center of the building being unchanged, and co-registration variances lead to "smearing" of this unchanged area. Example iv. shows the beginning of open pit mining with hotspots where most activities happened.  Example v. shows a series of predictions from ii.

Discussion
In the following, we discuss limitations of our proposed method and elaborate on improvements to mitigate or solve these limitations.

Current Limitations
Our proposed method uses the available observations as long as they can be coregistered. No further quality assessment was applied. On one hand, this enables a highly automatic process of training and inference, but it also introduces uncertainties and errors. The GUF [47] and WSF [46] clearly show the efforts needed for high-quality detection of urban areas. However, they only provide a single snapshot due to the high amount of resources needed. Our solution is designed for easy access to urban change detection for monitoring up to the immediate past but with higher uncertainties. We have inspected the predictions of the Liège AoI for both eras and discuss our observations with respect to limitations of our method in the following. Due to the higher available resolution of Sentinel 1 and 2, we show examples only for this era. The limitations are the same for the ERS-1/2 and Landsat 5 TM era.
The remote sensing data are sampled with irregular intervals and varying quality due to clouds, shadows, irradiance changes, co-registration errors, atmospheric effects, snow/ice, or saturation. Hence, not all changes on the ground can be detected at the intensity they should be, and sampling gaps exist due to no or sparse observations. Furthermore, the imbalance of remote sensing types can cause degraded prediction quality, as shown in Figure 7 for Liège in the years 2002-2003. This all is embedded in the original data quality and could only be mitigated with down-sampling (e.g., moving window averaging) as done by other work [14]. This, however, yields temporal loss and dampens high frequency changes.
Artefacts of clouds, snow, and ice can become visible in the predictions, as shown in Figure 10i. This example shows incomplete cloud removal where the dense core of the cloud reflects a spectrum that appears as a sudden impervious/urban change. We have observed these cases very rarely, and light clouds do not seem to affect the predictions noticeably, as can also be seen in the same example.
Bright or reflective surfaces (optical) or metal objects (SAR) can cause outliers and saturation at the sensors. This leads to large gradients that are erroneously detected as changes. For SAR, Nielsen et al. [49] state that these "outliers are usually due to strong reflections from sharp angles on antennas and other human-made objects". The same is true for optical observations of high-reflective objects (e.g., with white roofs or solar panels), which can change their reflectance values due to weather, irradiance, and atmospheric effects quite significantly. Figure 10ii,iii show the optical saturation by the reflection of sunlight due to solar panels and the combination of optical and SAR saturation through metallic roofs, respectively. The mitigation of these effects would be possible by filtering over time to reduce outliers, which trades-off temporal resolution and results in a less sensitive change detection. Alternatively, saturations can be masked, which we applied to Landsat 5 TM data with the 16 bit BQA data. This is not directly available for the other data sources.
The distinction of uncultivated farmland from impervious and construction areas can at times be difficult, especially due to the phenological changes in combination with dry periods. We noticed that esp. during summer (e.g., the drought in central Europe in June-July 2018) some fields are more likely detected as urban changes due to their arid appearance. Figure 10iv. shows this effect and false detection where, only for a short time, arid land was present, which resulted in a significant response. We also experimented with other index methods that are more resilient to such effects, such as Combinational Biophysical Composition Index (CBCI) [25] or a different shift parameter α for the used ENDISI, which resulted in less sensitivity of early construction stages. We also applied the EBBI [24] to Landsat 5 TM with good results, but it cannot be applied to Sentinel 2, which lacks a needed thermal band. As a consequence thereof, open pit mining is recognized as urban changes (see Figure 9iv). In our proposed solution, we decided to accept these false positives in favor of a more rigorous detection. Farming and mining is usually a series of constant changes over time and can be filtered if longer time series observations exist. As an example, the mean prediction intensities of the year(s) of interest can be subtracted from the maximum of all predictions, as mentioned earlier in Section 5.3.
The networks were trained to highlight urban related activities (i.e., changes involving impervious areas). Sensors have a coarse resolution, which does not allow identifying mobile objects (such as cars, trucks, planes, containers) nor distinguishing them from construction changes due to the mixed pixel problem [72]. However, they cause changes of reflectance to individual pixels. As a result, the network detects changes of urban activity in general. Urban centers with high activities show higher intensities than rural areas. The same is true for highly utilized roads and harbors with more likely identified activities, as shown in Figure 10v.,vi., respectively. Since these changes are typically also constant background activities, the same mitigation strategy as for farmland can be applied.  The mixed pixel problem and the limitation of the sensor resolutions limit the detection capabilities of our method. For example, in Figure 9ii, a new road on the top right was not detected. It is shared by multiple pixels due to which the surrounding vegetation reduces impervious reflectance characteristics. As investigated earlier, our solution has a bias towards optical-based detection of changes. Even though ERS-1/2 has a higher resolution than Landsat 5 TM, it did not contribute to finding such small changes.

Further Improvements
For inference, we have retained the same tile sizes of 32 × 32 and only used the center 30 × 30 pixels for the predictions. As our proposed neural network architecture is fully convolutional, we can allow any tile size without retraining. In the ideal case, the entire AoI can be predicted at once. However, the deep time series induces significant memory requirements and depending on the used inference platform, practical limitations of tile sizes exist. In the work from Isensee et al. [66], a sliding window approach is proposed to enable the use of tiles (patches) with the mitigation of stitching errors: Images are predicted with a sliding window approach, where the window size equals the patch size used during training. Adjacent predictions overlap by half the size of a patch. The accuracy of segmentation decreases towards the borders of the window. To suppress stitching artifacts and reduce the influence of positions close to the borders, a Gaussian importance weighting is applied, increasing the weight of the center voxels in the softmax aggregation. Test time augmentation by mirroring along all axes is applied.
This solution can be applied to our two-dimensional predictions, using pixels instead of voxels, applied directly after the final sigmoid activation.
Furthermore the training process can be improved by considering the real number of observations per pixel. In our current work, we have temporally stacked the observations to fill the gaps caused by unavailable data due to anomalies, cloud coverage, orout-ofswath conditions. As a result, every pixel in a window can have a different amount of real (effective) observations as it is influenced by replications of the same value to cover the gaps. This can be taken into account with the regularization of the loss function. Pixels that have the least amount of effective observations should only minimally influence the loss. This can further be extended to inference to combine the prediction with a confidence map. Change predictions of individual pixels indicate lower confidence if there are not enough effective observations backing the prediction. This can help in identifying real changes by lowering the error noise.
Furthermore, the training with noisy labels can be further improved. Many approaches [73] exist to learn from noisy labels with deep neural networks. We would consider the SELF method from Nguyen et al. [74] as a viable approach. This could help to push out the start of overfitting to lower the training losses further, leading to more accurate predictions. A downside of SELF is its conception for classification and not prediction tasks. Further work would be required to make it available for our solution.

Conclusions
In our study, we trained two ensemble neural networks of the same architecture for the eras of ERS-1/2 and Landsat 5 TM, as well as Sentinel 1 and 2, using deep time series of multispectral optical and SAR data types. The supervised training with synthetic but noisy labels has shown good utility, which avoids the need to manually create ground truth labels to train the neural networks. The ensemble of mixing two data types showed synergies but with a tendency towards relying more on optical observations. We also analyzed the sensitivity to changes over time, which indicates that changes closer to the beginning of the observation window show a stronger response. This suggests to sample multiple window predictions preferably over a longer period in order to balance change responses.
Our fully automated and parametric environment allows training and inference with little manual work. In combination with services, such as the Sentinel Hub, the necessary data are directly accessible and can be used for training or inference, omitting further processing. We make available our work on Github ( https://github.com/It4innovatio ns/ERCNN-DRS_urban_change_monitoring; accessed on 29 July 2021), including the training environments, trained models and data samples. Acknowledgments: The authors would like to thank ESA for funding the study as part of the BLENDED project [75] and IT4Innovations for funding the compute resources via the Open Access Grant Competition (OPEN-21-31). Furthermore, the authors would like to thank the data providers (USGS, ESA, Sentinel Hub and Google) for making remote sensing data freely available. The authors would finally like to thank the BLENDED project partners for supporting our work as a case study of the developed platform.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.