A Deep Learning Approach for Burned Area Segmentation with Sentinel-2 Data

: Wildﬁres have major ecological, social and economic consequences. Information about the extent of burned areas is essential to assess these consequences and can be derived from remote sensing data. Over the last years, several methods have been developed to segment burned areas with satellite imagery. However, these methods mostly require extensive preprocessing, while deep learning techniques—which have successfully been applied to other segmentation tasks—have yet to be fully explored. In this work, we combine sensor-speciﬁc and methodological developments from the past few years and suggest an automatic processing chain, based on deep learning, for burned area segmentation using mono-temporal Sentinel-2 imagery. In particular, we created a new training and validation dataset, which is used to train a convolutional neural network based on a U-Net architecture. We performed several tests on the input data and reached optimal network performance using the spectral bands of the visual, near infrared and shortwave infrared domains. The ﬁnal segmentation model achieved an overall accuracy of 0.98 and a kappa coe ﬃ cient of 0.94.


Introduction
Vegetation fires are a natural phenomenon in several ecosystems and can have positive effects on biodiversity and natural regeneration [1]. However, in the last centuries, the number of fires has increased significantly due to human activities [2], which results in notable adverse consequences at multiple scales. From an ecological point of view, they lead to a decline of forest stands [3], impair forest health and biodiversity [4] and emit aerosols and other greenhouse gases, which have impacts on the global carbon budget [5,6]. From a socioeconomic point of view, fires threaten human life, destroy property and have a negative impact on sources of income [7]. As climate change presents new, dynamic scenarios, fire seasons have been observed to become longer and to affect larger areas than before [8]. In order to coordinate emergency efforts on site, to determine the economic and ecological losses and to assess the recovery process, information about location, extent and frequency of fire events is essential. Satellite data are well suited to provide this information. Compared to field methods, they are time and cost efficient and provide synoptic information with global coverage at high temporal frequency.
As reviewed by Chuvieco et al. [2], satellite sensors in the Middle Infrared (MIR) and the thermal domains are particularly suitable for the detection of active fires. To derive the extent of burned areas, optical sensors in the solar domain are commonly used. This is because biomass incineration has clear effects that can be detected in this domain: burning results in a strong decrease of reflectance in the near infrared (NIR) and the resulting dryness leads to a moderate increase in the shortwave infrared burn index (MIRBI) [27] and the modified burned area index (BAIM) [28]. Frequently, not only a single post-event image is used, but also a pre-event image. In such a multi-temporal approach, spectral indices are calculated for both images and the burned areas are derived by change detection. As a result, burned areas can be better distinguished from other areas with a similar spectral response. However, an inadequate choice of a pre-event scene may also lead to misclassifications. Furthermore, a larger amount of cloud free data is required and needs to be preprocessed. However, this is a major drawback in a time-critical rapid mapping context.
In contrast to rule-based approaches, machine learning approaches learn the characteristics of burned pixels from a set of labelled samples. Examples of applied machine learning algorithms are random forests (RF) [29,30] and support vector machines (SVM) [31,32]. Usually, instead of using the raw image data, previously extracted features derived from spectral indices or band differences of pre-and post-fire images serve as inputs for these algorithms. Furthermore, auxiliary information is included in certain approaches, e.g., a digital elevation model [29].
Another area of machine learning that is used increasingly more often in remote sensing is neural networks. First developments in the context of burned area mapping were made in 2001, when a recurrent neural network based on adaptive resonance theory was used to classify satellite images pixel by pixel as burned or unburned [33]. However, in the following years, another kind of neural network prevailed: the multilayer perceptron (feedforward) network [34][35][36][37][38][39]. In this network architecture, several spectral characteristics of each pixel are passed through a layer-structured shallow network to determine whether the pixel is burned or not. More recently, a further development of neural networks, convolutional neural networks (CNNs), showed enormous success in remote sensing applications. CNNs do not connect all neurons of successive layers, but only those neurons within a receptive field (neighborhood) and, thus, allow for the integration of contextual information and reduce the required computational time. Furthermore, their deep layer architecture enables the network to map any nonlinear function and to generalize learned features. The multiple layers within the network are also the reason CNNs are often referred to as deep learning architectures. CNNs have successfully been used in different classification tasks in remote sensing such as the segmentation of buildings [40], slums [41], clouds [42], water [43] and the specification of land use [44] and crop types [45]. Although these examples exemplify the high potential of deep learning in image segmentation, this technique has had limited application specifically to the context of burned area segmentation. This is probably due to the limited availability of corresponding training data, which is required for supervised learning. So far, only the authors of [46] and [47] used a deep learning approach for burned area mapping. Pinto et al. [46] trained a CNN with long short term memory (LSTM) on a time series of coarse resolution VIIRS reflectance and active fire data to segment burned areas and to derive the fire date. Ban et al. [47] used deep learning to monitor wildfire progression in Sentinel-1 SAR time series.
A major drawback of current burned area segmentation methods is that they frequently follow complex rulesets and are not only based on a post-event, but also a pre-event scene, which increases preprocessing times and the probability of cloud cover in the final segmentation result. Furthermore, machine learning algorithms like RF or SVM often require extensive feature engineering of the input data and include the use of auxiliary data. Deep learning in the form of CNNs, which is another field of machine learning, does not require previous feature engineering, is able to map any nonlinear function and can generalize learned features. Therefore, the use of CNNs for burned area segmentation seems promising. While preliminary studies have been conducted, deep learning has not yet been performed with the use of high resolution Sentinel-2 data. Thus, our research objective is to combine sensor and algorithm developments of the recent years and to train a CNN best suited for mono-temporal burned area segmentation with Sentinel-2 data. The high spatial resolution and large swath width of Sentinel-2 imagery allows for the mapping of both extensive and small fragmented fires. Consequently, this reduces omission errors. Compared to Landsat imagery that is already frequently used in burned area segmentation, Sentinel-2 has a higher revisit frequency of only five days. Thus, there is a higher probability of acquiring cloud-free images. We suggest an approach based on a post-fire scene only, without the need for atmospheric correction or the inclusion of any auxiliary data. These characteristics ensure rapid processing of a large number of satellite scenes and make our method particularly suited for time-critical applications.
A major challenge of generating a neural network model for burned area segmentation is the availability of training data. Most global burned area products have a coarse spatial resolution and creating synthetic training data would not take into account contextual information. Thus, the first step involves creating a training dataset for burned area segmentation with Sentinel-2 data. To better understand the impact of the training data on the network's performance, we conduct several experiments before training a final model. These include the identification of the best suited spectral band combination for the CNN. Finally, we integrate the burned area segmentation model into a fully automatic processing chain including data ingestion, data preprocessing and burned area segmentation and mapping. This facilitates the transfer and use of the proposed method by any institution tasked with mapping activities in the context of forest fire management. We adjust an existing modular Sentinel-2 flood processor [43] in a way that it can also be used for burned area mapping to support these efforts. We apply the processing chain to three selected fire disasters to validate our results and to illustrate the advantages and disadvantages of the proposed method.

Data
In a supervised machine learning setup, a model is fit on a training dataset, which is comprised of labeled samples with associated feature vectors. For burned area segmentation, these are binary pixel masks indicating burned and unburned pixels and corresponding multi-band satellite images. To match the high resolution Sentinel-2 imagery, reference burned area masks with similar spatial resolution must be provided. Global databases, as described in Section 1, comprise of only medium resolution products, so we compiled a new dataset for training, validation and testing. The reference data used in this study originates from three different data sources: the fire incident database of the California Department of Forestry & Fire Protection (CAL FIRE) [48,49], the burned area database of the Portuguese Institute for Nature Conservation and Forests (ICNF) [25] and burned areas that have been processed in the past at the German Aerospace Center (DLR).
CALFIRE and ICNF provide fire perimeters as vector files. However, there is no detailed information available about how the features are delineated nor on which satellite images they are based. A visual inspection of burned areas larger than 1 ha from 2017 and 2018 shows that most vector files capture the outermost extent of larger burned areas (i.e., often without accounting for smaller unburned parts therein), rather than the results of a pixel-based segmentation. To ensure a sufficient spatial resolution, we acquired Sentinel-2 scenes captured over the burned areas with a sensing time as close as possible to the generation date of the reference masks and with cloud coverage of less than 10%. We performed visual quality control of the fire perimeters with the geographic information system QGIS and defined areas of interest with respect to observations within the Sentinel-2 scene. The corresponding vector files of the fire perimeters were converted to a 10 m pixel mask.
DLR generated burned area masks based on Sentinel-2 imagery in the past according to [17]. Processing requires a pre-and a post-fire scene as input and includes coregistration, atmospheric correction and cloud masking as preprocessing steps. Segmentation is performed with a two-phase pixel-based approach. In the first phase, burned pixels are detected by thresholding several monoand multi-temporal features. In the second phase, region growing based on the burned pixels of the first phase is performed. The result is a 10 m resolution pixel mask. Similarly, with CALFIRE and ICNF data, we visually quality controlled all of the burned area masks and defined areas of interest with high segmentation quality. As the segmentation results are prone to a "salt-and-pepper effect" that often occurs when pixel-based methods are applied to high-resolution images [50], we applied morphologic operators (closing and opening) with a 3 × 3 kernel to the DLR burned area masks in order to improve the quality.
Besides creating labeled samples, we acquired corresponding Sentinel-2 satellite imagery. Each of the two Sentinel-2 satellites has a multispectral instrument on board covering 13 spectral bands with spatial resolutions from 10 to 60 m (see Table 1). We resampled all bands to 10 m resolution, stacked them, created subsets according to the previously defined areas of interest and converted digital numbers (DN) to top of atmosphere (TOA) reflectance (details are provided in Section 2.2). The final reference dataset comprises of subsets of different dimensions that are derived from 184 Sentinel-2 scenes, which include burned areas, mostly acquired between 2017 and 2018. The data covers different biomes and seasons, with the aim to generate a representative set that captures different types of climatic, atmospheric and land-cover conditions. Figure 1 gives an overview of the spatial distribution. Most the reference data (70%) originates from the regional databases of ICNF, located in Portugal. The CAL FIRE and DLR data make up 9% and 21% of the total pixel number, respectively. About two thirds of the total reference data are obtained over Mediterranean forests, woodlands and scrubs. The rest of the reference data largely covers temperate forests, tropical and subtropical grasslands, savannas and shrublands. Burned areas of other biomes are limited in the reference data. Montane grasslands and shrubs, tundra and mangrove are not covered at all.  The neural network presented in Section 2.3 requires raster images of 256 × 256 pixels as input. Thus, we split up all satellite image subsets and burned area masks into non-overlapping tiles with the aforementioned dimension. To enable tiling of smaller subsets, all tiles were mirror-padded beforehand. In total, 2637 tiles were created and divided into training, validation and test sets, with an approximate ratio of 60/20/20% and aiming at an even distribution over the represented biomes. The neural network presented in Section 2.3 requires raster images of 256 × 256 pixels as input. Thus, we split up all satellite image subsets and burned area masks into non-overlapping tiles with the aforementioned dimension. To enable tiling of smaller subsets, all tiles were mirror-padded beforehand. In total, 2637 tiles were created and divided into training, validation and test sets, with an approximate ratio of 60/20/20% and aiming at an even distribution over the represented biomes.
To further increase the amount of training data and strengthen the network against atmospheric changes and different sensor resolutions, we augmented each training tile three times by applying a sequence of random brightness, contrast, shift, scale and rotation changes. Each transformation was applied with a probability of 75%. We randomly changed brightness values within a factor range of [−0. . We used border reflection and nearest neighbor interpolation to bring the augmented tiles back to the required 256 × 256 pixel tile size. Application of the aforementioned data augmentation produced a training dataset of 6272 tiles. The percentage of burned pixels within these tiles was on average 20%, including tiles completely covered by burned areas and tiles without any burned areas.
To independently test the burned area mapping module, we selected three fire events located in Pantelleria, Italy (study area A), Mossley, United Kingdom (study area B) and Torre de l'Espanyol, Spain (study area C), which were mapped as a part of Copernicus EMS. As these products have been generated in rapid mapping mode by different operators, burned area masks were manually refined to ensure high quality and to harmonize variable mapping styles. The Sentinel-2 scenes were clipped to the area of interest defined by Copernicus EMS. An overview of the three fire events can be found in Figure 2.  In June 2016, several fires broke out in the region of Sicily caused by high temperatures of up to 40 °C and strong southern winds [53]. One of the fires occurred on the island of Pantelleria (study area A) and destroyed about 600 ha of croplands, scrubs and woodlands. In response to this event, the EMS rapid mapping component was activated on June 23, 2016 (EMSR 169). Emergency maps were produced and released by e-GEOS, based on a Sentinel-2 pre-event image (10 m ground sampling distance (GSD)) acquired on April 16, 2017 and a Sentinel-2 post-event image (10 m GSD) acquired on June 25, 2016. We used the post-event Sentinel-2 image (tile 33STA) as input for testing.
In June 2018, a fire broke out in Saddleworth Moor to the east of Manchester, England (study area B) [54]. The fire affected an area of about 1000 ha, mainly characterized by inland wetlands, but In June 2016, several fires broke out in the region of Sicily caused by high temperatures of up to 40 • C and strong southern winds [53]. One of the fires occurred on the island of Pantelleria (study area A) and destroyed about 600 ha of croplands, scrubs and woodlands. In response to this event, the EMS rapid mapping component was activated on June 23, 2016 (EMSR 169). Emergency maps were produced and released by e-GEOS, based on a Sentinel-2 pre-event image (10 m ground sampling distance (GSD)) acquired on April 16, 2017 and a Sentinel-2 post-event image (10 m GSD) acquired on June 25, 2016. We used the post-event Sentinel-2 image (tile 33STA) as input for testing.
In June 2018, a fire broke out in Saddleworth Moor to the east of Manchester, England (study area B) [54]. The fire affected an area of about 1000 ha, mainly characterized by inland wetlands, but also with shrub/herbaceous vegetation and pastures. In response to this event, the EMS rapid mapping component was activated on June 27, 2018 (EMSR 291). Emergency maps were produced by e-GEOS and released by SERTIT, based on a Sentinel-2 pre-event image (10 m GSD) acquired on June 24, 2018 and a SPOT-7 post-event image acquired on July 4, 2018 (1.5 m GSD). We used a Sentinel-2 scene (10 m GSD) from July 4, 2018 (tile 30UWE) for testing.
In June 2019, a wildfire broke out in the municipality of Torre de l'Espanyol in the province of Catalonia (study area C) [55]. Subsequently, nearly 5000 ha were devastated, including forests, agricultural areas and shrub/herbaceous vegetation. In response to this event, the EMS rapid mapping component was activated on June 27, 2019 (EMSR 365). Emergency maps were produced by SERTIT and released by e-GEOS based on a SPOT6/7 pre-event image (1.5 m GSD) acquired on June 6, 2019 and a SPOT-6/7 post-event image (1.5 m GSD) acquired on June 30, 2019. We used a Sentinel-2 scene (10 m GSD) from June 30, 2019 (tile 31TBF) for testing.

Fully Automatic Processing Chain
The fully automatic burned area mapping workflow is presented in Figure 3.

Fully Automatic Processing Chain
The fully automatic burned area mapping workflow is presented in Figure 3. It is based on a modular Sentinel-2/Landsat processing chain developed at DLR, which was originally designed for flood mapping. The processing chain is fully implemented in Python using Keras with TensorFlow backend [56] as the deep learning framework. The principle components are explained in the following paragraph, for a detailed description see [43].
Two input arguments are mandatory to initiate the processing chain: an area of interest and a time range to search for satellite imagery. Optionally, further parameters can be defined such as maximum cloud coverage or the download of a quick look. The first step checks for and downloads Sentinel-2 scenes at processing level L1C that meet the defined criteria. The level L1C data product is It is based on a modular Sentinel-2/Landsat processing chain developed at DLR, which was originally designed for flood mapping. The processing chain is fully implemented in Python using Remote Sens. 2020, 12, 2422 8 of 22 Keras with TensorFlow backend [56] as the deep learning framework. The principle components are explained in the following paragraph, for a detailed description see [43].
Two input arguments are mandatory to initiate the processing chain: an area of interest and a time range to search for satellite imagery. Optionally, further parameters can be defined such as maximum cloud coverage or the download of a quick look. The first step checks for and downloads Sentinel-2 scenes at processing level L1C that meet the defined criteria. The level L1C data product is already radiometrically calibrated and orthorectified but is not atmospherically corrected. Data can be downloaded via the ESA Copernicus Open Access Hub, using the custom Application Programming Interface (API), and is provided as single band raster files in 100 × 100 km tiles. In the following step, the downloaded data are preprocessed. Images are converted from DN to TOA reflectance (DN2TOA). The single band raster files are resampled to the same spatial resolution (10 m) and those bands required for the neural network model are stacked to produce a single dataset. In this work, the spectral bands in the following domains are used: visual (Blue, Green, Red), near infrared (NIR) and short-wave infrared (SWIR1, SWIR2). This band combination was selected for based on the outcomes of comprehensive performance evaluation described in Section 3.
The preprocessed images are passed through a first CNN to segment invalid pixels (cloud, cloud shadows, snow and ice), following the procedure described in [42], which results in a valid pixel mask. Subsequently, the same image is passed through a second CNN that has been trained specifically for a corresponding mapping task, in this case burned area segmentation. The network assigns each pixel to a corresponding class, either burned or unburned. Finally, the segmentation result is clipped to the extent of the valid pixel mask and the final burned area mask is generated.

Semantic Segmentation of Burned Areas with a U-Net CNN
The key part of this work is the development of a deep learning model based on a CNN specifically suited for burned area segmentation. A neural network is a system of interconnected neurons that is used to model a complex function, in our case, a function to map a set of pixel values to labels. A defined set of weights serves to define the function. Neurons are arranged as multiple cascading layers allowing features to be learned in a hierarchical manner. During training, input tiles are propagated through the network and a loss function is calculated to evaluate the quality of the mapping function. Gradients of the loss function are then back-propagated through the network and weights are adjusted accordingly to minimize the loss.
A critical point that affects the network's performance is the network architecture. The architecture used in this work is based on a U-Net architecture [57]. The U-Net architecture was originally designed for biomedical image segmentation, but already achieved promising accuracies in other segmentation tasks with satellite imagery, such as cloud/cloud shadow [42] and water segmentation [43]. Further advantages of this architecture are that it can learn from a limited amount of data and has considerably less trainable parameters than other CNNs. We decided against using a pre-trained model, which is a common approach to reduce computational effort, because any new dataset would have to be similar to the one used in pre-training. This would have otherwise limited the choice of input data.
The U-Net is a fully convolutional network with encoder-decoder architecture. This means that in addition to the contracting part of a conventional CNN, the network has a more or less symmetric upsampling part. This leads to the typical u-shape of the network (see Figure 4). The U-Net presented in this work requires tiles with 6 spectral bands and a width and height of 256 pixels each. This tile size is frequently used for U-Net architectures [42,43] and has been observed to ensure that a significantly large batch size can be trained with a standard desktop PC. Images to be processed are usually larger than the specified dimension and must be tiled. We apply mirror-padding to minimize higher prediction errors towards the image border during inference and split the image into 256 × 256 pixel tiles with an overlap of 10%. These tiles are passed through the network, resulting in tiles that indicate the probability of being burned (ranging from 0 to 1). This tile size is frequently used for U-Net architectures [42,43] and has been observed to ensure that a significantly large batch size can be trained with a standard desktop PC. Images to be processed are usually larger than the specified dimension and must be tiled. We apply mirror-padding to minimize higher prediction errors towards the image border during inference and split the image into 256x256 pixel tiles with an overlap of 10%. These tiles are passed through the network, resulting in tiles that indicate the probability of being burned (ranging from 0 to 1).  The final binary output is calculated using a fixed threshold of 0.5. Finally, the tiles are reconstructed to the original image size. We use a tapered cosine window function w (Equation (1)) to blend the overlapping prediction tiles in which n represents the samples, N the number of pixels in the output window and α the fraction of the window inside the cosine tapered region.
In the following paragraph, the architecture of the U-Net is explained more in detail. In the encoder part, input data are passed through five convolutional blocks where features are extracted at different scales. Each convolutional block consists of two 3 × 3 convolutions with ReLu activation, each followed by batch normalization of the activations of the previous layer and ends with a max pooling operation with stride 2 that downsamples the feature maps. While the size of the feature maps is reduced by factor 4 after each convolutional block, the number of feature channels is doubled. In the decoder part, the feature maps are upsampled to the original image size by passing through five blocks as well. Each of these blocks consists of a 2 × 2 transpose convolution that halves the number of feature channels, a concatenation with the corresponding feature map from the encoder part and two 3 × 3 convolutions with ReLu activation, each followed by batch normalization. The final layer is a 1 × 1 convolution with sigmoid activation in order to calculate the probability for each pixel of being burned.

Training of the U-Net CNN
To define the weights of the U-Net, the network was trained on the augmented dataset presented in Section 2.1. Augmentation was applied in order to make the network invariant to specific transformations: the geometric transformations simulate different sensor characteristics such as spatial resolution. Thus, the network could potentially be used to process image data from other satellite sensors (e.g., Landsat). The radiometric transformations aim to capture variations with seasonality, land use, land cover and under different atmospheric conditions. The last point is especially important, as image data are not atmospherically corrected before entering the U-Net. To make the samples of the augmented dataset more comparable, all data values were standardized using the statistics of the training samples. Standardization was performed independently on each spectral band and included data-centering by subtraction of the mean and scaling to unit variance by division by the standard deviation.
During training, weights were adjusted after each epoch in a way that minimized the binary cross entropy loss H of the training data, between the true probability p and the predicted probabilityp Equation (2). Training data were shuffled after each epoch.
We used the adaptive moment estimation (Adam) optimization algorithm [58] with an initial learning rate of 0.001 and default hyper-parameters β 1 = 0.9 and β 2 = 0.999. Validation data were also passed through the network during training and its loss was calculated and monitored. If validation loss did not improve within five epochs, the learning rate was reduced by factor 0.5. To prevent overfitting, training was stopped if validation loss has not improved within 20 epochs. At the end of the training process, the model associated with the lowest validation loss was saved. Training was carried out on a standard desktop PC with Intel XEON W-2133 CPU @ 3.60 GHz, 6 cores, 64 GB RAM and NVIDIA Quadro P4000 graphic card. The network was trained until convergence occurred in batches of size 16, using Keras with TensorFlow backend [56] as the deep learning framework.

Accuracy Assessment
For model evaluation, we forwarded the testing data through the network and calculated overall accuracy, Cohen's kappa coefficient, inference speed as well as precision and recall of the pixels classified as burned. Overall, accuracy OA is defined as the number of correctly classified pixels over the total number of pixels, see Equation (3) (tp = true positive, tn = true negative, fp = false positive, fn = false negative). Cohens' kappa coefficient K is an indicator of interrater reliability and is calculated with Equation (4). The variable p 0 describes the observed overall accuracy and p e is the expected probability, when both raters assign labels randomly.
As overall accuracy is mainly influenced by the quantity of unburned pixels due to the unbalanced distribution of classes, we also calculated precision and recall of the burned class, which are also referred to as user's accuracy and producer's accuracy. Precision describes how many detected burned pixels are in fact burned Equation (5), whereas recall describes how many of the truly burned pixels have been detected as such Equation (6). Additionally, we calculated the F1 score, which is the mean of precision and recall.
We additionally trained a random forest model for benchmarking. We used the Scikit-learn implementation [59] with standard parameters, which fits a number of optimized CART decision trees [60] on subsamples of the training data and uses averaging to prevent overfitting and to improve the accuracy of the prediction.
To test the complete processing chain and to evaluate the performance of our network model in real-world emergency response applications, we processed the Sentinel-2 scenes of the study areas A-C and compared the results to the reference data.

Results
In order to find an optimal input data configuration for the CNN, we tested different band combinations before training a final network. The objective was to achieve best results with as few spectral bands as possible. This approach makes the model lighter, reduces inference and processing time and minimizes storage space requirements. Table 2 gives an overview of the spectral band combinations tested, starting with a combination of all spectral bands (BC1) and then gradually reducing the number of bands. First, we removed all atmospheric bands (BC2), then the narrow bands of the red edge domain (BC3). Afterwards, we omitted the bands of the SWIR domain (BC4) and finally trained a network with the spectral bands of the visual domain only (BC5). All networks were trained on the non-augmented dataset until convergence occurred. Inference time was averaged over five runs and refers to a calculation on the graphics processing unit (GPU). Table 2. Summary of the accuracy assessment and graphics processing unit (GPU) inference time for the 545 testing data tiles using different band combinations as input for the U-Net. The networks were trained on the non-augmented dataset until convergence occurred.  Table 2 shows that the accuracy assessment for BC1-BC4 varies only slightly. Changes cannot be clearly traced back to different band combinations, as they could also be based on different randomly set initial weights. Nevertheless, accuracy and kappa show a downward trend with lower numbers of spectral bands used for training. Training the network exclusively with the blue, green and red channels clearly results in worse outcomes. In the end, we selected BC3, a tradeoff between high accuracy and low inference time with balanced outcomes for precision and recall. Furthermore, this band combination can also be found in other multispectral satellite sensors, such as Landsat, which allows testing the transferability of the Sentinel-2 trained model to other sensor data.
The final model was trained on the augmented dataset using the spectral bands of BC3. Training lasted about 4 hours using the hardware described in Section 2.4. Figure 5 shows the loss and accuracy of the training and validation datasets over the epochs.  The final model was trained on the augmented dataset using the spectral bands of BC3. Training lasted about 4 hours using the hardware described in Section 2.4. Figure 5 shows the loss and accuracy of the training and validation datasets over the epochs. The results of the accuracy assessment for the testing tiles are summarized in Table 3. For the U-Net model, we achieved an overall accuracy of 0.98 and a precision and a recall of the burned class of 0.95 each. The central processing unit (CPU) inference speed computed over 545 test image tiles is 2.20 s/megapixel and the corresponding GPU inference speed is 0.26 s/megapixel. Compared to the RF model, the U-Net model resulted in significantly better metrics: overall accuracy is 0.02 higher and the kappa coefficient is 0.07 higher. In particular, the inference time of the RF model is almost twice as long. Calculating on the GPU even makes the U-Net 15 times faster. The results of the accuracy assessment for the testing tiles are summarized in Table 3. For the U-Net model, we achieved an overall accuracy of 0.98 and a precision and a recall of the burned class of 0.95 each. The central processing unit (CPU) inference speed computed over 545 test image tiles is 2.20 s/megapixel and the corresponding GPU inference speed is 0.26 s/megapixel. Compared to the RF model, the U-Net model resulted in significantly better metrics: overall accuracy is 0.02 higher and the kappa coefficient is 0.07 higher. In particular, the inference time of the RF model is almost twice as long. Calculating on the GPU even makes the U-Net 15 times faster. Table 3. Accuracy assessment and central processing unit (CPU) inference time for the testing tiles using the final network model (U-Net) and a random forests (RF) model. For the U-Net GPU inference time is provided in brackets additionally.   To test the complete processing chain, we processed the Sentinel-2 scenes that include study areas A-C. After downloading the scenes, the average processing time for data preparation, segmentation and final mapping was 193 s. Figure 7 shows the corresponding results and supports an initial qualitative analysis. Clouds, cloud shadows and no data values (e.g., in Figure 7a) are identified and classified as invalid pixels. The burned areas of study sites A-C are detected, as well the two other burned areas that are present in tiles 30UWE and 31TBF. These areas are indicated To test the complete processing chain, we processed the Sentinel-2 scenes that include study areas A-C. After downloading the scenes, the average processing time for data preparation, segmentation and final mapping was 193 s. Figure 7 shows the corresponding results and supports an initial qualitative analysis. Clouds, cloud shadows and no data values (e.g., in Figure 7a) are identified and classified as invalid pixels. The burned areas of study sites A-C are detected, as well the two other burned areas that are present in tiles 30UWE and 31TBF. These areas are indicated with circles. In general, there are only limited instances of false detections. In tile 31TBF, open-cast mining is frequently incorrectly identified as burned area, while in tile 30UWE, false detections mainly occur over agricultural fields, especially in the left side of Figure 7e.  To avoid an over-representation of background pixels within the full Sentinel-2 scenes in our accuracy metrics, we focus on study areas A-C for a more detailed accuracy assessment. Table 4 presents the quantitative results associated with the three study areas. The overall accuracy is 99% in all three cases, while kappa varies between 86% and 97%. The precision of the predicted burned pixels shows the largest variability: while it is just 77% in study area A, it is 96% in study area B. In contrast, the recall of the burned class is consistently high at an average of 0.98% and differs only marginally in the three cases. Table 4. Overview of the accuracy assessment for the three study areas including overall accuracy (OA), Cohen's kappa coefficient (K), as well as precision, recall and the F1 score of the burned areas. For a qualitative accuracy assessment, we visualized the U-Net prediction on an image composite with B12, B8 and B2 spectral bands and identified hit and false alarms (see Figure 8). Misclassifications mainly occur within or directly connected to the burned areas. Only in case of study area A there are also false burned alarms in the coastal area of the island. Thin structures, such as roads, are usually not detected as unburned. In general, false alarms in the predicted burned class clearly outweigh those in the unburned class. To avoid an over-representation of background pixels within the full Sentinel-2 scenes in our accuracy metrics, we focus on study areas A-C for a more detailed accuracy assessment. Table 4 presents the quantitative results associated with the three study areas. The overall accuracy is 99% in all three cases, while kappa varies between 86% and 97%. The precision of the predicted burned pixels shows the largest variability: while it is just 77% in study area A, it is 96% in study area B. In contrast, the recall of the burned class is consistently high at an average of 0.98% and differs only marginally in the three cases. For a qualitative accuracy assessment, we visualized the U-Net prediction on an image composite with B12, B8 and B2 spectral bands and identified hit and false alarms (see Figure 8). Misclassifications mainly occur within or directly connected to the burned areas. Only in case of study area A there are also false burned alarms in the coastal area of the island. Thin structures, such as roads, are usually not detected as unburned. In general, false alarms in the predicted burned class clearly outweigh those in the unburned class.

Discussion
One of the most important aspects that can influence the performance of any supervised machine learning algorithm is the choice of appropriate training data. Thus, in the first part of the discussion, we review the impact of several choices we made on the network's performance.
As described in Section 2.1, we did not only use those tiles for training that included both classes (burned and unburned), but also tiles completely covered by one class. This enables the network to learn more about unburned areas such as harvested cropland, which had caused major problems in other segmentation approaches [17]. To show the impact of this single class tile integration, we trained an additional network excluding these tiles and processed the Sentinel-2 data of the three study areas in Italy (A), UK (B) and Spain (C). Figure 9 presents the respective probability maps.

Discussion
One of the most important aspects that can influence the performance of any supervised machine learning algorithm is the choice of appropriate training data. Thus, in the first part of the discussion, we review the impact of several choices we made on the network's performance.
As described in Section 2.1, we did not only use those tiles for training that included both classes (burned and unburned), but also tiles completely covered by one class. This enables the network to learn more about unburned areas such as harvested cropland, which had caused major problems in other segmentation approaches [17]. To show the impact of this single class tile integration, we trained an additional network excluding these tiles and processed the Sentinel-2 data of the three study areas in Italy (A), UK (B) and Spain (C). Figure 9 presents the respective probability maps. The probability maps resulting from both models generally show segmentation difficulties in the same areas: the rocky coastline in study area A and the lakeshore in the east and agricultural fields in the north of the burned area in study area B. However, as can be seen in Figure 9e, the model trained without single class tiles assigns a lot higher probability of being burned to those misclassifications. Furthermore, not integrating single class tiles results in further misclassifications such as predicting the airport site in study area A as burned. These findings justify our decision to integrate single class tiles in the training dataset for our final network model.
Besides the integration of single class tiles, the kind of reference data used for training also has influence on the network's results. In this work, we included pixelwise burned area masks produced by a DLR processor [17] and burned area perimeters of ICNF and CAL FIRE. As previously described, the latter are characterized as the outermost extent of burned areas, rather than a more detailed pixel-by-pixel segmentation. To show the impact on the network's performance, we trained two additional networks: the first one using only the pixelwise burned area masks of DLR (798 tiles) and the second one using the areal burned area masks of ICNF and CAL FIRE (3906 tiles). Figure 10 The probability maps resulting from both models generally show segmentation difficulties in the same areas: the rocky coastline in study area A and the lakeshore in the east and agricultural fields in the north of the burned area in study area B. However, as can be seen in Figure 9e, the model trained without single class tiles assigns a lot higher probability of being burned to those misclassifications. Furthermore, not integrating single class tiles results in further misclassifications such as predicting the airport site in study area A as burned. These findings justify our decision to integrate single class tiles in the training dataset for our final network model.
Besides the integration of single class tiles, the kind of reference data used for training also has influence on the network's results. In this work, we included pixelwise burned area masks produced by a DLR processor [17] and burned area perimeters of ICNF and CAL FIRE. As previously described, the latter are characterized as the outermost extent of burned areas, rather than a more detailed pixel-by-pixel segmentation. To show the impact on the network's performance, we trained two additional networks: the first one using only the pixelwise burned area masks of DLR (798 tiles) and the second one using the areal burned area masks of ICNF and CAL FIRE (3906 tiles). Figure 10 presents the resulting probability maps. It is noticeable that the model trained with DLR tiles only results in a sharp delineation of the burned areas from their surroundings, while the model trained with ICNF and CAL FIRE data produces gradually decreasing probabilities at the borders of the burned areas. Furthermore, misclassifications occur more frequently and more clearly in the model trained with DLR data only. This is probably due to the fact that ICNF and CALFIRE provide their data as vector files representing the extent of the burned area, rather than a per pixel segmentation. The network learns this characteristic and takes the neighborhood much more into account to create homogenous burned areas. A further aspect that can be observed in Figure 10 is that the model trained with DLR data only is very prone to detecting water surfaces as burned areas, such as the ocean around the island in study area A, lakes in study area B and the river in study area C. Training a neural network with both datasets, as with the final model, constitutes a compromise between the advantages and disadvantages of the datasets. The corresponding probability maps are illustrated in Figure 9a-c. While there are hardly any misclassifications in water areas, however, burned areas do not have a sharp outline.
Remote Sens. 2020, 12, x FOR PEER REVIEW 17 of 23 presents the resulting probability maps. It is noticeable that the model trained with DLR tiles only results in a sharp delineation of the burned areas from their surroundings, while the model trained with ICNF and CAL FIRE data produces gradually decreasing probabilities at the borders of the burned areas. Furthermore, misclassifications occur more frequently and more clearly in the model trained with DLR data only. This is probably due to the fact that ICNF and CALFIRE provide their data as vector files representing the extent of the burned area, rather than a per pixel segmentation. The network learns this characteristic and takes the neighborhood much more into account to create homogenous burned areas. A further aspect that can be observed in Figure 10 is that the model trained with DLR data only is very prone to detecting water surfaces as burned areas, such as the ocean around the island in study area A, lakes in study area B and the river in study area C. Training a neural network with both datasets, as with the final model, constitutes a compromise between the advantages and disadvantages of the datasets. The corresponding probability maps are illustrated in Figure 9a-c. While there are hardly any misclassifications in water areas, however, burned areas do not have a sharp outline. This extensive discussion of the influence of the input data on the network's performance highlights the importance of high quality training data since the neural network always learns by imitation. Consequently, misclassifications in the inputs will be reproduced. Thus, the neural network should be ideally trained with a large amount of manually digitized burned areas and not with automatically derived products that are subject to a high degree of uncertainty. The training data used in this work do not fully meet these requirements. Nevertheless, we still achieve very good results in our final network model, even training with a freely available or automatically derived dataset without further modifications.
After discussing the impact of input data on the network's results, we now focus on the segmentation results of the final network model. In Section 3, we already mentioned that major This extensive discussion of the influence of the input data on the network's performance highlights the importance of high quality training data since the neural network always learns by imitation. Consequently, misclassifications in the inputs will be reproduced. Thus, the neural network should be ideally trained with a large amount of manually digitized burned areas and not with automatically derived products that are subject to a high degree of uncertainty. The training data used in this work do not fully meet these requirements. Nevertheless, we still achieve very good results in our final network model, even training with a freely available or automatically derived dataset without further modifications.
After discussing the impact of input data on the network's results, we now focus on the segmentation results of the final network model. In Section 3, we already mentioned that major misclassifications occur at the rocky coastline of the volcanic island of Pantelleria in study area A. In previous assessments of other fire events, we also observed misclassifications of rocks of volcanic origin as burned areas. Furthermore, the probability map of study area B (Figure 9b) points to a likelihood of confusion with dark water areas and agricultural fields. However, a far greater proportion of overclassification lies within the burned areas themselves. In most instances, these are characteristically narrow areas such as roads. An explanation may be that only four of the spectral bands received by the network as input have an original spatial resolution of 10 m. The two SWIR bands only have an original resolution of 20 m, which makes it difficult to detect structures with a smaller width. However, in study area A, even islands of unaffected forest of up to four ha were included into the burned area segmentation result of the U-Net. This can be attributed to the large proportion of ICNF and CAL FIRE data within the training tiles (79%). As learned from these data, the network aims at creating homogenous burned areas even under circumstances where there are smaller unburned areas therein. Figure 11 shows an example of this observation with an inset over a burned area in study area A.
Remote Sens. 2020, 12, x FOR PEER REVIEW 18 of 23 misclassifications occur at the rocky coastline of the volcanic island of Pantelleria in study area A. In previous assessments of other fire events, we also observed misclassifications of rocks of volcanic origin as burned areas. Furthermore, the probability map of study area B (Figure 9b) points to a likelihood of confusion with dark water areas and agricultural fields. However, a far greater proportion of overclassification lies within the burned areas themselves. In most instances, these are characteristically narrow areas such as roads. An explanation may be that only four of the spectral bands received by the network as input have an original spatial resolution of 10 m. The two SWIR bands only have an original resolution of 20 m, which makes it difficult to detect structures with a smaller width. However, in study area A, even islands of unaffected forest of up to four ha were included into the burned area segmentation result of the U-Net. This can be attributed to the large proportion of ICNF and CAL FIRE data within the training tiles (79%). As learned from these data, the network aims at creating homogenous burned areas even under circumstances where there are smaller unburned areas therein. Figure 11 shows an example of this observation with an inset over a burned area in study area A. Figure 11. U-Net prediction and reference burned area highlighting details within a burned area in study area A. The background image is a composition of the B12, B8 and B2 spectral bands and burned areas appear in red.
Compared to the reference burned area, which had been manually digitized based on Copernicus EMS rapid mapping products, the U-Net returns a coarse extent of the burned area rather than a pixel-based segmentation. Furthermore, the U-Net not only considers completely burned areas, but also those only partially affected, for example, the coniferous forest area located in the east of the inset. The forest here appears in a significantly darker green than in the surrounding areas. This is probably due to the fact that the forest was also affected by the fire, but the tree crowns did not burn completely. Another example of areas that are only partially affected, but still have been predicted as burned by the U-Net can be found in the central part of the inset in Figure 11. Pixels appear in light red color and change frequently to green pixels. This indicates mixed pixels, which contain both burned and unburned areas within the 100 m 2 area that each pixel covers. Even when performing a visual interpretation of these two examples, it is difficult to determine whether the pixel should belong to either the burned or unburned class. This unclear definition is a general problem in burned area segmentation in earth observation data and not only refers to partially affected areas. Old burned areas are another problem. In particular, the degree to which a previously burned area has recovered needs to be defined so that under this set of conditions, such an area would no longer be considered a part of the burned class. It must be defined beforehand whether a Figure 11. U-Net prediction and reference burned area highlighting details within a burned area in study area A. The background image is a composition of the B12, B8 and B2 spectral bands and burned areas appear in red.
Compared to the reference burned area, which had been manually digitized based on Copernicus EMS rapid mapping products, the U-Net returns a coarse extent of the burned area rather than a pixel-based segmentation. Furthermore, the U-Net not only considers completely burned areas, but also those only partially affected, for example, the coniferous forest area located in the east of the inset. The forest here appears in a significantly darker green than in the surrounding areas. This is probably due to the fact that the forest was also affected by the fire, but the tree crowns did not burn completely. Another example of areas that are only partially affected, but still have been predicted as burned by the U-Net can be found in the central part of the inset in Figure 11. Pixels appear in light red color and change frequently to green pixels. This indicates mixed pixels, which contain both burned and unburned areas within the 100 m 2 area that each pixel covers. Even when performing a visual interpretation of these two examples, it is difficult to determine whether the pixel should belong to either the burned or unburned class. This unclear definition is a general problem in burned area segmentation in earth observation data and not only refers to partially affected areas. Old burned areas are another problem. In particular, the degree to which a previously burned area has recovered needs to be defined so that under this set of conditions, such an area would no longer be considered a part of the burned class. It must be defined beforehand whether a segmentation model should be applied to only recently burned areas or also to older ones. When refining the Copernicus EMS rapid mapping products, we decided to consider only freshly burned areas that are completely destroyed. However, our training dataset contains tiles of different databases, which in turn (except for DLR data), are a collection of mapping products from different providers, each of which may have a different definition of a burned area. Therefore, the misclassifications in our three study sites are not only due to the performance of the network, but above all due to the unclear definition of burned areas. This is attributed to the network's reception of heterogeneous training data from the three different sources and, therefore, it is unable to learn clear rules.
An advantageous aspect of our approach addresses this problem. In particular, the operator can decide whether to receive a segmentation of the clearly burned areas or an extent of possibly damaged areas. This is done by changing the probability threshold to predict a pixel as burned. The default threshold is set to 0.5. While increasing the threshold results in a more precise delineation of the burned area, it can also lead to the omission of partially affected areas. The operator can choose a threshold that best fits the desired mapping style and that achieves optimal results with respect to the data. Figure 12 illustrates the results based on the application of different thresholds over the same aforementioned inset in study area A.
Remote Sens. 2020, 12, x FOR PEER REVIEW 19 of 23 segmentation model should be applied to only recently burned areas or also to older ones. When refining the Copernicus EMS rapid mapping products, we decided to consider only freshly burned areas that are completely destroyed. However, our training dataset contains tiles of different databases, which in turn (except for DLR data), are a collection of mapping products from different providers, each of which may have a different definition of a burned area. Therefore, the misclassifications in our three study sites are not only due to the performance of the network, but above all due to the unclear definition of burned areas. This is attributed to the network's reception of heterogeneous training data from the three different sources and, therefore, it is unable to learn clear rules.
An advantageous aspect of our approach addresses this problem. In particular, the operator can decide whether to receive a segmentation of the clearly burned areas or an extent of possibly damaged areas. This is done by changing the probability threshold to predict a pixel as burned. The default threshold is set to 0.5. While increasing the threshold results in a more precise delineation of the burned area, it can also lead to the omission of partially affected areas. The operator can choose a threshold that best fits the desired mapping style and that achieves optimal results with respect to the data. Figure 12 illustrates the results based on the application of different thresholds over the same aforementioned inset in study area A. In addition to this flexibility in defining what should be detected as burned, our approach has several other benefits. Compared to other burned area segmentation algorithms, a major advantage is that no previous feature engineering needs to be performed. This is generally not the case in RF approaches, for example. The network does not require any auxiliary data and receives as input Sentinel-2 data at level L1C, which have not been atmospherically corrected. The omission of this step significantly reduces the satellite data preprocessing time. Additionally, the overall processing time is shortened by the fact that only a post-event image is required. Compared to bi-/multi-temporal approaches, such as the one pursued so far by DLR [17] or the authors of [18], the final segmentation is less influenced by clouds because the invalid pixel mask is not composed of clouds from both a pre-and a post-fire scene. Pinto et al. [46] solved the problem attributed to the presence of clouds by using a time-series of satellite data. However, a large amount of data would have to be downloaded or be permanently provided to use this method in a rapid mapping context, for example, to provide timely burned area extents for an initial damage assessment. In contrast, our approach presents a simple and rapid end-to-end solution that includes data ingestion, preprocessing and burned area segmentation targeted towards operational use in forest fire management. For instance, the approach could be used to continuously monitor a specific region of interest. Furthermore, the use of high resolution Sentinel-2 data in our approach clearly improves the temporal availability and resolution of burned area delineations compared to methods based on Landsat or MODIS data. In addition to this flexibility in defining what should be detected as burned, our approach has several other benefits. Compared to other burned area segmentation algorithms, a major advantage is that no previous feature engineering needs to be performed. This is generally not the case in RF approaches, for example. The network does not require any auxiliary data and receives as input Sentinel-2 data at level L1C, which have not been atmospherically corrected. The omission of this step significantly reduces the satellite data preprocessing time. Additionally, the overall processing time is shortened by the fact that only a post-event image is required. Compared to bi-/multi-temporal approaches, such as the one pursued so far by DLR [17] or the authors of [18], the final segmentation is less influenced by clouds because the invalid pixel mask is not composed of clouds from both a preand a post-fire scene. Pinto et al. [46] solved the problem attributed to the presence of clouds by using a time-series of satellite data. However, a large amount of data would have to be downloaded or be permanently provided to use this method in a rapid mapping context, for example, to provide timely burned area extents for an initial damage assessment. In contrast, our approach presents a simple and rapid end-to-end solution that includes data ingestion, preprocessing and burned area segmentation targeted towards operational use in forest fire management. For instance, the approach could be used to continuously monitor a specific region of interest. Furthermore, the use of high resolution Sentinel-2 data in our approach clearly improves the temporal availability and resolution of burned area delineations compared to methods based on Landsat or MODIS data.
The testing of our final segmentation model shows an overall accuracy of 0.98 and a kappa coefficient of 0.94. However, the three use cases show that there are still problems with misclassifications. In the future, segmentation results could be further improved by integrating additional modules to the processing workflow. As it is common in burned area mapping, the segmentation results of the U-Net could be used to determine pixels having a high probability of being burned. In a second phase, region growing could be performed to define the final burned area mask. Furthermore, areas in which fires do not occur (such as water bodies) could be masked out in advance to reduce instances of misclassifications.
While the results of this work are promising, it should be noted that our segmentation approach has only been tested on burned areas without active fires and that were not covered by clouds. However, cloud cover is a major issue, especially in tropical areas. To address this topic, the model should be tested with other optical data sources, such as Landsat and adjusted if necessary. Integration of radar data should also be considered. Furthermore, especially for long lasting fires like the Australian bush fires of 2019/20, it is likely that smoke partially covers the satellite scenes, which may impede accurate burned area mapping and, thus, needs to be specifically addressed.
To date, we performed an extensive validation of our model on three selected fire events. However, as the qualitative analysis of the corresponding Sentinel-2 scenes has shown, there are specific areas that are prone to be falsely detected as burned areas, such as agricultural fields or open-cast mining areas. In the future, the model will be tested over larger regions to identify any additional, problematic area. Furthermore, we expect to improve our training data so that all biomes prone to fire events are fully represented, to realize the long-term goal of developing a model for global application.

Conclusions
In this work, we aimed at combining sensor and method developments of the past few years. We proposed a fully automatic processing chain to segment burned areas in Sentinel-2 imagery, including data ingestion, data preprocessing and burned area mapping. Hence, rapid mapping agencies tasked with burned area mapping could apply the methodology without having to manually download and preprocess data and can, thus, benefit from the available high-resolution Sentinel-2 data. Our segmentation model is based on a deep neural network and achieves an overall accuracy of 0.98 and kappa coefficient of 0.94. However, as our independent tests have shown, there are still some problems with certain types of misclassifications, especially at the shores of dark water bodies, in agricultural fields and for rocks of volcanic origin. Furthermore, our model tends to create a homogenous burned area mask closing small unaffected areas. However, the operator can counteract this by raising the probability threshold to detect a pixel as burned. To verify the global applicability of our model, further testing must be conducted, especially over a more comprehensive range of biomes. Furthermore, to ensure the utility of the model during ongoing fire events, the proposed processing chain should be extended with the inclusion of a smoke detection module.