Double-Step U-Net: A Deep Learning-Based Approach for the Estimation of Wildﬁre Damage Severity through Sentinel-2 Satellite Data

: Wildﬁre damage severity census is a crucial activity for estimating monetary losses and for planning a prompt restoration of the affected areas. It consists in assigning, after a wildﬁre, a numerical damage/severity level, between 0 and 4, to each sub-area of the hit area. While burned area identiﬁcation has been automatized by means of machine learning algorithms, the wildﬁre damage severity census operation is usually still performed manually and requires a signiﬁcant effort of domain experts through the analysis of imagery and, sometimes, on-site missions. In this paper, we propose a novel supervised learning approach for the automatic estimation of the damage/severity level of the hit areas after the wildﬁre extinction. Speciﬁcally, the proposed approach, leveraging on the combination of a classiﬁcation algorithm and a regression one, predicts the damage/severity level of the sub-areas of the area under analysis by processing a single post-ﬁre satellite acquisition. Our approach has been validated in ﬁve different European countries and on 21 wildﬁres. It has proved to be robust for the application in several geographical contexts presenting similar geological aspects.


Introduction
In recent years, European countries experienced an increase of wildfire events, causing extensive damage from environmental, humanitarian, and economical perspectives [1]. After a wildfire extinction, competent offices of public bodies report the perimeter of the burned areas and the severity of the damage in each burned sub-area. The census of the burned areas is usually adopted (i) for estimating the economical damage, and (ii) for planning a full environment restoration.
Europe actively supports wildfire census operations through the Copernicus Emergency Management Service (Copernicus EMS) [2]. The service provides certified information about the hit areas through delineation and grading maps. More precisely, delineation maps determine the boundaries of areas hit by wildfires, while grading maps estimate the severity of environmental damages. The damage level estimation activity, which is used to create the grading maps, is performed by the EMS experts and it consists in assigning to each sub-area of the area hit by a wildfire a numerical severity level between 0 and 4, where 0 is assigned to unburned sub-areas with no damages, while 4 is assigned to the burned areas that have been completely destroyed by the wildfire. The intermediate levels are used to represent burned sub-areas associated with negligible (1), moderate (2), or high (3) and other preprocessing operations, are not needed when our approach is used because only the post-wildfire image is needed.
A recent work of Sauliano et al. [25] studied the relatedness between the dNBR computed from Landsat-8 and Sentinel-2 satellites acquisitions with the CBI of an area affected by fire in Italy. The dNBR computed from Sentinel-2 acquisitions resulted to be a valid estimator of the CBI and therefore very useful for the estimation of severity levels. However, Franco et al. [26] adjusted the thresholds of both dNBR and the delta Normalized Difference Vegetation Index (dNDVI) values to evaluate the severity of two wildfires occurred in Patagonia between December 2013 and January 2014. Like the dNBR, the dNDVI is computed as the temporal difference of two indices-NDVI pre-fire and NDVI post-fire-sensible to the presence of vegetation. The work proves that, under an accurate fine-tuning, those indices are valid estimators of the more accurate CBI. However, the complex (non-automatic) fine-tuning of a set of thresholds is needed. Moreover, Xu et al. [27] evaluated the damage severity of a single wildfire ignited in the Great Hinggan Mountains (China) using fine-tuned thresholds on NBR and dNBR. Moreover, in that case, the time-consuming and complex fine-tuning operation is needed to obtain good results.
Other works imply the adoption of machine learning techniques. Zheng et al. [28] proposed a transfer learning approach based on a Support Vector Regressor (SVR) to predict the CBI index value, which outperforms the aforementioned approaches. However, the technique proposed by Zheng et al. was validated on pre-and post-fire satellite acquisitions on a unique fire event occurred in the Southwest of United States, in 2002 and therefore its generality has not been validated. Finally, another work worth mentioning is proposed by Gibson et al. [29], which examines eight wildfires affecting New South Wales (eastern Australia). It runs several tests using the Random Forest approach, trained on several sets of spectral indices (dNBR, RdNBR, dFCB, dNDVI, dBSI, etc.), highlighting the contribution of each one to the final prediction of the CBI index value. Also, in this case, the features are computed from pre-and post-wildfire acquisitions and the proposed approach has been validated on a single event.
Overall, the state-of-art techniques leverage on dNBR and other indices derived from spectral bands acquired from satellites (i.e., RdNBR, NBR, and dNDVI), either computing thresholds or applying state of the art machine learning approaches. Generally, they are based on the comparison of pre-and post-wildfire satellite imagery, which is not always possible, and on region-dependent empirical thresholds (i.e., on non-generalized approaches).
To address the automation of the severity damage estimation problem, we propose a generalized and an automatic deep learning-based system that, given a set of training grading maps related to past wildfires, provides a severity map for a new area by analyzing Sentinel-2 satellite data acquired after a new wildfire. The automatically returned grading map uses the same severity scale used by the EMS experts. Therefore, our system can be used to automatize the EMS grading map generation after a wildfire.
The main contributions of this paper are as follows.
• An automatic system, called Double-Step U-Net, for the estimation of wildfire damage severity by means of deep-learning models inferred from Sentinel-2 satellite images.

•
An automatic solution based only on post-wildfire imagery. • A land-type-independent approach that can be applied to European regions.
The paper is organized as follows. Section 2 introduces (i) the data sources used to fetch data and annotations (Section 2.1); (ii) data acquisition, processing, and analysis (Section 2.2); (iii) the problem statement (Section 2.3), (iv) the proposed methodology (Section 2.4); (v) the framework and tools adopted to accomplish this paper (Section 2.5); and (vi) the experimental setup, the hyperparameter adopted for the used machine learning algorithms, and the evaluation metrics used to compare the performances (Section 2.6). Section 3 reports and analyses the results, discussing the pros and cons of the proposed method with respect to current approaches. Finally, Section 4 draws conclusions and describe future work.

Materials and Methods
This section describes the proposed system, called Double-Step U-Net, which is based on the analysis of Sentinel-2 satellite data through deep learning models. Double-Step U-Net is based on the joint exploitation of a two-stage approach that combines a classification and a regression algorithm to address the post-wildfire severity estimation process. In this section, we initially describe the used data (Section 2.1) and how they were collected (Section 2.2). Then, we formalize the addressed problem (Section 2.3) and we describe in detail the proposed system (Section 2.4) and how we implemented it (Section 2.5).

Data Sources
In this work, two different sources of information have been adopted: Copernicus Sentinel-2 [30], which provides satellite imagery, and Copernicus Emergency Management Service (EMS) [2], which provides manually generated damage severity maps of burned regions hit by past wildfires. The EMS damage severity maps where used as ground truth grading maps.
Copernicus is the European Union's Earth Observation Programme, implemented in partnership with the Member States and the European Space Agency (ESA), offering information services based on satellite Earth Observation and in situ (non-space) data. Sentinel-2 is the second mission of the Sentinel spatial program, concerning satellites able to gather spectral data. In particular, the service can provide two kinds of products: Level-1C and Level-2A. Level-1C products provide raw data [31,32], composed of 100 × 100 km 2 tiles (ortho-images in UTM/WGS84 projection), resulting from the use of a Digital Elevation Model (DEM) to project the image in cartographic geometry. The tiles are acquired as 13 different spectral wavelengths and spatial resolutions, summarized in Table 1. Level-2A products [33,34] are generated applying an algorithm for atmospheric reflectance correction on Level-1C products, resulting in an orthoimage Bottom-Of-Atmosphere (BOA) corrected product. The applied correction reduce (i) the noise brought by natural conditions, like air turbulence and fog, and (ii) the influence of aerosols, producing a more qualitative image which highlights ground information. It is applied on every spectral band listed in Table 1, except for band 10 which does not contain any surface information and, therefore, it is omitted. Given the goal of this paper to estimate burning severity levels, it is crucial to avoid atmospherical noise, which could lead to erroneous analyses and predictions. Therefore, in this paper, Level-2A products were used as source data. Currently, those products are very large (about 600 MB for each tile) and thus hard to manage in raw format. Therefore, we used Sinergise Sentinel-Hub Service [35], a cloud based engine that handles the complexity of management of raw data internally, making Earth observation imagery easily accessible through Application Programming Interfaces (APIs).
Through the Emergency Management Service, Copernicus provides annotations on natural disasters made by domain experts who, depending on the situation, can use in situ measurements, aircrafts, and Sentinel satellite imagery to record the effects of the hazard and the severity of the damage. In the wildfire context, the EMS severity level ranges from 0 (unburned sub-area with "No damages") to 4 ("Completely Destroyed" sub-area). Intermediate values are used to represent the following situations; "Negligible to slight damage", "Moderately Damaged", and "Highly Damaged".
In this paper, we opted for the Copernicus EMS annotations because they are freely available for many past wildfire events.

Data Acquisition, Preprocessing, and Analysis
As introduced in the previous section, Copernicus EMS provides grading maps indicating the severity of damage caused by wildfires. They refer to (i) an Area of Interest (AoI) and (ii) a reference date (marked as "Situation as of" in the map's cartouche). The AoI is a squared region which includes the area/s hit by the wildfire. It is composed of two tuples of coordinates <Longitude, Latitude>, which indicate the top-left and bottom-right edges of the region. The reference date is the post-wildfire date used as a reference for the analysis by the domain experts.
Sentinel-2 data was downloaded referring to both the AoI and the reference date specified in the EMS grading map. However, it is quite common that, under those constraints, the Sentinel-2 data of the area of interest could not be available. Commonly, the reason can be twofold: (i) AoI was not explored or partially explored by satellites in the reference date, or (ii) the AoI is mostly covered with clouds. Therefore, to be used in this paper, Sentinel-2 acquisitions were subjected to three constraints: (i) the satellite acquisition must be equal to the reference date, (ii) data must be available for at least the 90% of the AoI, and (iii) cloud coverage must not exceed the 10% of the AoI. While the data availability is given by the Sentinel-Hub service, the cloud coverage value was estimated according to the method proposed by Braaten, Cohen and Yang [36].
In this paper, 21 Copernicus EMS grading maps have been collected from 5 European regions: Portugal, Spain, France, Italy, and Sweden. The Sentinel-2 data have been downloaded at the highest resolution. Then, the AoIs were split into 7 folds, according to two different constraints: (i) a fold must include at least two AoIs, and (ii) areas of interest must be geographically close. A representation of the geographical distribution of the wildfire-affected regions and their categorization in folds is shown in Figure 1.
Once downloaded, satellite acquisitions are images of dimensions (W × H × D). W and H are the acquisition Width and Height, respectively, and are up to 5000 × 5000 pixels. D, the Depth, is the number of spectral bands, which is 12 for Sentinel-2 L2A imagery. Copernicus EMS grading maps are images of size W × H, having the same dimensions of the satellite acquisitions. Each pixel of the Copernicus annotation determines the damage severity level of the corresponding pixel-at the same row and column-of the Sentinel-2 acquisition. Details about the collection, especially the Copernicus EMS annotations, the dates in which Sentinel-2 data were acquired, and the fold they were assigned to are reported in Appendix A. Moreover, in order to allow the proposed method to generalize among different kinds of vegetation, another aspect considered in the dataset is the heterogeneity of land use, which includes inland areas with dense vegetation (i.e., red fold), areas characterized by cropland and small or sparse trees (i.e., fucsia fold), coastal areas (i.e., blue fold) and rural areas with little or no vegetation (i.e., yellow fold). A detail of the land use distribution for every AoI included in the dataset is reported in the Appendix B. In order to assess the spectral bands that can provide useful information for the detection of the damage severity, the Pearson's correlation between the bands of the collected AoIs, the dNBR, and the ground truth EMS damage level have been investigated. We considered also the dNBR index because it is widely used in previous papers and we want to understand if that index, based on pre-and post-imagery, is more correlated to the target variable than the Sentinel-2 bands. The Pearson's correlation is a measure computed between two variables and it ranges between −1 and 1. High correlation is determined at the extremes of its domain: a highly positive or a highly negative value means the two variables are directly or indirectly related, respectively. More in detail, no correlation is expressed by 0, it is low for values between −0.35 and 0.35, and medium to strong for the remaining values [37]. In order to compute the correlation coefficient, a transformation was applied to each image and annotation. Each spectral band, the dNBR, and the EMS severity grading maps (ground truth, GT), all having dimensions (W × H), have been flattened into a vector of length (W × H), in order to resemble statistical variables. In the following, we use GT (Ground Truth) to refer to the target variable of this analysis, i.e., the damage severity level. As shown in Figure 2, the spectral bands presenting noticeable correlations (medium or high) with both the dNBR index and the target variable GT (i.e., the damage severity level) are B06, B07, B08, B8A, and B09. Except for B09, the other bands are known in the literature for the computation of the most used indexes for fire detection, such as the Burned Area Index (BAI) [17], the Burned Area Index for Sentinel2 (BAIS2) [38], the Normalized Burn Ratio (NBR) [39], the Normalized Burn Ratio 2 (NBR2) [16], and the Mid-Infrared Burned Index (MIRBI) [40]. Moreover, B07 and B08 are used for the computation of (i) the vegetation index, Normalized Difference Vegetation Index (NDVI) [41], and (ii) the water index, Normalized Difference Water Index (NDWI) [42], respectively. Moreover, it is worth considering the high correlation between dNBR and GT, which is coherent with the literature and confirms the dNBR to be a good estimator of the GT. However, also the Sentinel-2 bands are characterized by high correlation values with respect to the target variable and they can be obtained by considering one single post-fire image, whereas two imagery (pre-and post-fire) are needed to compute the dNBR index.  In Section 2.4, the new proposed approach, which leverages on Convolutional Neural Networks (CNNs), will be introduced. Given the huge interest from the scientific community in such algorithms, a common practice is to use consolidated approaches by adapting the general methodology to the investigated domain. As most CNNs, the input size is reduced due to computational performances and hardware limitations during the training phase. The high-resolution images retrieved from Sentinel-Hub (and consequently, the grading maps) have dimensions up to 5000 × 5000 pixels, but the image size is too big to be processed by a CNN in one shot (due to GPU memory limitations), therefore it needs to be re-adapted. In this paper, we opted for preserving all the provided information by tiling the original acquisitions in smaller crops of size 480 × 480 pixels, maintaining the spectral information as provided by Sentinel-Hub. Moreover, only the crops containing at least one pixel classified as burned (a damage severity level between 1 and 4) have been included in the dataset. In the end, the dataset contains a total of 135 crops, distributed in folds as follows; blue fold: 8, brown fold: 9, fucsia fold: 30, green fold: 16, orange fold: 18, red fold: 12, yellow fold: 42. As easily predictable, the dataset's folds present unbalanced EMS damage severity levels, as shown in Figure 3.

Problem Statement
This work focuses on the estimation of the EMS damage severity levels for the sub-areas of an area hit by a new wildfire. More formally, given a post-fire Sentinel-2 L2A satellite acquisition (an image of 12 spectral bands) of an area that has been hit by a new wildfire the goal is to predict a continuous value in the range [0, 4] ∈ for each pixel of the input post-fire image, in order to approximate the Copernicus EMS damage severity grading map values, whose severity levels are natural numbers within the same range. The problem is configured as a regression task because the target variable is a numerical feature, which is used to represent ordered severity values. A set of training post-wildfire imagery related to past wildfires, for which the value of the target variable is known, is used to train a predictive model that is then applied to perform the prediction for the pixels of the new post-wildfire imagery.

Methodology
As briefly introduced in Section 2.2, one of the most recent and promising approaches developed in the computer vision field is U-Net [43]. The model, inspired from a Convolutional Neural Network presented to the Computer Vision for Pattern Recognition conference [44], was originally employed in the medical field for the segmentation of biological cells and for the analysis of MRI scans for the detection of a number of pathologies [45][46][47].
The network concatenates two branches: a contracting path and an expansive path, giving it the u-shaped architecture. The contracting path is a classical convolutional network, consisting of repeated application of convolutions, each followed by a rectified linear unit (ReLU) and a max-pooling operation. During the contraction, the spatial information is reduced while feature information is increased. The expansive path combines both the features and the spatial information through a sequence of up-convolutions and concatenations with high-resolution features from the contracting path. The original version takes as input one grayscale image of dimension 572 × 572 pixels and it outputs a binary mask of 388 × 388 pixels. The input is oversized compared to the output: that choice was made to give some extra context to include information of the biological cells outside exceeding the border of the considered tile. In the context of this paper, the dimension of a burned area is not predictable, therefore the authors opted for setting the network's width and height input and output dimensions to 480 × 480 pixels for each of the 12 spectral bands. In the end, the U-Net's input dimension is set to 480 × 480 × 12 pixels.
One of the main contributions of this study is the modification of the original U-Net, empowering its ability to distinguish between ordered classes, as in the case of damage severity. The classical U-Net was proposed for solving a segmentation task, being able to identify a specified entity in an image. Therefore, it is able to recognize relations and features (i.e., borders and gradients) among the pixel values belonging to the searched entity. Conversely, in the context of this work, the goal appears to be more complex and it can be split into two sub-tasks: (i) to identify areas affected by fire, and (ii) to determine damage severity in the burned areas. In the first sub-task, the goal is to distinguish burned areas from unburned regions, like in a classical segmentation task (classification task). The second sub-task takes into account the areas affected by the fire and discriminates between four consequent levels the severity of the damage (regression task). The two sub-tasks are solved with two different building blocks: the "Binary Classification U-Net" and the "Regression U-Net". In both the building blocks, the output map is a matrix of dimensions 480 × 480, where each element refers to the pixel at the same position of the input tile. In the proposed solution, the two building blocks are combined together to outperform the prediction quality of the approach based only on the Regression U-Net block.
First, the Binary Classification U-Net is trained for segmentation purposes: given a satellite image of 12 channels and a resolution of 480 × 480, the network assigns to each pixel the probability of belonging to a burned area thanks to the application of the softmax activation function. Thus, the generated output is a binary segmentation map of size 480 × 480 with values {0, 1} (i.e., unburned or burned), where each pixel is assigned to the class with the highest probability. Second, the Regression U-Net is used to provide the severity level estimation. Given the input satellite image, the model generates as output a map of the same size as the input with values in range [0,4]. Both architectures are based on the U-Net model, with the only difference on the softmax activation function [48] at the output layer of the Binary Classification U-Net, which is absent in the Regression U-Net, and the loss functions used during the training process, as described in Section 2.6.  By combining the two mentioned building blocks differently, we have considered three different approaches: • A regression-only approach, in which only the Regression U-Net is used.

•
A parallel approach, namely, Parallel U-Net, in which the two building blocks are used separately in parallel. The final output is obtained by the mathematical multiplication of the two outputs as shown in Figure 4. • A two-step approach, namely, Double-step U-Net, in which the building blocks are concatenated. First, the binary segmentation model is used to predict the burned area regions in the input tile.
The binary prediction is used to filter the input and isolate the burned regions, thus generating the input tile for the Regression U-Net, which provides the damage severity estimation. Figure 5 shows the simplified architecture of the proposed solution with sample images, where for satellite imagery only the RGB channels are shown for simplicity.
A detailed version of the Double-step U-Net architecture is shown in Appendix C.

Frameworks and Tools
In this section, the hardware components of the workstation used to run the experiments are introduced, as well as a detail of the software packages used to develop this work.
Geospatial data from Copernicus EMS was processed using the GDAL software library [49] to determine AoIs' coordinates and to identify the precise regions affected by forest wildfires and the corresponding severity levels. Sentinel-Hub services were used to collect Sentinel-2 data.
Satellite imagery was processed through Python with OpenCV and scikit-image [50] libraries. Data analysis was performed through scikit-learn [51], whereas neural network models were developed and trained using PyTorch framework [52]. All the software packages and versions are specified in Table 2. The experiments were run on a workstation with an Intel Core i9-7940X @ 3.10GHz with 128GB of RAM and 4x GTX 1080Ti.

Experiments
Training neural networks is a complex and computationally intensive task, which requires proper data preparation steps and hyperparameters selection. This section presents the experiments performed, detailing the training and the evaluation procedures adopted.
As introduced in Section 2.2, the satellite acquisitions were divided into seven folds. The grouping criteria was chosen by considering geographical distance among the position of the acquisitions, in order to include in the same fold geographically close regions that could share similar morphology and land cover aspects, such as vegetation types, infrastructures, and agricultural areas. The models' performances were evaluated through a cross-validation approach, using the Root Mean Squared Error as the evaluation metric. At each iteration, five folds are used as the training set, one as the validation set, and the remaining fold as the test set. The validation set is used to assess the model's performance for the early stopping regularization criteria, discussed later in this section. A common prerequisite in supervised learning algorithms is that the training, validation and test data arise from the same distribution and are independent and identically distributed [53]. Therefore, in order to work properly, a validation set should resemble the data distribution of the test set. However, as shown in Figure 1, each fold presents a unique distribution of severity levels. In a real situation, there is no chance to know the distribution of severity levels a priori. Therefore, the choice fell on a fold which contains all severity levels and which could generalise the most, presenting a distribution of severity levels which tends mostly to a uniform distribution. Considering all those aspects, we chose the "fucsia" fold as the validation set for each test set, except for itself: in that case, we chose the "green" fold.
To improve models generalization, data augmentation was applied on the training set of each fold. During the training process, synthetic data was generated at each epoch from the 480 × 480 tiles by randomly applying four transformations in sequence to each image: random rotation, random horizontal flip, random vertical flip, and random shear. Each transformation has a specific probability to be applied. Rotation and shear effects were performed with randomly generated angles within specific ranges, differently for each image in each epoch. For reproducibility purposes, random generation operations were performed using the same seed in each training.
The transformations and their respective parameters are shown in Table 3. All the training phases were performed using Adam optimizer with a learning rate of 1 × 10 −4 , 50 epochs and a batch size of 8. Two different models were considered in this study: • the single U-Net: it is trained for the regression task. The Mean Squared Error (MSE) was used as a loss function. This model will be referred to as the baseline; • the Double-Step U-Net: it is trained in two steps: First, only the Binary Classification U-Net is trained with Dice loss function [54], keeping the weights of the regression model frozen. Second, the Regression U-Net was trained with Mean Squared Error (MSE) loss function. In this second step, the weights of the binary classification model are frozen.
To validate the approach, an ablation study will be presented. First, the performance obtained by the Binary U-Net on each fold will be reported. Then, the choice to link the two networks consecutively will be justified comparing Double-Step U-Net with the simplified Parallel U-Net model.
During the training process, three techniques for regularization were adopted: early stopping, dropout, and batch normalization. Early stopping was implemented to avoid overfitting and to stop the training process in case no further improvements were seen in the validation loss. A patience of 5 epochs was used with minimum improvements of 1 × 10 −2 on the validation loss. At the end of each training process, the model's best weights determined by the early stopping mechanism were restored. Dropout layers were enabled during the training process before each transposed convolution with a probability of 25%. Moreover, after each convolutional layer, batch normalization was performed. Moreover, to guarantee the reproducibility of the tests, in each training of the cross-validation process the networks were initialized with the same weights generated from the same seed number, using a normal distribution and the Glorot initialization [55]. The two subtasks introduced in Section 2 are evaluated as follows.

•
The binary classification between burned and unburned areas is evaluated with Precision, Recall, and F1-Score metrics [56]. Precision considers the purity of the predictions: among the pixels predicted as belonging to a certain class, e.g., belonging to a burned region, it indicates the percentage of matches with the GT. Recall verifies the ability of the estimator to recognize all the pixels belonging to a certain class, specified in the GT. Therefore, given the whole set of pixels belonging to a certain class (referring to the GT), the recall is the percentage of correctly predicted pixels among the whole set. The F1-Score is the harmonic mean between Precision and Recall. It is as a measure of accuracy with the property to consider the class imbalance.

•
The estimation of damage severity, concerning the distinction between 5 severity levels, is evaluated with Root Mean Squared Error (RMSE) metric. Given the ordinal relationship between damage severity levels, the RMSE gives a measure of distance between the prediction and the ground truth.

Results
This section shows and comments on the results of the experiments described in Section 2.6. First, U-Net's ability to distinguish between burned and unburned areas will be assessed. Then, the performance achieved to predict the damage severity level by the Double-Step U-Net, the Parallel U-Net and the Simple U-Net will be compared. Furthermore, their scores will be compared with the performance obtained by the dNBR-based solution, the reliable estimator largely adopted in the literature. Finally, the approaches will be discussed analyzing their average prediction error for each severity level.
In the binary classification task, the Binary U-Net achieved good performances, as shown in Table 4. According to the F1-Score, the approach was on the average able to classify correctly burned and unburned pixels (sub-areas) for the wildfires in every fold, except for the brown one, which contains data acquired from Sweden. In detail, in that fold, the Binary U-Net showed high Recall, but poor Precision: this means that the network overestimated the burned areas, predicting many false positives. The choice to use the same fold (fucsia, containing data acquired in Spain) as the validation set, helped us to understand the limits of applicability of the approach for data acquired far from the test set. Therefore, training and test areas with strong differences, in terms of geology or land cover aspects, can prevent the neural network from working properly in the binary classification task. However, this limitation is mitigated in the regression problem, discussed below in this section. A detailed performance report for the Single U-Net, the Parallel U-Net, and the Double-Step U-Net in every fold for the estimation of damage severity task is shown Table 5. As mentioned in Section 2.4, the three approaches use solely post-fire satellite acquisition for making their predictions. The table presents the RMSE evaluated on every fold, for every severity level, reported as an ordinal number for the sake of space. Severity levels are mapped as follows; 0 stands for No damage, 1 stands for Negligible to slight damage, 2 stands for Moderately Damaged, 3 stands for Highly damaged, and 4 stands for Completely destroyed. In a first analysis, we do not consider the dNBR column, but we focus only on the three networks performances. The best score for each row, considering only the U-Net-based approaches, is marked with the star symbol ( ). Compared to the Single U-Net, the approaches in which the outputs of the Binary U-Net and Regression U-Net are combined showed better overall performances. The Double-Step U-Net is the most accurate in the discrimination between severity levels (1 to 4), achieving best results in 5 folds out of 7 (blue, fucsia, green, orange, and yellow). The only exception is for the brown fold because the Regression U-Net is strongly dependent on the Binary U-Net's performance. However, the RMSE values of the brown fold for the Double-Step U-Net result to be comparable to the RMSE values of other folds (i.e., orange and yellow), and they are better than the ones of the consolidated approach based on dNBR. Therefore, Double-Step U-Net result to be also robust for regions presenting strong differences from the ones that are used to train the model. Considering the dNBR in the evaluation, the best performances per row in Table 5 are marked with the dagger symbol ( † ). In order to compare the dNBR with the GT, its values were thresholded according to the default values [23]. In this case, best results vary from fold to fold, but generally, they are matched by U-Net based approaches. It must be considered that the dNBR is computed using both pre-and post-fire acquisitions, whereas U-Net's approaches consider only post-fire acquisitions. In order to summarize the performance, the average RMSE value for each severity level is shown in Table 6. Compared to the Single U-Net, the Double-Step U-Net results to be a better approach, achieving the best RMSE on each severity level. Moreover, with reference to the dNBR, the Double-Step U-Net achieves comparable performance, with a noticeable improvement for the detection of the unburned area (severity 0), but using only half of the information. The reason behind the success of Double-Step U-Net is hidden in the problem split. First, the neurons of the Binary U-Net are employed to identify burned regions. Its prediction will mask the spectral values of unburned regions, leaving only the information related to burned areas to the Regression U-Net. Therefore, the latter network will employ its neurons in finding differences between correlated values (severity levels 1 to 4). It is worth mentioning that the masking operation performed by the Binary U-Net prediction introduces a new and uncommon value in the spectral information fed as input to the Regression U-Net: the 0 value. Areas identified as unburned will be "cancelled" by replacing their original value with 0, that is not present in nature. Therefore, a bad classification from the Binary U-Net can lead the Regression U-Net to make more mistakes because it will consider 0-valued-regions as unburned and every other unburned region not detected by the Binary U-Net will be considered as burned. In Figure 6, a comparison between predictions of dNBR, Single U-Net, and Double-Step U-Net is shown in two areas of the green fold. At a first glance, delineating the wildfire contours just looking at the RGB acquisition (pictures a1 and b1) seems feasible, but assigning different severity levels appears to be more challenging. In both the acquisitions, the Binary U-Net predictions resulted to be highly accurate (pictures a3 and b3), compared to the Copernicus EMS annotation (GT, pictures a2 and b2). The dNBR (pictures a4 and b4) show a good match with the GT, except for some noise in the vast unburned regions. The Single U-Net (pictures a5 and b5) correctly identifies the burned region and the contours of different burned areas, but it tends to underestimate the severity. In the end, the Double-Step U-Net (pictures a6 and b6) improves the prediction of the Single U-Net, resulting to be more similar to the GT. The time required by the Double-Step U-Net to perform the damage severity evaluation using one NVIDIA GTX 1080 Ti on an input tile of size 100 km × 100 km is, considering the highest spatial resolution available (reported in Table 1), 28 s on average. The time required to predict the severity level estimation by the proposed deep learning model for an image of 480 × 480 resolution with 12 spectral bands is inferior to 1 s.

Conclusions
This work introduces the application of a convolutional neural network, namely, U-Net for the estimation of the damage severity of regions affected by wildfires from satellite imagery. Compared to the literature, which commonly uses pre-fire and post-fire satellite acquisitions, this approach only makes use of post-fire data. Moreover, our approach results to be location-independent for the assessed European regions, and possibly for all the areas presenting common land-use as the AoIs presented in this work), being able to process geographically distributed satellite imagery. Furthermore, a modified version of U-Net named Double-Step U-Net, is created and introduced to improve the performance of the standard method. The approaches have been validated across five European regions, within 21 manually annotated wildfire events. As a result, Double-Step U-Net outperformed U-Net and achieved comparable performance to the thresholded dNBR, which is computed using both pre-and post-wildfire satellite acquisitions.
Future works will assess the contribution of other information, like Synthetic-aperture radar (SAR) data provided by Sentinel-1 data, or the Digital Elevation Maps (DEM) to improve the performances of Double-Step U-Net for distinguishing between wildfire severity levels. Moreover, the approach can be adapted and evolved to analyse the daily wildfire expansion, in order to provide a risk map for the potential areas that could be affected in the near future. Therefore, areas more likely to be affected by severe damage can be identified earlier, in order to let decision-makers to conduct better disaster response management and to limit overall damages. Author Contributions: Conceptualization, A.F. and L.C.; methodology, A.F. and L.C.; software, L.C. and A.F.; validation, A.F., L.C. and P.G.; formal analysis, A.F., L.C., and P.G.; investigation, A.F. and L.C.; resources, A.F. and L.C.; data curation, L.C. and A.F.; writing-original draft preparation, A.F., L.C. and P.G.; writing-review and editing, P.G.; visualization, A.F. and L.C.; supervision, P.G.; funding acquisition, A.F. and P.G. All authors have read and agreed to the published version of the manuscript. FOLD indicates in which fold the product belongs to, according to the colors represented in Figure 1.   Tables A2 and A3. They are specified in the grading maps cartouche, using the hectare (ha) as unit of measurement. The land use types are specified as follows (Land use types used in this work refer to the official Copernicus EMS notation, available at https://emergency.copernicus.eu/mapping/ems/ domains): • Residential/Industrial: urban areas involving residential or industrial buildings; • Arable land: also specified as cropland, non-irrigated arable land areas, permanently irrigated land, and rice fields;