Uni-Temporal Multispectral Imagery for Burned Area Mapping with Deep Learning

: Accurate burned area information is needed to assess the impacts of wildﬁres on people, communities, and natural ecosystems. Various burned area detection methods have been developed using satellite remote sensing measurements with wide coverage and frequent revisits. Our study aims to expound on the capability of deep learning (DL) models for automatically mapping burned areas from uni-temporal multispectral imagery. Speciﬁcally, several semantic segmentation network architectures, i.e., U-Net, HRNet, Fast-SCNN, and DeepLabv3+, and machine learning (ML) algorithms were applied to Sentinel-2 imagery and Landsat-8 imagery in three wildﬁre sites in two different local climate zones. The validation results show that the DL algorithms outperform the ML methods in two of the three cases with the compact burned scars, while ML methods seem to be more suitable for mapping dispersed burn in boreal forests. Using Sentinel-2 images, U-Net and HRNet exhibit comparatively identical performance with higher kappa (around 0.9) in one heterogeneous Mediterranean ﬁre site in Greece; Fast-SCNN performs better than others with kappa over 0.79 in one compact boreal forest ﬁre with various burn severity in Sweden. Furthermore, directly transferring the trained models to corresponding Landsat-8 data, HRNet dominates in the three test sites among DL models and can preserve the high accuracy. The results demonstrated that DL models can make full use of contextual information and capture spatial details in multiple scales from ﬁre-sensitive spectral bands to map burned areas. Using only a post-ﬁre image, the DL methods not only provide automatic, accurate, and bias-free large-scale mapping option with cross-sensor applicability, but also have potential to be used for onboard processing in the next Earth observation satellites.


Introduction
Wildfires are widely recognized as one of the most critical ecosystem disturbances, as they not only result in the significant loss of human lives and properties, but also affect biodiversity and the carbon cycle [1]. Accurate and timely mapping of burned areas is, therefore, needed for the assessment of economic losses caused by the wildfires, managing post-fire hazards such as landslides or mudflows, and planning of remediation and revegetation efforts. Historically, ground-based estimates were used to collect burned area information [2]. With the launch of Earth observation satellites, remote sensing has become a more efficient alternative to monitor wildfire extent due to its timely coverage of fire occurrences regionally and globally [3,4].
Over the past decades, coarse-resolution satellite sensors such as the Moderate Resolution Imaging Spectroradiometer (MODIS) have been used to identify the burned areas globally based on the thermal emission of burned vegetation [4]. For instance, some MODIS monthly burned area products, such as MCD64A1 [5] at 500 m resolution and FireCCI51 [6] at 250 m resolution, are currently available online, e.g., Google Earth Engine (GEE) platform. However, these burned area products often miss small burned areas that account for a significant proportion of the total burned area [7]. Therefore, it is desirable to evaluate effective methods for accurate burned area mapping using medium-to-high resolution satellite images.
Open access to Landsat-8 and Sentinel-2 satellite data with a global median average revisit at 2.9 days provides an excellent opportunity for mapping burned areas in nearreal-time (NRT) [8]. Various algorithms have been developed using Sentinel-2 and/or Landsat data, but most of the studies require a pre-fire image, dense time-series data, or an empirical threshold [9][10][11]. To address the main challenge of estimating burn scar from a single post-fire scene, the objective of this study is to investigate DL models for burned area mapping in two climate zones in comparison to other commonly used classical methods.

Related Studies
Sensors like Sentinel-2 and Landsat-8 have multispectral channels in the visible/nearinfrared (NIR) and short-wave infrared (SWIR) that are sensitive to fire disturbance. Removal or alteration of leaf structure and canopy cover vegetation caused by wildfire changes the land vegetated surface on the ground even to soil exposure in some extreme events. The amount of radiation reflectance from these changes can be represented as a function of the spectral wavelength [12]. For instance, more radiation in NIR is absorbed by fire-disturbed areas than unburned ones, while burned areas would reflect more radiation in the visible and SWIR bands [13]. Therefore, many well-known spectral indices (covering the visible/NIR, the NIR/short SWIR, and the short SWIR/long SWIR spectral spaces) have been proposed for burned area detection, e.g., MIRBI, EVI, NDVI, BAIM, NBR, NBR2, and BAIS2 [7,[14][15][16][17][18][19][20]. These indices can be further differentially normalized by the corresponding bitemporal pair (i.e., pre-fire and post-fire scenes) to enhance burned area discrimination (e.g., dNBR as a typical index), which additionally requires the cloud-free pre-fire satellite imagery.
Index thresholding is commonly used for burned area mapping, and the threshold values are often set empirically depending on visual interpretation, biome types, and tree cover percentages [21]. On the other hand, automated methods on indices such as OTSU [22] might have difficulties determining an optimal threshold for indices when the distributions of scene intensity and land cover are complex. Recently, an automatic thresholding chain was proposed for NRT burned area mapping at a national scale using Sentinel-2 data based on the dNDVI and RdNBR with mapping unit 1 hectare (ha) [23]. However, these bitemporal indices increase preprocessing time and limit future applications, especially for the next Earth observation satellite with onboard data processing and limited storage for previous scenes. Some automated methods based on spectral signatures require retraining when applied to diverse landscapes despite tuning thresholds to local conditions [24]. Therefore, the nonparametric machine learning (ML) algorithms have received much attention for their better performance in burned area detection than traditional threshold-based methods using spectral indices [25].
As ML algorithms depend on the distribution of the training data without any assumptions, automated burned area mapping becomes achievable. Various ML methods have been widely used in wildfire science including random forests (RF), Support Vector Machines (SVM), Artificial Neural Networks (ANN), decision trees, and MaxEnt [25]. Furthermore, dense harmonic time-series of Landsat data were used to identify burned areas [26,27]. An automated burned area mapping algorithm using paired Sentinel-2 images was introduced in [28], with an SVM for an initial pixel-based classification and a multiple spectral-spatial classification approach for smoothing the final burned area delineation. Moreover, a global burned area mapping approach was implemented using RF and a seed-growing approach based on time-series Landsat-8 images and GEE [29].
Automated burned area mapping with a uni-temporal post-fire image is promising but relatively challenging because of spectrally similar effects to burned areas caused by unrelated phenomena and disturbances such as shadows, agricultural harvesting, or plowing in the absence of ancillary information [10,30,31]. Various attempts have been made to map burned areas using a post-fire image. For example, logistic regression (LR) was applied to a single post-fire Landsat-5 Thematic Mapper (TM) image for burned land mapping [30]. Mitrakis et al. [32] compared a variety of ML algorithms including ANN, SVM, and Ada Boost Classifier (AdaBoost) for burned-area mapping in the Mediterranean region with one post-fire Landsat-5 TM image and found that all methods displayed similar accuracy. Likewise, Mallinis and Koutsias [33] compared 10 classification methods and concluded that the variance imposed by the methods is less than the variance imposed by factors differentiated locally in the study sites. Further, Pu and Gong [34] used LR to calculate probabilities of burned scars from a single post-fire Landsat 7 ETM+ image with acceptable overall accuracy. Stroppiana et al. [35] proposed a (semi)-automated multicriteria method for burned area mapping from uni-temporal Landsat TM images in the Mediterranean environment. This soft aggregation approach could reduce omission errors less than 3% but resulted in high commission errors of about 21%.
Recent studies demonstrated that deep learning (DL) algorithms have the capacity to automatically capture object features at multiple scales without the requirement for an extra user input but with a few specific hyperparameters [36]. Semantic segmentation architectures with convolutional neural networks (CNNs) have unique characteristics to extract the contextual information in multiple scales and then label each pixel of an image [37]. It is now playing a significant role in many image analysis tasks ranging from autonomous vehicles, human-computer interaction, robotics, medical research, precision farming, and so on [37][38][39][40][41][42][43]. For remote sensing, semantic segmentation algorithms have recently been applied to 2D satellite images and even 3D scenes [44,45]. For instance, automatic extraction of snow cover from high spatial resolution optical images was proposed using DL on a small dataset [46]. A fully convolutional networks model trained on very high resolution (VHR) optical satellite imagery was transferred to Sentinel-2 and SAR data in slum mapping [47]. CloudNet was presented to classify cloud and haze from Sentinel-2 imagery based on deep residual learning, semantic image segmentation, and the concept of atrous convolution [48]. A revised U-Net network structure named DeepUNet was explored for pixel-level sea-land segmentation with images collected from Google Earth and handicraft labeled ground truth images [49]. A similar U-Net architecture with residual units was employed for road area extraction with relatively high accuracy [50]. A Mask Region-Based CNN(R-CNN) model was applied to automatically mapping applications such as ice-wedge polygons [41,42] and archaeological sites [43] with high-resolution or VHR remote sensing imagery.
As such, there exist opportunities to employ DL-based models in burned-area detection, particularly in cases involving large multivariate datasets [25]. Recently, a DL approach achieved competitive results with low spatial resolution observations (0.01 • spatial resolution grid) for mapping and dating of burned areas [51]. Langford et al. [52] took a weight selection strategy to tackle the imbalanced classification in deep neural networks training for binary wildfire classification across Alaska with MODIS variables. Concerning the medium-and high-resolution satellite images, an implicit Radar Convolutional Burn Index was proposed based on multitemporal Sentinel-1 SAR data and InSAR technique for mapping the burned areas under a convolutional network-based classification framework [53]. Sentinel-1 SAR backscatter was further proved effective to detect burned areas with a CNN-based DL framework [54]. Bermudez et al. [55] used a conditional Generative Adversarial Network to synthesize missing remote sensed optical data from Sentinel-1 SAR data for a region with the presence of burned area. Recently, de Bem et al. [56] analyzed the performance of deep convolutional autoencoders (U-Net and ResUnet) using bitemporal image pair of the Landsat scenes and recommended the sampling window size of 256 by 256 pixels in DL model training.
Using uni-temporal Sentinel-2 imagery, Knopp et al. [57] proposed an automatic processing chain for burned area segmentation based on U-Net. It successfully reaches high overall accuracy, but lacks the test on transferability of the Sentinel-2 trained model to other sensor data. Due to the large proportion of coarse perimeters as reference data, the network in [57] attempts to create homogeneous burned area delineation with false positives therein. In addition, more DL-based models need to be involved in comprehensive evaluation along with other ML approaches in different landscapes. Considering the spectral consistency between the multispectral Sentinel-2 and Landsat-8 data [58], it implies the potential to make use of transferability and generalization of DL-based models to map burned areas with cross-sensor multispectral data. In our study, several semantic segmentation networks in different classes [37] are compared, including the widely used U-Net [59], region proposal-based CNN architectures like Fast-SCNN [60], increase the resolution of the feature-based DeepLabv3+ [61], and state-of-art enhancement of feature-based highresolution networks (HRNet) [62].
The overall objective of this study is to specifically explore the capabilities of DL-based models for mapping burned areas in various landscapes including the Mediterranean regions and the boreal forests in Sweden. The specific aims are to address the following limitations of the previous studies by using state-of-the-art DL techniques: (1) A poor generalization of spectral indices in heterogeneous regions.
(2) Additional pre-fire image acquisitions for bitemporal indices.  Figure 1 provides an overview of the training and testing study sites highlighting the different biome types. The Mediterranean forests, woodlands, and scrubs dominates the vegetation in Portugal, Spain (a) and Greece (d). In Figure 1a, four large wildfire events-two in Portugal (P1 in Leiria District and P4 in Castelo Branco) and two in Spain (P2 in Donana and P3 in Encinedo)-are selected as training sites for models (see Table 1 for details). Historically, the Mediterranean Basin is extremely vulnerable to wildfires, frequently happening in the summer period (between June and September). For instance, the Castelo Branco fire covers approximately 9646 ha near two villages with 1471 people affected among the total of 15,596 living nearby. In addition, one large Elephant Hill fire (P5 in Figure 1b) in BC, Canada, is added as a training site to the temperate conifer forests biome. Started on 6 July 2017, the Elephant Hill fire was the largest wildfire in BC during the record-breaking wildfire season in 2017, causing nearly 191,865 ha of land to be burned (see Table 1 for details). The other boreal conifer region of interest is central Sweden in Figure 1c. Due to the extreme and longlasting drought in forests and windy weather in summer 2018, Sweden suffered many large wildfires, with more than 25,000 ha burned and almost 3 million cubic meters of wood destroyed. One large fire in Enskogen is added for training (P6 in Figure 1c). Three fire events are selected for independent testing, including two small boreal forests fires in Fågelsjö-Lillåsen and Trängslet (T2 and T3 in Figure 1c) and one wildfire near Corinthia on the Peloponnesian peninsula in Greece (T1 in Figure 1d).  [63] in regions or countries such as Portugal, Spain, British Columbia in Canada, Sweden, and Greece. Fire events used for training of the burned area mapping methods are marked with points from P1 to P6, while events for independent testing are marked with points of T1-T3. These 3 testing study areas are displayed in false color composites of Sentinel-2 B12, B8A, and B11 bands at the bottom, respectively; burned areas appear in dark red. (T1) Study area in Corinthia, Greece; (T2) study area in Fågelsjö-Lillåsen, Sweden; and (T3) study area near Trängslet, Sweden.
Although Sentinel-2 MSI has several similar spectral wavelengths with Landsat 8 OLI, two Sentinel-2 satellites can provide higher temporal resolution together (5 days vs. 16 days) and higher spatial resolution (10/20 vs. 30 m) than Landsat-8 OLI. The multispectral scenes of Sentinel-2 MSI L1C TOA and Landsat-8 OLI TOA in this study were downloaded from GEE datasets (i.e., Image Collection "COPERNICUS/S2" and "LAND-SAT/LC08/C01/T1_TOA", respectively). The data collection process included date filtering, bounds filtering to the regions of interest, and subsets clipping. All images were resampled into 20 m using the nearest neighbor resampling method. The acquisition dates and image size of post-event Sentinel-2 and Landsat-8 images used in this study are listed in Table 1.

Reference Data
Copernicus Emergency Management Service (EMS) provides us with delineation products and grading products (https://emergency.copernicus.eu/mapping/list-of-activationsrapid, accessed on 9 April 2021) as precise annotation masks for corresponding training and testing images. These products have been used as reference data in burned area detection or burn severity estimation in previous studies [16,18,23,33,57,65,66]. Most EMS delineation or grading maps are derived from VHR post-fire images using WorldView-2 and/or SPOT6/7 with 1.5-2.0 m resolution under approximately 0% cloud coverage. For instance, the data source of post-fire images used for test sites is SPOT6/7 with 1.5 m ground sampling distance as presented in Table 2. Based on these VHR images, a semiautomatic strategy helps deliver the EMS thematic layer through visual interpretation (https://emergency.copernicus.eu/mapping/ems/detection-methods, accessed on 9 April 2021). This approach is common for the analysis of forest fires based on optical satellite data to identify and classify the burned areas in product delivery. Although we aim to map the extent of burned areas in this study, the burn severity provided by EMS grading maps also helps us analyze the characteristics of spatial heterogeneity of the fire scar.
The burned area masks of fire events in Spain and Portugal are directly derived from EMS products whose Activation ID numbers are EMSR207 (Leiria District), EMSR209 (Donana), EMSR227 (Encinedo), and EMSR372 (Castelo Branco), respectively. Moreover, the EMSR447, EMSR298_05, and EMSR298_03 products provide the reference data for the test sites in Corinthia, Fågelsjö-Lillåsen, and Trängslet, respectively. Differently, for the Elephant Hill and Enskogen fires, their dNBR images, calculated from cloud-free pre-fire and post-fire Sentinel-2 images, were empirically thresholded to elaborate the precise ground truth mask within the official perimeters from the Copernicus EMS (EMSR298_01) and BC Wildfire Service (K20637) [67] as de Bem et al. [56] did. Furthermore, we manually refined all the burned area annotations based on visual analysis of VHR post-event optical images (i.e., Google Earth Map). Some reference dates in Table 1 are later than those of the post-fire image acquisition dates (e.g., one month for the Spanish fire and 7 months for the Canadian fire). For Sweden, the EMS reference dates are earlier than post-fire image acquisition dates (e.g., 2 months for Enskogen, 1 month for Fågelsjö-Lillåsen, and 2 months for Trängslet). One reason for this is the different data sources of the reference images and post-fire images. The acquisition of Sentinel-2 or Landsat-8 testing images depends on the cloud coverage, while the delivery of reference products mainly relies on the available VHR images and interpretation time. On the other hand, the reference data acquisition would be earlier or later depending on the official emergency applications (rapid activation or grading severity mapping). The Elephant Hill Fire has the official perimeters after the fire event but it would not affect the quality of burned area perimeters due to the slow regrowth of boreal forests. The time gap between multispectral images and reference data has little influence on the detection of burned area extent in this study, especially for boreal forest fires in Sweden and Canada. However, it would be important to use the images with as little time gap as possible to ensure the reliability of the burn severity accuracy assessment.

Test Sites Characteristics
In EMS reports, land use is specified in the grading maps, which refers to the official Copernicus EMS definition (https://emergency.copernicus.eu/mapping/ems/domains, accessed on 9 April 2021) according to Corine Land Cover (CLC) [68]. The first subtable in Table 2 reports the transportation affected by fire in length (km) including primary road, secondary road, local road and cart track, the number of inhabitants (population) affected by fire, and the digital elevation range (meter). The second subtable reports the land use affected by the fire in ha, for residential/industrial areas (residential buildings, non-residential farm buildings, and other buildings not elsewhere classified), forests, heterogeneous agricultural areas, permanent croplands, shrubs or herbaceous vegetation areas, and inland wetlands areas.
The Corinthia fire consists of the various heterogeneous landscapes with crops, agricultural areas, forests, and shrubs, while the two Swedish fire sites mainly consist of forests, shrubs, and wetlands. Except for the land use, other artificial impacts and locations of residence are different in these test sites. For example, the Corinthia fire affected more than 1500 people nearby and damaged 14.3 ha of residential or industrial areas, while the Swedish fire did no damage to the residence. Interestingly, the fire in Corinthia has various elevations and hill shade which might result in fluctuating mountain slope. In contrast, Swedish fires are located in relatively flat regions with rolling hills. From the EMSR447 grading map of the Corinthia fire, over 98.9% burned areas are destroyed or damaged, while the Fågelsjö-Lillåsen fire has over 10% burned areas (395.5 ha) that are possibly damaged (i.e., with low burn severity).

Spectral Feature Selection
Spectral bands play different roles in burned area discrimination. A previous study concluded that Sentinel-2 NIR (B8 and B8A), red-edge (B5-B7), and SWIR bands (B11 and B12) are most sensitive to the change in spectral reflectance caused by fire [17]. Fire scar would cause a decrease in NIR and might result in an increase in SWIRs depending on the ecosystem. Therefore, SWIR and NIR bands are widely used for index like NBR and NBR2 in wildfire science [69]. For Sentinel-2 MSI channels, B8A at 20 m resolution is a more suitable NIR band for vegetation monitoring applications than B8 due to a narrower spectral width [70]. Furthermore, Sentinel-2 MSI B8A rather than B8 has similar characteristics to Landsat-8 OLI NIR band (B5) [71]. To facilitate the potential transferring to Landsat-8 data, the combination of bands B8A, B11, and B12 becomes the most suitable input channels for model training.
To further comprehensively support our assumption, the feature selection approaches including AdaBoost [72] and Light Gradient Boosting Machine (LightGBM) [73] were performed on the Castelo Branco Fire dataset for ranking the spectral feature importance (cf. Appendix A). Three out of 10 bands of Sentinel-2 stood out, as expected, namely, B12, B11, and B8A, as the three most important features on the burned area target. Knopp et al. [57] assessed the accuracy for different band combinations of Sentinel-2 data as input for U-Net to support the channel selection. It was demonstrated that only blue, green, and red channels as input data result in worse outcomes. If B8 and two SWIRs bands are additionally involved, the kappa coefficient of the burned area mapping would increase from 0.75 to 0.90 in comparison with three visible channels as input.
Although Sentinel-2 imagery contains 10 bands, supporting the design of distinct architectures, our DL models take as input a 3-channel (in a color composite: B12, B11, and B8A) image patch, where we replace the B8 in [57] with B8A as one of input channels. One benefit of choosing the most representative three bands is that more channel input will increase the computing operations dramatically and limit the future application in some DL models where only three channels are restricted. A three-channel input can be visualized easily and saved as standard image format (e.g., PNG or JPEG) for the webbased application of graphical user interface. The other advantage is that these three bands match the corresponding B7, B6, and B5 of Landsat 8 imagery in the spectrum with high transferability, allowing further cross-sensor application. Abundant Sentinel-2 data can be used to train the DL-based model which can be transferred directly to Landsat-8 data so that we do not need to train a separate model for the Landsat-8 image. Figure 2 provides an overview of the experimental design of this research. The feature maps extracted from representative layers are visualized in a proper way for each DL model. To evaluate the performance of DL-based models in burned area mapping, several typical supervised ML methods and traditional NBR-based thresholding approaches are carried out as comparisons. These comparison experiments contribute to the selection of the method suitable for detection of burned areas for specific landscapes.

Methods
As shown in Figure 3, a fully automatic workflow for burned area mapping using uni-temporal Sentinel-2 and Landsat-8 multispectral data is demonstrated in comparison. Training data from Sentinel-2 are used to train the DL model (U-Net, HRNet, Fast-SCNN, and DeepLabv3+) after data augmentation. Trained DL models are then used to inference the testing data in an end-to-end processing scheme. On the other hand, feature vectors of spectral characteristics are sorted out to train the ML estimators (i.e., LightGBM, RF, and k-Nearest Neighbors (KNN)) separately to get the predictive models for future estimation.

Threshold-Based Approaches
NBR has been widely used in fire-related research in a uni-temporal way as in Equation (1): where N IR and SW IR are the reflectance of B8A and B12 for Sentinel-2 data (in accordance with B5 and B7 for Landsat-8 data), respectively. One drawback using NBR-based methods is that the common false positives like water body have to be removed manually. The further bitemporal difference (i.e., dNBR) can be computed to highlight the burned areas as Equation (2).
where pre and post denote the pre-fire and post-fire images, respectively. A dNBR index between −0.1 and 0.1 indicates that few changes over time interval are accounted for [74].
If there exists fire disturbance between the pre and post images, dNBR might have a bimodal distribution which usually needs a further thresholding process (in an empirical or automatic way) to filter out burned areas pixels. The empirical thresholding approach is used to generate reference data for training images of Elephant Hill and Enskogen as described in Section 3.2.2. Nobuyuki Otsu [22] proposed a classical threshold selection method for the gray image in an automatic way to define the threshold in a bimodal distribution of NBR, which has been used for burned area mapping [75,76]. In this study, the performance of the OTSU method is compared with ML-based and DL-based methods. Specifically, the pixel values less than the NBR thresholds (empirical or OTSU-based) are clustered as burned area pixels.

ML-Based Approaches
To select suitable ML methods, the commonly used ML algorithms were compared for the classification of burned areas based on the PyCaret implementation [77] with standard parameters; 100,000 pixel data sampled from training imageries with feature vectors of B12, B11, and B8A were fed as estimator inputs. The ratio between the pixel amounts of the burned areas and unburned areas is around two-thirds in the whole training data. Seventy percent of samples were used for training based on a 10-fold cross-validation approach [78], while 30% were used for validation to assess the performance of the various ML methods. The comprehensive results can be found in Table A1. LightGBM, KNN, and RF performed better than others, thus were selected (Note: the top two methods are both boosting-based approaches so the first one is selected.).
The LightGBM is a new gradient boosting framework that employs tree-based learning algorithms. Different from traditional boosting tools applied to fire-related work (e.g., XGBoost [79], AdaBoost [32], and LogitBoost [32]) that use pre-sort-based algorithms, LightGBM is a histogram-based algorithm, which buckets continuous feature (attribute) values into discrete bins. This speeds up training and reduces memory usage. We consider the Gradient Boosting Decision Tree boosting method with 170 iterations, 60 leaves, and 0.1 as a learning rate.
KNN is a data classification algorithm based on the premise that similar data exist in close proximity to each other according to some metric [80]. It has been used for burned area mapping in France [81], fire occurrence prediction [82], and wildfire damage assessment [83]. The parameters used in the experiments are 49 neighbors, Minkowski metric, and uniform weight.
RF is one of the most popular classifiers within the remote sensing community, vigorously handling dimensionality and multicollinearity in high data [84,85]. Ramo and Chuvieco [86] developed a global burned area mapping algorithm with MODIS data based on the RF classifier. Ramo et al. [87] evaluated the ability of four algorithms, including RF, SVM, Neural Networks, and a decision tree algorithm, for classifying burned areas at a global scale using MODIS data. It was shown that RF offered the best performance. The RF method consists of an ensemble of individual decision trees and is more robust than single classifiers. Each tree employs a set of decision rules to spit out a class prediction. The ensemble votes are used to determine the final model's prediction. A larger number of trees can help the generalization error converge [88]. In this study, we consider the maximum depth as 5, 150 trees in the forest, and 5 as the minimum number of instances in leaves.

DL-Based Approaches
Semantic segmentation can be categorized into several classes according to the common concept that underlays their architectures [37]. The architecture would usually determine the network's performance. In general, combining the low-resolution feature with larger receptive fields and the high-resolution feature with smaller receptive fields can help extract contextual information. To compare the ability of different DL architectures for burned area mapping, we evaluate four typical models among popular categories: HRNet, U-Net, Fast-SCNN, and DeepLabv3+.

U-Net
Upsampling/deconvolution-based methods are dominant approaches to extract feature maps based on stacked convolutional layers, ReLu layers, and pooling layers [37]. The U-Net connects low-level details and high-level information, achieving better performance than classical ML classification methods [49,59]. Recent papers [56,57] mostly employed U-Net series architecture for burned area mapping with multispectral imagery. The U-Net structure applied here is shown in Figure 4, which is adapted from [59]. The encoder part could extract the features at different scales using five convolutional blocks from input data. In each convolutional block, there exist two 3 × 3 convolutions with ReLu activation layers. The output is followed by batch normalization and a max-pooling operation to downsample the feature maps. Therefore, the size of feature maps is divided by 4 after the convolutional block, while the number of feature channels will be doubled. Through the other five blocks in the decoder part, these feature maps are upsampled to the input size. Each block has 2 × 2 transpose convolution, a concatenation of feature maps from the corresponding encoder part, and two 3 × 3 convolutions with ReLu activation and batch normalization, with the feature maps in A-D after bilinear interpolation (bilinear_interp), respectively. The final layer additionally has a 1 × 1 convolution to calculate the probability for burned area prediction.

HRNet
As a typical enhancement of feature-based methods, state-of-the-art HRNet maintains stronger high-resolution representations through high-to-low resolution parallel convolu-tions [62]. According to the semantic segmentation results on PASCAL-context [89] and LIP [90], HRNet outperforms DeepLabv3+ and U-Net++ with lighter computation cost and fewer parameters [62].
There are four stages in HRNet to connect the high-to-low resolution convolutions in parallel, whose final Stage 4 is mainly illustrated in Figure 5. It comprises a horizontal multi-resolution parallel convolution and a crossing multi-resolution fusion. The multiresolution parallel convolution is adapted from the group convolution that conducts a regular convolution over a subset clipped from input channels in various spatial resolutions separately. The multi-resolution fusion resembles the fully connected multi-branch convolutions. The input and output channels are both divided into several subsets along with the stages deepening. The four output representations are mixed from four resolutions using a 1 × 1 convolution followed by a classifier with softmax loss to obtain the segmentation maps. Furthermore, these maps are upsampled into the size same as the original input image using bilinear interpolation. Four outputs representations from low to high resolutions are named bilinear_interp_33, bilinear_interp_32, bilinear_interp_31, and relu_152, respectively.

Fast-SCNN
In addition, one of the feature encoder-based methods is Fast-SCNN [60], a realtime semantic segmentation model, which extracts multiple low-level resolution features simultaneously, making it suitable on embedded devices offline. As Figure 6 demonstrated, Fast-SCNN has four parts (details can be found in [60]): (1) A learning to downsample module with a standard convolutional layer (Conv2D) and two depthwise separable convolutional layers (DSConv). The output of feature maps is named relu_4 in A. (2) A coarse global feature extractor to capture the contextual information for segmentation using a bottleneck block that employs the depthwise separable convolution. It can reduce the number of parameters to train floating-point operations. The end of the extractor is a pyramid pooling module that aims to aggregate the differentregion-based context. The context information from the extractor part is given in relu_6 before the fusion operation. (3) A feature fusion module with simple addition of the features with high-level and low-level representations (relu_7 in C). (4) A standard classifier consists of two DSConv (relu_11 in D), one Conv2D to boost the accuracy, and a softmax layer to get the segmentation results.

DeepLabv3+
The last one is the DeepLabv3+ model, an increased resolution feature-based method [61]. A fully connected Conditional Random Field on the final layer of DeepLabv3+ improved localization performance both quantitatively and qualitatively. We mainly introduce the DeepLabv3+ framework in Figure 7 (more details can be referred to in [61]). Multi-scale contextual information is extracted by atrous convolution at an arbitrary resolution in the encoder module. In the decoder module, first, the features from the encoder atrous convolution (relu_8 in A) are bilinearly upsampled by 4 and then concatenated with low-level features (relu_9 in B) from the backbone with the same resolution after the 1×1 convolution to obtain the multi-level feature fusion (relu_10 in C). Further, a few 3 × 3 convolutions are applied to refine the features (relu_12 in D). They are upsampled by a factor of 4 in a simple bilinear way.

Data Augmentation
Training images were cropped into patch tiles of 256 × 256 pixels randomly to raise the volume of this dataset and to reduce classification problems around edges. There is a total of 1837 training patches and 197 validation patches. Data augmentation was employed to enhance the data set and avoid overfitting as listed in Table 3. We adopted step-scaling resize between 0.7 and 1.2 with step 0.1; flip operation with a probability of 0.5; mirror operation with a probability of 0.5; and additional operations including rotation, area crop, aspect, and color jitter (brightness, saturation, and contrast with a probability of 0.2) on the input images. These augmented images were standardly normalized between 0 and 1 by removing the mean and scaling to unit variance. ASPECT min = 0.2 A random aspect ratio of the original aspect ratio.
COLOR JITTER brightness jitter ratio = 0.5; saturation jitter ratio = 0.5; contrast jitter ratio=0.5 Randomly change the brightness, contrast, saturation of an image with a given probability.

Accuracy Assessment
For model validation and accuracy assessment, overall accuracy (OA) and the mean intersection over union (mIoU) of two classes (unburned pixels and burned pixels) are used to measure the comprehensive performance of various methods, see Equations (3) and (4) (fp = false positive, tp = true positive, tn = true negative, fn = false negative in confusion matrix). The Cohens' kappa coefficient [91], commission errors (Ce), and omission errors (Oe) are used to assess the performance of testing process, see Equations (5) and (6).
To compensate for the shortcomings of kappa in map comparison, allocation disagreement (AD) and quantity disagreement (QD) [92] are also evaluated in this study. AD denotes the proportion of the disagreement (i.e., error) associated with pixels in the wrong spatial location, as Equation (5) in [92]. QD means the proportion of the disagreement associated with the amount classified, presented by Equation (3) in [92]. Overall, the sum of OA, AD, and QD should be equal to 1. The way to interpret these two indexes is that if AD is high but QD is low, then this reveals that the area classified is correct but the locations classified are incorrect; conversely, if AD is low but QD is high, then this represents that the locations classified are correct but the amounts of pixels classified are incorrect.

Results
The capabilities of DL-based models for burned area mapping with Sentinel-2 and Landsat-8 data are presented and analyzed in the following sections, subject to the DL network evaluation and quantitative assessment in comparison with ML-based models and threshold-based approaches.

Test Results and Analysis
DL models were trained on an Nvidia Tesla V100 GPU using the Adam optimizer [93] and a batch size of eight patches. The training speed is illustrated in Figure 8a. The HRNet has an average lower speed (5.8 step/s) with a total of 4 h and 48 min in the training of 400 epochs, but other DL models take approximately 2 h and 30 min. All models use a polynomial decay learning rate policy with a gradual warmup strategy [94] based on the PaddlePaddle backend DL framework (https://github.com/PaddlePaddle/Paddle, accessed on 9 April 2021). The initial learning rate (0.000005) increases to the final learning rate (0.001) after 2000 warmup steps (one of the super parameters) and then decreases to the initial learning rate gradually when the training process ends with a total of 91,600 steps. The loss is a summation of the errors produced by each batch in training or validation sets, which indicates how properly or badly a trained model performs after each iteration of optimization. A weighted sparse softmax cross-entropy loss function [95] is employed to track the performance in the training stage. It particularly picks the weights according to the current labels and applies them as batch weights, and then combines the calculation of the softmax operation and the cross-entropy loss function [96] to provide a more numerically stable gradient. The respective curves with motion average of the training loss are presented in Figure 8b. It can be observed that the training loss stably decreases with an increase in training steps. All networks show convergence towards zero with some minimal jitter between 0.01 and 0.05. U-Net reaches a low loss value much faster than the other models but seems to cause quick overfitting, whereas the HRNet seems to converge reasonably.
We evaluated the segmentation accuracy of DL models on the test dataset in every 50 epochs. The accuracy curves are shown in Figure 8c,d, where HRNet in green obtains the highest mIoU after the training of 200 epochs rapidly, and the accuracy of U-Net increases gradually and peaks in 350th epochs. The other two models (DeepLabv3+ and Fast-SCNN) represent a fluctuation in curves during training, and they achieve a high value in the 250th epoch. With an increase in the number of training epochs, the accuracy of the DL network decreases after some epochs and even the training loss continues decreasing in Figure 8b, and therefore we selected the trained models (parameters and weights) with the best accuracy rather than the final model to avoid overfitting.

Feature Analysis
The test images are input into the trained DL models with an arbitrary size, and then features extracted from a few representative layers can be visualized. The visualization results with Sentinel-2 and Landsat-8 images as input for each test fire event are listed in Figure 9. Four outputs of feature maps are marked with A-D in each DL architecture in Section 4.3. The features selected in each DL model can reflect different levels of semantic information from low to high. The deep convolution layers during the encoder branch gradually extract the low-resolution representations (or high-level features) such as contour features and hotspot distribution caused by burned areas. The decoder branch with upsampling subnetwork aims to recover high-resolution representations (i.e., precise segmentation) such as burned confidence, accurate burned delineation, and unburned areas around burned areas. Note that not all the features can be visualized well or can be understood intuitively, so here we select the most important feature in some typical layers (upsampling layer or ReLu activation layer) in the feature visualization results, which are normal phenomena in the DL field.
Regarding the U-Net in Figure 9, the first bilinear_interp_0 is the name of the feature outputs from the bilinear upsampling layer after the convolution layer with 512 filters at the end of the encoder branch. It perceives a high-level representation that contains general characteristics regarding the semantic information of the input image. Therefore, we observed the heatmap from one typical feature matrix that implies the general distribution of burned areas. Similarly, the bilinear_interp_1 of the decoder branch fuses the low-level spatial features from the encoder branch, and one of its feature maps showed the contour features (i.e., fire delineation) of burned areas for each fire event. One feature channel of the further bilinear_interp_2 indicates the inner burned areas independently. Finally, the last one of the upsampling layer (i.e., bilinear_interp_3) provides us with reliable visualization results close to the burned area segmentation, even reflecting the burned severity to some extent that needs to be investigated in the future study.
HRNet contains multi-resolution group convolution with four outputs as presented in Figure 5. We visualized these four representations from low resolution to high resolution in Figure 9. The bilinear_interp_33 denotes the output features of low-resolution subsets. It learns the low-resolution representations successfully with the raw spatial distribution of possible burned areas in Figure 9. Furthermore, one of the feature maps in the bilinear_interp_32 output shows an obvious highlight in the unburned regions, which conversely facilitate the unburned area segmentation. This group convolution could capture mid-level representations. In parallel, high-resolution representations are maintained through the whole process (i.e., bilinear_interp_31 and relu_152). The repeatedly conducting multi-scale fusions across parallel convolutions connects high-to-low resolution convolutions. Therefore, it can be observed that one feature of relu_152 can capture burned areas with more high-resolution details, compared to one of the feature maps from the bilinear_interp_31 outputs. Fast-SCNN learns the features through downsampling, and its output in relu_4 preserves some semantic information but lacks the contextual relationship, due to the missing of high-level features with large receptive fields. Then, the global feature extractors produce the feature maps in the ReLu activation layer (i.e., relu_6). However, it fails to learn the high-level features, as Figure 9 showed. Therefore, the feature maps in relu_7 have similar representations to relu_4. As a real-time semantic segmentation model, Fast-SCNN is more suitable for embedded devices with low memory and power. DeepLabv3+ shows better visualization results in low-level feature extraction from the relu_9 activation layer than that of relu_4 in the Fast-SCNN model. In addition, its high-level features in the relu_8 passing ASPP module also look more reasonable than the relu_6 of Fast-SCNN. Therefore, the combination of low-level and high-level features produces contextual information. On the other hand, one of the features in the final ReLu activation layer (relu_12) represents obvious burned areas but with some overestimation.
Overall, HRNet shows more promising visualization results, while U-Net looks at overfitting in the training process taking the feature maps of from bilinear_interp_2 output as an example. Fast-SCNN performs the worst among the four DL models due to poor representations from low resolution to high resolution. DeepLabv3+ has reasonable results, but they are not so precise to some extent. Table 4 summarizes a quantitative comparison of different methods for burned area detection in Corinthia area (Mediterranean forests). DL algorithms (except for Fast-SCNN) generally outperform ML-based models. U-Net results in significantly better metrics than LightGBM: mIoU is 0.04 higher and the kappa coefficient is 0.06 higher. Table 4. Testing accuracies of burned area detection for the Corinthia fire with Sentinel-2 data. The lowest omission errors (Oe), commission errors (Ce), alllocation disagreement (AD), and quantity disagreement (QD) and the highest mIoU and Kappa in each model group are highlighted in bold.

Model
Oe ( HRNet and U-Net display similar results, with higher kappa and mIoU values than Fast-SCNN and DeepLabv3+. On the other hand, the U-Net model provides a good balance between AD and QD, indicating that it classifies large amounts of pixels and locations correctly. The overall performance of U-Net shows high agreement with recent literature in [57], whose U-Net could reach kappa of 0.86 for mapping Mediterranean forest fires in Pantelleria, Italy, with uni-temporal Sentinel-2 data, but resulted in a poor balance between commission errors (23%) and omission errors (3%).
U-Net reconstructs more accurate delineations around the mountain valleys as it learned the features about hillshade in convolution feature maps as Figure 9 showed. The HRNet model produces lower omission errors (false negative) but higher commission errors (false positive) than U-Net, as presented in Figure 10. The highest omission errors observed by using the Fast-SCNN are mostly related to the underestimation of burned areas in complex terrain surfaces in the northern valleys. The pixel-based ML classification methods seem to systematically underestimate the burned area within the fire perimeter as Figure 10 presented, resulting in higher omission error (see Table 4). These phenomena are in agreement with the classification results with uni-temporal Landsat TM imagery on three Mediterranean test sites [28]. Within the three ML methods in our study, LigthGBM has higher mIoU than RF and KNN. The low kappa is mainly related to high omission errors in the northern regions. Moreover, several false positives outside of the perimeters are mainly related to the croplands, as they show very similar spectral behavior to burned areas. Automated threshold-based methods (NBR otsu ) yield lower accuracy than all DL and ML methods with huge omission errors due to relatively strict thresholds being applied. In general, NBR-based methods result in false negatives and false positives that occur in the burned class and unburned class, respectively (see Figure 10). Table 5 demonstrates a quantitative algorithm comparison for burned area detection in Fågelsjö-Lillåsen and Trängslet with Boreal conifer trees. Each DL algorithm significantly outperforms the ML methods and NBR-based approaches for the Fågelsjö-Lillåsen fire. Interestingly, Fast-SCNN results in the best metrics: mIoU is 0.81 and the kappa value is 0.79 as it can highlight most low burned severity areas (covered over 10% burned areas from EMSR 298_05 grading map), which are greatly underestimated by other DL models and ML methods as shown in Figure 10. Fast-SCNN also keeps the lowest omission errors but the largest commission errors among all methods due to the compact consistency of the burned area delineation in mapping burned areas. ML methods would misclassify the shoreside soils as burned areas due to their similar spectral behaviors to burned areas. Different from the Fågelsjö-Lillåsen fire, the Trängslet fire caused the disperse heterogeneous burned areas due to several wetlands within the study areas rather than homogeneous land cover. DL models tend to consider the connectivity between different burned area patches based on the convolutional feature extraction in low-resolution. Regarding the Trängslet fire, the dispersed burned areas mislead the general contextual information extraction of DL models. A similar conclusion also applies to the work in [57]. Most of the commission errors are located over the unburned parches close to the perimeters that are related with the wetland and bare soils in Figure 11. Importantly, all ML methods surpass all DL methods greatly with a kappa value of approximately 0.88. Every ML method represents similar visual results in Figure 10, which can inevitably misclassify the burned areas in the subset with wetland and bare soils. The empirical NBR-based thresholding method can reach a high accuracy in detecting Trängslet fire (kappa over 0.86).

Transferring Phase with Landsat-8 Data
DL and ML models trained with Sentinel-2 data are then transferred to the corresponding Landsat-8 images on the same test sites so that we could assess the prediction performance and similarity with identical references. The performance evaluation with Landsat-8 data in Table 6 keeps the same tendency as Sentinel-2 data. DL models outperform other methods in Corinthia and Fågelsjö-Lillåsen fires, whereas ML methods (i.e., LightGBM) perform best in the Trängslet fire. The accuracy of HRNet dominates all DL models in the three test sites with Landsat-8 data, rather than U-Net and Fast-SCNN in Tables 4 and 5, respectively. DL models indicate good generalization ability in cross-sensor satellite images with acceptable accuracy as listed in Table 6. The performance of ML models seems to be unstable. Interestingly, the segmentation accuracies of DL models (except Fast-SCNN) are increased at the Trängslet site. On the other hand, the classification accuracies of ML methods decrease greatly in Corinthia and Fågelsjö-Lillåsen, even though they still perform well in the Trängslet fire with kappa over 0.85. In detail, kappa values decrease by 0.04 (HRNet) and 0.09 (LightGBM) in comparison to Table 4 for the Corinthia fire. Regarding the Fågelsjö-Lillåsen fire, it only sees a slight decrease of 0.02 in kappa values using the Fast-SCNN model while RF drops from 0.56 to 0.48. Finally, for the Trängslet fire, LightGBM has a small decrease of 0.03 in kappa, while HRNet conversely shows a little improvement in kappa from 0.82 to 0.83. The reason might be that the commission errors caused by the Landsat-8 data are lower than the Sentinel-2 data.
Compared to other methods for mapping burned areas from Landsat-8, our results for the Corinthia fire reached a comparative accuracy with the results in [97], which used a two-phase algorithm (i.e., spectral-temporal rule and region-growing) to balance omission and commission errors (11.1% and 16.5%, respectively) with a kappa value of 0.85 for a wildfire in Portugal. HRNet in our study also reaches a kappa of approximately 0.85 but much lower commission errors (4.08%).
From the graphical results in Figure A2, DL models show more robust segmentation on burned areas with accurate delineation for the large burned patch in the Corinthia and Fågelsjö-Lillåsen fires. Due to the abundant lightly burned areas, ML methods fail to highlight the burned areas in Fågelsjö-Lillåsen, while DL models like HRNet and Fast-SCNN could still depict the perimeters of burned areas. On the other hand, disperse burned patches in Trängslet also affect the performance of DL with high commission errors due to their similar contextual connectivity in the spatial domain.

Discussion
Test results have shown that DL models achieve acceptable performance in mapping burned areas. They make full use of the background-foreground context in single-date imagery and capture the object features in multiple scales, making discrimination of the burned area possible. DL models can successfully keep a strong feature representation within neighboring burned pixels. Overall, DL models show good results on the compact Mediterranean mountain forests and Swedish boreal forests, while ML methods can obtain more accurate results in the dispersed Swedish site.
Although U-Net and Fast-SCNN perform best with Sentinel-2 data in the testing sites of the Corinthia fire and Fågelsjö-Lillåsen fire, respectively, HRNet stands out with higher mIoU and kappa than other methods when the testing data are from Landsat-8. In other words, all DL models need to be treated with some caution as their selection may depend on the data source, what landscapes and terrain the study areas are located in, and what kinds of biomass was burned.
ML methods employ pixel-wise classification that might lack spatial contextual information. They would produce some commission errors outside perimeters and huge omission errors within delineations, leading to the lack of balance between the commission and omission errors. Using a uni-temporal image, NBR can reach good performance in some specific sites (e.g., Trängslet fire) but is not recommended to map burned areas on a large scale with various landscapes. ML methods show similar results, and their performance is deeply affected by variance imposed by local study sites rather than the selection of methods as reported in [33]. Moreover, the parameters of ML methods in this study can be further optimized to improve the accuracy to some extent.
On the other hand, the bitemporal index, namely, dNBR, requires the pre-fire cloudfree image; thus, it was expected to perform better than NBR but was somewhat unexpected to be worse if using OTSU-based automated method, taking the fire in Corinthia as an example (NBR otsu in Table 4 vs. dNBR otsu in Table 7). That might be caused by the sub-optimal threshold computed using the OTSU approach (see the Figure 12c). Not surprisingly, empirically thresholding dNBR in Table 7 can obtain a similar kappa value (0.90) to the U-Net model in Table 4. These comparative experiments might imply that some DL models based on one uni-temporal post-event image could reach the same high accuracy as bitemporal index. However, we aim to avoid the usage of bitemporal images and manual assistance, as the advantages of automated algorithms are easy to deploy in practice, widely used, and without any human intervention.
The pretrained DL models on Sentinel-2 were successfully transferred to map the burned area on Landsat-8 data with acceptable accuracy. We relate this observation to the fact that DL network (e.g., HRNet) has a high level of generalization on multi-scale representative features to keep consistency between Sentinel-2 and Landsat-8 data. By fusing available multispectral data, DL models show promise for use in future NRT applications.  Figure 12. Burned area mapping results of Corinthia using the threshold-based methods on dNBR with Sentinel-2 data. Two thresholds are determined by human intervention and OTSU automated approaches, respectively, as shown in the (c) histogram map. (a) The NBR map to threshold the burned area of the Corinthia fire in Figure 10. (b) dNBR map that is binarized to get the burned areas of dNBR em (d) and dNBR otsu (e) by empirical and automatic approaches, respectively.

Conclusions
In this study, a series of experiments is conducted to evaluate the capabilities of several DL models for mapping burned areas with uni-temporal Sentinel-2 and Landsat-8 images. These DL approaches achieve acceptable performance in comparison to ML algorithms and NBR-based methods, especially when the burned areas are compact. It can be observed that the DL network model tends to increase the mapping accuracy and thematic consistency of the final burned area delineation due to the fusion of multi-scale features rather than a pixel-based classification. This research highlights that DL models have several advantages: (i) automated mapping with high overall accuracy without the need for a pre-fire cloud-free image or the use of fixed or empirical thresholds and (ii) cross-sensor ability to detect the burned area in the various biomes.
From an operational viewpoint, although the present results show a very promising potential using DL models for burned area mapping, surely further study is needed towards the extension of creating larger data sets in more diverse fire-disturbed regions around the globe. An important direction of future work would be a specific investigation of the hybrid approaches to fuse DL and ML methods to improve the accuracy. Furthermore, fusion of optical and SAR imagery could be explored to improve burned area mapping.   Figure A2. Burned area detection results with Landsat-8 data.