Performance Analysis of Deep Convolutional Autoencoders with Different Patch Sizes for Change Detection from Burnt Areas

Fire is one of the primary sources of damages to natural environments globally. Estimates show that approximately 4 million km2 of land burns yearly. Studies have shown that such estimates often underestimate the real extent of burnt land, which highlights the need to find better, state-of-the-art methods to detect and classify these areas. This study aimed to analyze the use of deep convolutional Autoencoders in the classification of burnt areas, considering different sample patch sizes. A simple Autoencoder and the U-Net and ResUnet architectures were evaluated. We collected Landsat 8 OLI+ data from three scenes in four consecutive dates to detect the changes specifically in the form of burnt land. The data were sampled according to four different sampling strategies to evaluate possible performance changes related to sampling window sizes. The training stage used two scenes, while the validation stage used the remaining scene. The ground truth change mask was created using the Normalized Burn Ratio (NBR) spectral index through a thresholding approach. The classifications were evaluated according to the F1 index, Kappa index, and mean Intersection over Union (mIoU) value. Results have shown that the U-Net and ResUnet architectures offered the best classifications with average F1, Kappa, and mIoU values of approximately 0.96, representing excellent classification results. We have also verified that a sampling window size of 256 by 256 pixels offered the best results.


Introduction
Deep Learning (DL) is the term that refers to the use of multilayered neural networks to solve complex problems. It is one of the fastest-growing trends in machine learning, data science, and computer vision. The DL field has become more accessible in the latest years due to improvements in the understanding of machine-learning theory, coupled with the increased processing power from the better consumer-grade computer hardware. Within the full range of deep network architectures, Convolutional Neural Networks (CNN) gained the focus in recent DL studies. CNNs are multilayer neural networks capable of identifying patterns in data, both spatial and spectrally wise, and using these learned patterns for inference to classify data. Studies have shown they possess the ability to solve a wide variety of problems, such as text classification [1], image recognition [2], video analysis [3], and speech recognition [4]. different window sizes (5, 10, 15, 20, 25, 30, 35, and 40) and determined the dimension of 30 × 30 pixels as the ideal size. Another study evaluated different window sizes (60, 80, 100, 120, 140, 160, 180, and 200) to locate cars in unmanned aerial vehicle (UAV) images [59], concluding that a patch size of 160 × 160 pixels provided the best total accuracy. Additionally, studies show that some degree of overlap between windows is beneficial to the classification as it reduces the loss of contextual information along image patch borders [60].
This study aimed to investigate the use of DL algorithms to map burnt area changes within the Cerrado region to provide an accurate automated classification method. This research evaluated three CNN models based on the concept of the Autoencoder architecture: (a) the basic Autoencoder, (b) the U-Net, and (c) the ResUnet architectures, which propose improvements over the basic Autoencoder. Furthermore, we tested four sampling strategies to find optimal sampling window sizes for this specific classification task. In the following sections, we describe the study area; our dataset structure, how the models were built, how they were evaluated and lastly, we present the results found and a brief discussion over them.

Landsat Data
Our training and testing datasets were created by collecting Tier 1 atmospherically corrected reflectance data detected by the Landsat 8 OLI+ sensors and pre-processed by the United States Geological Survey (USGS) agency. This study used bands 2 to 7, which offer the majority of the spectral information relevant to the detection of burnt lands for the same 30-meter spatial resolution. The training used the Landsat scenes (path-row) 221-71 and 221-70 (sites A and B), while the validation used scene 221-69 (site C) ( Figure 1). The areas fit into the Cerrado biome and offer detection dates on the same day. The overlapping region between scenes B and C was excluded from scene B to avoid sharing data between training and validation.
To detect changes, we selected four different dates in August and September 2017 (August 9 and 25, September 10 and 26). The choice of date was based on the more significant occurrence of fires in the region during the end of the dry season [37]. Additionally, these dates offered the least amount of cloud cover throughout the year.  To detect changes, we selected four different dates in August and September 2017 (August 9 and 25, September 10 and 26). The choice of date was based on the more significant occurrence of fires in the region during the end of the dry season [37]. Additionally, these dates offered the least amount of cloud cover throughout the year.

Burnt Area Change Mask
The elaboration of the ground truth mask used the Normalized Burn Ratio (NBR) spectral index (Equation (1)), which has been extensively used in research to highlight burnt areas and assess burn severity [38,61,62].
where NIR and SWIR are the near and shortwave infrared bands, respectively. The NBR temporal difference (∆NBR) can then be calculated to further highlight the burnt areas (Equation (2)).
where T1 and T2 are the pre-fire and post-fire images, respectively. Specific ∆NBR threshold values allow assessing the severity of the burn. In this study, we classified pixels with ∆NBR values above 0.1 as burnt areas, regardless of the severity ( Figure 2). Common false positives such as bodies of water and shadows, which are often also highlighted by this approach, were manually removed from the masks to guarantee that only burnt areas were present. This approach only detected new-burnt areas between two consecutive images without accounting for the accumulated burnt area.

Burnt Area Change Mask
The elaboration of the ground truth mask used the Normalized Burn Ratio (NBR) spectral index (Equation (1)), which has been extensively used in research to highlight burnt areas and assess burn severity [38,61,62].
where NIR and SWIR are the near and shortwave infrared bands, respectively. The NBR temporal difference (∆NBR) can then be calculated to further highlight the burnt areas (Equation (2)).
where T1 and T2 are the pre-fire and post-fire images, respectively. Specific ∆NBR threshold values allow assessing the severity of the burn. In this study, we classified pixels with ∆NBR values above 0.1 as burnt areas, regardless of the severity ( Figure 2). Common false positives such as bodies of water and shadows, which are often also highlighted by this approach, were manually removed from the masks to guarantee that only burnt areas were present. This approach only detected new-burnt areas between two consecutive images without accounting for the accumulated burnt area.

Data Structure
In this study, we used a bi-temporal approach in order to detect burnt area change. Therefore, images on two consecutive dates were paired and stacked depth-wise, generating a 12-band file associated with the respective change mask. Given our available images, this process generated three sets of bi-temporal images for each of the Landsat scenes.
Additionally, since data are structured in a batch by batch basis for deep learning models, our images had to be restructured and sampled as a 4D tensor containing multiple image patches and with shape [S × H × W × B] where S is the number of samples, H and W the height and width of the patches in number pixels and B the number of bands in the bi-temporal image pair. In this study, we sampled the images through a sliding window of four different sizes based on power of two (2 n ) numbers: (a) 512 by 512, (b) 256 by 256, (c) 128 by 128, and (d) 64 by 64 pixels (Figure 3).

Data Structure
In this study, we used a bi-temporal approach in order to detect burnt area change. Therefore, images on two consecutive dates were paired and stacked depth-wise, generating a 12-band file associated with the respective change mask. Given our available images, this process generated three sets of bi-temporal images for each of the Landsat scenes.
Additionally, since data are structured in a batch by batch basis for deep learning models, our images had to be restructured and sampled as a 4D tensor containing multiple image patches and with shape [S × H × W × B] where S is the number of samples, H and W the height and width of the patches in number pixels and B the number of bands in the bi-temporal image pair. In this study, we sampled the images through a sliding window of four different sizes based on power of two (2 n ) numbers: (a) 512 by 512, (b) 256 by 256, (c) 128 by 128, and (d) 64 by 64 pixels (Figure 3).
Furthermore, we used a 12.5% overlap between sampling windows to reduce the loss of predictive power near sample edges, an effect that is induced by the padding operation in convolutional layers and by the lack of contextual information near the patch edges. Incomplete windows, i.e., with empty pixel values, were discarded. The total number of samples per image generated through this process was 168, 747, 3140, and 12,860, respectively.

Deep Learning Models
The basis for the models used in this study was the Autoencoder model, which consists of an architecture that downsamples (decodes) the feature maps generated through convolutional layers to learn features compactly and then upsamples (encodes) them back to the desired output size. This process usually leads to the loss of spatial information as the feature maps are downsampled. The U-Net architecture [63] can be considered an evolution of the basic Autoencoder model, which tries to correct the loss of spatial information through the introduction of residual connections that propagate the information before being downsampled towards the upsampling layers. This allows the model to learn low-level detail while also keeping the high-level semantic information. A further enhancement of the architecture has been proposed through the insertion of residual connections within the architecture's blocks, resulting in what has been called ResUnet [51,64]. These three Furthermore, we used a 12.5% overlap between sampling windows to reduce the loss of predictive power near sample edges, an effect that is induced by the padding operation in convolutional layers and by the lack of contextual information near the patch edges. Incomplete windows, i.e., with empty pixel values, were discarded. The total number of samples per image generated through this process was 168, 747, 3140, and 12,860, respectively.

Deep Learning Models
The basis for the models used in this study was the Autoencoder model, which consists of an architecture that downsamples (decodes) the feature maps generated through convolutional layers to learn features compactly and then upsamples (encodes) them back to the desired output size. This process usually leads to the loss of spatial information as the feature maps are downsampled. The U-Net architecture [63] can be considered an evolution of the basic Autoencoder model, which tries to correct the loss of spatial information through the introduction of residual connections that propagate the information before being downsampled towards the upsampling layers. This allows the model to learn low-level detail while also keeping the high-level semantic information. A further Remote Sens. 2020, 12, 2576 6 of 19 enhancement of the architecture has been proposed through the insertion of residual connections within the architecture's blocks, resulting in what has been called ResUnet [51,64]. These three architectures have been used to classify remote sensing data before with good results [48][49][50][51]64]. We adapted and evaluated these three architectures to describe possible differences when used to detect burnt area changes. Figure 4 describes the general structure of the models used.
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 20 architectures have been used to classify remote sensing data before with good results [48][49][50][51]64]. We adapted and evaluated these three architectures to describe possible differences when used to detect burnt area changes. Figure 4 describes the general structure of the models used. In this study, the architecture of the three models have the same number of layers and basic structure. Still, they differ in the way the residual connections are used: (a) the Autoencoder uses no connections at all, (b) the U-Net architecture uses connections only between blocks from both sides of the structure, and (c) the ResUnet uses connections between and within the blocks. This allowed us to evaluate the effect of the addition of the residual connections. The Keras [65] Python framework was used to build and train the models and to classify the images.

Model Training
The three models shared most of the training parameters. To compute loss, we used the sum of the Binary Cross Entropy (BCE) loss and the Dice loss [66] functions (Equations (3)-(5)).
where: In this study, the architecture of the three models have the same number of layers and basic structure. Still, they differ in the way the residual connections are used: (a) the Autoencoder uses no connections at all, (b) the U-Net architecture uses connections only between blocks from both sides of the structure, and (c) the ResUnet uses connections between and within the blocks. This allowed us to evaluate the effect of the addition of the residual connections. The Keras [65] Python framework was used to build and train the models and to classify the images.

Model Training
The three models shared most of the training parameters. To compute loss, we used the sum of the Binary Cross Entropy (BCE) loss and the Dice loss [66] functions (Equations (3)-(5)).
Final Loss = BCE loss + Dice loss, where: where m is the number of mini-batches, y are the ground truth class values, y are the class scores from the sigmoid activation, and p are the predicted class values. The Dice loss comes from the Dice coefficient, also known as F1, which is especially useful for classifications with uneven class distributions (as in this study). This coefficient equally values positive and negative cases without the need to set arbitrary weights. The gradient descent optimization used the RMSprop algorithm with a learning rate of 10 −4 that automatically decreased by a magnitude of 10 every time the loss reached a plateau, to a minimum of 10 −6 . The models were trained with the data from scenes A and B for a total of 200 epochs, which was enough to stabilize model loss and error for every model instance. The main differing parameter for model training was the batch size, which varied depending on the sample size as the available hardware memory constrained it. The batch sizes used were of 4, 8, 16, and 32, respectively, for the window sizes of 512, 256, 128, and 64. Those were the largest possible batch sizes that allowed us to fit the samples into memory for their respective window sizes without reducing the number of samples. The model training used a computer equipped with Nvidia GeForce RTX 2080 TI graphics card with 11 GB of GPU memory, 16 GB of RAM memory, and an Intel Core i7-4770K CPU.

Model Evaluation
As mentioned before, model validation used the data from scene C. The three main metrics used to evaluate the models were the F1 measure, the Kappa coefficient, and the mean Intersection Over Union (mIoU) value, represented by Equations (6), (9), and (12), respectively.
where: Precision = True Positives True Positives + False Positives Recall = True Positives True Positives + False Negatives (8) where p o is the rate of agreement between the ground truth and the classification, and p e is the expected rate of random agreement (Equations (10) and (11) (TP + FN + TF + FP) 2 (11) mIoU = IoU 1 + IoU 2 · · · + IoU n n (12) where IoU is the area of the intersection divided by the area of the union of the classification and ground truth for a class, and n is the total number of classes. All three of these measures range from 0 to 1, where a result of 1 would represent a perfect classification. In this study, they provide a better quantitative assessment over the traditional accuracy value, which tends to be misleadingly optimistic in classifications with an imbalanced number of observations and a large number of background (negative) cases relative to the foreground (positive) cases [67,68].
In addition, we employed McNemar's test [69] to evaluate whether the models were significantly different between each other. This test is a non-parametric test that mainly evaluates whether the error distribution between two classifications is similar. In this study, we used the variation of the test based on a chi-square distribution with a single degree of freedom and continuity correction [70] (Equation (13)).
where f 12 and f 21 are the frequency of observations in disagreement between two classifications in a contingency table. A p-value of 0.05 was used as the threshold value, where lower values indicate that the distributions between two compared models are significantly different. Table 1 lists the detailed validation results for the F1, Kappa and mIoU measures for each model, while Figure 5 shows a visual comparison of these measures. The basic Autoencoder architecture showed the worst results overall, although it showed good F1, Kappa, and mIoU values.

Results
The U-Net and ResUnet architectures showed similar results. Despite performing better on average, the U-Net showed worse results in the time sequences of 08-25 to 09-10 and 09-10 to 09-26.
The models trained with samples with a size of 64 by 64 showed the worst results overall. The basic Autoencoder was the only model to show an improvement still when the window size was increased to 512 by 512. The ResUnet model showed a more marked loss of performance, increasing the window size to 512 by 512. Using a window size of 256 by 256 resulted in the best F1, Kappa, and mIoU values for both the U-Net and ResUnet models.
In most cases, the models produced more false positives than false negatives ( Figure 6). Improvements in the performance measures seemed to stem mainly from decreases in the number of false-positive predictions as the window sizes grew to 256 by 256. Comparatively, the number of false negatives varied little. However, the ResUnet model showed a noticeable increase in false negatives with the window size of 512 by 512. The time sequence between 08-25 and 09-10 showed the lowest amount of incorrectly classified pixels, which is explained by the fact that this sequence also showed the lowest extent of burnt areas overall.  In most cases, the models produced more false positives than false negatives ( Figure 6). Improvements in the performance measures seemed to stem mainly from decreases in the number of McNemar's test shows that when compared, most models have significantly different error distributions, which means the observed differences in the results did not occur at random ( Table 2). The only models found to be statistically similar were the 256-window U-Net and ResUnet. The differences in error distributions are also visually noticeable in the classification maps, particularly at the edges of burnt area patches. While the basic Autoencoder misclassified large groups of pixels, the U-Net and ResUnet models mostly showed misclassifications as very small groups or single pixels.  There were no noticeable differences between time sequences despite the physical and phenological changes (Figures 7-9). The models automatically masked water bodies and most shadows, which are spectrally similar to burnt areas. However, burnt area patches with cloud cover in either image in a sequence were still sources of error in the classifications. Despite that, cloud shadows in unburnt lands were still correctly classified as negatives in most cases. Both the U-Net and ResUnet models were able to classify unclouded patches of burnt land with a low occurrence of errors. Remote Sens. 2020, 12, x FOR PEER REVIEW 11 of 20

Discussion
Results have shown that the DL models evaluated offer excellent classification results when used to detect burnt area changes. Even the worst model using the Autoencoder architecture with a sample window of 64 by 64 resulted in F1, Kappa, and mIoU values over 0.8, which can already be considered a good result. The addition of residual connections between the decoding and encoding layers in the U-Net significantly improved the results. In contrast, the ResUnet's addition of connections within individual blocks gave marginal improvements and only in some cases. Visually, the Autoencoder's lack of connections translated to a noticeable loss of spatial information in the form of less detailed contours between the positive and negative classes. Overall, the U-Net architecture showed the best results, although not much higher than the ResUnet architecture, which was superior in the time sequences with greater extensions of burnt land. Regions with clouded patches of burnt land were among the primary sources of errors in the classifications as occasionally, the models misclassified the cloud shadows as extensions of the burnt land patches, generating false positives. Despite that, the models correctly classified regions without mixed burnt areas and cloud shadows and automatically masked objects commonly detected as false positives through ∆NBR thresholding, therefore reducing the need for human intervention. The presence of cloud cover is one of the main limitations when using Landsat data for change detection as it is a common occurrence that impacts both the creation of a ground truth mask and the training of the models. Radar data can be used instead but at the cost of a significant loss of spectral information and possibly accuracy [71]. Studies have been carried using CNNs and Synthetic Aperture Radar (SAR) data to detect burnt areas with results similar to those found in this study [46,72] although to much smaller extents. Our bi-temporal approach was similar to that used by the Brazilian Institute of Space Research (INPE) to produce official burnt area reports [73]. However, our use of DL architectures instead of a thresholding process produced a much lower rate of false positives (commission errors) and false negatives (omission errors). Furthermore, DL models can be trained incrementally with new training data and further improve results, although up to a specific limit.
Increasing the sample window size improved the results despite simultaneously decreasing the training batch sizes, although only up to 256 by 256 in the case of the U-Net and ResUnet models. The size of 512 by 512 worsened the results, particularly for the ResUnet model, which showed results close to the same model using the 64 by 64 samples. The cause can be attributed to the lower batch size used. Studies have shown that smaller batch sizes can introduce more noise in the training gradients, leading to a loss of generalizability and, therefore, less accuracy [74,75]. However, the window size of 64 by 64 showed the worst results overall, even using the largest batch size, which shows that there is possibly a balance between sample window size and batch size. This problem is ultimately limited by the quantity of memory available in the graphics card, which determines the number of samples, the size of samples, and the batch size that can be used in the same training process. In addition, increasing the model complexity (e.g., by increasing the number of layers or filters) can exponentially increase the memory required for training. Remote sensing data can be highly memory intensive, especially at higher spatial and spectral resolutions, making the process of optimizing the training parameters for DL models challenging with consumer-grade hardware.
The loss of performance with smaller window sizes is also related to the size of the object at hand. While small window sizes might cover the full extent of small objects, they cannot fully cover larger objects, leading to less information about the relationship between the object of study and its surroundings. Given the way convolutional networks function, the information within each image patch is highly important. In this study, the extent of burnt areas ranged from single pixels (900 m 2 ) to several square kilometers, and, as seen in Figure 3, the smaller window sizes created several image patches with low or no background-foreground context, i.e., without enough information about the burnt area border dynamics. Despite that, the results show that the sampling window does not necessarily need to cover the full extent of the object of detection, corroborating with results found in other studies [59].

Conclusions
In this study, we evaluated three Deep Learning models to detect burnt area changes in three bi-temporal Landsat image pairs: a basic Autoencoder, U-Net, and ResUnet. All three networks were based on the same principles but with differences in the use of residual connections. The training and validation of the models used Landsat data from scenes within the region of the Brazilian Cerrado. The models were trained with four different sample window sizes in pixels to verify performance differences: 64 by 64, 128 by 128, 256 by 256, and 512 by 512.
Results have shown that the architectures used are a reliable automated way to map burnt area changes between bi-temporal image pairs in the Cerrado. However, the U-Net and ResUnet models were superior to the basic Autoencoder as the introduction of residual connections significantly improved the results. The sample window size of 256 by 256 pixels showed the best results for the U-Net and ResUnet models, and further increasing it produced worse results for both of these models. The model evaluation considered the F1, Kappa, and mIoU measures, of which the 256 by 256 window U-Net model achieved the best overall results with average values of 0.960, 0.961, and 0.962, respectively. The ResUnet model had slightly worse results on average, but slightly better results in two of the three time sequences evaluated. McNemar's test verified the possibility that the differences between classifications were not statistically significant, and only the U-Net and ResUnet models using 256 × 256-pixel samples were found to be similar, while every other model was statistically unique.
The Cerrado biome is an important region given its biodiversity, but it is constantly under the threat of destruction through fires and deforestation. We recommend that future studies investigate more uses of current Deep Learning techniques to provide better solutions for the detection and mitigation of these threats. In addition, certain spectral vegetation and burn indexes that have been shown to possibly improve the detection of burnt land [76,77] were not used in this study and could be investigated in future works. A few other suggestions for further studies of the theme arose from certain limitations found in this study. We recommend an investigation of the effect of training batch sizes along with sample window sizes. The batch size is an essential factor in the model performances but is limited by memory, along with several other parameters relevant to DL models. Secondly, the presence of cloud cover in Landsat images is a source of error. Radar data partly solves this problem at the expense of spectral information. Therefore, the possibility of the use of mixed sensor data for burnt area mapping should be investigated. Another future investigation should be to compare the performance of DL algorithms with shallow ML algorithms to highlight differences in performance.