Change Detection of Deforestation in the Brazilian Amazon Using Landsat Data and Convolutional Neural Networks

Mapping deforestation is an essential step in the process of managing tropical rainforests. It lets us understand and monitor both legal and illegal deforestation and its implications, which include the effect deforestation may have on climate change through greenhouse gas emissions. Given that there is ample room for improvements when it comes to mapping deforestation using satellite imagery, in this study, we aimed to test and evaluate the use of algorithms belonging to the growing field of deep learning (DL), particularly convolutional neural networks (CNNs), to this end. Although studies have been using DL algorithms for a variety of remote sensing tasks for the past few years, they are still relatively unexplored for deforestation mapping. We attempted to map the deforestation between images approximately one year apart, specifically between 2017 and 2018 and between 2018 and 2019. Three CNN architectures that are available in the literature—SharpMask, U-Net, and ResUnet—were used to classify the change between years and were then compared to two classic machine learning (ML) algorithms—random forest (RF) and multilayer perceptron (MLP)—as points of reference. After validation, we found that the DL models were better in most performance metrics including the Kappa index, F1 score, and mean intersection over union (mIoU) measure, while the ResUnet model achieved the best overall results with a value of 0.94 in all three measures in both time sequences. Visually, the DL models also provided classifications with better defined deforestation patches and did not need any sort of post-processing to remove noise, unlike the ML models, which needed some noise removal to improve results.


Introduction
Deforestation is one of the primary sources of concern regarding climate change as it is one of the largest sources of greenhouse gas emissions in the world, second only to the burning of fossil fuels [1]. Within the region of the Brazilian Amazon, studies have shown that deforestation, in conjunction with forest fires, can make up to 48% of the total emissions [2]. It also bears substantial implications regarding the conservation of ecosystems and their biodiversity in the region, and it has been linked to the loss of species [3] and general loss of ecosystem stability through fragmentation [4]. Locally, estimates also show that unchecked deforestation could lead to reductions in seasonal rainfall [5] and into the savanization of the environment [6].
Remote sensing imagery has been instrumental in the process of keeping track of deforestation in the Amazon. The Brazilian National Institute for Space Research (INPE) releases annual deforestation support geometric or regular configurations and usually develop around official or unofficial roads, forming a dendritic or "fishbone" distribution [65]. Despite being a prime target for the application of DL algorithms, the number of studies related to deforestation is still small, given the variety of the types of algorithms available. In order to investigate the use of DL for deforestation detection, three different CNN architectures were used to classify deforested areas yearly and were then compared to two classical ML algorithms as points of reference.

Training and Test Sites
In this study, we selected three regions within the Brazilian Amazon as study sites. These scenes encompass major deforestation centers that have developed along the "TransAmazon" (BR-230) [66][67][68] and "Cuiabá-Santarem" (BR-163) [69,70] highways ( Figure 1). In the Amazon, roads are the driving forces for the spatial distribution of deforestation in the Amazon, where most deforestation occurs in the neighborhood of the main highway [71,72]. Widely discussed in the literature, the opening of roads in the Amazon forest favors the establishment of settlements, attracts migrants, facilitates the extraction of resources, increases the profitability of livestock and agriculture, and establishes access to wood [73][74][75][76][77].

Training and Test Sites
In this study, we selected three regions within the Brazilian Amazon as study sites. These scenes encompass major deforestation centers that have developed along the "TransAmazon" (BR-230) [66-,68] and "Cuiabá-Santarem" (BR-163) [69,-7170] highways ( Figure 1). In the Amazon, roads are the driving forces for the spatial distribution of deforestation in the Amazon, where most deforestation occurs in the neighborhood of the main highway [721,732]. Widely discussed in the literature, the opening of roads in the Amazon forest favors the establishment of settlements, attracts migrants, facilitates the extraction of resources, increases the profitability of livestock and agriculture, and establishes access to wood .
The training used two scenes (Sites A and B), and validation utilized the remaining scene (Site C). We defined a bi-temporal approach for modeling and obtained Landsat 8/OLI imagery for each site for the years of 2017, 2018, and 2019, with approximately one year between each observation. Multitemporal images from similar periods of the year reduce variations in the phenology and sunterrain-sensor geometry. The images acquired were from the dry season to minimize cloud cover and reduce noise (Table 1). Tier 1 Landsat images were used as they offer consistent georegistration within prescribed image-to-image tolerances of less than 12-meter radial root mean square error (RMSE) and are therefore appropriate for time-series analysis (https://www.usgs.gov/landresources/nli/landsat/landsat-collection-1) [78].  Multitemporal images from similar periods of the year reduce variations in the phenology and sun-terrain-sensor geometry. The images acquired were from the dry season to minimize cloud cover and reduce noise (Table 1). Tier 1 Landsat images were used as they offer consistent georegistration within prescribed image-to-image tolerances of less than 12-meter radial root mean square error (RMSE) and are therefore appropriate for time-series analysis [78].

Deep Learning Models
This research used three different DL architectures available in the literature: U-Net [79], SharpMask [80], and ResUnet [81]. While the U-Net and SharpMask algorithms were not developed for classification with remote sensing data in mind, studies have found that they are not only suitable, but offer state-of-the-art results [39,82]. These algorithms share similarities, being based on architectures known as autoencoders with the addition of bridges or residual connections. Autoencoders downsample the feature maps generated through convolutional filters while incrementally increasing their number to learn low-level features compactly, and then upsample them back to the original input shape for inference. This process can be further enhanced using connections bridging the downsampling and upsampling steps ( Figure 2) to propagate information. These connections help speed up training and reduce the degradation of data by combining both low-level detail and high-level contextual information. Low-level spatial detail is essential for change detection and land cover classifications, and that is the main reason behind the choice of this specific type of architecture for this study. . While the U-Net and SharpMask algorithms were not developed for classification with remote sensing data in mind, studies have found that they are not only suitable, but offer state-of-the-art results [39,. These algorithms share similarities, being based on architectures known as autoencoders with the addition of bridges or residual connections. Autoencoders downsample the feature maps generated through convolutional filters while incrementally increasing their number to learn low-level features compactly, and then upsample them back to the original input shape for inference. This process can be further enhanced using connections bridging the downsampling and upsampling steps ( Figure 2) to propagate information. These connections help speed up training and reduce the degradation of data by combining both lowlevel detail and high-level contextual information. Low-level spatial detail is essential for change detection and land cover classifications, and that is the main reason behind the choice of this specific type of architecture for this study. While similar in principle and structure, the chosen architectures differ in depth and complexity. Table 2 shows a summary of the number of layers and the total number of parameters in each model after adapting them for this study. Some of the inner workings of each model are also different. For example, the U-Net and SharpMask algorithms downsample the feature maps through a pooling operation, whereas the ResUnet architecture downsamples by using a stride of two between convolutional filter windows. Another example is how the models use skip connections in different ways, where both U-Net and SharpMask use exclusively long connections (linking the downsampling and upsampling sides of the architecture) while ResUnet makes use of long and short connections (between convolutional blocks).  While similar in principle and structure, the chosen architectures differ in depth and complexity. Table 2 shows a summary of the number of layers and the total number of parameters in each model after adapting them for this study. Some of the inner workings of each model are also different. For example, the U-Net and SharpMask algorithms downsample the feature maps through a pooling operation, whereas the ResUnet architecture downsamples by using a stride of two between Remote Sens. 2020, 12, 901 5 of 19 convolutional filter windows. Another example is how the models use skip connections in different ways, where both U-Net and SharpMask use exclusively long connections (linking the downsampling and upsampling sides of the architecture) while ResUnet makes use of long and short connections (between convolutional blocks).

Data Structure
The Landsat dataset consisted only of bands 1 through 7, as they share the same spatial resolution and contain most of the spectral information. Our initial training data was a bi-temporal cube stacking the base image and next year's image, constituting 14 bands. We maintained this data structure for the RF and MLP algorithms, where each pixel is an observation, and each band is a variable. The datasets had to be restructured for the DL algorithms due to the inner workings of the CNNs and due to hardware memory constraints. To build and train the models in this study, we used the Keras [83] python library, a high-level wrapper for the well-known Tensorflow library [84]. When working with three-dimensional image data, Keras accepts inputs in the form of a four-dimensional array with shape (samples, sample rows, sample columns, channels). To convert our images to the correct format, we extracted patches through 200 × 200 pixel windows with a 10-pixel overlap on each side ( Figure 3). This process generated a total of 844 samples per site per time sequence, with a total of 3376 training samples and 1688 test samples.

Data Structure
The Landsat dataset consisted only of bands 1 through 7, as they share the same spatial resolution and contain most of the spectral information. Our initial training data was a bi-temporal cube stacking the base image and next year's image, constituting 14 bands. We maintained this data structure for the RF and MLP algorithms, where each pixel is an observation, and each band is a variable. The datasets had to be restructured for the DL algorithms due to the inner workings of the CNNs and due to hardware memory constraints. To build and train the models in this study, we used the Keras [8379] python library, a high-level wrapper for the well-known Tensorflow library [8084]. When working with three-dimensional image data, Keras accepts inputs in the form of a fourdimensional array with shape (samples, sample rows, sample columns, channels). To convert our images to the correct format, we extracted patches through 200 × 200 pixel windows with a 10-pixel overlap on each side ( Figure 3). This process generated a total of 844 samples per site per time sequence, with a total of 3376 training samples and 1688 test samples.

Ground Truth
To create our ground truth masks, we used the Brazilian Institute of Space Research's Project for Deforestation Mapping (INPE's PRODES) data (available at http://www.dpi.inpe.br/prodesdigital/dadosn/) [7] for the years of 2018 and 2019 as a visual guide and then refined it by remapping the deforestation polygons on a smaller scale. PRODES data are commonly used for deforestation reports and studies have used it before when modeling and studying deforestation dynamics [851,826]. The changes were mapped using digitizing tools from the QGIS software [837] at 1:30,000 scale and subsequently transformed into binary raster files with 0 and 1 as absence-presence codes, respectively. In this process, we mapped changes exclusively to the natural forest, regardless of the land cover type in the following year ( Figure 4).

Ground Truth
To create our ground truth masks, we used the Brazilian Institute of Space Research's Project for Deforestation Mapping (INPE's PRODES) data [7] for the years of 2018 and 2019 as a visual guide and then refined it by remapping the deforestation polygons on a smaller scale. PRODES data are commonly used for deforestation reports and studies have used it before when modeling and studying Remote Sens. 2020, 12, 901 6 of 19 deforestation dynamics [85,86]. The changes were mapped using digitizing tools from the QGIS software [87] at 1:30,000 scale and subsequently transformed into binary raster files with 0 and 1 as absence-presence codes, respectively. In this process, we mapped changes exclusively to the natural forest, regardless of the land cover type in the following year ( Figure 4).

Hyperparameters
The RF model only needed two hyperparameters set, the number of trees to build (ntree) and the number of variables randomly sampled as candidates at each split (mtry). These were set to 500 trees, and three variables, respectively. The structure of the MLP algorithm consisted of a simple 3layer network containing an input layer, a hidden layer with 256 nodes, and an output layer. The DL algorithms and MLP shared the same hyperparameters for training. Focal loss [848] was used as the loss function as it excels in classification problems with an uneven number of observations in each class, as is the case of our object of study. For gradient descent optimization, we used the adaptive moment estimation (ADAM) algorithm [895] with incorporated Nesterov Momentum (NADAM) with a learning rate of 2e-3, of 0.9 and of 0.999. The number of epochs was set to 250 and the batch size to 16 to fit the training process into memory.

Modeling Approach
Given the context of our main methodological steps described in the previous sections, a topdown view of our modeling approach is described in Figure 5. In addition to the DL algorithms, two classical ML algorithms, random forest (RF) and a simple multilayer perceptron (MLP) architecture, were used as a reference point for the assessment of the DL models. Both models have been extensively researched for land cover classification, and change detection in remote sensing data with their performance is well documented [9086,9187].

Hyperparameters
The RF model only needed two hyperparameters set, the number of trees to build (ntree) and the number of variables randomly sampled as candidates at each split (mtry). These were set to 500 trees, and three variables, respectively. The structure of the MLP algorithm consisted of a simple 3-layer network containing an input layer, a hidden layer with 256 nodes, and an output layer. The DL algorithms and MLP shared the same hyperparameters for training. Focal loss [88] was used as the loss function as it excels in classification problems with an uneven number of observations in each class, as is the case of our object of study. For gradient descent optimization, we used the adaptive moment estimation (ADAM) algorithm [89] with incorporated Nesterov Momentum (NADAM) with a learning rate of 2e-3, β 1 of 0.9 and β 2 of 0.999. The number of epochs was set to 250 and the batch size to 16 to fit the training process into memory.

Modeling Approach
Given the context of our main methodological steps described in the previous sections, a top-down view of our modeling approach is described in Figure 5. In addition to the DL algorithms, two classical ML algorithms, random forest (RF) and a simple multilayer perceptron (MLP) architecture, were used as a reference point for the assessment of the DL models. Both models have been extensively researched for land cover classification, and change detection in remote sensing data with their performance is well documented [90,91].

Accuracy Assessment
The accuracy metrics were calculated using the test site data exclusively to avoid the possibility of biased results due to overfitting. The classification results were compared to the ground truth mask for the test site in order to calculate the accuracy measures. Given that deforestation related change is typically a rare phenomenon, the change-no-change ratio is highly imbalanced [9288,8993]. Therefore, change detection research usually shows a predominance of invariant areas, causing a bias in some accuracy metrics. For example, overall accuracy is relatively high on most change maps [9094]. The Precision and Recall measures (Equations (1) and (2)) were used to offer more insight in the distribution of errors in the classifications, along with three other measures besides accuracy: F1 score (also known as Dice coefficient), Kappa index, and mean intersection over union (mIoU) measure (Equations (3)-(5), respectively). These measures are often used to evaluate DL and ML classifications and are better suited for classifications with imbalanced datasets than overall accuracy as they equally weight class distributions.

Accuracy Assessment
The accuracy metrics were calculated using the test site data exclusively to avoid the possibility of biased results due to overfitting. The classification results were compared to the ground truth mask for the test site in order to calculate the accuracy measures. Given that deforestation related change is typically a rare phenomenon, the change-no-change ratio is highly imbalanced [92,93]. Therefore, change detection research usually shows a predominance of invariant areas, causing a bias in some accuracy metrics. For example, overall accuracy is relatively high on most change maps [94]. (1) and (2)) were used to offer more insight in the distribution of errors in the classifications, along with three other measures besides accuracy: F1 score (also known as Dice coefficient), Kappa index, and mean intersection over union (mIoU) measure (Equations (3)-(5), respectively). These measures are often used to evaluate DL and ML classifications and are better suited for classifications with imbalanced datasets than overall accuracy as they equally weight class distributions.

Recall = True Positives True Positives + False Negatives
(2) Remote Sens. 2020, 12, 901 where p o is the rate of agreement between the ground truth and the classification, and p e is the expected rate of agreement due to chance. mIoU = IoU 1 + IoU 2 · · · + IoU n n where IoU is the area of intersection divided by the area of union between the classification and ground truth for a class and n is the total number of classes. Finally, we used McNemar's test [95] to evaluate the statistical significance of differences between the classifications.

Results
Quantitatively, the DL models showed a clear advantage over RF and MLP ( Table 3). The ResUnet model had the best results with regard to every measure with the exception of Precision in the 2017-2018 time frame. The SharpMask and U-Net models showed similar but slightly inferior results. In comparison, the RF model showed the worst results in most measures, although the performance measures still indicated a good classification. It should be noted that the RF and MLP classifications exhibited a considerable amount of impulse noise ("salt-and-pepper" type), and a majority filter was applied to reduce the noise and improve the classification both visually and quantitatively. The DL models did not require any post-processing steps as they produced classifications with virtually no noise. All models showed very high overall accuracy, but, as explained previously, this measure should be carefully considered as the ratio between the change and no-change classes is highly imbalanced and is mostly explained by the larger, no-change class. McNemar's test results indicate that despite the seemingly similar results, the model classifications were all significantly different from each other ( Table 4).  RF's higher precision in the 2017-2018 frame can be explained by the low number of false positives produced. Conversely, however, it produced a very high number of false negatives within the same frame ( Figure 6). The ResUnet model had the lowest number of misclassified pixels in both time sequences. It also produced the least number of false negatives out of all the models. When looking at the number of false-positive cases, the DL algorithms did not show a large difference over the ML models. With regard to false-negatives, however, they showed a clear advantage. The reduction of false-negative classifications is a considerable advantage of the DL models over the classic ML algorithms, given that underestimating the extent of deforestation is a less desirable outcome than its overestimation. RF's higher precision in the 2017-2018 frame can be explained by the low number of false positives produced. Conversely, however, it produced a very high number of false negatives within the same frame (Figure 106). The ResUnet model had the lowest number of misclassified pixels in both time sequences. It also produced the least number of false negatives out of all the models. When looking at the number of false-positive cases, the DL algorithms did not show a large difference over the ML models. With regard to false-negatives, however, they showed a clear advantage. The reduction of false-negative classifications is a considerable advantage of the DL models over the classic ML algorithms, given that underestimating the extent of deforestation is a less desirable outcome than its overestimation. The models detected roughly the same deforestation sites at the validation site across both time sequences (Figures 76 and 78). However, the DL models provided more detailed classifications within smaller scales, particularly around feature edges. Moreover, all models were able to classify "easy" deforestation patches with less complex spectral mixtures (Figure 98), but the classification of the ML algorithms degraded as the spectral signatures within the patches increased in complexity ( Figure  109). RF showed a higher tendency to produce false-negatives both visually and quantitatively. The models detected roughly the same deforestation sites at the validation site across both time sequences (Figures 7 and 8). However, the DL models provided more detailed classifications within smaller scales, particularly around feature edges. Moreover, all models were able to classify "easy" deforestation patches with less complex spectral mixtures (Figure 9), but the classification of the ML algorithms degraded as the spectral signatures within the patches increased in complexity (Figure 10). RF showed a higher tendency to produce false-negatives both visually and quantitatively.
The total deforested area was slightly higher than the ground truth in the SharpMask and ResUnet predictions in both time sequences (Table 5). The opposite was true for the MLP prediction, which slightly underestimated the total area in both time spans. The RF model underestimated the total deforested area by a very large portion (almost 40 km 2 or a 26% decrease in area) in the 2017-2018 sequence due to a large number of false negative predictions, but despite this, it came closest to the ground truth area in the 2018-2019 sequence, although, that does not necessarily mean the predicted areas were the same as the ground truth. Processing times varied from model to model, but the MLP and DL models offered faster training and prediction times than RF, mainly due to the fact that the Tensorflow framework uses the computer's graphical processing unit (GPU) for parallel processing instead of the central processing unit (CPU), which is traditionally used for ML. Using an NVIDIA GTX 1070 GPU and a batch size of 16, the total training time ranged from approximately 40 minutes for the simpler MLP model (around 10 seconds per epoch) to almost three hours for the more complex ResUnet model (approximately 40 seconds per epoch). Given the size of the datasets, RF took approximately six hours to train using parallel processing with an Intel Core i5-4690k processor. The difference in processing times was particularly considerable when using the models to classify the images after training. The DL models and MLP classified the test scene within seconds, whereas RF took almost an hour to complete the task.       The yellow rectangle highlights an example of a "hard-to-classify" deforestation patch.
The total deforested area was slightly higher than the ground truth in the SharpMask and ResUnet predictions in both time sequences (Table 5). The opposite was true for the MLP prediction, which slightly underestimated the total area in both time spans. The RF model underestimated the total deforested area by a very large portion (almost 40 km 2 or a 26% decrease in area) in the 2017-2018 sequence due to a large number of false negative predictions, but despite this, it came closest to the ground truth area in the 2018-2019 sequence, although, that does not necessarily mean the predicted areas were the same as the ground truth.  The yellow rectangle highlights an example of a "hard-to-classify" deforestation patch.

Discussion
The CNN architectures used in this study showed a clear advantage to the classic ML algorithms, both quantitatively and visually, regarding deforestation mapping. Similarly, a comparative study of methods developed by [96] for wetland mapping found that deep learning methods (completely convolutional networks and patch-based deep CNN) obtained better accuracy than RF and support vector machine. The authors found that CNN may produce inferior performance when the training sample size is small, but it tends to show substantially higher accuracy than conventional classifiers using a larger training sample size. We assume that the difference in performance between the DL and traditional ML methods stems from the former's capability to understand both the spatial and spectral context, whereas the regular ML models inherently only see the spectral information.
Although the current methodologies to detect deforestation with DL architectures vary widely, studies are in agreement that they produce excellent classification results [97][98][99]. Other analogous studies that have investigated the use of DL for single class classifications seem to corroborate with this trend, although there is a large variation between the choice of targets and architectures [34,[100][101][102]. While choice and development of architectures for certain targets is a relevant topic for future research, we have found that autoencoder networks with residual connections seem to be a good starting point for classifications in remote sensing imagery as they can take advantage of spatial and spectral information in a very efficient manner.
Despite their advantages, DL algorithms are still not as accessible or as easy to use as classic ML models. Besides needing specific hardware for training, they require a relatively large quantity of samples, and developing ground truth masks for specific targets can be challenging and time-consuming in large extents as both spatial and spectral context are strictly needed, whereas the traditional ML algorithms work with simpler sampling schemes and can produce reasonably good results with a much smaller sample size. Therefore, the process of building a model for broader use (i.e., country-wide monitoring) can be complicated. However, these models have another advantage in the fact that they can be incrementally trained, meaning they could be gradually provided with new samples to update the model weights and improve their classifications with time. With that said, the "black box" nature of these networks can make them undesirable for those who might wish to know and disclose their internal workings such as public and governmental entities. Despite that, through our findings, we believe that with enough development, DL algorithms can provide a viable automatic solution for mapping deforestation in the Amazon alongside projects such as INPE's PRODES and TerraClass.
It should be noted that while the models showed good capability for generalization within our region of study, we cannot assert that they would achieve the same results in different areas where deforestation is a common occurrence. A broader reaching model would necessarily require samples from different regions to account for the possible spatial and spectral variability from one region to another. Further research should be carried out to study the applicability of the models for similar targets in different areas. In addition, while Landsat data are enough for annual deforestation mapping between dry seasons, more frequent monitoring is virtually impossible as clouds are present above the forest canopy during most of the year, and the ground reflectance cannot reach the satellite's optical sensors. One solution would be the use of radar data to be able to cross the cloud cover. As such, we also recommend the investigation of the use of radar data and DL algorithms to detect deforestation within a shorter time-frame.

Conclusions
In this study, we proposed the use of existing DL architectures to detect yearly changes in the vegetation cover in the region of the Brazilian Amazon, successfully achieving our goal. Results show that these algorithms are a viable alternative to classical ML algorithms, with the improvement of all performance measures and clear advantages such as faster prediction times and lack of noise in the classifications. The SharpMask, U-Net, and ResUnet models showed similar results. However, ResUnet achieved the best values of accuracy, Kappa, F1, and mIoU, and the least amount of errors overall. Visually, the DL algorithms also produced classification masks with well-defined deforestation patches, while the ML models showed an evident loss of quality in harder to classify patches, with a tendency to produce false-negatives and impulse noise that needed to be filtered. One of the main shortcomings of CNNs seems to be the necessity of a 1:1 ground truth regarding the extent of the area of study as the spatial context is critical. In contrast, simpler ML models can be trained on a point-by-point basis (e.g., random sampling points within the extent). Developing a whole ground truth can be an extensive process. However, we have achieved very good results with a relatively small sample size with very little augmentation in the form of overlapping sample patches. The additional bands in remote sensing data may facilitate the detection of targets with less samples, but that supposition needs further research. Furthermore, considering the models were validated by being applied to an independent dataset, the performance measures show that they have very good potential for generalization. DL is still a growing technology, particularly in the remote sensing field as not even popular libraries such as Keras and Tensorflow have built-in tools for dealing with multi-band satellite imagery, but researchers are slowly adapting and developing better architectures specific for remote sensing data. The architectures used in this study performed well in our specific task, although they were developed for entirely different targets. Therefore, these algorithms do not necessarily need to be tailored for specific cases and can even work interchangeably between fields of research.