Deforestation Detection with Fully Convolutional Networks in the Amazon Forest from Landsat-8 and Sentinel-2 Images

: The availability of remote-sensing multisource data from optical-based satellite sensors has created new opportunities and challenges for forest monitoring in the Amazon Biome. In particular, change-detection analysis has emerged in recent decades to monitor forest-change dynamics, supporting some Brazilian governmental initiatives such as PRODES and DETER projects for biodi-versity preservation in threatened areas. In recent years fully convolutional network architectures have witnessed numerous proposals adapted for the change-detection task. This paper compre-hensively explores state-of-the-art fully convolutional networks such as U-Net, ResU-Net, SegNet, FC-DenseNet, and two DeepLabv3+ variants on monitoring deforestation in the Brazilian Amazon. The networks’ performance is evaluated experimentally in terms of Precision, Recall, F 1-score, and computational load using satellite images with different spatial and spectral resolution: Landsat-8 and Sentinel-2. We also include the results of an unprecedented auditing process performed by senior specialists to visually evaluate each deforestation polygon derived from the network with the highest accuracy results for both satellites. This assessment allowed estimation of the accuracy of these networks simulating a process “in nature” and faithful to the PRODES methodology. We conclude that the high resolution of Sentinel-2 images improves the segmentation of deforestation polygons both quantitatively (in terms of F 1-score) and qualitatively. Moreover, the study also points to the potential of the operational use of Deep Learning (DL) mapping as products to be consumed in PRODES. of 62.8% of Landsat derived polygons matched with Sentinel ones and 71.5% of the Sentinel polygons matched with Landsat ones. These results showed that Landsat-based


Introduction
Deforestation is one of the most serious environmental problems today. The devastation of forests and natural resources compromises the ecological balance and seriously affects the economy and quality of life across the planet.
As the largest tropical forest in the world, the Amazon Rainforest has particular importance. It plays an essential role in carbon balance and climate regulation, provides numerous ecosystem services, and is among the most biodiverse biomes on earth [1]. With about 5 million km 2 , the Brazilian Amazon occupies the largest area of the Amazon Forest, covering about 65% of the total area.
Until 1970, deforestation in the Brazilian Amazon comprised about 98,000 km 2 , while in the last 40 years, the deforested area covered 730,000 km 2 , which corresponds to twice the German territory and comprises nearly 18% of the area formerly covered by vegetation [2,3].
Moreover, the destruction of the Brazilian Amazon rainforest continues at an accelerated rate. Satellite monitoring data published by the Brazilian National Institute for Space Research (INPE) show that the deforestation rate in the Amazon increased by 34% in 2019 compared to the previous year, and the 2020 annual increment of deforestation reached the highest value in the last ten years [4].
The National Institute for Space Research (INPE) has monitored the deforestation rate in the Amazon region since the 1980s through the PRODES (Brazilian Amazon Rainforest Monitoring Program by Satellite (http://www.obt.inpe.br/OBT/assuntos/programas/amazonia/ prodes, accessed on 18 July 2021) and the DETER (Real-time Deforestation Detection System (http://www.obt.inpe.br/OBT/assuntos/programas/amazonia/deter, accessed on 18 July 2021) projects [5]). The DETER project is a rapid monitoring system based on medium resolution satellite images, which aims at the early detection of deforestation activities in almost real-time to enable the intervention of security agents before the damage reaches large proportions [6]. The PRODES project, on the other hand, is in charge of accurately mapping deforestation and computing annual deforestation rates. It is based on Landsat-8 images to map the new forest loss that occurred over a year in the Brazilian Amazon biome.
Both the PRODES and DETER protocols involve a great deal of visual interpretation, being consequently a time-consuming process subjected to a significant degree of subjectivity. Although such automatic approaches allow for faster and less subjective analysis, the PRODES and DETER programs still involve considerable human intervention because no entirely automated procedure was able thus far to meet the accuracy requirements.
The literature is plenty of studies aimed at monitoring changes in forest areas based on satellite images [7][8][9]. Many change-detection techniques have been proposed (e.g., [10,11]). Some of the early strategies include techniques based on image algebra [10,12]. Nelson et al. [13] studied image differencing, image rationing, and vegetation index differencing (VZD) for detecting forest canopy alteration. Other change-detection methods proposed thus far rely on machine learning classifiers such as artificial neural networks [14], decision trees [15], fuzzy theory [16], and support vector machines [17].
These change-detection methods present good results for identifying changes in medium-to coarse-resolution imagery but fail when dealing with high-resolution images, tending to produce the salt-and-pepper pattern on the resulting maps [15]. This effect occurs due to high-frequency components and high contrast of the high-resolution images. The image acquisition variation often results in too many changes being detected [15]. Indeed, the method's main limitation is the difficulty of modeling contextual information as the classes of neighboring pixels are often ignored [18].
In recent years, DL has emerged as the dominant trend in image analysis, performing even better than humans in some complex tasks. Indeed, DL methods have shown great potential in change-detection applications, outperforming traditional machine learning methods (e.g., [19][20][21]). Some approaches in this category involve patch-based classification, including Early Fusion and Siamese CNNs [22,23]. An essential disadvantage of patch classification with CNNs is the redundant operations that imply a high computational cost. The fully convolutional neural networks (FCNs), first proposed in [24], are computationally more efficient by doing pixel-wise instead of patch-based classification. Daudt et al. [25] presented one of the first works using FCNs for change detection. This work adapted patchbased Early Fusion and Siamese Networks into an FCN architecture, having achieved better accuracy and faster computation. Few studies published so far used FCNs for deforestation mapping. De Bem et al. [26] evaluated different machine learning techniques for deforestation mapping in Brazilian biomes based on Landsat-8 images. The comparative study showed a clear advantage of the FCNs over the classic Machine Learning and CNN's algorithms, both quantitatively and qualitatively.
Most works published thus far pick the class with the highest posterior probability. In deforestation mapping, this means assigning to the class de f orestation whenever the corresponding posterior probability exceeds 50%, which implies weighting equally false-negatives and false-positives. However, depending on the purpose, false-positives are more detrimental than false-negatives. Consider, for example, a decision whether or not to send a team for inspection and possible infraction notice to a spot suspected of ongoing deforestation activities. The decision to unleash such an action requires confidence in the prediction, for instance, a deforestation probability higher than 50%. In other operational scenarios, the opposite may occur, i.e., false negatives being more harmful than false positives. Therefore, it is worth investigating how the various accuracy metrics behave for different probability thresholds.
This study raises in this scenario and investigates alternative FCN architectures in the Early Fusion configuration for deforestation mapping in the Brazilian Amazon. We further investigate how the different spatial and spectral resolutions of Landsat-8 and Sentinel-2 images, the two currently freely available optical data, may impact the performance of said network designs.
Considering the legend maps are derived from coarse spatial resolution Landsat satellite sensors, we intend to test whether the high resolution of Sentinel-2 images helps to better preserve the spatial structure of the deforestation polygon in comparison with Landsat images.
Specifically, the contributions of this work are three-fold: • The rest of the text is structured as follows: Section 2 presents the FCN architectures investigated in this work. Section 3 describes the study area, the datasets used in our experimental analysis, the experimental setup, the networks' implementation, and the adopted performance metrics. Next, Section 4 shows the results recorded and, finally, Section 6 summarizes the conclusions derived from the experiments carried out and provides guidance for the continuation of this research.

Methods
Fully convolutional networks are the most successful approaches for the semantic segmentation task [24]. Typically, these networks consist of an encoder stage, which reduces the spatial resolution by convolution and pooling operations through consecutive layers, followed by a decoder stage that retrieves the original spatial resolution.
The change-detection algorithm adopted in this work follows the "early fusion" that has been adopted in several related works [23,25,27]. In short, an FC network seeks to identify the pixels where deforestation occurred in the time interval comprised by the acquisition dates of two co-registered images. The input for FC network is the tensor formed by the concatenation of the two images along the spectral dimension.
In this section, we present a short description of the main characteristics of the FCN architectures on which this study is based.

SegNet
A distinctive feature of this network architecture is the decoding technique, as shown in Figure 1. The encoder stage's max-pooling operations reduce spatial resolution and computational complexity, causing a loss of spatial information, negatively impacting the outcome, especially at the object borders. SegNet seeks to overcome this downside by storing the maximum pooling indexes in the encoder and using them to recover the fine details' location in the upsampling operation at the corresponding decoder stage. Compared to U-Net's skip connections (see next subsection), this strategy allows for a faster training process, as the network does not need to learn weights in the upsampling [28] stage.

U-Net
The U-Net is probably the most widely used network architecture for semantic segmentation, see Figure 2. Similar to SegNet, it consists of two sequential stages. First, the so-called encoder successively reduces the spatial resolution as it extracts increasingly coarse features. Then, the decoder continues extracting features at increasingly higher resolutions until it reaches the original resolution and, finally, associates one class to each input pixel position. Characteristic of the U-Net [29] architecture is the skip connections that concatenate features captured in the downsampling path to the features computed by corresponding layers of the upsampling path. In this way, it recovers small details lost through the pooling operations and allows a faster convergence of the model [29].

ResU-Net
ResU-Net [30] is a fully convolutional neural network that takes advantage of both the U-Net architecture and the so-called residual block introduced by the ResNet architecture first conceived for image classification [31]. Residual blocks prevent the vanishing or exploding gradient problem and help this way to create deeper networks. In sum, instead of the input-to-output mapping, a residual block learns the residual to be added to the input to produce the output (see Figure 3 on the right). The ResU-Net consists of stacked layers of residual blocks in an encoder-decoder structure. In the encoder stage, a 1 × 1 convolution with a stride of 2 downsamples the output of each residual block. The layers within residual blocks are shown in detail in Figure 3. In the decoder stage, the upsampling layers increase the spatial resolution until reaching the original image size. The ResU-Net inherits from the U-Net architecture the skip connections to preserve fine details lost in the encoder stage.

FC-DenseNet
Jégou et al. [32] extended the Densely Connected Convolutional Network (DenseNet) used for image classification [33] by adding an upsampling path obtaining this way a fully convolutional network for semantic segmentation. The main characteristic of this architecture is its ability to reuse at each layer the preceding layers' information. The FC-DenseNet concatenates the feature maps computed at each layer with features generated in prior layers forming so-called dense blocks (DB in Figure 4). Therefore, the number of feature maps increases at each new layer. Each block consists of several layers, whereby each layer is composed of batch normalization, a ReLU activation, a 3 × 3 convolution, and dropout. FC-DenseNet uses dense blocks and transition up modules in the upsampling path, which applies transposed convolution to upscale the feature maps up to the original image resolution. The skip connections introduced in the U-Net are also present in FC-DenseNet to recover small details that may go lost along the downsampling path.

DeepLabv3+
The DeepLabv3+ model (see Figure 5) has an encoder stage that extracts a compact image representation and a decoder stage that recovers the original image resolution and delivers pixel-wise posterior class probabilities. Compared with previous DeepLab variants, the decoder stage in DeepLabv3+ allows for improved segmentation outcomes along object boundaries [35]. DeepLabv3+ uses the Xception-65 [36] module as a backbone, a deep module based entirely on depthwise separable convolutions [35,36] with different strides. Conceptually, the spatial separable convolution breaks down the convolution into two separate operations: a depthwise and a pointwise convolution [36].  To increase the field of view without increasing the number of parameters, DeepLab uses the Atrous Spatial Pyramid Pooling in the bottleneck by applying atrous convolution with multiple rates. Atrous convolution, also known as dilated convolution, operates on an input feature map (x) as follows:

Backbone
where i is the location in the output feature map y, w is a convolution filter, and r is the dilation rate that determines the stride in which the input signal is sampled [35]. The basic idea consists of expanding a filter by including zeros between the kernel elements. In this way, we increase the receptive field of the output layers without increasing the number of learnable kernel elements and the computational effort.
Atrous Spatial Pyramid Pooling (ASPP) involves employing atrous convolution with different rates in parallel at the same input as a strategy to extract features at multiple scales. This technique alleviates the loss of the spatial information intrinsic of pooling or convolutions with striding operations [37]. This trick allows for a larger receptive field by increasing the rate value while maintaining the number of parameters.
The decoder is a simple structure that uses the bilinear upsampling to recover the original spatial resolution [35].

Mobilenetv2
Most of the best-performing FCN architectures require computational resources not available in many mobile devices. This fact moved some researchers to design neural network architectures tailored to mobile and resource-constrained hardware platforms with low accuracy loss. One example is a DeepLabv3+ variant called Mobilenetv2. The architecture's building blocks are the so-called inverted residual structure [38] (see Figure 6). This module first expands a low-dimensional input to a higher dimension and then applies a lightweight depthwise linear convolution to project back to a low-dimensional representation. Finally, inspired by traditional residual connections, shortcuts speed training and improve accuracy.

Study Area
We selected a portion of the Amazon forest in Acre and Amazonas states, Brazil, as a study site (see Figure 7). This area extends over approximately 12,065 km 2 , covering around 0.3% of the total Brazilian Amazon forest. This area intersects with the 003066 Landsat pathrow scene, and its coordinates are 08 • 08 28 S-09 • 08 07 S latitude, and 68 • 54 40 W-69 • 54 29 W longitude. The area is characterized by typical Southwest Amazon moist forests and with a lesser presence of flooded forests. Most of the study area has no specific protection status, with less than 5% of the extractivist reserve and indigenous lands on its Southern part. Two aspects strengthen the choice of this area to conduct our experiment: (i) the deforestation patterns diversity, based on landscape metrics and polygon area [39], are represented by multidirectional, geometric regular, linear, and diffused occupation forms and (ii) the deforestation dynamics in the region, characterized by intense occupation along the BR-364 federal road, but also deep inside the forest, and induced by the expansion of agricultural frontier from the Rondonia State to western territories. More than 7% of the study area has already been deforested until 2020.

Datasets
We downloaded all datasets used in our experiment from PRODES and DETER websites, which are accessible from TerraBrasilis portal , http://terrabrasilis.dpi.inpe.br/ en/home-page/ (accessed on 18 July 2021), where all deforestation reports produced by both programs since they came into operation are available for free.
The input to deep learning models were Landsat-8 Collection 1 Tier 1 (https://developers. google.com/earth-engine/datasets/catalog/LANDSAT_LC08_C01_T1, accessed on 18 July 2021) and Sentinel-2 L1C (https://developers.google.com/earth-engine/datasets/catalog/ COPERNICUS_S2, accessed on 18 July 2021) data. These products include radiometric and geometric corrections to generate highly accurate geolocated images without involving secondary preprocessing such as atmospheric corrections. Before feeding the network, we normalized the input data channel-wise to zero mean and unit variance.
The two Landsat-8 images with size 2145 × 3670, and two Sentinel-2 L1C images of size 6435 × 11,010 were acquired between 1 July of 2017 and 21 August of 2018 (see Figure 8a,b). It is not easy to obtain cloudless optical orbital images from the Amazon rainforest for most of the year. That is why PRODES reports refer to deforestation from the dry season of one year to the dry season of the following year. The public PRODES database from which we extracted the data used in this study relies on images acquired around July/August of each year. We used all Landsat-8 seven bands with 30 m spatial resolution. As for the Sentinel-2 dataset, we used four bands (Blue, Green, Red, and NIR) with 10 m spatial resolution (see Figure 8c,d).
Experienced professionals annotated all images by visual photo-interpretation. They identify change patterns based on three main observable image features: tone, texture, and context. Additionally, they only annotated deforestation polygons with an area greater than 6.25 hectares.
PRODES adopts an incremental mapping methodology for building the deforestation maps. It involves an exclusion mask (see Figure 8e) , which covers the areas deforested up to the current date and the residuals (deforestation detected in a given year but referring to the image of the previous year), and a second mask (see Figure 8f) , which corresponds to areas deforested in the reference year. The exclusion mask helps the photo-interpreters to delineate polygons of recent deforestation exclusively [2]. Following the PRODES and DETER methodology to calculate accuracy, we disregarded the classification results within a two pixels wide buffer around the prediction and reference polygons, where both are not reliable. To be consistent, we did not consider pixels in these regions either for training or testing.

Experimental Setup
To model the deforestation dynamics in the study area between two consecutive years, we took pairs of co-registered images acquired in 2017 and 2018, as specified in Section 3.2. We concatenated the co-registered images of each bi-temporal pair along the spectral dimension following the Early Fusion method [23], both for Landsat-8 and Sentinel-2.
We split the images into 15 non-overlapping tiles of 715 × 734 and 2145 × 2202 for Landsat and Sentinel 2 dataset, respectively, and separated the tiles into three groups: 20% for training, 5% for validation, and 75% for testing.
Each tile was further split into equal-sized patches of 128 × 128 pixels yielding a total of 5824 and 17,473 patches from Landsat and Sentinel-2 datasets, respectively.
Both datasets are highly unbalanced. The deforestation class corresponds to less than 1% for both datasets. To alleviate the class imbalance, we applied data augmentation using 90°rotations, horizontal and vertical flip transformations to the patches containing deforestation spots.
To compensate for the class imbalance, we adopted the Weighted Cross-Entropy Loss. The objective was to force FCNs to focus on those weakly represented instances by assigning them a larger weight. Equation (2) represents the Weighted Cross-Entropy Loss for binary classifications, where N and M stand for the total number of training pixels in rows and columns, respectively, while w d stand for the weights of the deforestation class. The weights for deforestation class in relation to forested class were empirically set to 5. Moreover, y i,j andŷ i,j represent the target and predicted label at pixel (i, j).

Networks' Implementation
We conducted an exploratory analysis to select the hyperparameter values for each tested network, including several layers, operations per layer, and kernels' size.The source codes are available at https://github.com/DLoboT/Change_Detection_FCNs (accessed on 18 July 2021). Table 1 presents the network configurations used in our experiments. Table 2 shows the parameters' setup in each case.
We implemented the networks using the Keras deep learning framework [40] on a hardware platform with the following configuration: Intel(R) Core(TM) i7 processor, 64 GB of RAM, and NVIDIA GeForce RTX 2080Ti GPU. We trained the models from scratch for up to 100 epochs using the Adam optimizer with a learning rate of 10 −4 . Training stopped when the performance measured on the validation set degraded over ten consecutive epochs. In the end, the model that exhibited the best performance on the validation set across all executed epochs kept for the test phase. The batch-size was selected experimentally to 16 for all tested networks.

Performance Metrics
The accuracy metrics adopted in our analysis were Overall Accuracy, Precision, Recall, and F1-score, as defined in Equations (3), (4) and (5) respectively.
The overall accuracy is defined as: where tp, tn, f p, f n stand for the number of true positives, true negatives, false-positives, and false-negatives samples, respectively. The terms positive and negative refer to deforestation and no-deforestation classes, respectively.
The F1-score is defined as: where P and R stand for precision and recall, respectively, and are given by the ratios [41]: Another metric to report accuracy in our experiments is the Alarm Area (AA) [23]. It gives the proportion of the test area whose deforestation probability delivered by the network exceeds a given threshold, formally: AA occurs in our analysis in combination with Recall. By varying the deforestation probability threshold, we obtain corresponding pairs of AA and R values that can be plotted in a AA vs. R curve for each tested architecture. AA is related to the amount of human or material resources needed to more closely inspect the areas with deforestation probability above a certain threshold. R, on the other hand, gives the proportion of all deforested areas whose deforestation probability estimated by the automatic method exceeds that threshold. Therefore, the AA vs. R curve provides different tradeoffs between these metrics and subsidizes a decision on which spots to direct the inspection resources in different operational scenarios.
We also report networks' accuracy in terms of the Average Precision (mAP) defined as the area under the P vs. R curve.   The first bar group reports the overall accuracies. All networks achieved similar scores, which is little revealing regarding the tested network architectures' relative performance. Those values correspond approximately to the occurrence of class deforestation in each pair of images. It may raise the question of whether the classifiers merely associated all pixels with the class no-deforestation. Figure 11a,b clarifies this issue. They show the proportion of false-positives and false-negatives considering the total number of pixels in the test set by each network in the Landsat-8 and the Sentinel-2 datasets, respectively. The analysis of these plots places the ResU-Net as the best-performing network in terms of the total number of classification errors, followed by FC-DenseNet. This ranking remains practically unchanged in the experiments with Landsat-8 and Sentinel-2 data. It is noteworthy a slight reduction of classification errors for all the networks in the Sentinel-2 in comparison with the Landsat-8 dataset.  The second and the third bar groups of Figures 9 and 10 refer to Recall and Precision. According to the Recall metric, FC-DenseNet and U-Net were always the best-performing networks following from Xception in Figure 9, and surpassed for ResU-Net in Figure 10. MobileNetV2 and Segnet always performed among the five and six networks in terms of Recall for both datasets. Unlike Xception, it is also remarkable that the networks achieved the best Recall results for Sentinel-2 data, being MobileNetV2 and Segnet as the most profited networks with a difference considering Landsat-8 of 3% and 2.7%, respectively.

Segmentation Accuracy for Deforestation Detection
FC-DenseNet and U-net performed among the three lowest Precision results for both datasets, indicating a tendency towards misclassifying deforested areas. In comparison, ResU-Net was always among the top two best-performing networks at a short distance from Segnet in the Sentinel-2 dataset. Please note that Segnet and MobileNetV2 were consistently among the two networks whose results varied the most over the ten runs. Among the datasets, more than the other networks Xception and Segnet showed a clear advantage in the Sentinel-2, with gains of 12.4% and 8.7%, respectively.
The F1-score in the fourth bar groups of Figures 9 and 10 summarizes in a single value the Recall and Precision for each network and dataset. From the F1-score perspective, the results obtained for both datasets lead to the same conclusion: ResU-Net and FC Densenet achieving the highest F1 score followed by U-Net and Xception, and Segnet and MobileNetV2 as the worst-performing networks. Additionally, in terms of variation over the ten runs, Segnet and MobileNetV2 presented the worst behavior.
The inferior performance of the Segnet could lie in the manner in which recovered the spatial resolution in the decoder stage. Specifically, the upsample maps in Segnet employ interpolation, while ResU-net, FC-DenseNet, and U-net used transposed convolution. Furthermore, the essential spatial information is better preserved by the skip connections of the ResU-net, FC-DenseNet, and U-net than the pooling indices in the Segnet.
On the other hand, the DeepLabv3+ variants can encode greater contextual information; nevertheless, along with Segnet it ranked in the last tree positions in the experiments with both datasets. It would seem that the use of more contextual information did not impact these results given the low-scale variability of the deforestation polygons in our application.
As stated before, the Xception variant overcame the MobileNetv2 in all cases. Since MobileNetv2 is a lightweight version of Xception, we argue that the latter produced better results due to its greater complexity. This is attested by all the quantitative results shown in Figures 9 and 10.
Concerning the difference of both datasets, except ResU-net, all the networks performed better in Sentinel-2 in terms of F1-score. Again, we also noticed that MobileNetv2 and Segnet benefited more than all other architectures.
For the results reported in Figures 9-11, we assigned to the class deforestation all pixels which probability delivered by the corresponding network exceeded a threshold equal to 50%.
We applied McNemar's test to each pair of network models evaluated in this work, taking as the null hypothesis that the tested models presented similar performance. Each model was represented by an ensemble composed of the networks resulting from the ten training sessions. Then, we applied "majority voting" to the results produced by the ten trained networks and thus obtained the consensus result that represented the model in the tests. Next, we applied McNemar's test to each pair of ensembles and datasets. The lowest among all computed p-value was 0.82 for Segnet and MobileNetV2 on Sentinel dataset. Thus, notwithstanding the similar OA values, the high p-values allow rejecting the null hypothesis in favor of the hypothesis that all models tested in this study performed differently.
We can evaluate the networks' performance for different confidence levels of deforestation alarms by changing the probability threshold. Therefore, we built the Alarm Area versus Recall curves (Figures 12a and 13a) and Precision versus Recall curves (Figures 12b and 13b) for the experiments on the Landsat-8 and Sentinel-2 datasets, respectively. The closer each curve is to the coordinates (1,0) and (1,1), respectively, the better the performance will be. As shown in Figure 12a, all methods, except SegNet and Mobilenetv2, signaled less than 10% of the total imaged area as suspicious of deforestation for a Recall value around 90%. This means that all approaches but SegNet and Mobilenetv2 managed to identify regions that correspond to less than 10% of the imaged area and contain 90% or more of the total deforestation spots. Hence, based on these results, a photo-interpreter could reduce his/her work on less than 10% of the input image that would concentrate more than 90% of all deforestation occurrences.
Fc-DenseNet and ResUnet achieved the best results among all tested network architectures with a Recall higher than 95% for 10% Alarm Area. By this criterion, the U-Net and Xception were between the worst and the best-performing architectures. Figure 12b also reveal that the Segnet and MobileNetv2 achieved the poorest performance among the evaluated architectures. ResU-Net and FC-DenseNet stood out as the best performers, followed closely by U-Net, while Xception stood behind the first three architectures in the ranking. In comparison with Figure 12a), the inferior Xception performance can be explained by the high number of false positives produced.
The mAP values placed the networks in a similar ranking. ResU-Net, FC-DenseNet and U-Net reached mAP values above 70%, with Xception 1.3% further behind, followed by Mobilenetv2 and SegNet.
The profile in Figure 13a is similar to that of Figure 12a. Again, U-Net, FC-DenseNet, ResU-Net and Xception managed to correctly identify more than 90% of the samples when looking at 10% of the image. It would seem that Segnet and MobileNetv2 also guarantee a lower AA (less than 10%) when Recall values reach 90% for the Sentinel-2 dataset. This result indicates that for all the FCN the omission errors were low and remained almost unchanged with an increase in the threshold.  Figure 12b for the experiments conducted on the Sentinel-2 dataset. It is noteworthy, except for ResU-Net, mAP values were slightly higher than those produced for the Landsat data, especially for SegNet as the network that profited more.

Computational Complexity
Tables 3 and 4 present the average training and inference times measured on the hardware infrastructure described in Section 3.4 for the Landsat-8 and Sentinel-2 datasets, respectively. The training time stands for the median value of over ten runs. The inference time stands for each model's median prediction time for the whole image.
We trained the networks with the same values of learning rate, batch-size, optimizer, and patch-size, for both datasets and for the architectures described in Table 2. We adopted the same basic design for each architecture, as presented in Section 3.3. The only differences related to the number of spectral bands in each dataset. The training and inference times for Sentinel-2 were longer because the input image was about nine times larger than the Landsat-8 data. We worked with patches (128 × 128) with 50% overlap, which contributed to the high inference time in both datasets. The results shown in Tables 3 and 4 place the networks in the following increasing order of their respective processing times: (a) Mo-bileNetV2, (b) U-Net, (c) ResU-Net, (d) FC-DenseNet, followed by (e) and (f) Xception and SegNet with the longest processing times.
It is not surprising that MobileNetV2 was the fastest network for training and inference in both databases since this architecture was designed for lightweight devices. What draws attention is that U-Net has shown training times close to those of MobileNetV2, specifically around 10% higher. We also observed that ResUnet achieved a faster training convergence despite its high computational complexity. This was related to the fact that the residual connections present in its architecture facilitate information propagation and convergence speed. The other networks required at least one database more than twice the training time of MobileNetV2. FC-DenseNet followed in the ranking, with SegNet and Xception alternating as the worst architecture in terms of training time.
As for the inference time, MobileNetV2 stood out even more to its counterparts. Even maintaining the same ranking observed for training times, the differences in inference times among the other networks were not that large. However, SegNet and Xception took about twice as long as MobileNetV2 to inference.

Discussion
The DL-based techniques investigated in the present paper have promising use in Earth Observation projects such as PRODES that provides the official Brazilian deforestation annual maps in Amazon since 1988. Although diverging from PRODES in several methodological aspects, these techniques can provide deforestation classification products to be consumed in a traditional PRODES auditing process, thus fitting the DL maps to the methodological requirements of the monitoring project.
The PRODES methodology is not based on a pixel-by-pixel classification, but uses (i) visual interpretation and manual vector editing for historical reasons and to provide high accuracy products, (ii) a minimum mapping area to maintain the historical series consistent, and (iii) a mapping scale of 1/75,000, because of the large extension of the Brazilian Amazon, the time-consuming visual interpretation and also because a finer scale would not significantly improve the detection of deforestation polygons higher than the minimal mapping area.
As a result, PRODES may classify small patches of remnant vegetation amid deforested areas as deforestation, and small deforested areas may remain in original forest class. This observation extends to the edges of polygons, which are manually delineated by interpreter vector editing at a fixed mapping scale of 1/75,000.
Although the assessment of the accuracy of DL models by the original PRODES map is relevant, the methodological differences between the two mapping techniques may lead to underestimations in accuracy metrics, as it assumes that the reference map represents the absolute truth at the pixel level.
Another consequence of methodological divergences is an artificial increase of false negatives when using the PRODES minimal mapping unit value as a threshold to filter Deep Learning polygons. Actually, the pixel-by-pixel approach tends to increase polygon fragmentation compared to the PRODES reference. A PRODES polygon larger than the minimum area might be modeled in the form of several fragments smaller than the minimum area, which would improperly exclude part of the set of polygons that were correctly detected.
To neutralize the impact of these divergences on accuracy metrics, the PRODES team of senior analysts conducted an unprecedented experiment of complete auditing of the DL maps that used ResU-Net network on data of both satellites. All PRODES protocol requirements were respected. This auditing allowed (i) to estimate accuracy for DL maps using the final audited PRODES map that would be derived as the reference, (ii) to compare these estimations to the accuracy metrics reported in Section 4.1, based on classical pixelby-pixel comparison. In addition, this assessment contributed to evaluate the potential of operational use of DL mapping as products to be consumed in PRODES.
The analysts segmented the ResU-Net classified pixels with deforestation probability higher or equal to 0.5 and selected polygons higher than 1 hectare. Visual evaluation, edition and reclassification were performed by one PRODES senior analyst and systematically checked by other analyst, in the entire observed area and for all mapped polygons. During this auditing process, the analysts used the images of the mapping year and of the two precedent years to evaluate the cover change based on tone, color, form, texture and context. This allowed them to keep, add or delete deforestation polygons in a free edition process. The analysts followed the rigorous PRODES protocol to reach the same mapping requirements. After auditing, only polygons higher than 6.25 ha were maintained in the analysis to meet the PRODES standards. The overall accuracy, F1-score, Recall and Precision were estimated for the DL maps, considering the audited maps as the new PRODES reference.
The estimated accuracy metrics for each satellite are presented in Table 5. As expected, the values increased or remained stable. Despite an unchanged Recall of 62.2%, the Landsat-based map presented a precision higher than 99% according to PRODES auditing. As PRODES considers false positive the most expansive error in deforestation monitoring, this result sounds positive. The accuracy estimates from Sentinel-based map showed a substantial increase in both precision and recall, reaching 82.3% and 74.2% respectively, which positively impacted the F1-score. The omission area mapped during the auditing process reached 12.5 km 2 and 6.2 km 2 for Landsat and Sentinel maps, respectively. These results mean that the edition effort of original DL maps was mainly related to omission mapping in Landsat-based product, while it was substantial in removing false-positives in Sentinel. This might be explained by the fact that the samples used in model training have been collected in the official deforestation PRODES map based on Landsat images. Such samples would thus present some class divergences when used in finer spatial resolution images. The divergences between DL and audited maps, after filtering by the minimal mapping unit (>6.25 ha), can be represented in two main categories. The first one is detection errors, which are model-dependent and encompass false-positives and false-negatives. The false-positives were almost only observed in Sentinel audited map, and mainly concerned forest polygons that suffered degradation (88.8%), which process generally precedes deforestation (Figure 14a). The higher area of false-positives in Sentinel might be the result of finer spatial resolution, which could have enhanced vegetation changes in the model but not in visual inspection at a 1/75,000 scale. After checking, we noticed that more than 90% of these Sentinel false positives have been classified as deforestation on Landsat DL map, confirming that the higher resolution could have led to this divergence. The false negatives were mainly related to internal proximity to polygon border, as the estimated probability could not reach the 0.5 threshold value, at the contrary of the core area of the polygon (Figure 14b). This might be explained by the model conservatism in detecting deforestation and by delineation divergences of the analyst at the 1/75,000 scale for visual detection and manual mapping. Additionally, we noticed that the smaller was the deforested polygon area, the higher was the omission rate in the model. Recall was significantly reduced to low values 21.9% and 42.1%, respectively for Landsat and Sentinel maps, when only considering lower than 6.25 ha polygons. The second category of errors is derived from shifts between images and PRODES deforestation mask data. It often concerns a maximum of two pixels along part of the polygon border (Figure 14c).
To estimate the effective impact of satellite data source on the final PRODES maps, when derived from DL techniques, we compared the total deforestation area detected between the two maps, after auditing. The area reached 32.9 km 2 and 23.7 km 2 , respectively for Landsat and Sentinel. This difference is probably related to the higher spatial resolution in Sentinel that would tend to increase polygon fragmentation and improperly exclude part of the set of polygons that were correctly detected when applying the minimal mapping unit filter, compared to Landsat map. A spatial union of the two audited maps showed that 50.2% of the total area of Landsat and Sentinel final deforestation polygons matched. A total of 62.8% of Landsat derived polygons matched with Sentinel ones and 71.5% of the Sentinel polygons matched with Landsat ones. These results showed that Landsat-based DL detection presented a higher potential to be used as a draft in PRODES mapping, in relation to Sentinel product, when using Landsat-based sampling and the PRODES minimal mapping unit. This unprecedented experiment showed a substantial improvement of accuracy metrics for both Landsat and Sentinel ResU-Net maps in relation to classic estimation in Section 4.1, when considering PRODES auditing. These results indicate that fully convolutional networks, in particular ResU-Net are promising tools to provide a first draft for the PRODES project, but auditing efforts might focus on omissions that presented a significantly higher rate than false positives principally in Landsat-based maps.

Conclusions
In this work, we compared six fully convolutional architectures (U-Net, ResU-Net, SegNet, FC-DenseNet and the Xception and MobileNetV2 variants of DeepLabv3+) for detecting deforestation in the Brazilian Amazon rain forest from Landsat-8 and Sentinel-2 image pairs.
We evaluated the networks' performance based on different accuracy metrics, computational complexity, and visual assessment. The analysis also considered different confidence levels of deforestation alarms and their implications in terms of false negatives. We did it by varying the probability threshold above which we regarded a pixel as belonging to the class deforestation and recording the corresponding false-positive vs. false-negative values.
The study revealed the potential of the tested networks as an automatic alternative to deforestation mapping programs for the Amazon region that still involve a lot of visual interpretation.
ResU-Net consistently presented the best accuracy among all tested networks, being closely followed by FC-DenseNet. The assessment of how the relation of false-positives vs. false-negatives behaved for different probability thresholds also indicated similar performances of these two networks, but again with ResU-Net's slight superiority.
As for the associated computational complexity, U-Net was only surpassed by the MobileNetV2, which on the other hand, together with Segnet, achieved the worst accuracy values throughout our experiments.
The experiments on Landsat-8 and Sentinel-2 data led to similar conclusions regarding the networks' relative performance.
In sum, throughout our experiments, ResU-Net consistently presented the best tradeoff between accuracy and training/inference times. On the other hand, MobileNetV2 and especially Segnet presented the worst results among the evaluated networks.
The Brazilian Amazon biome covers 4.2 million km 2 , has around 2500 tree species and other 30,000 plant species, and is far from being a homogeneous forest. The study presented here constitutes a step towards making the monitoring of the Amazon rainforest more agile, less subjective, and more accurate. Definitive and generally valid conclusions regarding the pros and cons of automatic mapping methods require further studies using data that represents all this diversity. We will be moving towards this goal in the continuation of this research.

Data Availability Statement:
The data presented in this study are openly available in the USGS archives and Copernicus Open Access Hub at https://earthexplorer.usgs.gov/ (accessed on 18 July 2021) and https://scihub.copernicus.eu/dhus/#/home (accessed on 18 July 2021), respectively.