Evaluation of Deep Learning Techniques for Deforestation Detection in the Brazilian Amazon and Cerrado Biomes From Remote Sensing Imagery

: Deforestation is one of the major threats to natural ecosystems. This process has a substantial contribution to climate change and biodiversity reduction. Therefore, the monitoring and early detection of deforestation is an essential process for preservation. Techniques based on satellite images are among the most attractive options for this application. However, many approaches involve some human intervention or are dependent on a manually selected threshold to identify regions that suffer deforestation. Motivated by this scenario, the present work evaluates Deep Learning-based strategies for automatic deforestation detection, namely, Early Fusion (EF), Siamese Network (SN), and Convolutional Support Vector Machine (CSVM) as well as Support Vector Machine (SVM), used as the baseline. The target areas are two regions with different deforestation patterns: the Amazon and Cerrado biomes in Brazil. The experiments used two co-registered Landsat 8 images acquired at different dates. The strategies based on Deep Learning achieved the best performance in our analysis in comparison with the baseline, with SN and EF superior to CSVM and SVM. In the same way, a reduction of the salt-and-pepper effect in the generated probabilistic change maps was noticed as the number of training samples increased. Finally, the work assesses how the methods can reduce the time invested in the visual inspection of deforested areas.


Introduction
Deforestation is one of the largest sources of anthropogenic CO 2 emissions. It is a wide-reaching problem, including the reduction of carbon storage, greenhouse gas emissions, and other environmental issues such as biodiversity losses [1]. Currently, one of the highest deforestation rates occurs in South America [2], where the most significant statistics of tree losses are concentrated in Brazil [3,4]. This country comprises most of the Amazon rainforest, with 60% of its total territory [5]. In particular, Amazon and Cerrado biomes cover the most significant portion of the Brazilian territory, with an area of about 49% and 24%, respectively, comprising together an area of around 6.2 million square kilometers of the Brazilian territory. In both biomes, the deforested areas are predominantly converted to pasture [6,7], in addition to a strong expansion of soy in the Cerrado biome [8]. These biomes have different characteristics and accommodate rich biodiversity of endemic species, many of them vulnerable [9]. Therefore, the conservation of Amazon and Cerrado biomes is essential for the future of our planet.
For several decades, the Amazon ecosystems have been threatened by disorderly economic growth. The main causes are the growth of agribusiness, certain mining and logging activities, mapping area of 1 ha is considered. The process is entirely manual and is carried out by visual interpretation by taking into account five main elements: color, tonality, texture, shape, and context. TerraClass Cerrado is another project coordinated by Brazil's environment ministry in cooperation with the Brazilian Agricultural Research Corporation and INPE. With this project, they map the use and cover of deforested areas of the Brazilian Amazon, and they enable the characterization of the areas mapped by PRODES, using satellite images from 2013 (http://www.dpi.inpe.br/tccerrado/). MapBiomas (http://mapbiomas.org/) is another initiative that analyzes the Brazilian territory by mapping the land use and land cover since 1985. The methodology adopted by MapBiomas is presented in [38]. Its dataset comprises images of all Brazilian biomes collected by Landsat 5-6-7 sensors. The methodology also involves extracting statistical features to train a Random Forest (RF) classifier. In a post-classification stage, spatial and temporal filters remove classification noise and fill information gaps due to clouds. In this methodology, all procedures are performed using the Google Earth Engine. However, the final validation stage is based on visual interpretation, and an overestimated forest area is presented.
Further works have been developed by the RS community. Bueno et al. [9] presented a study for detecting deforestation in areas of the Cerrado biome. They adopted the object-based image analysis (OBIA) methodology, applied to Landsat OLI (The Operational Land Imager) time series, to find out the best spectral bands and vegetation indices for discrimination of true deforestation from seasonal changes using a RF classifier. Likewise, Machado et al. [39] presented a study of mapping the deforested areas using images of MODIS sensor (or Moderate Resolution Imaging Spectroradiometer). A maximum likelihood classifier implemented in ERDAS software was used for the task. Moreover, Picoli et al. [40] presented a land use and land cover classification using satellite image time series. They used a SVM model to discriminate natural and human-transformed land areas in the state of Mato Grosso.
More recently, Deep Learning (DL) techniques have become state-of-the-art in many application fields, including RS. Through the potential of Deep Neural Networks (DNNs), representations at multiple levels can be extracted, which usually provides features with further information and often allows for better results than what can be achieved by using domain-specific handcrafted features. Zagoruyko and Komodakis [41], introduced multiple CNN models to learn similar patterns in pairs of images, which experienced geometric transformation and changes in their illumination patterns, among other changes. The authors reported promising results in comparison to traditional methods that rely on hand-crafted features. Some of those methods are the Early Fusion, Siamese CNN, and Pseudo-Siamese, which were later used by Caye et al. [42] for urban change detection. In this work, the authors compared the Siamese CNN and Early Fusion CNN techniques and evaluated the impact of using different spectral channels as inputs.
In a similar work [43], a Siamese CNN was successfully applied to detect changes in objects such as buildings and trees, as well as to discriminate real and false changes generated by inaccurate registrations or alignments. To this goal, the patches assigned as "change" were grouped and verified as individual object changes.
Zhan et al. [27] proposed a supervised change detection method based on a deep Siamese CNN for optical aerial images. As in previous approaches, the authors of this work improved the preliminary classification results in a post-processing stage. Essentially, the score map produced by the SN is segmented using a thresholding technique. The generated segments are then classified using a k-nearest neighbor (k-NN) approach. In a similar approach, the authors of [44] integrate the advantages of CNN and RNN to learn joint spectral-spatial-temporal features and solve a multispectral image change detection problem, achieving encouraging results.

Goals And Contributions
This work evaluates Deep-Learning based techniques applied to deforestation detection in two tropical regions with different deforestation patterns: the Amazon and Cerrado biomes.
For comparison purposes, we used a SVM classifier, which was taken as the baseline. The main objective is to reduce the human effort involved in monitoring programs such as the Amazon Deforestation Monitoring Project and the Cerrado Monitoring Project. In addition, this work aims at contributing to improve accuracy and reduce the subjectivity inherent to human photointerpretation, besides the reduction of costs and time for monitoring the vegetation of these biomes, which meets the need for constant improvement of monitoring instruments [45].
The main contributions of this work are: • An evaluation and comparison of three Deep Learning techniques for automatic deforestation detection in Brazilian Amazon and Cerrado biomes; namely, Early Fusion (EF), Siamese Network (SN), and Convolutional SVM (CSVM).

•
An assessment of these methods' accuracy under scarce training samples.

•
An estimation for each method of the relation: area assigned as deforestation vs. area of true deforestation.
The rest of the paper is structured as follows. Section 2 describes the assessed methods for deforestation detection, the study areas, and the experimental protocol. Next, the experimental results are presented and discussed in Section 3. Finally, conclusions and future works are summarized in Section 4.

Materials and Methods
This section explains the three DL based approaches investigated in the present analysis for detecting deforestation from optical images: Early Fusion (EF), Siamese Network (SN), and Convolutional SVM (CSVM).
For all methods, the inputs are pairs of co-registered images of two optical images acquired at different dates, denoted T1 and T2 henceforth. The classification receives as input an image patch, and the result is assigned to the patch central pixel. A sliding window approach is adopted to classify all pixels of the target site.

Early Fusion (EF)
This model can be regarded as an extension of a regular CNN. It is composed of a series of convolutions and pooling layers, followed by fully connected (FC) layers, whereby the last one is a Softmax layer having as many outputs as the number of classes. Softmax assigns posterior probability values to each class in a classification problem, which adds up to 1. For deforestation detection, this layer has two outputs related to the "deforestation" and "no-deforestation" classes, and the final label is defined based on the class corresponding to the maximum probability.
The EF architecture used in this work was inspired by the CNN model proposed in [42] used for detecting changes in urban areas with good reported performance. The architecture of this CNN model takes as input the stack formed by the concatenation of both images (T1 and T2) along the spectral dimension. Each patch is a tensor of size h-by-h-by-2c, denoting the patch height, width, and depth, respectively. Figure 1 outlines the procedure.

Siamese Network (SN)
A Siamese Network can be regarded as an extension of a conventional CNN. The particular network design used in this analysis was adapted from [42]. It consists of two subnetworks that share the same hyperparameters and weights [43]. Each patch of a pair of corresponding patches feeds a subnetwork. In consequence, the descriptor vectors of the two corresponding patches are computed by the same model [27]. The descriptor vectors delivered by each subnetwork are concatenated to produce the final feature vector, which is forwarded to a two-layer decision network [46] that assigns the label "deforestation" or "no-deforestation" to the central pixel of the input patch pair. The whole procedure is illustrated in Figure 2.

Convolutional SVM (CSVM)
The CSVM, proposed by [47], is an alternative DL approach based on SVMs. This method was tested for object detection from Unmanned Aerial Vehicle (UAV) imagery and performed well in the task of discerning between instances of the object of interest and the background. Analogous to a traditional CNN, a CSVM architecture is composed of a set of convolutional and pooling layers followed by a classification layer at the end [47], but in contrast to CNN, CSVM does not use the backpropagation algorithm during the training; it trains the set of linear SVMs in a layerwise fashion. The intended advantage of this method is its performance in classification tasks where there are very few training samples available.
Similar to the EF method, the two input images (T1 and T2) are concatenated along their spectral dimension. Again, as in EF, in the CSVM approach, we classify patches of size h-by-h-by-2c whose classification output is assigned to the patch central pixel. Next, we describe how the method proposed by Bazi and Melgani [47] for image classification was adapted in this work for pixel-wise deforestation detection.

Construction of Training Set
Following patches extraction, the training set is created for learning the SVMs filters. The extracted input patches are split into non overlapping rectangular sections, called mini-patches, of size h 1 -by-h 1 -by-2c, which are vectorized to form the global training set. This procedure is illustrated in Figure 3a.

Training the SVMs Filter Bank
After the global training set is built, m subsets of N random selected samples are created to train m SVMs filters. These m subsets are composed of n samples per class, which are randomly selected from the global training set. The weights of the SVMs filters are learned using a conventional forward supervised learning layer by layer in a greedy fashion. To make the most of available training samples and to avoid data duplication in the subsets, in our study, the value of n was set to the ratio between the number of training samples (N) and the amount (m) of SVM filters.

Generation of Feature Maps
In this stage, the input patch pairs are convolved with the learned SVM filters to generate the feature maps, which are fed to a pooling layer followed by a non-linear activation function. The output is the input to the next convolution layer (see Figure 4). The procedure is repeated until the desired number of layers is reached.

Classification
As mentioned before, the feature maps obtained in the last layer are fed again to a final binary SVM classifier that identifies the class label of each patch central pixel, either as "deforestation" or "no-deforestation". In the original CSVM, the final feature map is a vector containing the means of the four quadrants of the input feature map. In contrast, in our approach, the feature descriptor is a vector obtained after flattening the output of each convolutional layer. This procedure was carried out after experimental analysis where the results presented a better performance as well as a reduction in the inference time.

Study Areas
Two study areas from the Brazilan Biomes were selected. The first one is a region located in the Amazon biome, and the second one is located in the Cerrado biome. The detailed description of each one is presented in the following.

Amazon Biome
The first study area corresponds to a region of the Amazon Biome, more specifically localized in the Pará State, Brazil, centered on coordinates of 03 • 17'23" South and 050 • 55'08" West, Figure 5. Pará state comprises 26% of Brazilian Amazon [48], and most of it is covered with dense tropical rainforest. This area has faced continuous degradation process, as indicated by PRODES and DETER reports [33].
The reference change map used in our experiments refers to the deforestation that occurred between August 2016 and July 2017. This information was downloaded from INPE site, which is freely available at the PRODES database (Available at http://terrabrasilis.dpi.inpe.br/map/deforestation). For this reference, the following considerations were taken into account: • Polygons of areas deforested in previous years (before August 2016) were disregarded. • An external buffer of two pixels inside the polygons of class "deforestation" was not considered for the training, validation, and test. The reason was to avoid the impact of the variation between the photointerpreters estimation. • Areas lower than 6.25 ha (69 pixels) were also not considered in our evaluation because PRODES data does not record deforestation areas smaller than that for the Amazon biome.
The dataset is composed of two Landsat 8-OLI scenes, with a spatial resolution equal to 30 m. The images were acquired by the United States Geological Survey (USGS). After the atmospheric correction, the images were clipped around the selected area. The resulting data of the BLA had a size of 1100 × 2600 pixels and seven spectral bands: Coastal/Aerosol, Blue, Green, Red, NIR, SWIR-1, and SWIR-2. The acquisition dates were August 2nd, 2016, and July 20th, 2017 (see Figure 5). They were selected based on PRODES reference date, which computes the annual deforestation rate from August 1st of each year, during the dry season (June to September), when the cloud cover, a major problem over the whole BLA region, is minimum.

Cerrado Biome
The second study area belongs to the Brazilian Cerrado biome, localized in the Maranhão State, Brazil, centered on coordinates of 04 • 58' 53" S and 043 • 49' 41" W. Figure 6 illustrates this study area. The state of Maranhão is in a transition area among three different biomes: Cerrado (64%), Amazon (35%) and Caatinga (1%), with a predominance of savanna formations in the Cerrado. This transition makes the Cerrado Maranhense present from dense tree formations, known as "Cerradão", and more open formations with low shrubs, vegetation with twisted trunks and thick barks typical of a "Stricto Senso" savanna. This Cerrado vegetation has suffered a significant agricultural expansion, most of it over native vegetation [49], and the deforestation in this biome has also been monitored by PRODES (Available at http://www.obt.inpe.br/cerrado/). The dataset is also composed of two images from Landsat 8-OLI with seven spectral bands, pre-processed in the same way as in the Amazon Biome dataset. The size of the images was 1719 × 1442 pixels. For this database, the first image is from 3 September 2017, and the second one is from 22 September 2018. Since the reference provided by PRODES is also from the dry season, the reference used in this case does not contain all the deforested areas. Then, the reference had the following adaptations: • Some areas that suffered deforestation after the PRODES report were included in the reference. The added polygons were reviewed and approved by an expert photointerpreter. The final reference change map of the Cerrado is presented in Figure 6. • An external buffer of two pixels around the samples of class "deforestation" was not considered in our evaluation to avoid the aforesaid inaccuracy problem along the borders. • Areas lower than 1 ha (11 pixels) were not considered in the computation of the accuracy metrics because PRODES data does not consider deforested areas smaller than this value for the Cerrado biome.

Experimental Setup
For all the methods, two optical images acquired at different dates were used. Furthermore, the Normalized Difference Vegetation Index (NDVI) was computed (Equation (1)). This is an indicator of quantity and quality of vegetation, and it is measured from Landsat bands 5 and 4, which correspond to the near-infrared (N IR) and red (Red) ranges, respectively.
This index was appended to each image along the spectral dimension, so that the final images formed a tensor with depth equal to eight. Then, each individual spectral band was normalized to zero mean and unit variance.
The patch size was selected experimentally as 15 − by − 15. Then, the input of EF and CSVM was a tensor of size of 15-by-15-by-16, for SN a tensor of size of 15-by-15-by-8 in each subnetwork and for SVM a vector of size of 15 × 15 × 16. The procedure of the patch extraction was applied following the overlapping sliding windows with stride equal to three. The size of the patch and stride were selected empirically. In all methods, the input was an image patch, and the classification outcome was assigned to the patch central pixel.
As the number of available samples related with class "no-deforestation" was significantly higher (97% for Amazon biome and 95% for Cerrado biome) than class "deforestation" (3% for Amazon biome and 5% for Cerrado biome), a data augmentation on "deforestation" samples was adopted: 90 • rotation, horizontal and vertical flip. To balance the number of samples per class, we further undersampled the class "deforestation".
Tables 1 and 2 present the number of available patch pairs for training, validation, and test for the Amazon and Cerrado databases. Tables also present the number of patches obtained after applying the balancing procedure for both classes, "deforestation" and "no-deforestation". The EF network architecture consisted of three convolutional layers (Conv) including the Rectified Linear Unit (ReLU), two Max-pooling layers (MaxPool), and two Fully Connected layers (FC), with a softmax layer at the end with two outputs, corresponding to "deforestation" and "no-deforestation" classes. The filter and output size of each layer are summarized in Table 3.  Figure 8. Distribution of the Cerrado database. The region was divided into fifteen tiles. Four tiles were used for training (1,5,12,13) and two for validation (6,10). The remaining tiles were used for testing (2,3,4,7,8,9,11,14,15). The polygons indicate deforested areas. Regarding the SN architecture, each subnetwork was also composed of three convolutional and two Max-pooling layers. The output of each subnetwork was fed to a FC layer and later concatenated in a single vector. In the end, a softmax layer generates the posterior probabilities for the classes "deforestation" and "no-deforestation". Table 4 shows the details of SN architecture with the filter and output size of each layer. For training the EF and SN models, we selected the following setup empirically: batch size equal to 32 with 100 number of epochs, early stopping after 10 epochs with no improvement (over the validation set) and a dropout rate of 0.2 in the final FC layer. Additionally, Adam optimizer was selected empirically with weight decay equal to 0.9 and learning rate equal to 10 −3 . As loss function, we used the binary cross-entropy.
For the CSVM approach, the architecture comprised three convolutional layers, including ReLU, each one followed by a Max-pooling layer. The output size of each layer is shown in Table 5. In this method and for the baseline, the validation samples were added to the training set. For the computation of the weights of the SVM filters, the multicore Liblinear software package [50] was used. The parameter setup of the CSVM was: stride equal to one for the Conv and MaxPool layers, 12 SVMs used in each Conv layer. The training set was split in such a way that each SVM had the same number of samples for both classes. The size of the mini-patches used for learning the SVMs was equal to 3 × 3 × 16 for the first convolutional layer and 3 × 3 × 12 for the second and third layers. The estimation of the regularization parameter C for each SVM was performed using three-fold cross-validation restrained in the range [10 −1 , 10 3 ]. The buffer of both references was obtained applying the morphological dilation, using as structuring elements a disk of radius 2. This operation expanded the boundaries of the deforested polygons. Then, a difference between the dilated and original images was performed, resulting in the outer edge, and the patches with the central pixel in these regions were not considered for training, validation or test.

Influence of the Number of Training Samples
To evaluate the influence of the number of training samples, four scenarios were considered. Specifically collecting samples from the training set of a one, two, three and four tiles, denoted as N i , where i corresponds to the number of tiles used in each scenario. For EF and SN methods, the validation set (val) was used to stop training once the loss increased in 10 consecutive epochs (early stopping). As mentioned before, for CSVM and SVM the samples in this set were added to the training set (tr). The number of training samples in each scenario for the Amazon and Cerrado databases is presented in Tables 6 and 7, respectively.

Accuracy Assessment
The performance of the evaluated methods was expressed in terms of Overall Accuracy (OA), F1-Score, and Alarm Area (AA).

•
Overall Accuracy (OA): is a global metric that indicates the percentage of samples correctly classified in relation to the total samples. It is defined by: where true positives (tp) is the number of samples correctly assigned to the class "deforestation", false positives ( f p) refer to the number of samples erroneously assigned to the class "deforestation". Analogously, true negatives (tn) and false negatives ( f n) correspond to the number of samples correctly and incorrectly assigned to the class "no-deforestation", respectively. P and N denote the total number of positive and negative samples in the test set. • Precision, also known as Correctness, represents the proportion of samples assigned by the classifier to the class "deforestation", which truly belongs to that class, formally • Recall, also known as Completeness, is the proportion of all "non deforestation" samples recognized by the classifier as such, i.e., • F1-score: is given by the harmonic mean of Precision and Recall and it also varies in a range of 0 to 1. This metric is defined by: • Alarm Area: this metric is the portion of the monitored area classified as "deforestation". We defined this metric by the rate of tp and f p between the total P and N samples in the test set.
This metric is important in an operational scenario where an automatic system highlights areas suspected of deforestation (alarm), which will be subsequently evaluated visually by a human analyst to eliminate false positives. The lower the Alarm Area, the lower is the human effort.

Results and Discussion
In this section, we present and discuss the results obtained by the methods described in Section 2 for the Amazon and Cerrado databases. Firstly, we report the average of Overall Accuracy (OA) and F1-score computed over ten runs, each run with a different choice of training samples for the class "no deforestation". Next, we present the probability maps generated in each experiment, and finally, we analyze how semi-automatic approaches could be designed based on these methods to reduce human intervention with minimal accuracy loss. Figure 9 summarizes the results of the experiments on the Amazon biome in respect of F1-score for the class "deforestation". This figure shows that SN and EF achieved the best performance in most results. According to expectations, the performance of all methods improved as more patches were used for training. However, CSVM was only able to reach the baseline performance when four tiles were used for training. It attained low scores in comparison with the other methods. With a single training tile, the CSVM performance was similar for two and three layers; with two and four tiles for training, the best performance was obtained with two layers; using three tiles for training, the performance decreased as more layers were added.

Amazon Biome
The SN was the best performing method, followed by SVM and then EF. This was not surprising, given the well-known generalization capacity of SVM in the face of the scarce training data. Contrarily, SN and EF consistently outperformed SVM in about 10% and 13% when data from two, three, and four tiles were used for training.
The results in terms of Overall Accuracy (OA) are presented in Figure 10. Similar to the F1-score, OA improved as more training samples were available. In all cases, OA scores over 90% were obtained. Again, CSVM presented lower scores in comparison to other methods, and its performance was similar to F1-score in each scenario. The high OA scores can be understood because about 97% of the test samples were from class "no-deforestation". Figures 11 and 12 show the NIR-G-B composition (Near Infrared, Green, and Blue bands) at both dates (T1 and T2), the reference, and the probability maps for tiles 2 and 14, respectively. These tiles are part of the test set. Columns correspond to methods SVM, EF, SN, and CSVM (after layer 2), and rows correspond to using one, two, three, and four tiles for training. Blue color represents the lowest probability of deforestation, while the red color represents the highest probability.
As in the F1-score and OA plots, the probability maps improved, and the salt-and-pepper effect reduced when the number of training tiles augmented. In the first scenario, when a single tile was used for training, SVM delivered many false positives, followed by CSVM, causing a more noticeable salt-and-pepper effect.
SVM was least confident among the tested methods in its results. It produced comparatively many intermediate probability values, whereas its counterparts generated probabilities more concentrated close to 0 and 1. All methods presented intermediate probabilistic values mainly around polygon borders. Inaccuracies in the reference of deforested polygons might have contributed to this behavior.

Alarm Area vs. Recall for Amazon Biome
Next, we evaluate the methods as part of an alarm scheme. In this scheme, the underlying classifier indicates areas where deforestation is likely to have occurred. A photointerpreter then visually analyzes the image, or an inspector could be sent to the indicated areas to check what was real deforestation and what was just a false alarm. The main benefit of this scheme is to restrict the human effort to just a portion of the area being monitored.
On the other hand, in this scheme, parts of the deforested areas can be undetected by the classifier and go unnoticed. Two metrics are critical in this analysis: first, the proportion of monitored area flagged as potentially deforested and, second, the proportion of total deforestation concentrated in the areas indicated by the classifier. The first metric is the Alarm Area defined in Equation (6), whereas the second metric is the Recall defined in Equation (4). Both metrics will depend on a threshold for the deforestation probability assigned by the classifier above which a site should deserve attention. The higher this threshold, the smaller the Alarm Area and the smaller the Recall. The threshold value expresses a tradeoff between accuracy and human effort and will be determined by the operational demands at each time and each region. Therefore, the following analysis focuses on the behavior of these two metrics as the deforestation probability threshold varies.
Specifically, we present the curves Recall versus Alarm Area for each method. Each point in the curve corresponds to a threshold imposed on the deforestation probability produced by each tested method. A small area to be checked out at a high Recall, is the desired profile.
With one tile for training, all methods achieved Recall values of about 90% when looking at less than 10% of the whole imaged area. It means that 90% of the correctly identified deforestation is contained in 10% of the image. Hence, instead of looking at the entire image, the analyst would focus on 10% of it, reducing human work by 90%. As expected, as Recall increased, the area to be observed also increased, but, in this particular case, CSVM (with three layers) presented the best results (see Figure 13a). For Recall beyond 96%, the threshold values were very close to zero, most pixels tended to be classified as deforestation, and the area to be observed went up to 100%, as can be observed in Figure 13b. Using two, three, and four tiles for training, all methods presented a similar profile until 96% Recall but, beyond this value, SVM presented the best performance. It managed to classify more deforested samples correctly, with a minimum increase in the area to be observed. Analogous to the results for one training tile, when threshold values were set very close to zero, all samples tended to be classified as deforestation, and the Recall approached quickly to 100%, as well as the area to be observed. A somewhat surprising conclusion from Figures 13-16 is that the CSVM performed close to the other methods. Contrarily, SVM performed significantly worse than EF and SN in the analysis reported in the preceding section. Notice that, in that case, we implicitly set the probability threshold to 50% to discriminate the classes. In the present analysis in which the threshold varies, there was no significant superiority of the other methods over the SVM. The experiments, therefore, indicated that SVM might be an attractive option for an alarm system, given its low demand for training samples when compared to DL-based methods.

Cerrado Biome
The results for Cerrado in terms of F1-score and OA are summarized in Figures 17 and 18, respectively. Similar to the Amazon database, EF and SN presented the best performance in all experiments. Using a single tile for training, EF and SN outperformed SVM in 2% and 3% respectively. The best performance achieved by CSVM was 51%, which was obtained after the first layer. However, it did not reach the baseline. Using two tiles for training, EF and SN outperformed SVM with a difference of about 2%. In this case, SVM outperformed CSVM by 9%. Using three tiles for training, EF and SN outperformed SVM in 3% and 2%, respectively, and CSVM in the second layer came very close to SVM. Using four training tiles, the DL-based methods were better than SVM. EF and SN and CSVM (one layer) overcome SVM in 2%, 3%, and 1%, respectively. In terms of OA, the results presented a similar trend observed on F1-score. Scores above 90% were achieved in all scenarios. However, EF and SN obtained the best performance in all experiments. Analogous to the experiments on the Amazon database, CSVM presented lower scores in comparison with other methods. Only in the last case, using four training tiles, CSVM matched SVM at 97%. As in the experiments in the Amazon dataset, the high OA values were because the vast majority of samples belonging to the class "no-deforestation" were correctly classified. Figures 19 and 20 show the NIR-G-B composition (Near Infrared, Green, and Blue bands) at both dates (T1 and T2), the reference, and the probability maps of tile 2 and 8, respectively. Again, columns correspond to methods SVM, EF, SN, and CSVM (after layer 1), and rows correspond to the results one, two, three, and four training tiles. Blue represents the lowest probability of deforestation, while Red represents the highest probability. Like the results recorded on the Amazon database, the probability maps improved, and the salt-and-pepper effect reduced as the number of training tiles increased. If we observe the first scenario, where a single tile was used for training, all maps present a large number of false positives and a notable salt-and-pepper effect. Likewise, EF, SN, and CSVM are more confident, assigning values close to one for pixels of class "deforestation", and values close to zero to pixels of "no-deforestation" class. Contrarily, the probability maps delivered by SVM contain comparatively many pixels with probability values in the intermediate range.
As observed in the previous experiment series, the probability maps show that all methods were less confident, i.e., present probability values around 50%, close to the borders of the reference polygons. As mentioned before, this is possibly related to inaccuracies in the delimitation of deforestation polygons in the reference.

Alarm Area vs. Recall for Cerrado Biome
The analysis under the perspective of an alarm system is presented in Figures 21-24 for one, two, three, and four training tiles, respectively. For this database, in the four scenarios, the best performance was obtained by EF. Although with a single training tile, the performance was similar for all methods, EF was slightly superior. Using two tiles for training, EF, was also the best performing method, followed by CSVM and SVM, which presented very similar results. Finally, using three and four tiles for training, EF and CSVM achieved better results: they correctly classified more samples of class "deforested" and the area to be observed is lower. According to the graphs at 95% of Recall, the area to be observed is reduced to 10% of the entire image (see Figures 21a, 22a, 23a and 24a). In the same case of the Amazon database, for threshold values close to zero, all samples are classified as deforestation class, the value of Recall is about 100% and the area to be observed is the entire image, as can be seen in Figures 21b, 22b, 23b and 24b.    Compared with the results of the analysis conducted on the Amazon biome, the superiority of EF over its competitors was more pronounced here, especially for Recall values starting at 95%, even when only one training tile was used.

Conclusions
This work reported an evaluation of three state-of-the-art deep learning techniques for deforestation detection: Early Fusion (EF), Siamese Network (SN), and Convolutional SVM (CSVM). Additionally, the performance of these methods was compared against a baseline based on probabilistic Support Vector Machine (SVM), which is one of the most popular machine learning techniques for change detection.
Experiments were carried out using two areas of the Brazilian biomes. The first one corresponds to a region of the Amazon biome, and the second corresponds to the Cerrado biome. The references used in this work were collected from the PRODES Project, which was developed by the National Institute for Space Research (INPE). The methodology employed to accomplish this task involves significant human intervention. This work has a great potential at reducing human intervention and assessing state-of-the-art methods towards more automatic deforestation detection. With the improvements resulting from the proposed techniques, the mapping can be performed in less time, with lower costs and with a lower degree of subjectivity.
The experimental analysis relied on two LANDSAT 8/OLI optical images acquired at dates about one year apart from each other. Four different scenarios were considered, using one, two, three, and four tiles training. As expected, the performance of all methods increased with the number of training samples. This trend was chiefly remarkable for EF and SN.
EF and SN presented the best performance in most experiments. In a few cases, CSVM outperformed SVM. The accuracy obtained by EF and SN in experiments were up to 95% in terms of Overall Accuracy (OA) and up to 63% in terms of F1-score for Amazon, and up to 97% in terms of OA and 78% in terms of F1-score for Cerrado, showing that the results for the Cerrado database achieved a higher percentages than Amazon database. The reason lies in the pattern of deforestation in the Cerrado biome. It is comparatively more intense; the vegetation is completely removed, and most of the soil is exposed, unlike the Amazon, where it is common to have vegetation remains in the deforestation process, which hinders detection.
Besides, the probability maps indicated that EF, SN, and CSVM were more confident in their outcomes. Most posterior probabilities delivered by these methods were concentrated close to one and zero, for deforestation and no-deforestation, respectively, whereas the posteriors computed by SVM took intermediate values over comparatively many areas.
The main motivation for including CSVM in this study was the good performance reported in a recent paper under a small training sample size. The experimental analysis did not confirm this expectation. Indeed, in our experiments, CSVM was consistently outperformed by EF and SN, and in few cases, also by SVM.
Regarding CSVM, some additional experiments for the final classification layer were performed. The first one was the usage of the flattening feature maps obtained after each convolutional layer to train the binary SVM, instead of pooling them over four quadrants and calculate the means. The second experiment involved the selection of the final classifier. We tested a Softmax layer classifier and SVM with an RBF (Radial Basis Function) and a linear kernel. However, the best results were obtained using a linear kernel.
It is worth to mention that despite CSVM did not overcome the baseline in most cases, it presents an advantage concerning EF and SN, it is a CPU-based method, then it does not require GPU to carry out the experiments. GPU is much more costly, and it relies on powerful supplementary equipment to support it.
An additional evaluation was also performed to verify how the methods can reduce the time invested in the visual inspection of deforested areas. The metric defined as Alarm Area (AA) was computed to evaluate how the methods can reduce the human analyst effort for visual classification, by restricting the whole image to just a portion of the total area being monitored. According to the experiments, for the Amazon biome, it was estimated that it would be possible to reduce the human work by 90%, with the guarantee that 90% of the deforestation occurrences are present in 10% of the whole imaged area. For the Cerrado biome, 95% of the deforestation occurrences would be present in a similar portion of the image.
Although the evaluated methods were tested on deforestation detection, they can be easily adapted to other change detection applications. In the present study, these methods proved to be promising directions in the research to monitor and control environmental issues that are of paramount importance today.
Further studies should test other combinations of hyperparameters of the assessed methods with a focus on decreasing the number of false deforestation samples, as well as to evaluate other deep architectures, such as Fully Convolutional Networks (FCN) and Recurrent Neural Networks (RNN). Another investigation direction relates to the usage of freely available data from other sensors. An attractive alternative is the Sentinel-2 data that provides better temporal and spatial accuracy than LANDSAT-8. Furthermore, the management of the Synthetic Aperture Radar (SAR), would allow monitoring deforestation in a way that is almost independent of weather conditions. Indeed, the Brazilian biomes present a cloud coverage for nearly the whole year, which prevents the use of optical imagery. Given these circumstances, SAR images are a promising option.
A critical issue is still the number of training samples required by deep learning-based methods to achieve their full potential. Techniques based on domain adaptation seem another promising research direction to mitigate this hindrance.