Forest Disturbance Detection with Seasonal and Trend Model Components and Machine Learning Algorithms

: Forest disturbances reduce the extent of natural habitats, biodiversity, and carbon sequestered in forests. With the implementation of the international framework Reduce Emissions from Deforestation and forest Degradation (REDD+), it is important to improve the accuracy in the estimation of the extent of forest disturbances. Time series analyses, such as Breaks for Additive Season and Trend (BFAST), have been frequently used to map tropical forest disturbances with promising results. Previous studies suggest that in addition to magnitude of change, disturbance accuracy could be enhanced by using other components of BFAST that describe additional aspects of the model, such as its goodness-of-ﬁt, NDVI seasonal variation, temporal trend, historical length of observations and data quality, as well as by using separate thresholds for distinct forest types. The objective of this study is to determine if the BFAST algorithm can beneﬁt from using these model components in a supervised scheme to improve the accuracy to detect forest disturbance. A random forests and support vector machines algorithms were trained and veriﬁed using 238 points in three different datasets: all-forest, tropical dry forest, and temperate forest. The results show that the highest accuracy was achieved by the support vector machines algorithm using the all-forest dataset. Although the increase in accuracy of the latter model vs. a magnitude threshold model is small, i.e., 0.14% for sample-based accuracy and 0.71% for area-weighted accuracy, the standard error of the estimated total disturbed forest area was 4352.59 ha smaller, while the annual disturbance rate was also smaller by 1262.2 ha year − 1 . The implemented approach can be useful to obtain more precise estimates in forest disturbance, as well as its associated carbon emissions.


Introduction
Forest disturbance has been studied extensively especially after the recognition of its important role as a source of greenhouse gas (GHG) emissions and climate change since approximately 12-20% of the global GHG anthropogenic emissions have been attributed to this process [1,2]. Forest disturbances modify forest canopy structure and biomass content and can result from deforestation and forest degradation. Deforestation usually consists of the complete removal of forest cover at a big scale, while forest degradation involves subtle changes in forest structure and canopy cover [3]. With the future projections of climate change, improving accuracy and reducing uncertainty in mapping forest disturbances is critical, since it assures accurate estimations of forest carbon and climate change modeling [4]. In this study, forest disturbance was defined as "a relatively discrete event causing a change in the physical structure of the vegetation and surface soil" [5]. We focus on identifying disturbances that include both natural events, such as fires and anthropogenic activities-such as slash and burn agriculture and logging-that cause changes in forest structure that are detectable by remote sensing.

Study Site
The study area is in the Ayuquila river basin, western Mexico, with elevations ranging from approximately 250 m to 2500 m above mean sea level [37] (Figure 1). The elevation variability in the study area translates into a range of climatic conditions that include tropical semi-dry and sub-humid climates in the lower elevations, and template subhumid and humid conditions in the higher altitudes. Annual precipitation is found in the 800-1200 mm interval and the average annual temperature is between 18 and 22 • C. The study area shows a clear rainfall seasonal pattern, where most of the precipitation falls from June to October [37].
There are two types of natural forest: temperate forest (TF) and tropical dry forest (TDF): TF covers about 12% of the watershed and is found mostly in higher elevations, while TDF occupies around 24% of the basin and is distributed usually in the lower areas. The main composition of TF includes pines (Pinus spp.), firs (Abies spp.), and oaks (Quercus spp.) and it is exploited mainly for timber, although recently, avocado (Persea americana) plantations have been established in areas previously occupied by TF. The area covered by TDF is mainly used for shifting cultivation, although residents also use TDF as natural resources for fuelwood extraction, cattle grazing, and pole extraction for constructing fences [38].

General Workflow
In a previous study, forest disturbances were identified by applying a BFAST model to an NDVI time series (1994-2018) derived from Landsat 5, 7, and 8 [30]. The time series were divided into a historical period  and a change monitoring period (2016-2018). We applied a threshold value of |0.2| for the magnitude of change, which was derived from field verification, to optimize the detection of disturbances. Finally, we used 624 stratified random points to evaluate the detected disturbances and obtained 96.52% overall accuracy with an area-weighted error matrix [30]. These results represent the baseline model to which the ML models will be compared.

General Workflow
In a previous study, forest disturbances were identified by applying a BFAST model to an NDVI time series (1994-2018) derived from Landsat 5, 7, and 8 [30]. The time series were divided into a historical period  and a change monitoring period (2016)(2017)(2018). We applied a threshold value of |0.2| for the magnitude of change, which was derived from field verification, to optimize the detection of disturbances. Finally, we used 624 stratified random points to evaluate the detected disturbances and obtained 96.52% overall accuracy with an area-weighted error matrix [30]. These results represent the baseline model to which the ML models will be compared. The general workflow of the methodology is presented in Figure 2. In the present study, to balance the number of observations of true and false disturbances in the training and validation datasets, we sampled 238 points based on a gamma distribution of magnitude values centered at |0.2|. We manually labeled those points as disturbed and non-disturbed forests using monthly median reflectance composites of Sentinel-2 images (2016-2018) as reference. Then, we trained two machine learning algorithms, random forests (RF) and support vector machines (SVM), to classify forest disturbances using the following components derived from BFAST: magnitude of change, trend, amplitude, goodness-of-fit, model fitting period, and data quality. In the following sections, we provide detailed descriptions of the methods. non-disturbed forests using monthly median reflectance composites of Sentinel-2 images (2016-2018) as reference. Then, we trained two machine learning algorithms, random forests (RF) and support vector machines (SVM), to classify forest disturbances using the following components derived from BFAST: magnitude of change, trend, amplitude, goodness-of-fit, model fitting period, and data quality. In the following sections, we provide detailed descriptions of the methods.

Training and Validation Datasets
Although all the breakpoints detected by the BFAST spatial model with negative magnitude values are potential forest disturbances, those with magnitude values close to zero most probably correspond to false disturbances caused by noise or interannual variations, while those with magnitude values farther from zero will tend to correspond to true disturbances [29,39]. In order to balance the number of observations of true and false disturbances to train and validate the models, the raw magnitude output of the BFAST was used to establish a stratified random sampling design. In this procedure, we used a gamma distribution centered at magnitude |0.2| to determine the number of observations for the complete range of magnitude values (0-|0.5|) in 0.05-value steps. A total of 238 points were selected following this design and distributed proportionally by the extent of the two types of forest. Thus, most of the visually interpreted points were close to a magnitude = |0.2|, while the more distant magnitude classes had fewer points (0-|0.05|: 10 pts, |0.45|-|0.5|: 6 pts; Table A1).
These points were labeled manually as disturbed and non-disturbed. For this, we consulted Sentinel-2 images from 2016-2018 in GEE at level 1C, which were orthorectified and radiometrically calibrated. Sentinel-2 images provided by the European Space Agency (ESA) have spatial resolutions ranging from 10 m to 60 m and multi-spectral bands covering the visible, red edge, and infrared spectral range and a temporal resolution of 2-5 days [40]. We first applied a cloud mask using the data quality layer 'QA60' and then we constructed monthly time series images using the median value. For each image, we used blue, green, and red channels with a spatial resolution of 10 m to

Training and Validation Datasets
Although all the breakpoints detected by the BFAST spatial model with negative magnitude values are potential forest disturbances, those with magnitude values close to zero most probably correspond to false disturbances caused by noise or interannual variations, while those with magnitude values farther from zero will tend to correspond to true disturbances [29,39]. In order to balance the number of observations of true and false disturbances to train and validate the models, the raw magnitude output of the BFAST was used to establish a stratified random sampling design. In this procedure, we used a gamma distribution centered at magnitude |0.2| to determine the number of observations for the complete range of magnitude values (0-|0.5|) in 0.05-value steps. A total of 238 points were selected following this design and distributed proportionally by the extent of the two types of forest. Thus, most of the visually interpreted points were close to a magnitude = |0.2|, while the more distant magnitude classes had fewer points (0-|0.05|: 10 pts, |0.45|-|0.5|: 6 pts; Table A1).
These points were labeled manually as disturbed and non-disturbed. For this, we consulted Sentinel-2 images from 2016-2018 in GEE at level 1C, which were orthorectified and radiometrically calibrated. Sentinel-2 images provided by the European Space Agency (ESA) have spatial resolutions ranging from 10 m to 60 m and multi-spectral bands covering the visible, red edge, and infrared spectral range and a temporal resolution of 2-5 days [40]. We first applied a cloud mask using the data quality layer 'QA60' and then we constructed monthly time series images using the median value. For each image, we used blue, green, and red channels with a spatial resolution of 10 m to construct natural color composites. Using the monthly natural color composites, we visually interpreted the sample points and labeled them as either disturbed or non-disturbed forest. When the interpretation was not clear with the Sentinel-2 images, we consulted high spatial resolution images in Google Earth. Some examples of Sentinel-2 images for sample points interpretation as disturbance and non-disturbance in TDF and TF are in Figure 1B.
After visually interpreting the 238 points, 143 points corresponded to non-disturbed forest (TF: 53 and TDF: 48), while 95 points, to disturbed forest (TF: 90, TDF: 47). This distribution confirmed that the followed stratified random sample helped in obtaining a more balanced dataset in terms of the number of observations of the disturbance/nondisturbance classes. Afterward, the verified points were split at random into training and validation datasets with a proportion of 0.7/0.3, respectively. It has been reported that forest type is an important factor in classifying forest disturbance using BFAST [30][31][32], therefore, three models were constructed with different forest stratum: all-forest, only TDF, and only TF.

Variables: BFAST Components and Time Series Landsat Data Quality
BFAST has been successfully applied in characterizing spatial-temporal vegetation dynamics and detecting changes in a time series [7,11,36]. BFAST model focuses on near real-time change detection, by which data is divided into a historical and monitoring period. The stable trend, modeled in the historical period, is projected into the monitoring period and a breakpoint of change is detected when the difference between the projected and observed values is larger than a threshold [29]. Usually, magnitude of change is the variable that is adjusted to optimize the detection [29,39,41]. However, a range of other variables can be obtained from the BFAST model such as trend, amplitude of the seasonal model, goodness-of-fit, model fitting period, as well as data quality in the fitted model. These variables have been tested relevant for vegetation change detection. For example, Grogan et al. [31] reported that the difference in slope of the model is important in the disturbance detection of different forest types. De Vries et al. [39] found an enhanced disturbance detection when using the breakpoints along with magnitude, while Dutrieux et al. [7] and Schultz et al. [42] reported data availability as an important factor in disturbance detection.
In this work, the following BFAST model components were extracted: magnitude, trend, amplitude, goodness-of-fit, model fitting period, and data quality ( Table 1). The abbreviations in Table 1 are the following: i stands for each observation in a time series, y i andŷ i for the observed and predicted NDVI value of i, y stands for the mean NDVI, n for the number of observations in the time series, x for the independent variable (i.e., days), y stands for the dependent variable (NDVI), y break andŷ break for the observed and predicted value in the detected breakpoint, respectively,ŷ max andŷ min for the maximum and the minimum predicted value of the detrended model, respectively, n, represents valid observations and N represents the total number of observations in the stable historical period, respectively. We use data quality as a measure of the percentage of valid observations over the total observations, which are 365 per year. The valid observations correspond to cloudless observations, while the total observations include valid observations, no data, cloud pixels, and missing data. Table 1. Definition and equation of the BFAST components that were used as predictive variables in the machine learning algorithms.

Magnitude
The absolute value of the difference between the observed and projected NDVI magnitude = |y break −ŷ break |

Trend
The trend of the linear model that was fitted in the stable historical data period The difference between the maximum and minimum NDVI of the detrended model amplitude =ŷ max −ŷ min

Goodness-of-fit
The goodness-of-fit of the fitted model measured by Model fitting period The length in years of the stable historical data period, corresponding to the time window to which the BFAST model is fitted Data quality The number of valid observations over the total observations in the stable historical data period data quality = n N

Baseline Model
The baseline model for forest disturbance detection uses only the threshold (|0.2|) of magnitude of change as a predictive variable, which was determined by field verification. With this model, an area with an absolute magnitude value higher than this threshold is classified as a disturbance. This model is applied in all three datasets with different forest stratum. The purpose of this baseline model is to have a point of comparison with the ML models using other variables of BFAST components.

Machine Learning Algorithm
Two ML algorithms, RF and SVM, were applied in a supervised classification scheme using variables of BFAST components and time series data quality to test if they can improve upon the baseline model. These two algorithms use non-parametric models that have been applied previously for land cover classification and change detection [18].
RF is a classification and regression algorithm based on an ensemble of multiple decision trees, trained using random samples of the data and predictor variables [43]. Predictions are designated by aggregating the estimates made by each tree using a majority vote. RF is usually trained with a proportion of two-thirds of the training data and validated by the remaining one-third, also referred to as out-of-bag dataset [43]. Previous studies have reported that using a high number of decision trees makes the generalization error converge and reduces the overfitting of the model [19,20,43]. Therefore, we use 500 random trees to perform the classification, while the number of predictive variables used in each split of each tree was the square root of the number of predictive variables. Finally, the out-of-bag ratio was one-third.
SVM is also a classification and regression algorithm that tries to find a hyperplane that minimizes the error in the predictions [44]. The hyperplane for the classification is determined by the subset of the data that fall in the decision boundary (these points are referred to as support vectors). SVM assumes that the data are linearly separable in a multidimensional space; but, since this is often not the case, kernel functions are used to transform the feature space to facilitate the classification [17]. SVM can use different types of kernel functions; however, in this study, we applied the radial basis function, which has been frequently used in remote sensing applications [45,46]. On the other hand, a key concept in SVM is the margin, which refers to the distance between the decision hyperplane and the closest observation. Usually, a large margin allows clear discrimination by the hyperplane. Thus, the cost parameter consists of a positive value indicating the penalization for predicting a sample within or on the wrong side of the margin, while the sigma is the precision parameter. In this case, we use cost = 1 and sigma was calculated automatically from the data in a heuristic procedure [47].

Variable Importance
The importance value of each predictive variable was calculated for both ML algorithms using models that included all the predictive variables. Based on the importance values, multiple models were trained sequentially, including the most important variables first. This sequential model training was applied for the three datasets: all-forest, TDF and TF. The all-forest dataset included forest-type as an additional predictor, besides the BFAST components; while the forest-type specific datasets included only the BFAST components as predictive variables. In total, seven models were trained for the all-forest dataset and six models for each of the forest-type specific datasets.
For RF models, the variable importance was calculated as a linear transformation of the Mean Decrease Gini into a 0-100 range, which measures the total decrease in node impurities resulting from splitting the data according to each predictive variable and then averaging this decrease over all the constructed trees [43]. For SVM models, the variable importance was calculated by permuting the independent variables and fitting a singlepredictor model on the complete data [48]. Afterward, the predictor variable that obtained the highest value of the area under the curve (AUC) of the receiver operating characteristic (ROC) curve was ranked as 100 and the variable with the lowest AUC value as zero, while all the other importance values were scaled to this range.

Best Model Selection
To find the best model configuration in terms of the predictive variables, different models were trained using only the training dataset (n = 168), while leaving the validation data untouched (n = 70). The model training was based on a 3-fold cross-validation with 40 repeats. In this way, each model ran 120 iterations from which the average accuracy and standard error were calculated. Afterward, the best configuration was selected based on a balance between the overall accuracy and the number of variables included, such that models with fewer predictive variables and a non-significant difference with the highest achieved accuracy were prioritized (i.e., within the range of mean accuracy ± 1.96 standard error).
The identical procedure was followed in the three datasets: all-forest, TDF and TF. Although the split ratio between the training and validation datasets was the same for all the three datasets, for TDF and TF datasets, the total number of observations for each split was smaller in comparison with the all-forest dataset. In the case of TDF, training and validation datasets had 97 and 41 observations, respectively; while for TF, the same datasets had 72 and 29 observations, respectively.
The overall accuracy is defined as a metric that summarizes the number of correct predictions over the total predictions [49]. It is calculated as the ratio of the sum of the true positive and true negative cases over the total predictions as summarized in an error matrix ( Table 2, Equation (1)).

Overall accuracy = (TP + TN)/(TP + TN + FP + FN)
(1) Table 2. TP (true positive), FP (false positive), FN (false negative), and TN (true negative) in an error matrix for a binary classification of disturbance and non-disturbance. The ground data are located in the columns, and the predicted data are located in the rows.

Non-Disturbance
Disturbance TP FP Non-disturbance FN TN

Model Validation
The trained model was tested using the validation data, which was kept untouched. Several metrics were calculated including overall accuracy, F1-score, and ROC AUC. Overall accuracy evaluates the correct detection of both disturbance and non-disturbance samples, while F1-score and ROC AUC focus on the class of interest (i.e., disturbance). Therefore, each metric evaluates a different aspect of the results. F1-score corresponds to the harmonic mean of precision and recall, calculated as Equation (2), where p stands for precision (or user's accuracy) and r for recall (or producer's accuracy) [50,51]. Precision is calculated as the ratio of true positives over the total of positive predictions and recall is calculated as the ratio of the true positives over the sum of true positives and false negatives [51].
The ROC curve is constructed by plotting the false positive rate against the true positive rate of a classification model, using different probability thresholds. True positive rate is a synonym for recall (see Equation (2)), while false positive is calculated as 1 − TN/(TN + FN).
ROC curves can be used to interpret the skill of a model to correctly classify a binary outcome (true or false). In our case, it shows the ability of the model to classify the disturbance class. In turn, the AUC of a ROC plot is calculated as the area under the ROC curve. For a random guess, the AUC is 0.5, while a perfect model will have an AUC of 1. Models with AUC between 0.5 and 1 represent a better prediction than a random guess, while higher AUC values are considered better models.

Post Classification Processing and Accuracy Assessment
The best model (SVM all-forest) was used to predict forest disturbance in the complete study area. Although the validation dataset was used to verify the models using an independent set of observations, this dataset was created following a gamma distribution of magnitude. Thus, it was not considered as a representative sample of the entire study area. To assess the accuracy of this classification using an independent set of validation data, a stratified random sampling was implemented based on the area occupied by each class of interest in the classification map, following [52]. The number of samples was calculated using Equation (3) where S(Ô) is the standard error of the estimated overall accuracy. As suggested by Olofsson et al. [52] a value of 0.01 was used. W i is the mapped proportion of the area of class i and S i is the standard deviation of class i, and S i = U i (1 − U i ). This formula estimated a sample size of 624 observations. For the rare classes-i.e., disturbances-50 random points were assigned for each forest type. The rest of the validation points were defined according to the predicted proportional area of the remaining strata (i.e., non-disturbed forest in both TF and TDF). Afterward, these points were verified by visual interpretation in Google Earth Engine (GEE) using Sentinel-2 images.

R Packages
The complete method was carried out in an R 4.0.5 environment. Variable importance was calculated using the caret package [48], the training and validation of the ML algorithms were performed using the tidymodels [53] and yardstick packages [54], the spatial information operations were done using the sf [55] and raster packages [56], while for data wrangling and plots the tidyverse package [57] was used.

Variable Importance
The variable importance values show that the essential variables to correctly identify disturbances vary with the algorithm and dataset used (all-forest, TDF, or TF). Nevertheless, regardless of the dataset and ML algorithm, magnitude was always ranked as the most important predictor, while other important variables included trend, data quality, model fitting period, and amplitude. On the contrary, forest-type and R 2 were ranked as the least important ( Figure 3).

Variable Importance
The variable importance values show that the essential variables to correctly identify disturbances vary with the algorithm and dataset used (all-forest, TDF, or TF). Nevertheless, regardless of the dataset and ML algorithm, magnitude was always ranked as the most important predictor, while other important variables included trend, data quality, model fitting period, and amplitude. On the contrary, forest-type and R 2 were ranked as the least important ( Figure 3).

Best Model Selection
The exploration of the models with a different number of predictive variables indicated that the accuracy ranged from 61.48% to 87.89% (see Figure 3 and Table A2). From these results, the models that were not significantly different from the highest achieved accuracy and with less predictive variables were selected and verified on the validation set (Table 3). For the all-forest model, both SVM and RF favored three identical predictive variables: magnitude, trend, and model fitting period, in the same order of importance; however, SVM obtained higher overall accuracy than RF on the validation dataset. In turn, in the RF model of TDF, the best model included magnitude, trend, and amplitude; while using SVM, it included magnitude, trend, model fitting period, and data quality. In the TF model, RF and SVM used fewer predictive variables, either magnitude and data quality for RF or magnitude and model fitting period for SVM (Table 3).

Best Model Selection
The exploration of the models with a different number of predictive variables indicated that the accuracy ranged from 61.48% to 87.89% (see Figure 3 and Table A2). From these results, the models that were not significantly different from the highest achieved accuracy and with less predictive variables were selected and verified on the validation set (Table 3). For the all-forest model, both SVM and RF favored three identical predictive variables: magnitude, trend, and model fitting period, in the same order of importance; however, SVM obtained higher overall accuracy than RF on the validation dataset. In turn, in the RF model of TDF, the best model included magnitude, trend, and amplitude; while using SVM, it included magnitude, trend, model fitting period, and data quality. In the TF model, RF and SVM used fewer predictive variables, either magnitude and data quality for RF or magnitude and model fitting period for SVM (Table 3).

Model Validation
Regardless of the ML algorithm used, the validation of these models showed the same accuracy or an increase in accuracy of up to 21.95% compared to the baseline models (∆ Best vs. Baseline, Table 4). The largest difference was observed in the TDF dataset, where the baseline model achieved an accuracy of 58.54% on the validation dataset, while the RF model had an accuracy of 80.49%. On the contrary, the smallest difference was observed in the TF dataset, where the baseline model and SVM model obtained the same accuracy of 79.31%. When comparing the two ML algorithms, SVM showed a higher accuracy on the all-forest dataset, while RF achieved higher accuracies on the TDF and TF datasets (Table 4). Finally, the model that enabled the highest accuracy was SVM with the all-forest dataset.

ROC Curve and AUC
The ROC curves of the models with the highest accuracy indicate that in all cases the models showed a higher AUC than a random guess and less than the perfect model. Additionally, both ML algorithms have a higher AUC than the baseline model. The plot shows the possible combinations of true positive rate and false positive rate given the data. For example, when the true positive rate and the false positive rate are both at one (1, 1), all the validation data are classified as disturbed forest; while the opposite combination of (0, 0) happens when none of the validation data is classified as disturbed forest. All the points between these two extremes show a combination of different true and false positive rates in the detection (Figure 4).

ROC Curve and AUC
The ROC curves of the models with the highest accuracy indicate that in all cases the models showed a higher AUC than a random guess and less than the perfect model. Additionally, both ML algorithms have a higher AUC than the baseline model. The plot shows the possible combinations of true positive rate and false positive rate given the data. For example, when the true positive rate and the false positive rate are both at one (1, 1), all the validation data are classified as disturbed forest; while the opposite combination of (0, 0) happens when none of the validation data is classified as disturbed forest. All the points between these two extremes show a combination of different true and false positive rates in the detection (Figure 4).

Forest Disturbance Prediction in the Study Area
The all-forest with the SVM model was used to predict forest disturbance for the complete study area ( Figure 5). Evaluated by 624 random points, the prediction obtained an overall accuracy of 94.87%, while the user's and producer's accuracy for the disturbance and non-disturbance were between 82.69% and 97.31%. When weighted by the area proportion of each class, the overall accuracy increased to 97.23%, while the producer's accuracy for forest disturbance was lowered from 86% to 13.77% (Table 5). Overall accuracy can be biased by the imbalance in the number of observations between

Forest Disturbance Prediction in the Study Area
The all-forest with the SVM model was used to predict forest disturbance for the complete study area ( Figure 5). Evaluated by 624 random points, the prediction obtained an overall accuracy of 94.87%, while the user's and producer's accuracy for the disturbance and non-disturbance were between 82.69% and 97.31%. When weighted by the area proportion of each class, the overall accuracy increased to 97.23%, while the producer's accuracy for forest disturbance was lowered from 86% to 13.77% (Table 5). Overall accuracy can be biased by the imbalance in the number of observations between the classes. Therefore, high overall accuracy values can disguise the low classification capabilities for rare classes. Other methods can give further insights into the classification capabilities for both common and rare classes, such as the F1-score. Although the F1-score can give a better idea of the classification capabilities for all the classes, the overall accuracy reports the precision of the entire map. Thus, the two metrics can be used to describe different aspects of the classification performance.
When the accuracy assessment was weighted by area proportion, due to the extremely large area occupied by the non-disturbance class, a small commission error can have a detrimental effect on the area estimation of the disturbance class [48]. In our case, the commission error weighted by the area of the non-disturbance class (99.48%) reduces the producer's accuracy of the disturbance class from 86% to 13.77%, when weighted by area proportion. Table 5. Confusion matrix for the accuracy assessment of the classification by the SVM all-forest model for the complete study area; the area-weighted accuracy indices including the producer's, user's, and overall accuracy, as well as area estimates for each class are shown. The actual disturbance and non-disturbance are located in the columns, and the predicted disturbance and non-disturbance are located in the rows.

Improvement in Disturbance Detection Using Machine Learning Algorithms
This study demonstrates that using ML algorithms and BFAST model components (ML+BFAST) enabled higher accuracy in forest disturbance detection in comparison with the baseline model, which relies only on a magnitude threshold. In all datasets (i.e., allforest, TDF, TF), the ML algorithms (i.e., SVM or RF) improved the accuracy achieved in the validation dataset, except for SVM with the TF dataset, which obtained the same accuracy as the baseline model (79.31%). Other evaluations reached similar conclusions when using additional BFAST components or ML algorithms to improve the disturbance detection [28,31].
In the all-forest dataset, SVM over-performed RF with a small difference (4.29%). In turn, in both TDF and TF models, RF obtained 4.88-6.90% higher accuracy than SVM. The similar performance of SVM and RF in remote sensing-based classifications has been commonly reported [18,45]. However, it is unclear if there is a particular advantage of one algorithm over the other, as both algorithms can deal with heterogeneous data, data of high dimensionality, and a limited number of observations [18,45].
Compared with the baseline model, ML approaches did not show an advantage when using magnitude as the single predictive variable (Table A2). This suggests that when magnitude is the only variable used, expert knowledge enables a higher accuracy than a ML approach; however, when the number of variables increases, ML algorithms achieved a higher accuracy (Table A2). A prior report on the use of ML and BFAST has shown the advantages of ML over a simple threshold procedure to identify disturbances [28]. In our case, we used a small training sample and therefore, the advantage of expert knowledge vs. ML is still yet to be tested with a larger dataset.
Interestingly, the most accurate model was the one with the all-forest data instead of the forest-type specific models. This seems to contradict the results from previous findings [28,30,31]. However, a smaller sample size in the forest-type specific models may have caused the lower observed accuracies, since the reduction of the size of the training dataset can have repercussions over the accuracy, particularly with an already small dataset [58,59]. Further research is needed to separate the confounding effects of forest type, sample size and data heterogeneity on the accuracy of the tested models. On the other hand, the importance of forest-type as a predictive variable is less than magnitude, trend, and model fitting period; nonetheless, the importance of this variable is relative to the number of predictive variables included in the models. For example, if fewer predictive variables were used, the importance of forest-type would be enhanced. Additionally, because in our study we only included two types of forest over a relatively small area, it is unclear if forest-type might indeed act as an important variable when including a larger variety of forests [31] or with other forest types [60].
In comparison with similar studies, our model had slightly lower accuracy [28,31], which might be related to the data and the data quality. Experiments with better results have used datasets with higher observation density (e.g., MODIS), a harmonized collection from different sensors or data with improved quality through pre-processing with filtering [31,61,62]. Better results have also been reported when using different indices (e.g., NDMI, NDFI, or multiple spectral indices; [7,28,39,42]), or by including SAR together with optical images [60,63].
Reducing the dimensionality of the feature space before training the ML algorithms has been suggested as a necessary preprocess to improve model performance [64]. However, our study used a limited number of predictors, so we used the importance value analysis instead to determine the predictors included in the simplest models, as they offered more critical information for the classification. Additionally, to avoid overfitting, the best model was selected as the one with fewer predictive variables and not significantly different from the model with the highest accuracy.

Variable Selection
Our results show that magnitude and trend were ranked most important in all the models. Similar results have also been reported [30,31]. Since magnitude represents the difference between the observed and predicted NDVI, disturbances with higher magnitude values represent a larger deviance from a stable trend, and therefore are more likely correspond to a correct disturbance detection [29,39]. On the contrary, lower magnitude values are often related to interannual variations or noise, and they correspond more often to false detection. As for trend, we found that small trend values usually correspond to a long stable period, which facilitates a correct detection; while high trend values (negative or positive) often were fitted either to a short historical data period or very noisy data ( Figure 6). Grogan et al. [31] applied a variable of difference in slope (diff.slope) before and after the breakpoint together with amplitude and other variables and found that diff.slope enabled an increase in ROC AUC of 4-0.14% when used with magnitude and depending on forest-type. Thus, trend seems to be the second most important BFAST component to aid in disturbance detection, just after magnitude of change. Data quality and model fitting period were ranked as either the second or third most important variables. Data quality contributed to accuracy improvement in the RF with TF dataset model in comparison with the baseline model (86.21% vs. 79.31%). The importance of data quality for disturbance detection has also been reported in [7,60]. In turn, the model fitting period is related to the model fitting quality, since models with longer fitting periods are more robust to interannual variation or noise (e.g., in Figure 6A,C). Time series data can have the same percentage of 'no-data' values, implying similar data quality, but the ones with a longer model fitting period would normally render better results. We interpret that this is the reason why model fitting period was selected in the SVM all-forest model, instead of data quality.

Disturbance Area and Rate Estimation
The increase in accuracy of the ML+BFAST compared with the baseline model is Data quality and model fitting period were ranked as either the second or third most important variables. Data quality contributed to accuracy improvement in the RF with TF dataset model in comparison with the baseline model (86.21% vs. 79.31%). The importance of data quality for disturbance detection has also been reported in [7,60]. In turn, the model fitting period is related to the model fitting quality, since models with longer fitting periods are more robust to interannual variation or noise (e.g., in Figure 6A,C). Time series data can have the same percentage of 'no-data' values, implying similar data quality, but the ones with a longer model fitting period would normally render better results. We interpret that this is the reason why model fitting period was selected in the SVM all-forest model, instead of data quality.

Disturbance Area and Rate Estimation
The increase in accuracy of the ML+BFAST compared with the baseline model is small (overall accuracy: 0.14%; area-weighted accuracy: 0.71%). In addition, both the estimated area of forest disturbance and disturbance rate of the ML+BFAST model overlaps with the previous estimate using the baseline model (baseline: 9136.43 ± 5599.49 ha vs. ML+BFAST: 5477.49 ± 1246.90 ha; baseline: 3654.7 ± 2239.8 ha year −1 vs. ML+BFAST: 2191.0 ± 977.6 ha year −1 ) [30]. However, in both cases, the ML+BFAST model has a smaller standard error, which translates to less uncertainty in the estimated disturbed area and disturbance rate. This is particularly relevant for forest disturbance quantification since it helps reduce uncertainties in estimating carbon and GHG emissions and therefore is significant for REDD+ implementation [1,65].
On the other hand, the results show that regardless of the method (i.e., using magnitude threshold or ML), most of the forest disturbances remain undetected in the final map (see Table 4, mapped disturbance vs. unbiased disturbance area estimate). Therefore, the validation with stratified sampling is essential to estimate the total area of forest disturbance in the study area.

Study Limitations
Admittedly, the spatial resolution of the Sentinel-2 images might have been too coarse to distinguish fine-scale degradations in the validation process. In these cases, other available imagery such as Google Earth could have been more informative; however, the problem we encountered was that frequently the available images were from very different dates, which introduced uncertainty in distinguishing seasonal changes from actual disturbances. Thus, we prioritized temporal agreement between the comparisons, instead of spatial resolution. A direct consequence of this decision was that fine-scale degradations could remain undetected in the validation process. Therefore, probably most of the detected disturbances correspond to deforestation. Particularly for very seasonal systems, like the one studied here, the availability of freely available images with high spatial and temporal resolution will be critical to help identify and verify fine-scale disturbances.
Another limitation shown by the followed approach is that it lacks a temporal label for each detected disturbance, as the final map only classifies areas into disturbance and non-disturbance in the monitoring period. One way to identify the dates of the disturbances is to consult the breakpoint date of each change detected by BFAST.

Conclusions
This study demonstrates that BFAST can benefit from using a machine learning approach to increase its accuracy for disturbance detection, particularly using magnitude of change, trend, and model fitting period. Support vector machines achieved higher accuracy using the all-forest dataset, while random forests over-performed support vector machines in forest-type specific datasets. We found that expert knowledge over-performed machine learning algorithms when using only magnitude as a predictive variable; however, when the number of variables increases, machine learning algorithms performed better than expert knowledge. Since we used a rather small sample size (238 points), the advantage of expert knowledge vs. machine learning algorithm needs to be tested with training data of a larger sample size to be conclusive. Nevertheless, our results suggest that ML algorithms can be used with BFAST to enhance the accuracy achieved to detect disturbances and obtain disturbance area estimates with lower errors. Future studies should address if the advantage of BFAST+ML for disturbance detection is relative to sample size or the scale of disturbances.
Author Contributions: Conceptualization, methodology, and data validation: J.V.S. and Y.G., Original draft preparation, review and editing: J.V.S. and Y.G. All authors have read and agreed to the published version of the manuscript.
Acknowledgments: The first author wishes to thank the Consejo Nacional de Ciencia y Tecnología (CONACyT) for providing a PhD scholarship.

Conflicts of Interest:
The authors declare no conflict of interest.
Appendix A Table A1. The extracted BFAST model components for the 238 training and validation data for tropical dry forest (TDF) and temperate forest (TF). The column "Change" includes disturbances, coded as 1, and non-disturbances, as 0.  Table A2. Mean accuracy in range of confidence interval at 95% level obtained for each of the constructed models using the RF and SVM algorithms. The models were trained using only the random subsets of the training data (not validation data), with 120 iterations. Additionally, the models were applied in three datasets, All-forest, TDF, TF, and the variables for each dataset are indicated. The best model, marked by underlined bold font, is selected by a balance between the accuracy and the number of variables. Given similar accuracy, the models with lower number of predictive variables were preferred.