Automated Plantation Mapping in Southeast Asia Using MODIS Data and Imperfect Visual Annotations

Expansion of large-scale tree plantations for commodity crop and timber production is a leading cause of tropical deforestation. While automated detection of plantations across large spatial scales and with high temporal resolution is critical to inform policies to reduce deforestation, such mapping is technically challenging. Thus, most available plantation maps rely on visual inspection of imagery, and many of them are limited to small areas for specific years. Here, we present an automated approach, which we call Plantation Analysis by Learning from Multiple Classes (PALM), for mapping plantations on an annual basis using satellite remote sensing data. Due to the heterogeneity of land cover classes, PALM utilizes ensemble learning to simultaneously incorporate training samples from multiple land cover classes over different years. After the ensemble learning, we further improve the performance by post-processing using a Hidden Markov Model. We implement the proposed automated approach using MODIS data in Sumatra and Indonesian Borneo (Kalimantan). To validate the classification, we compare plantations detected using our approach with existing datasets developed through visual interpretation. Based on random sampling and comparison with high-resolution images, the user’s accuracy and producer’s accuracy of our generated map are around 85% and 80% in our study region.


Introduction
In Southeast Asia, expansion of tree crops and managed forests that help meet demand for global commodities has resulted in substantial tropical deforestation [1][2][3]. Replacement of natural forest cover with tree plantations, including tree crops for food (e.g., oil palm), fiber (e.g., pulp and paper), and materials (e.g., rubber), leads to net greenhouse gas (GHG) emissions to the atmosphere [3,4] and negatively impacts local environments, including degradation of biodiversity [5] and water quality [6]. These tree crops are frequently grown at the industrial plantation scale across large contiguous patches by capitalized companies [3], but may also be produced by smallholder farmers in small patchy areas [7].
Due to concerns over these large trade-offs between the environment and human material needs [8], several initiatives across government, industry, and non-governmental organization (NGO) sectors aim to ensure that tropical commodities meet rigorous sustainability standards [9]. The United Nations Collaborative Programme on Reducing Emissions from Deforestation and Forest Degradation in Developing Countries (REDD+), launched in 2008, aims to mitigate climate change through enhanced forest management. Since the mid-2000s, many large international corporations that produce, trade, and sell tropical commodities have committed to zero-deforestation supply chains [10]. The companies have also sought to demonstrate the sustainability of products and supply chains through third-party certifications such as Roundtable on Sustainable Palm Oil (RSPO) certification [11,12], and International Sustainability and Carbon Certification [13]. Concurrently, governments have begun to address commodity-driven deforestation through initiatives such as the Indonesian moratorium on allocation of new commodity concession in forested areas [14]. However, evaluating the effectiveness of these diverse policies depends on the ability to monitor land cover change due to tree commodity expansion [15]. Hence, scalable and timely monitoring of these land uses is essential for understanding whether programs and policies are meeting their stated goals [16,17].
Recent advances in access to and processing of remote sensing data have enabled semi-automated monitoring of high resolution changes in tree canopy cover over regional to global areas [18][19][20]. Concurrent progress has been made in classifying individual crop types, such as soybean and sugarcane in Brazil, using remote sensing data [21,22]. However, differentiating natural forests from tree crops remains a major challenge. Tree crops frequently have spectral properties (e.g., greenness) similar to natural forests [23,24], and can thus be difficult to discern from natural forests based on single-date spectral properties alone. For instance, the most widely used global tree cover loss product [18] defines forests on the basis of structure (tree height and percent canopy cover) and therefore does not differentiate between forest and plantations [25]. Moreover, they often have long rotation times (e.g., oil palms are replanted every 25 to 30 years), so that clearing and replanting may not be detected in a relatively long time-series of remote sensing imagery.
For these reasons, many plantation mapping studies have relied on visual interpretation-which makes use of human expertise to recognize patterns-in the plantation detection process [26][27][28][29]. For instance, Miettinen et al. [30] manually delineated industrial palm plantations based on Landsat 7 and Landsat 8 images. Similarly, Petersen et al. [29] divided Landsat images into a grid of 20 × 20 km and visually scanned each gridded structure for multiple types of plantations with the assistance of forest gain and loss information [18]. Miettinen and Hooijer et al. [27] first conducted clustering on Moderate Resolution Imaging Spectroradiometer (MODIS) satellite images and then manually checked the properties of each cluster. Koh et al. [26] clustered Advanced Land Observation Satellite (ALOS) images, visually interpreted each cluster as basic land cover types, and then classified them into finer-grained land cover types.
While such a manual approach is useful for creating maps for tree crops within small regions or for specific dates, this approach does not scale well to regional or global areas and also has limitations for creating maps at high temporal frequency (e.g., annually). This approach may require multiple observers to delineate tree crop plantations, and observers are likely to be inconsistent with one another [31,32]. Most critically, the human resources needed for large-scale annual digitizing are substantial. Given the challenges of time expense and likelihood of inconsistency of observations, mapping methods heavily reliant on visual interpretation are not feasible for large regions with annual repetition.
As an alternative, automated machine learning-based methods have been applied successfully to map plantation agriculture in the humid tropics. For example, Gutierrez Velez et al. [33] used a combination of MODIS Enhanced Vegetation Index (EVI) time-series and Landsat to detect forest conversion to oil palm in Peru. Automated approaches applied to detect tree crops have commonly utilized thresholding-based approaches [33][34][35], and the nearest neighbor method [36]. These approaches are limited in their capabilities, however, because they do not fully exploit crop-specific changes in vegetation characteristics over time (e.g., land clearing, growth cycles, and phenology specific to each tree crop species).
While the complex high-dimensional feature space in remote sensing data poses a challenge for learning processes [37][38][39], automated approaches have great potential to provide annual tree crop maps over large scales especially if several methods are implemented in these approaches. Specifically, a high temporal approach has been shown to be effective at reducing the fuzziness of boundaries by examining the vegetation's phenology and eliminating cloud-cover pixels [40][41][42].
In addition, there exist varying levels of similarity between different land cover types [43][44][45]. Forests are more like plantations than urban areas. If a simple binary classifier is used to distinguish between plantation and "non-plantation", the classifier is highly likely to confuse plantations with similar land covers (e.g., forests) [46,47]. One method to address this issue is aggregation of similar land cover types into subgroups (aggregated classes) and training of separate classifiers to distinguish between each pair of subgroups. Moreover, the distribution of multiple land covers is often skewed, which necessitates a careful sampling design to collect training data from each subgroup.
Our objective is to automatically generate annual large-scale tree plantation maps. To do so, we introduce an automated approach for annually mapping tree crop plantations, which we call Plantation Analysis by Learning from Multiple Classes (PALM). The method uses a combination of remote sensing data, existing visually-delineated tree plantation maps, and machine learning techniques (ensemble learning, Deep Belief Networks (DBN), Hidden Markov Model (HMM)) to distinguish plantations from other land cover types in Indonesia (Sumatra and Kalimantan).
We use two existing plantation products, the Tree Plantation (TP) and Roundtable on Sustainable Palm Oil (RSPO) datasets, to create training data as an input to PALM. We validate the proposed approach by conducting random sampling to evaluate plantation maps generated by our approach and existing plantation maps available from TP and RSPO using a variety of metrics. Our validation process is based on an independent data source, i.e., high-resolution DigitalGlobe images, which are not used in the training process.

MODIS Data
To map tree plantations, we use the 500 m resolution MODIS Surface Reflectance product MOD09A1 [48], which consists of seven reflectance bands collected by MODIS instruments onboard Terra satellites. The MODIS product is defined on global sinusoidal grids in fixed geolocated 10 × 10 degree tiles and is publicly available from the Land Processes Distributed Active Archive Center [49]. In this product, 8-day composite images are generated from daily images by selecting the per-pixel reflectance values with the least noise (i.e., clouds and missing values) from every 8-day interval. The product provides 46 composite images per year since year 2000. Because the MODIS level-3 data product has filtered multiple factors including aerosoles and cloud, each image contains pixels with no data. We interpolate these missing values using temporally adjacent reflectance values. For example, if the reflectance value for band d is missing at time t, we will look for the available time for this band at previous time steps and following steps and conduct a linear interpolation for the missing value.

Study Region
Our study region includes the land area of the three MODIS tiles h29v09, h29v08 and h28v09 in Kalimantan and Sumatra, as shown in Figure 1. Tile h29v09 ("southern Kalimantan") covers the province of South Kalimantan, as well as parts of West, Central, and East Kalimantan. Tile h29v08 ("northern Kalimantan") covers North Kalimantan and parts of East and Central Kalimantan. Tile h28v09 ("southern Sumatra") covers most of the southern part of Sumatra. These regions are extensively planted with oil palm, rubber, pulp and paper, and coconut palm [50,51]. In Table 1, we summarize the study area in these three MODIS tiles.

Training Data
Our classification method requires training samples to define the locations and the spectral features of various land cover types. Here, we use two training datasets developed by visual interpretation of remote sensing data. We refer to these as Tree Plantation ("TP") [29] and Roundtable on Sustainable Palm Oil ("RSPO") [12] datasets. Both datasets provide complete coverage in our region of study.

Tree Plantation Dataset
The Tree Plantation dataset (TP) was developed through visual interpretation of moderate-and high-resolution satellite imagery, and provides the location of tree plantations in selected tropical countries circa 2013-2014 [29]. In this dataset, plantations are defined as "tree established through planting and/or deliberate seeding of native or introduced species", and include tree crops. In this work, we consider only the major plantation species in our study region, comprised of oil palm, acacia, rubber, and coconut palm [50]. Plantations are further categorized as industrial, medium-sized mosaic, small-sized mosaic, or very young. Random stratified sampling conducted by Perterson et al. [29] found that the user's accuracy of this dataset is 79% (i.e., 79% of the identified tree plantation locations are in fact plantations) and the producer's accuracy is 94% (i.e., 94% of all plantations are identified). This dataset identifies around 60,804 km 2 , 26,242 km 2 and 112,062 km 2 of plantation area in MODIS tile h29v09 (southern Kalimantan), h29v08 (northern Kalimantan), and h28v09 (southern Sumatra), respectively circa 2013-2014.

RSPO Dataset
This dataset was produced by the 2nd Greenhouse Gas Working Group of the Roundtable on Sustainable Palm Oil (RSPO) [12] via visual interpretation of Landsat satellite images to delineate industrial oil palm plantations across Indonesia, Malaysia, and Papua New Guinea in 1990Guinea in , 2000Guinea in , 2005Guinea in , and 2010. In addition, the study categorized each land pixel into 19 land cover types, which were further aggregated into nine higher level classes (Table 2, columns 2 and 3). We also show the estimated area and the number of MODIS pixels for each land cover in southern Kalimantan in Table 3. Although this study did not provide an accuracy assessment, our validation in Section 4.2 reports that the RSPO approach rarely classifies a non-plantation area as plantation (high user's accuracy) but misses many real plantation areas (low producer's accuracy).

PALM Framework
The PALM framework uses samples derived from manually-digitized plantation datasets available for specific years. In particular, we employ an ensemble learning method to identify tree crop plantations. Ensemble learning refers to a set of approaches that jointly utilize multiple models to classify data [52]. Similar ensemble learning techniques have been used to identify land cover types from heterogeneous data, and these studies suggest that such techniques perform better than directly trained individual classifiers [35,53,54]. We use a deep learning model, Deep Belief Networks (DBN), as the classifier in ensemble learning to discover discriminative information from multi-spectral satellite data collected from multiple dates. The success of deep learning [55][56][57][58][59][60][61] in a wide range of applications demonstrates its capacity to learn discriminatively from complex feature space and extract representative features. In ensemble learning, we simultaneously train multiple DBNs to differentiate between each pair of aggregated land cover classes. However, even after using the most suitable classification machinery, classification errors in individual annual maps have a compounding effect in the context of land cover change detection. Therefore, to further improve classification accuracy, we use a post-processing method based on transition characteristics and spatial contiguity of land cover types. In addition, we distinguish different plantation species by conducting a hierarchical classification on the obtained plantation maps. The flow of the PALM framework is shown in Figure 2. There are many possible implementations of these modules. See [62] for the evaluations of different possible implementations for some of these modules (collecting training samples, comparison between ensemble learning and other learning strategies, filtering, post-processing strategies, and different amounts of training data) using imperfect data from TP and RSPO. In the following section we describe the specific implementations that were tested in the context of this study.

Ensemble Learning Method
The RSPO dataset provides training data that are suitable for our ensemble learning method because the dataset has many classes, multiple time points, complete coverage, and high user's accuracy (see Table 11). We first aggregated the RSPO land cover types into several classes to train a learning model. The number of aggregated classes needs to be carefully chosen since both detailed taxonomy (a large number of aggregated classes) and binary classification (aggregated "plantation" and "non-plantation" classes) have limitations. Although a detailed land cover taxonomy can provide the maximum information about land cover change, the existing RSPO maps do not achieve high accuracy on fine-grained land covers, as described in the RSPO report [12]. Conversely, if all the land cover types other than plantation are aggregated into a "non-plantation" class, the high heterogeneity within the "non-plantation" class will reduce learning performance and increase misclassification risk [47]. While distinguishing between plantation and forest is challenging due to similar per-pixel spectral properties, other land cover types such as urban areas and annual crops are typically dissimilar from forests and tree plantations, and are therefore easier to distinguish from plantation. Based on these considerations, we aggregate all land cover types into three classes: "forest", "plantation", and "other" ( Table 2, column 1).
The aggregation facilitates the classification process in that the classifier can capture specific discriminative patterns to distinguish tree plantations with similar classes ("forest" class), and also with dissimilar classes ("other" class). To determine which RSPO land cover types are relatively similar to plantations and belong to the aggregated "forest" class, we conduct pairwise binary classification between plantation and every land cover type within "disturbed forest" and "undisturbed forest". Then, we aggregate the land cover types which result in a relatively low classification accuracy into the "forest" class.
Given the aggregated classes, we train three binary classifiers using a one vs. one strategy: plantation versus forest (P-F), forest versus other (F-O), and other versus plantation (O-P). In this way, each classifier focuses on exploiting the discriminative knowledge between a specific pair of classes while ignoring the other class. This learning strategy [63] is thought to reduce data heterogeneity from multiple classes and improve the learning performance. After training the three classifiers separately, we then aggregate the predicted results from each classifier to make a final classification decision based on majority voting. Since each binary classifier focuses on differentiating between a specific pair of classes, eight possible combinations represent potential outcomes from the three classifiers. Based on the separate predictions from each classifier, we assign the aggregated prediction result as the majority class label. For instance, if both P-F and O-P classifiers predict a test location as "plantation", then we will label this test location as "plantation" regardless of the prediction of F-O classifier. When the three binary classifiers generate mutually different labels, the test sample will be assigned to the "Unknown" (U) class. This "Unknown" class is handled through our post-processing step as discussed in Section 3.1.5.

Collecting Training Samples
Aggregated land cover classes derived from the RSPO dataset contain multiple land cover types, and relative area under each land cover type varies widely. For instance, in the RSPO dataset, settlement area is much smaller than grassland area. Classification algorithms may over-emphasize the classes with larger population of samples [64]. If we uniformly sampled from each aggregated non-plantation land cover class for supervised classification, the resulting map would likely be dominated by the land cover types with the largest relative area. As a result, minority classes might be misclassified. To address this issue, we randomly sample equal numbers of pixels from each sub-class within the aggregated "forest" and "other" classes.
The spectral features of each land cover can be different in different years because of changing climate conditions (e.g., precipitation, sunlight). Due to such temporal heterogeneity, a model developed for a specific year can result in a poor prediction when applied to another year. To overcome this limitation, we sample each pixel for every year for which we can be confident about its label. This enables the training of a global model that can be applied over an entire observation period. Specifically, according to the availability of RSPO data, we first divide the 2001 to 2014 period into three intervals. Since the quality of labeled data can impact the performance of the machine learning algorithm, we further improve the confidence of selected samples used for training. We utilize the TP dataset to prune less confident samples from this set. If the RSPO dataset identifies a plantation pixel that is not part of the TP dataset, this indicates a potential false positive in the RSPO dataset since the TP dataset includes the clear majority of true plantations. Hence, we keep only those plantation samples that are also labeled as "plantation" in the TP dataset. We also use the TP dataset to prune the confident samples for the "forest" class and "other" class, keeping only those confident samples that are not identified as plantations in the TP dataset.
Since the TP dataset and RSPO dataset were created using satellite imagery at a higher spatial resolution than MODIS pixel, there exist mixed MODIS pixels along the boundaries of each land cover. To ensure that the training process only involves high-quality training data, we exclude these mixed pixels from the classification model.

Learning Model
While ensemble learning assists with distinguishing tree plantations with multiple other land cover types, we still need to carefully choose the learning model that distinguishes between aggregated classes to ensure high classification performance. After gathering training samples, we train a Deep Belief Network (DBN) [60] to learn each of these three binary classifiers. DBNs are effective in exploiting the latent relationship among input features and extracting high-level representative features via the unsupervised pre-training phase and the supervised fine-tuning process, and thus are an excellent tool for such land cover classification. As mentioned in Section 3.1.1, the samples of each class can be associated with multiple land cover types, which requires that the model effectively transforms the spectral features into a subspace where the land cover types within each aggregated class are close and those from different classes are distinct from one another. We take advantage of the spectral and temporal richness of our data by exploiting reflectance values from all seven MODIS bands, and multiple available images each year [58,59], to determine land cover.
The main components of a DBN are Restricted Boltzmann Machines (RBMs). An RBM is an undirected graphical model structured as a fully connected bipartite graph between two layers of binary variables, visible variables V ∈ R N , and hidden variables H ∈ R M (Figure 3). In our problem, V represents concatenation of reflectance values over multiple composite images in a given year, which has the length of N = N b × N d . N b and N d denote the number of bands and the number of composite images in a year, respectively. The joint distribution of V and H is defined by an energy-based distribution [65]. While RBMs are capable of extracting latent features, a single RBM may fail to reveal truly discriminative features due to the limited ways of combining input features [66]. Therefore, multiple RBM layers are usually stacked to form a DBN. The DBN enables the learning of more representative features since the number of combinations increases exponentially with the number of layers. The DBN model can be trained using Gibbs sampling (alternating sampling of visible and latent variables) in a greedy (layer-wise) fashion [60]. The objective of DBN training is to automatically extract representative features from a large volume of high-dimensional data. After training, the learned DBN model can be folded into standard deep neural networks for fine-tuning processes to fit the training data [60].
Here, we train three DBN models separately to discriminate between each pair of aggregate classes (i.e., plantation-forest, forest-other, other-plantation). We feed each DBN model with the concatenation of seven-band spectral features collected for 46 dates of a year, and it outputs a class label for every pixel every year. For each DBN model, we adopt a stacked four-layer structure with 158, 64, and 20 hidden variables, respectively (The last layer outputs the class label).

Filtering
During the analysis of our ensemble classification machinery, we found a set of locations that were labeled as bare soil by RSPO but that were mistaken for plantation by our classification approach. This bare soil class refers to bare rock, gravel, sand, silt, clay, or other exposed soil, and often includes recently cleared (deforested) areas, landscapes impacted by fire and portions of estates undergoing replanting procedures [12]. Some of these locations may potentially be converted into plantations in future. Therefore, they can be confused with true new plantations as they share similar characteristics for early stage of plantations, e.g., road networks in plantations.
While most bare soil locations and plantations in Southeast Asia can be distinguished based on phenology, another important reason for the classification error is the lack of such training data (e.g., deforested area with road networks) used for three-class classification (see Section 3.1.1).
Even though we collect equal numbers of samples from each land cover type to obtain a rich training set, samples within a given land cover type are selected randomly. Although these locations allow better class separability, they are rarely present in the bare soil class. When these limited samples are mixed with other land cover types in the "other" class, the training process can be dominated by the other samples that are more distinguishable from plantations.
To solve this problem, we train an additional classifier that differentiates between bare soil and plantation (B-P). The separate training of this B-P classifier ensures that we have sufficient samples for both classes, which allows deep learning models to automatically extract proper spectral metrics that reflect temporal differences in reflectance. Then, we utilize this classifier to filter the detected plantation locations obtained from the three-class ensemble classification step. Specifically, we apply the B-P classifier only to locations that are classified as plantation by the three-class ensemble classification model. In this way,we remove bare soil locations that look similar to plantations.

Post-Processing
Due to the natural disturbance and the temporal variation in remote sensing data, conducting separate annual predictions results in classification errors and inconsistencies among years. For instance, a pixel may be classified as {plantation, plantation, plantation, forest, plantation, plantation} for six consecutive years. Here, the "forest" label is highly likely to be a classification error because plantations in Southeast Asia are rarely converted back to forest. Indeed, every location labeled as plantation in 2001 and 2005 in the RSPO dataset remains as a plantation in following plantation maps.
In general, different land cover types have different transition or conversion characteristics across the years. To capture this transition relationship and reduce the interannual classification errors, we utilize a Hidden Markov Model (HMM), which models the transition probability among latent states by a transition matrix T and the mapping between the latent state and the observed class by an emission matrix E. In particular, T ij represents the transition probability from state i to state j, and E ik denotes the emission probability from state i to the observed class k.
In our post-processing method, each latent state in the HMM represents a real land cover type from {"plantation", "forest", "other"}, and each observed class k belongs to plantation, forest, other, or unknown class. We initialize the transition matrix and emission matrix using the visually delineated RSPO land cover dataset. We first interpolate maps in each year from the RSPO dataset which is T ij is initialized as the proportion of locations in land cover i at any year to be converted to land cover j at the next year. E ik is initialized as the proportion of pixels in land cover type i that are classified as class k ∈ {"plantation", "forest", "other", "unknown"}. In this way, the emission matrix E can capture the confusion between land cover classes. Since our assumption of constant conversion rate is not accurate, this initialization can only serve as an approximation to the transition probability between consecutive years and the emission probability. However, the transition matrix T and emission matrix E are further tuned in the training process of HMM [67] so that they can precisely capture the yearly transition relationships. With the obtained transition matrix and emission matrix, we improve the yearly prediction for each location via the Viterbi algorithm [68].
We also conduct spatial post-processing. New plantations are frequently developed near existing plantations due to improved access and other factors provided by previous developments. Therefore, we implement a spatial filtering process using the following steps. First, we conduct spatial clustering by finding all the connected components of plantation locations. For each cluster, if the cluster size is less than a threshold δ, we will consider this cluster as false positive and remove it. In our implementation, we set δ = 30, which is equivalent to a minimum area of 7.5 km 2 . To determine this threshold, we select a set of confident locations in 2014 which are labeled as plantations by both TP and RSPO. Since these locations are selected conservatively, the failure to detect any of these locations can lead to omission errors. At the same time, since the TP dataset has high producer's accuracy [29], any locations that are labeled as plantations by PALM but not by the TP dataset can be commission errors. As we increase the threshold δ, we keep track of these estimated omission errors and commission errors. The selected threshold of δ = 30 significantly reduces commission errors with a limited increase of omission errors. Because this procedure removes small (<750 ha) patches, while mean smallholder farm size in Indonesia is <1 ha [69], this post processing likely removes most independent smallholder farms that are not contiguous with industrial scale plantations.

Hierarchical Classification
Having obtained annual plantation maps using the method described above, we further conduct hierarchical classification using another DBN model to distinguish major plantation species in our study region (i.e., oil palm, acacia, rubber, and coconut palm). We take the manually annotated locations for these four species from TP dataset. The hierarchical classification is only conducted for the last year (i.e., 2014) of our study period. We utilize the yearly plantation maps to infer the locations of different species in previous years. For example, if a location converts to plantation in 2006 according to our generated annual plantation maps, and it is classified as oil palm in 2014, then it is identified as oil palm for every year from 2006-2014. Our method does not account for the rare situations where producers switch plantation species across years.

Validation of Plantation Maps
To evaluate the quality of our maps, we visually inspect a set of randomly selected samples using DigitalGlobe high-resolution imagery. We propose a random sampling-based validation approach which is beyond the sampling approaches typically used to measure user's and producer's accuracy.
Producer's accuracy validation: Existing studies have commonly sampled pixels outside of classified plantations to measure producer's accuracy. However, this validation approach is inefficient in that only an extremely small portion of selected samples can be true plantations assuming that the proposed method performs reasonably well in plantation mapping. For this reason, we measure producer's accuracy following the method proposed in [70], which biases the sampling process towards locations that are more likely to be true plantations. This is helpful in obtaining a sufficient number of true plantation samples to robustly estimate omission errors in generated plantation maps.
We randomly sample a set of locations around plantation mill locations (http://data. globalforestwatch.org/datasets/ed8d5951b2a4482a9e62c4fe0bc23b5f$_$27) in each MODIS tile, and then we check the real plantation locations from the sampled locations using DigitalGlobe in 2014 via visual interpretation [1]. Within selected real plantations, we then measure the fraction of locations detected by PALM, TP, and RSPO. This analysis provides insight into the producer's accuracy of these products.
User's accuracy validation: To better understand the user's accuracy of our product relative to TP and RSPO, we individually sample each major difference region, as described below (Figure 4): • R0-locations labeled as plantations by PALM, TP, and RSPO.
• R1-locations labeled as plantations by PALM and TP, but not by RSPO.
• R2-locations labeled as plantations by TP, but not by PALM and RSPO.
• R3-locations labeled as plantations by PALM, but not by TP. This includes both the locations detected by RSPO and the locations not detected by RSPO. Only a few locations in R3 are detected by RSPO.
• R4-locations labeled as plantations by RSPO, but not by PALM. This includes both the locations detected by TP and the locations not detected by TP.
This sampling strategy assists in better investigating the difference between our detected plantation locations (from 2001 to 2014) and these two available datasets. Here, we use on the RSPO map on 2010, which is almost a subset of TP (91% of the RSPO map is included in TP in our entire study region). We quantify the accuracy of our analysis by comparing randomly selected samples from each difference region to high-resolution images DigitalGlobe (https://www.digitalglobe.com/) in 2014. We take 200-300 samples from each difference region, and visually interpret high-resolution DigitalGlobe images [1]. We then compare the results of our visual interpretation with plantation maps to obtain the fraction of samples that are real plantations, which is also referred to as the confidence.
Based on the quantified results of {R0, R1, R2, R3, R4}, we then estimate the user's accuracy for the PALM, TP, and RSPO. According to the obtained user's accuracy and producer's accuracy, as well as the area of detected plantations, we estimate the overall accuracy and Cohen's κ coefficient for PALM, TP, and RSPO. Finally, to illustrate differences between PALM, TP, and RSPO plantation maps, we conduct case studies by overlaying regions R0-R4 on high spatial resolution DigitalGlobe imagery.

Plantation Map and Basic Statistics
Our classification suggests that, in 2001, southern and eastern Kalimantan and southern Sumatra contained around 81 × 10 3 km 2 of tree plantations, which include oil palm, pulp and paper, rubber, and coconut palm plantations. By 2014, plantations extent had almost doubled to 152 × 10 3 km 2 (Figure 5a). We also show the spatial distribution of PALM, TP, and RSPO in Figure 5b. In Figure 6, we show the growing area of plantations in different MODIS tiles. Kalimantan had substantially greater plantation expansion rates (9.57% and 7.29% per year in southern Kalimantan and northern Kalimantan, respectively) than southern Sumatra (2.27% per year, Figure 6).
(a)  Across the study region, in 2014, our maps suggest that around 59% of all plantation area was oil palm, 18% rubber, 16% timber (acacia), and 7% coconut palm. Here, we show the major species in each tile in Tables 4-6. According to our results, almost all the detected plantations in northern Kalimantan are oil palm. Oil palm and acacia expanded most rapidly across all regions, increasing by 6% and 5% per year, respectively from 2001 to 2014.

Validation for Producer's Accuracy
Our producer's accuracy assessment, generated from a random sample of around 1000 pixels around oil palm plantation mills in each MODIS tile, suggests that both PALM and TP have reasonably good coverage of true plantations, whereas RSPO has much lower plantation detection for all three MODIS tiles (Table 7). RSPO accounts for only around 25% of sampled true plantations for MODIS tile h28v09 (southern Sumatra), which has the largest plantation area. Table 7. Producer's accuracy for PALM, TP, and RSPO derived from a sample of detected plantations around plantation mills in Indonesia. We report the total number of locations in each category, as well as the percent of true plantations detected by each data product (i.e., the producer's accuracy).

Validation for User's Accuracy
To estimate user's accuracy for PALM and existing products TP and RSPO, we analyze the five difference regions {R0, R1, R2, R3, R4} defined in Section 3.2. Our user's accuracy assessment (Tables 8-10) suggests that our approach outperforms the TP dataset, but has slightly lower performance than the RSPO dataset (Table 11). All sampled locations from R0 are identified as real plantations from high-resolution imagery. About 81%-84% of locations labeled as plantations by PALM and TP, but not by RSPO (R1) are real plantations as identified from high-resolution optical imagery. In contrast, just 20%-40% of pixels labeled as plantations by TP, but not by PALM and RSPO (R2) are real locations. Moreover, although RSPO has high user's accuracy and TP has high producer's accuracy, 40%-58% of pixels labeled as plantations by PALM, but not by TP (R3) are real plantations while 52%-63% of pixels labeled as plantations by RSPO, but not by PALM (R4) are false positives.

Overall Accuracy
Given the estimated user's accuracy and producer's accuracy, we can estimate the overall accuracy based on the area of detected plantations and the area of the entire study region. Across all study regions, PALM had a 94% overall accuracy, compared to 89% for TP and 84% for RSPO (Table 12). PALM also had higher κ coefficient than TP and RSPO (Table 13). These differences were especially pronounced in southern Sumatra. There commonly exists a trade-off between user's accuracy and producer's accuracy. For instance, the RSPO's conservative classification strategy resulted in higher user's accuracy but missed many real plantations, and therefore had lower producer's accuracy. In contrast, the TP classification covered a much larger plantation area and had higher producer's accuracy, but mislabeled many non-plantation area as plantations, and yielded lower user's accuracy. Compared with the TP and RSPO datasets, PALM leads to a better balance between the user's accuracy and producer's accuracy

Case Studies of Model Performance
Our case studies provide visual illustration of differences between datasets presented here (Figure 7), and cases where PALM outperforms other approaches and datasets.
In Figure 7a, the red region (R1) detected by PALM and TP is a real plantation as confirmed by the DigitalGlobe image, but is missing from the RSPO dataset. In Figure 7b, the blue region (R2) included as plantation by the TP dataset was not a real plantation, and was not picked up by PALM. From the high-resolution image in Figure 7c, we observe that PALM detects the boundary between real plantation and non-plantation area (red), while the TP dataset mislabels nearby non-plantation area as plantations (blue). Figure 7d provides an example of plantations that are detected by our approach but are missed by the TP dataset (yellow, R3). Finally, Figure 7e shows a case where RSPO incorrectly identifies a plantation but PALM correctly indicates no plantation (green, R4), adjacent to plantations correctly identified by all three datasets (magenta, R0).

Discussion
Our study evaluated the accuracy of an automated plantation mapping approach (PALM) using MODIS satellite data in Southeast Asia. Compared to manually-digitized plantation datasets, PALM produced plantation maps with greater overall accuracy (5% higher overall accuracy than the TP dataset and 10% than the RSPO dataset) and higher temporal resolution (annual). Our maps suggest that the extent of industrial-scale tree plantations in the study region is 58% greater than the RSPO dataset in 2010 and 36% smaller than the TP dataset in 2014. The proposed method focuses on capturing contiguous or large-scale plantations, which account for over half of oil palm plantations and almost all of pulp and paper according to [29]. These generated maps are useful for several purposes, including annual assessments of tree plantation expansion.

Smallholder Tree Plantations
The RSPO dataset that we used to train our model focused on mapping large-scale oil palm plantations and excluded most smallholder oil palm farms [12]. Indeed, validation suggests that the RSPO dataset omitted the most plantation area in southern Sumatra, where smallholder oil palm is more prevalent than in Kalimantan. In addition, we chose to threshold our plantation maps such that the minimum patch size was 7.5 km 2 (750 ha), substantially larger than any definition of a "smallholder" farm in Indonesia. For instance, the Indonesian Ministry of Agriculture sets this limit at 25 hectares for farmers growing plantation crops [71]. Thus, our classification methodology likely excluded a portion of the independent smallholder tree plantation area that was visually dissimilar to the RSPO-identified plantation classes, or that occurred in patches <750 ha. According to the Indonesian Central Bureau of Statistics (BPS), in 2015 around 40% of oil palm and 85% of rubber plantation area is under smallholders, such that around 50% of the total area of these crops is held by smallholders [72]. Figures are not available for coconut palm or acacia, but most coconut palm is likely smallholder-run, while most pulp and paper is expected to be grown on an industrial scale. In contrast, the methods used to detect plantations in the TP dataset included identification of "small" plantations, with patches less than 10 hectares [29]. While the 30-meter resolution Landsat data used for their plantation identification are not always able to distinguish small patchy tree plantations from other land uses, the relatively greater producers' accuracy of the TP dataset may be partially explained by the greater focus on smallholder tree plantations. Given increased focus on smallholder land use practices by the Indonesian government (e.g., ISPO) as well as corporations trading tree crop commodities (e.g., zero-deforestation pledges that reference smallholders), adapting PALM to better capture smallholder plantation expansion would be a critical next step to evaluate smallholder-linked land use dynamics.

Tree Plantation Species-Specific Mapping
To our knowledge, this is the first automated species-specific mapping effort for tree plantations in Indonesia across large spatial and temporal scales. Species-specific mapping is important because it allows identification of the commodity driving land cover change, and enables additional research into questions of legality, forest conversion, yields, land tenure, and effects of changing climate on tree plantations. However, our species-specific mapping was heavily limited by our inability to accurately validate our species-level maps. Distinguishing between different plantation species can be challenging even with high spatial resolution DigitialGlobe data when trees are recently planted, and requires considerable expertise in regional land use or an extensive field dataset on the locations of the tree plantations in question. For this research, we therefore chose not to validate our species mapping, but this is a clear next step for this research.
Our sampling process for producer's accuracy validation is biased towards oil palm plantations because the mills dataset includes only palm oil facilities. Given that rubber, acacia, and coconut palm represent major land covers in Indonesia (BPS Indonesia [50]), a sampling process that makes use of other datasets such as concession boundaries for oil palm, timber, and rubber plantations, or government census or survey data on the cultivated land area for each of these species, would be needed to address this limitation. Our approach was also limited in its ability to map changes in species over time. In the hierarchical classification process, we assume that plantation species are constant at the same location over years. Since 2012, the price of rubber has declined in comparison to that of oil palm [73], suggesting that conversion from rubber to oil palm may be occurring, a dynamic that we do not track in the current PALM iteration.
Looking forward, a potential solution is to utilize tree structural information to distinguish different plantation species (e.g., rubber as a broad-leaved species may be structurally different with other species, while coconut palm trees usually grow taller than oil palm trees). Indeed, previous research has had success in mapping mature oil palm using structural information from radar data [74]. In addition, the spectral information available at higher resolution, e.g., Landsat 8 or Sentinel-2 satellites, could also help to better distinguish among species. These datasets could be incorporated into PALM by merging the extracted features from deep learning models at intermediate layers, as discussed in previous work [75].

Classification Model
Compared with other classification models, our selected deep learning approach has two main advantages. First, deep learning methods have shown superior performance in learning from high-dimensional features given large volume of data [76]. The combination of MODIS data with TP and RSPO datasets generates large volume of MODIS multi-spectral data collected on multiple dates, and therefore has high dimensionality. Second, training the DBN enables learning representative features (e.g., spectral properties) from unlabeled data (i.e., years for which training data are not available). Since the TP and RSPO datasets are only available in specific years, we are not certain about the land use classes for many locations in other years even after using the sampling method mentioned in Section 3.1.2. However, the training of the DBN can still make use of these data to extract representative features while the fine-tuning process only focuses on the selected labeled samples.
We do not adopt popular ensemble methods including bagging-based algorithms such as Random Forests [77] or boosting-based algorithms such as Adaboost [78] because these methods cannot use unlabeled data to learning representative features. Deep learning methods also require less feature engineering [79] than these approaches. Moreover, the boosting methods are sensitive to mislabeled samples in training data [80], which can be caused by observers' mistakes in the manual labeling process.
Another promising approach to land cover classification problems like the one presented here is Fully Convolutional Networks (FCN), which have been used for pixel-wise labeling in computer vision applications [81,82]. FCN is directly trained on the entire image and is thus capable of capturing the spatial correlations in the classification process. In contrast, the PALM approach utilizes a spatial post-processing step to filter potential commission errors, but may misclassify individual pixels with noisy data. However, it is extremely inefficient to train an FCN model directly across a large spatial extent with high dimensional features. One potential solution is to extract spatial context information as a separate step, as discussed in [58,59]. The extracted features could then be combined with PALM for the pixel-wise training.
It is noteworthy that the model training process focuses on the difference region R0 and R4, since we use the RSPO and TP datasets to select samples. However, validation comprised a much larger area including R0-R4. Moreover, the labeling information used in training process is based on the RSPO and TP datasets, while the classification is validated through visual inspection using high-resolution images. Therefore, the overall classification performance provided in Tables 12 and 13 is not directly impacted by the potential overlap between training and testing data.

Limitations
While this method performs better than manual plantation delineation, it remains limited in terms of validation, classification, and imagery inputs, which need to be addressed in future work. The first limitation lies in the visual validation process using DigitalGlobe. In this study, we determine a pixel to be plantation if it shows obvious plantation characteristics (e.g., the context of surrounding pixels, alignment structure, crossing road networks). However, some plantations may not be easily identified visually (e.g., due to their advanced age and associated high tree cover) or due to limitations of our expertise. Another assumption that underlies our evaluation relative to RSPO (as shown in Figure 7e) is that a location that is a plantation in 2010 is not converted back to other land cover (e.g., forests) in 2014. This assumption is consistent with the fact that each location that has been classified as plantation in any of the RSPO maps is also labeled as a plantation in each of the future RSPO maps. However, this assumption, if incorrect for some locations, could inflate the errors reported for RSPO.
Since the RSPO dataset is only available through 2010, the selection of training data after 2010 depends on the RSPO land cover class in 2010 and subsequent (lack of) changes in EVI. The assumption of stable land cover under a stable EVI potentially introduces noisy training data and consequently poor classification after 2010. One potential solution is to incorporate the Global Forest Change dataset [18] to further increase certainty about locations with stable post-2010 land covers which can be used as samples.
Since the training is directly conducted on a collection of samples from multiple years, the classification performance can be degraded for the years that have different weather conditions to other years. A potential solution is to study the weather conditions (e.g., rainfall) for each year and train classifiers separately for each type of weather conditions (e.g., wet years vs. dry years).
According to our validation, the RSPO dataset has low producer's accuracy. One potential reason is that many new plantations were developed after 2010, which conforms to our classification results ( Figure 6). We also manually verify this source of error using Google Timelapse, which mainly relies on Landsat images [83]. We randomly select 20 plantation patches newly detected from 2011 to 2013 by PALM but that are not detected by the RSPO dataset. According to this analysis, every selected plantation patch was developed after 2010.
Our analysis was also limited by the resolution of the MODIS data. While the high resolution of Landsat data (30 m) and Sentinel data (10 m) offer potential to map plantations more accurately, the low temporal frequency of Landsat (16 day) and Sentinel (10 day) makes it hard to find images with little noise (e.g., clouds). Learning approaches that are more robust to the noise and the multi-scale learning framework that combines Landsat, Sentinel and MODIS can help address this limitation.

Conclusions
Here, we present an ensemble learning method to map plantation areas. We aggregate the land cover types defined by Roundtable on Sustainable Palm Oil and propose to simultaneously sample from multiple land cover types. To learn each individual classifier, we utilize Deep Belief Networks to extract representative information from spectral features over multiple days. Furthermore, we utilize HMM and spatial information to post-process the result. The use of large-scale remote sensing data provides an advantage for the learning process, and our test in Indonesia shows that (1) the method can automatically generate high-quality annual plantation maps, and (2) the detected plantation map achieves a better balance of user's accuracy and producer's accuracy than either TP or RSPO maps. Besides the detection of large-scale tree plantations, PALM also has the potential to discern specific plantation species. Since this method generates annual plantation maps, it has great potential to support mapping and monitoring of commodity crop expansion in the tropics.