Analyzing the Effects of Temporal Resolution and Classification Confidence for Modeling Land Cover Change with Long Short-Term Memory Networks

Land cover change (LCC) is typically characterized by infrequent changes over space and time. Data-driven methods such as deep learning (DL) approaches have proven effective in many domains for predictive and classification tasks. When applied to geospatial data, sequential DL methods such as long short-term memory (LSTM) have yielded promising results in remote sensing and GIScience studies. However, the characteristics of geospatial datasets selected for use with these methods have demonstrated important implications on method performance. The number of data layers available, the rate of LCC, and inherent errors resulting from classification procedures are expected to influence model performance. Yet, it is unknown how these can affect compatibility with the LSTM method. As such, the main objective of this study is to explore the capacity of LSTM to forecast patterns that have emerged from LCC dynamics given varying temporal resolutions, persistent land cover classes, and auxiliary data layers pertaining to classification confidence. Stacked LSTM modeling approaches are applied to 17-year MODIS land cover datasets focused on the province of British Columbia, Canada. This geospatial data is reclassified to four major land cover (LC) classes during pre-processing procedures. The evaluation considers the dataset at variable temporal resolutions to demonstrate the significance of geospatial data characteristics on LSTM method performance in several scenarios. Results indicate that LSTM can be utilized for forecasting LCC patterns when there are few limitations on temporal intervals of the datasets provided. Likewise, this study demonstrates improved performance measures when there are classes that do not change. Furthermore, providing classification confidence data as ancillary input also demonstrated improved results when the number of timesteps or temporal resolution is limited. This study contributes to future applications of DL and LSTM methods for forecasting LCC.


Introduction
Land cover changes (LCCs) are typically slow changes occurring across Earth's surface over long periods of time [1].The aggregated effects of changes have global implications, contributing to local, regional, and global climate changes, loss of biodiversity, and, ultimately, disturbing the capacity of systems to sustain people [2,3].Therefore, analyzing and representing LCC processes are important tasks in various disciplines such as geography [4], hydrology [5], and climatology [6].
LCC has been previously assessed at local, regional, and global extents [7].LCC has also been linked to changes in precipitation, air temperature, and ecology at local and regional scales.Furthermore, LCC studies conducted at a global scale have assessed the cumulative implications of Remote Sens. 2019, 11, 2784 2 of 27 land change processes such as urban growth and deforestation [8].Addressing LCC from a top-down perspective, data-driven modeling tactics enable the extraction and detections of patterns that have resulted from local interactions [9].Top-down approaches are primarily focused on overall patterns that result from processes, utilizing satellite and aggregated data sources such as Census data to obtain rates of land change over time.
Previous studies have introduced and assessed methodologies for forecasting LCC.Such methods have included cellular, agent-based, data-driven, and hybrid modeling approaches [10].In particular, data-driven methods including machine learning (ML) approaches have been increasingly considered for LCC forecasting [11].The goal of data-driven methodologies is for models to "learn" patterns existing in datasets while reducing the manual operations required to utilize the method [12].Applied to LC datasets, the aim is to use these automated, statistical methods to identify and analyze spatial patterns that have resulted from underlying processes of LCC over time.ML methods employed in LC simulations and assessments have previously included neural networks (NNs) [13], decision trees (DTs), and support vector machines (SVMs) [14].
While ML approaches have demonstrated promising results for forecasting and detecting LCC, challenges include the infrequence of LC changes and obtaining appropriate labeled training datasets [15].Recent advances in a subfield of ML called deep learning (DL) have demonstrated the capacity of these increasingly complex models.Such models exhibit aptness for learning more complicated relationships existing in training datasets while simultaneously achieving substantial improvements in predictive performance measures.DL approaches such as recurrent neural networks (RNNs) are best suited for sequential or timeseries data [16].An improved RNN architecture called long short-term memory (LSTM) has garnered increased attention for geospatial applications, with demonstrated aptitude to forecast LCC [17] and to classify LC [18] by leveraging patterns obtained from timeseries data.
LSTM approaches have been previously utilized for simulating patterns resulting from dynamic geospatial systems [19][20][21][22].The method has been leveraged to learn LC changes in transfer learning applications [19].It proved effective in learning changes in binary and multi-class LC datasets created for three variable-sized study areas focused on different cities in China.LSTM has also been employed to reveal natural disturbances such as fires and floods, or human disturbances such as deforestation and urban growth, using satellite image timeseries datasets from a moderate-resolution imaging spectroradiometer (MODIS) data product [20].LSTM has also been effectively utilized in shortand long-term forecasts of sea surface temperature [21].Using geographic datasets pertaining to weather, pollution, and influenza spread, LSTM has been successfully applied in prediction of influenza propagation in the state of Georgia in the United States as well [22].
Though prior research has made evident that LSTM is useful for geospatial applications, the implications of varying temporal resolution have not yet been applied to real-world LC datasets.The goal of this study is to evaluate the effectiveness of LSTM networks for LCC forecasting considering actual and hybrid datasets, where hybrid datasets are created by integrating an actual dataset with a hypothetical persistent class.Considering variable geospatial properties such as temporal resolution and the number of data layers obtainable to train LSTM models, the aim is to quantify the repercussions of changing temporal resolution of LC datasets on quality of forecasts produced.Real-world LC datasets feature inevitable classification errors, further impacting method performance.In addition to changing temporal resolution, this study intends to demonstrate the impacts of persistent LC classes and to determine if improved performance can be obtained using available classification confidence layers.
However, the effectiveness of these methods to forecast or classify change is impacted by the characteristics of the geospatial dataset selected.In this research study, a sensitivity analysis is used to explore LSTM response to changing geospatial data characteristics to evaluate the impact of increasing temporal resolution.It is hypothesized that method performance will be impeded by coarser temporal resolutions.Thus, the main objective of this study is to evaluate the effectiveness of data-driven approaches such as LSTM for LCC applications where the amount of change is typically small or occurring at slow rates over long periods of time.Additionally, this study demonstrates the implications of the characteristics of geospatial datasets selected for use with this method, namely temporal resolution and influence of persistent classes.Such an assessment of the method's sensitivity to temporal resolution has not yet been conducted in the presence of classification errors that exist in real-world datasets.
It is expected that results will demonstrate that coarser temporal resolutions will have negative repercussions on method performance.Likewise, it is expected that erroneous values due to classification procedures will impact the results.To further observe the effects of potentially inaccurate values, hybrid data scenarios are also created by adapting the original LC dataset to maintain one persistent class across all data layers.The classification confidence layer associated with the original LC data product will be incorporated to a secondary modeling approach applied to scenarios involving real-world and hybrid datasets.By including this ancillary data layer, the aim is to reduce epistemic imperfection and determine if improvements can be obtained by indicating potential at each cell to contain erroneous values [23].Thus, the three main objectives of this study are; (1) to evaluate the sensitivity of the method to varying temporal resolutions, (2) to assess the implications of persistent LC classes in hybrid scenarios, and (3) to examine potential improvements by providing classification confidence layers as model input in scenarios utilizing real-world and hybrid datasets.

Recurrent Neural Networks
RNNs were introduced as a variant of traditional NN modeling approaches, devised for sequential data [16].To do this, a recurrent connection was added to traditional neurons, allowing information from previous elements in timeseries data to be propagated when weight updates occur for subsequent elements observed by the network.The recurrent connection allowed information to be propagated through the entire timeseries input, with weights being updated with respect to the entire timeseries input.However, a major problem arose when these networks were utilized to learn dependencies across long input sequences.For instance, information pertaining to critical features from early in a sequence could not be connected to data elements occurring later in the sequence.This phenomenon ensues from the vanishing (or exploding) gradient problem [24].This implies that network weights tend toward either very small (vanishing) or very large (exploding) values, negating the ability of the network to learn important information as weight updates become infinitesimal or massive.

Long Short-Term Memory
LSTM is an improved variation of the traditional RNN architecture [16].With internal memory cells and gating functions controlling the propagation of information through a unit, LSTMs have proven capable in mitigating the effects of the vanishing gradient problem that was detrimental to earlier RNN implementations [25].The input gate (i t ), forget gate (f t ), output gate (o t ), and input modulation gate (g t ) control the propagation of new information in, within, or out of the unit to be considered with the next element in the input sequence (Figure 1).While other LSTM variants exist, it has been determined in large-scale studies that the standard LSTM architecture performance remains most effective [26].The following equations have been obtained from Donahue et al. [27].[28] and [29]).
When an input sequence is provided, (x0, x1, …, xt-1, xt), a hidden state, ht-1, is either initialized at the start of the sequence or propagated from a computation considering a previous input sequence element.The amount of data from the next element in the input sequence, xt, and how much of the hidden state from the previous timestep, ht-1, should be committed to the internal memory cell ct is determined by the input gate (it).How much information propagated through this gate is specified as follows: where the sigmoid function () limits resulting values to the range of (0,1).Wxi and Whi are learnable weight matrix parameters and bi is a learnable bias parameter [30].
Next, a critical component of the LSTM unit is called a forget gate (ft) [25].By using gating functions to "forget" information, this component allows the network to ignore non-critical information during training procedures [30,31].The vector resulting from the forget gate is denoted as follows: where the sigmoid function () limits resulting values to the range of (0,1).Wxf and Whf are learnable weight matrix parameters and bf is a learnable bias parameter.The next component of the LSTM cell is called the output gate (ot).This mitigates the degree to which the value or state stored in the cell's internal memory, ct, should be propagated to the new hidden state, ht, to be computed.The behavior of the output gate is represented as follows: where Wxo and Who are learnable weight matrix parameters and b0 is a learnable bias parameter.Utilizing a hyperbolic tangent function, the input modulation gate (gt) scales input values xt and ht before all or part of the resulting value is committed to the cell's internal memory in (5).The output of the input modulation gate is computed as: where Wxc and Whc are learnable weight matrix parameters and bc is a learnable bias parameter.The cell's internal memory is then updated, using the outputs of the forget gate (ft), the previous internal memory state (ct-1), input gate (it), and input modulation gate (gt): where element-wise multiplication is symbolized by ⊙.
Finally, the new hidden state is computed as follows: When an input sequence is provided, (x 0 , x 1 , . . ., x t−1 , x t ), a hidden state, h t−1 , is either initialized at the start of the sequence or propagated from a computation considering a previous input sequence element.The amount of data from the next element in the input sequence, x t , and how much of the hidden state from the previous timestep, h t−1 , should be committed to the internal memory cell c t is determined by the input gate (i t ).How much information propagated through this gate is specified as follows: where the sigmoid function (σ) limits resulting values to the range of (0,1).W xi and W hi are learnable weight matrix parameters and b i is a learnable bias parameter [30].
Next, a critical component of the LSTM unit is called a forget gate (f t ) [25].By using gating functions to "forget" information, this component allows the network to ignore non-critical information during training procedures [30,31].The vector resulting from the forget gate is denoted as follows: where the sigmoid function (σ) limits resulting values to the range of (0,1).W xf and W hf are learnable weight matrix parameters and b f is a learnable bias parameter.The next component of the LSTM cell is called the output gate (o t ).This mitigates the degree to which the value or state stored in the cell's internal memory, c t , should be propagated to the new hidden state, h t , to be computed.The behavior of the output gate is represented as follows: where W xo and W ho are learnable weight matrix parameters and b 0 is a learnable bias parameter.Utilizing a hyperbolic tangent function, the input modulation gate (g t ) scales input values x t and h t before all or part of the resulting value is committed to the cell's internal memory in (5).The output of the input modulation gate is computed as: where W xc and W hc are learnable weight matrix parameters and b c is a learnable bias parameter.The cell's internal memory is then updated, using the outputs of the forget gate (f t ), the previous internal memory state (c t−1 ), input gate (i t ), and input modulation gate (g t ): where element-wise multiplication is symbolized by .Finally, the new hidden state is computed as follows: where h t becomes the next h t−1 , and the next element in the input sequence is considered.

Study Area and Datasets
The "MODIS Terra+Aqua Combined Land Cover product" featuring global annual LC data was first obtained [32].This dataset features 13 scientific data layers, including land cover, surface hydrology, and classification confidence layers.This dataset features 17 LC classes using the "MCD12Q1 International Geosphere-Biosphere Programme (IGBP)" classification system [33].It features 500-m spatial resolution with data layers available annually from 2001 to 2017.Additional details regarding this dataset are shown in Table 1.The number of timesteps, temporal resolution, and numerous geospatial data layers available in the MODIS dataset motivated the selection of this data for this study, despite coarse spatial resolution inhibiting studies at smaller spatial scales.Thus, the data is further processed to consider the province of British Columbia, Canada.Located in the Pacific Northwest region, the province features diverse ecosystems including coastal forests, sub-boreal forests, and alpine tundra [34,35].Dense coniferous forests are the most prevalent vegetation cover found in British Columbia [35].Land Cover Layer "Land Cover Type 1: Annual International Geosphere-Biosphere Programme (IGBP) classification" [32] Land Cover Confidence Layer "LCCS1 land cover layer confidence" [32], with assessments recorded as percentages for each cell Number of Land Cover Classes 17 LC classes using the MCD12Q1 International Geosphere-Biosphere Programme (IGBP) classification system Data Format HDF-EOS Data Acquisition Tools LP DAAC2Disk used to directly download MODIS data [41] First, using the Geospatial Data Abstraction Library (v2.2.4), the desired 17-class LC data layer and respective classification confidence layer were obtained and combined to create a multidimensional mosaic dataset for each timestep [36].The raster mosaics were then extracted and re-projected to the NAD 1983 BC Environment Albers projected coordinate system with the provincial boundary of British Columbia obtained from the 2016 Canadian Census Boundary Files [37].These operations were conducted using the data management tools available in Esri's ArcGIS Pro (v2.4.0) and applied to both the LC and classification confidence layers [38].When using the "Project Raster" tool available in ArcGIS Pro [39], the "Nearest Neighbor" resampling technique was used as "it is suitable for discrete data, such as land cover" [39].The nearest neighbor resampling technique is utilized for the discrete and continuous datasets to maintain the related confidence layer data with the respective LC class for each cell [40].Different resampling techniques including "bilinear interpolation" and "cubic convolution" are only relevant for continuous datasets [40].
Classes are aggregated to form four LC classes to improve model performance and to reduce computation time.For this study, the 17 available LC classes have been aggregated to forest, anthropogenic areas, non-forest areas, and water as presented in Table 2.This reclassification procedure was also conducted in Esri's ArcGIS Pro [38].The new aggregated class labels were named as per [42].The aggregated "forest" class includes areas with forest cover greater than 60%, decided due to the predominant vegetation type in the province being coniferous trees [35].Datasets were subsequently resampled to 1 km spatial resolution to reduce the computation time required.Confidence data layers have been processed to this spatial resolution as well, featuring continuous percentage values at each cell (Figure 2).as per [42].The aggregated "forest" class includes areas with forest cover greater than 60%, decided due to the predominant vegetation type in the province being coniferous trees [35].Datasets were subsequently resampled to 1 km spatial resolution to reduce the computation time required.Confidence data layers have been processed to this spatial resolution as well, featuring continuous percentage values at each cell (Figure 2).For the purpose of investigating the performance of the LSTM method given a persistent class, a hybrid dataset has been created.To create the hybrid dataset, the actual LC data is amalgamated with a hypothetical persistent water class used for persistent water dataset experiments.The water class has been chosen to become a persistent class given the observed fluctuation of its boundaries through all data layers that potentially are due to classification errors.For example, the number of cells denoted as water abruptly changed from 90,963 to 94,290 between 2016 to 2017 (Figure 3).To make the water class persistent through time, all water cells present in the study area from 2002 to 2017 were converted to "No Data."Next, the cells containing water in the study area in 2001 were overlaid to the same location from 2001 through 2017.Cells that were occupied by water in 2002 to 2017 that were not water cells in 2001 remain "No Data" and are thus excluded in the model evaluation procedure.These procedures were enabled by reading data using the Geospatial Data Abstraction Library and modifying the data programmatically using the Python programming language (v3.6.5)[43].For the purpose of investigating the performance of the LSTM method given a persistent class, a hybrid dataset has been created.To create the hybrid dataset, the actual LC data is amalgamated with a hypothetical persistent water class used for persistent water dataset experiments.The water class has been chosen to become a persistent class given the observed fluctuation of its boundaries through all data layers that potentially are due to classification errors.For example, the number of cells denoted as water abruptly changed from 90,963 to 94,290 between 2016 to 2017 (Figure 3).To make the water class persistent through time, all water cells present in the study area from 2002 to 2017 were converted to "No Data."Next, the cells containing water in the study area in 2001 were overlaid to the same location from 2001 through 2017.Cells that were occupied by water in 2002 to 2017 that were not water cells in 2001 remain "No Data" and are thus excluded in the model evaluation procedure.These procedures were enabled by reading data using the Geospatial Data Abstraction Library and modifying the data programmatically using the Python programming language (v3.6.5)[43].

Training and Testing Procedures
Following the data preparation procedures, datasets must be further processed to create training and testing datasets for use with the LSTM method.Input sequences composing the training and testing datasets are created using a moving-window approach [20].Training sets are thus denoted as (x0, x1, x2, …, xT-3), while the target LC class is denoted by (yT-2).Input sequences in the test set are

Training and Testing Procedures
Following the data preparation procedures, datasets must be further processed to create training and testing datasets for use with the LSTM method.Input sequences composing the training and testing datasets are created using a moving-window approach [20].Training sets are thus denoted as (x 0 , x 1 , x 2 , . . ., x T−3 ), while the target LC class is denoted by (y T−2 ).Input sequences in the test set are denoted as (x 1 , x 2 , x 3 , . . ., x T−2 ), while the target LC class is denoted by (y T−1 ).The training and test sets are dependent on the temporal resolution being utilized with the currently trained model.Table 3 presents the years used in training and testing procedures with respective temporal resolutions.The subsequent timestep to be forecasted, or the training targets (y T−2 ) and testing targets (y T−1 ), have been underlined (Table 3).Both training and testing datasets are one-hot encoded to represent LC classes at each timestep [19,44].The LC classification confidence layer is processed in the same way, without the one-hot encoding procedure.Comprising the confidence layers are continuous values representing percentages ranging from 0 to 100.Zero indicates no confidence in the LC classification result for a cell, while values nearing 100 indicate high confidence in LC classification algorithm results for a given cell [33].2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016 The overall number of cells per class considering the real LC dataset has been shown in Figure 3.Although there are many ways of composing the training sets, due to the scarcity of cells undergoing change [15], it is necessary to apply an improved sampling strategy to form the training sets used in each scenario.This means that strategies such as random sampling or obtaining equal counts from each class are disadvantageous for phenomena such as LCC.Samardžić-Petrović et al. [45] propose a balanced sampling strategy, demonstrating the benefits of equal counts of persistent cells and changed cells used to compose the training set.Changed cells are denoted as such if they have undergone one or many changes between x 0 and y T−2 .However, due to the large discrepancy between changed and persistent cell counts, persistent cells cannot be simply randomly sampled.Therefore, all changed cells are added to the training data set, while persistent cells are sampled at random across the entire study area while maintaining the original distributions of classes found in the entire set of persistent cells [45,46].This technique impacts the number of inputs available for training for each temporal resolution.That is, if changes occurred during timesteps unavailable in the creation of the training set for the temporal resolution being considered, the cell would be marked as persistent.This sampling procedure is also applied to the scenarios considering the hybrid data, ignoring the additional cells marked as "No Data."Due to the increased number of cells ignored in the hybrid datasets, training and testing datasets are smaller in scenarios in which this dataset is used.
Following model training procedures, testing or validation can occur with a dataset withheld from the training dataset [47].The "one-step" ahead forecasting convention used to test or validate respective models is adopted from [20,47].Previous methods have utilized subsets of available datasets for testing, following suit with training dataset creation procedures [45,46].However, in this study, the balanced sampling strategy has been employed for only the training dataset creation procedure.To create the test set, all cells (except for those containing "No Data") are considered in the evaluation of the entire forecasted map.That is, the number of cells forecasted correctly as changed or persistent are calculated for all cells available in the forecasted map in all scenarios.

Model Specifications
The LSTM-based approach for forecasting LCC includes a stacked LSTM model (Figure 4). Figure 4 depicts the model considering training input sequences x 0 to x 14 , considering one-year temporal resolution.The "Dense" layer refers to a fully-connected neural network layer used as the output layer.By stacking LSTM layers, these models have been demonstrated to have improved capacity to capture increasingly complex relationships subsisting in input datasets [48,49].In this study, three LSTM layers are used in a stacked modeling approach accepting categorical LC input sequence data.Each LSTM layer is composed of 128 neurons per layer.The input layer is compatible with one-hot encoded input sequences.Between each layer, various dropout regularization factors were also tested before settling on a factor of 0.5 between each layer [20,50].The dropout factor controls the probability of neurons being discarded during the training procedure.For instance, if a factor of 0.5 is applied between each layer, the probability that the information from a neuron is "dropped" becomes 50%.This simple tactic has proven effective in preventing overfitting and improving representations learned by RNNs [51].
Models were developed using Python (v3.6.5) and the Keras API (v2.2.0) [54].The Keras API aids in simplifying DL model prototyping workflows while affording users the functionality of Google's TensorFlow GPU implementation (v1.8.0) [54].TensorFlow is "an open-source machine learning framework" that provides users advanced functionality to fine-tune ML and DL models [55].The optimization method used was the adaptive moment estimation or "Adam" algorithm [56] due to its demonstrated success and robustness to model hyperparameters.The learning rates used are those determined by Kingma and Ba [56], available as default values with the Keras API [57].Categorical cross-entropy is utilized as the objective function to accommodate the multi-class data being used with this method [28].To incorporate the classification confidence layer, this configuration is concatenated with an additional input layer and LSTM layer.This branched configuration is required to provide mixed input types to models, a prevalent approach in applications such as image captioning [52,53].The model branch accepts the confidence layer as input.The output of this branch is concatenated with output resulting from the stacked LSTM used to consider the LC class sequences.This configuration is demonstrated in Figure 5.The application of the dropout regularization terms remains the same in the model branch considering LC input sequences.
Models were developed using Python (v3.6.5) and the Keras API (v2.2.0) [54].The Keras API aids in simplifying DL model prototyping workflows while affording users the functionality of Google's TensorFlow GPU implementation (v1.8.0) [54].TensorFlow is "an open-source machine learning framework" that provides users advanced functionality to fine-tune ML and DL models [55].The optimization method used was the adaptive moment estimation or "Adam" algorithm [56] due to its demonstrated success and robustness to model hyperparameters.The learning rates used are those determined by Kingma and Ba [56], available as default values with the Keras API [57].Categorical cross-entropy is utilized as the objective function to accommodate the multi-class data being used with this method [28].
The Softmax activation function [12] was employed in the output layer.This ensures a vector of probabilities corresponding to each class label for each cell is output from the model.This activation function is commonly used with multi-class classification and predictive models [12,58].In this study, there are four LC classes.The fifth class (y 0 in Figures 4 and 5) denotes a "No Data" option which is unrepresented in the training and test sets.The maximal number of epochs the models are enabled to train for is 1000 epochs, with early-stopping callbacks utilized to prevent overfitting [59].Early stopping terminates model fitting when there have been no improvements to a model's objective function within a specified number of epochs.The batch size was set to 32.Considering the actual data, models to be trained featured 332,421 internal parameters.When utilizing the additional confidence input layer, the number of internal parameters increased to 399,621 with the addition of the secondary input branch (Figure 5).Models have been trained and tested using a NVIDIA GeForce GTX 1080 Ti GPU.
stopping terminates model fitting when there have been no improvements to a model's objective function within a specified number of epochs.The batch size was set to 32.Considering the actual data, models to be trained featured 332,421 internal parameters.When utilizing the additional confidence input layer, the number of internal parameters increased to 399,621 with the addition of the secondary input branch (Figure 5).Models have been trained and tested using a NVIDIA GeForce GTX 1080 Ti GPU.

Experiment Overview
Using the original datasets that have undergone pre-processing procedures described in Section 2.1, two experiments are performed, including (1) training and evaluating the stacked LSTM model considering only the LC dataset (Figure 4) and ( 2) training and evaluating the mixed input stacked LSTM model considering both the LC dataset and the ancillary classification confidence layer (Figure 5).In the evaluation procedure, all cells apart from those assigned "No Data" are involved.
The experiments using the hybrid datasets focus on assessing whether performance may increase or decrease when a class remains persistent through all timesteps.Considering the hybrid dataset, the two experiments conducted to assess the repercussions of the persistent water class include (1) training and evaluating the stacked LSTM model (Figure 4) and ( 2) training and evaluating the mixed input stacked LSTM model considering both the hybrid dataset and the classification confidence layer (Figure 5).In the evaluations considering the hybrid dataset, a greater number of cells will be marked as "No Data" due to data creation procedures described in Section 2.3, implying more cells will be omitted during the evaluation procedure.

Experiment Overview
Using the original datasets that have undergone pre-processing procedures described in Section 2.1, two experiments are performed, including (1) training and evaluating the stacked LSTM model considering only the LC dataset (Figure 4) and ( 2) training and evaluating the mixed input stacked LSTM model considering both the LC dataset and the ancillary classification confidence layer (Figure 5).In the evaluation procedure, all cells apart from those assigned "No Data" are involved.
The experiments using the hybrid datasets focus on assessing whether performance may increase or decrease when a class remains persistent through all timesteps.Considering the hybrid dataset, the two experiments conducted to assess the repercussions of the persistent water class include (1) training and evaluating the stacked LSTM model (Figure 4) and ( 2) training and evaluating the mixed input stacked LSTM model considering both the hybrid dataset and the classification confidence layer (Figure 5).In the evaluations considering the hybrid dataset, a greater number of cells will be marked as "No Data" due to data creation procedures described in Section 2.3, implying more cells will be omitted during the evaluation procedure.

Results
Constructing four models for each experiment, each model was trained with training sets featuring one-, two-, four-, and eight-year temporal resolutions, respectively.Computation time required to run each model configuration (Figures 4 and 5) with all temporal resolution options and both the actual and hybrid datasets exceeded one day.Training set composition has been shown in Table 4 for experiments involving the actual datasets (Table 4a) and hybrid datasets (Table 4b).The number of training samples for each temporal resolution option in Table 4 indicate that as the temporal resolution becomes coarser, the number of cells changed becomes smaller with the lesser number of years available.Likewise, the number of persistent cells in the respective training sets becomes smaller to uphold the balanced sampling strategy posed in Section 2.3.Though the water class is persistent through all timesteps (Table 4b), the percentage of persistent water cells obtained for the respective training sets maintains the sampling strategy when considering the hybrid dataset.The years composing the test set will vary for each temporal resolution considered.The timesteps involved in training and testing each model in all experiments is demonstrated in Table 3.To examine differences in forecasted outputs, metrics and maps are produced at the provincial extent.Additionally, a smaller spatial extent focused on the Central Okanagan region is selected for verification and visual assessment (Figure 6).This data was extracted from the "Regional District Boundaries" file available for the province of British Columbia [60].
differences in forecasted outputs, metrics and maps are produced at the provincial extent.Additionally, a smaller spatial extent focused on the Central Okanagan region is selected for verification and visual assessment (Figure 6).This data was extracted from the "Regional District Boundaries" file available for the province of British Columbia [60].
To evaluate the method, cells are marked as changed if they have undergone a transition between 2001 and 2017.Likewise, the evaluation does not consider cells marked as "No Data."This operation considers all timesteps and is then used to compare to the forecasted output generated from each of the models.Each model in every experiment is tested to forecast the LC geospatial data available for 2017 when using actual and hybrid datasets, respectively.Evaluation metrics are considered per category by "cell-by-cell comparison," indicating which cells were forecasted as persistent or changed [61].Instead of utilizing the traditional kappa metric, this study employs a set of metrics from [46] used to highlight trends in the method's capacity to forecast LCC.To evaluate the method, cells are marked as changed if they have undergone a transition between 2001 and 2017.Likewise, the evaluation does not consider cells marked as "No Data."This operation considers all timesteps and is then used to compare to the forecasted output generated from each of the models.Each model in every experiment is tested to forecast the LC geospatial data available for 2017 when using actual and hybrid datasets, respectively.Evaluation metrics are considered per category by "cell-by-cell comparison," indicating which cells were forecasted as persistent or changed [61].Instead of utilizing the traditional kappa metric, this study employs a set of metrics from [46] used to highlight trends in the method's capacity to forecast LCC.
Analyses are conducted using the results produced with actual LC datasets depicted in Figure 7 and Table 5. Considering both the actual LC and classification confidence layers with the model structure shown in Figure 5, obtained results in 2017 for the study area are shown in Figure 8 and Table 6.Centered on the Central Okanagan region, Figure 9 showcases a comparison of forecasts obtained using the respective model specifications.The number of correctly changed cells produced in respective forecasts decreases for two non-majority classes (forest and anthropogenic areas) for both modeling scenarios as temporal resolution becomes coarser (Tables 5 and 6).Likewise, the number of cells changed incorrectly to persistent increases as temporal resolution becomes coarser.The hybrid dataset experiments feature a persistent water class developed from the actual dataset, where water cells from 2001 are made persistent through all timesteps.Results of these subsequent experiments using the hybrid LC dataset as input have been shown in Figure 10 and Table 7.
Outcomes of experiments utilizing both the hybrid LC and classification confidence layers have been shown in Figure 11 and Table 8.Focusing on the Central Okanagan region, Figure 12 showcases a comparison of forecasts obtained using the respective model specifications.With the persistent water class data as input, models created as per both specifications forecasted several correctly changed cells (Tables 7 and 8).

Discussion
Across all experiments, in utilizing the coarsest temporal resolution possible (eight-year temporal resolution), results obtained were poor and erratic (Figures 7-12, Tables 5-8).This scenario involves one input feature mapped to one output feature for training and testing.By comparing models trained with four temporal resolutions, it was also observed that there is an overall bias for this method to forecast persistent cells, despite the balanced sampling regime used.This bias toward unchanged cells was also demonstrated in prior studies considering LSTM [19].This method also demonstrates a bias toward the persistent majority class (non-forest), especially as the number of timesteps decreased and temporal resolution became coarser in all experiments.
Overall, the effects of increasing or decreasing temporal resolution were as expected.It was observed that method performance is impeded in scenarios involving coarser temporal resolutions.When forecasting land cover, the LSTM method forecasts a greater number of persistent cells than changed cells as temporal resolution becomes coarser.This is demonstrated in Table 5, where the number of persistent cells forecasted correctly increases as temporal resolution becomes coarser.The capacity of the models to forecast changed cells increases as temporal resolution becomes finer (Table 5).For instance, the models trained with the finest temporal resolution (one-year) produced the forecast with the most correctly transitioned cells for forest and anthropogenic areas.Therefore, it is observed that LSTM models are most effective when provided with increased numbers of timesteps and finer temporal resolutions in both the real-world and hybrid application.
While the classification confidence layer slightly increased the number of changed cells forecasted correctly in experiments where actual LC data is used, the greatest increase was seen in experiments using the hybrid dataset containing the persistent water class.This suggests the method's sensitivity to classification errors, paralleling findings indicating the method's sensitivity to random noise [20].The mixed input configuration is thus one approach for mitigating situations where fewer classes undergo abrupt or rapid changes.

Actual Dataset Experiments
Using the actual dataset, the models trained with one-year temporal resolution yielded highest counts of correctly simulated cells across the forest, anthropogenic areas, and non-forest area classes.This is exhibited by results including only the LC data layer (Figure 7), and the LC data layer with the confidence layer as ancillary data (Figure 8).This includes both changed and persistent cells belonging to each respective class.Similarly, the model trained with the finest temporal resolution forecasted the least number of changed cells incorrectly as persistent.That is, the number of changed cells simulated as persistent increases with temporal resolution.It is also observed that the highest number of simulation errors exists in "changed cells simulated incorrectly as persistent" counts (Tables 5 and 6).
The exemption to the aforementioned trends is the water class, exhibiting slightly increased counts of correctly simulated cells as temporal resolution becomes coarser.Overall, models performed poorly when forecasting water.Assumed to be mostly persistent, erroneous values may have factored in to poor performance for simulating this specific class.For instance, from 2016 to 2017, the number of cells occupied by water changes from 90,963 to 94,290 (Figure 3).This abrupt change occurring at the end of the available timeseries may be due to classification errors or discrepancies when annual data products were created.
It was hypothesized that adding the classification confidence layer would mitigate the effects of erroneous values and thus enhance LCC forecasts when considering all temporal resolution options.However, results obtained using the mixed input models failed to significantly improve in scenarios involving finer temporal resolutions (Table 6).This is also evident in the map produced for the Central Okanagan region (Figure 9).By including the LC classification confidence layer, the capacity for the models to forecast changes improved marginally as temporal resolution became coarser.For instance, Table 5 shows the model trained with eight-year temporal resolution data to forecast only two of the four possible classes, including the majority class (non-forest areas) and water cells.While still obtaining suboptimal results, models trained with both types of inputs forecasted some cells in forest and anthropogenic classes, albeit less than 1% in each class, respectively.

Hybrid Dataset Experiments
Following suit with results shown in Section 3, the models trained with finer temporal resolutions obtained the most favorable results (Tables 7 and 8).Models trained with the eight-year temporal resolution option forecasted only non-forest areas and persistent water cells (Table 7).It is also observed that cells occupied by persistent water through all timesteps have been forecasted with no errors across all modeling scenarios considering both exclusively the hybrid dataset as well as the mixed inputs.
Like results obtained in Section 3, the addition of the classification confidence layer as input to the model increased model performance slightly when considering coarser temporal resolutions and fewer timesteps.It similarly had little to no effect on LCC forecasting performance in scenarios where finer temporal resolutions were considered.However, when the considering the hybrid dataset with eight-year temporal resolution, the addition of the classification confidence layer as model input contributed to success in forecasting anthropogenic areas, with 92.6% of persistent anthropogenic areas forecasted correctly (Table 8).This can be viewed in Figure 12 at the smaller spatial extent.
As previously observed, the capacity of the method to forecast changes degrades as temporal resolution becomes coarser.An exemption to this trend is the number of non-forest areas correctly forecasted as changed (Tables 7 and 8).

Conclusions
The stacked LSTM modeling approach for forecasting LCC aims to detect patterns occurring across the temporal dimension to forecast forest, anthropogenic areas, non-forested non-anthropogenic areas, and water.Compared to previous applications of LSTM for prediction and classification tasks utilizing geospatial data, real-world applications thus far had not considered the implications and importance of the choice of geospatial input data characteristics on method performance.For slow-changing geospatial systems such as LCC, it is demonstrated that obtaining datasets that feature many timesteps and finer temporal resolutions enable more optimal models to be obtained.It is also indicative of potential issues arising when considering the LSTM method for use with geospatial datasets that are limited in the number of data layers or timesteps.
With demonstrated bias toward training datasets, future work should consider improved sampling strategies to further address this issue.Similarly, a consequence of the sampling regime employed was a significant loss of potential training data samples.Maintaining and obtaining additional high-quality training samples available should continue to be a priority for further research involving LSTM.
Future works should also consider strategies to use this method to its fullest potential when real-world geospatial datasets have limited layers available.For instance, additional data layers should be considered to increase method performance, especially if classes undergo more rapid changes.Moreover, different LC class aggregation strategies may produce improved results.This may entail differing compositions of or replacements of one or many of the forest, anthropogenic areas, non-forest areas, and water classes.It would be beneficial to consider sensitivity analyses of increasing LC class cardinality and implications of changing spatial resolution on method performance.Such experiments will require computational resources such as a distributed computing facility to efficiently run experiments with increasing LC classes and spatial resolutions.
Since this evaluation was conducted at a provincial scale, future works should also consider the implications of spatial extent and resolution on method performance.It is also recommended that future work utilize additional evaluation metrics that consider not only location-based metrics, but also spatial pattern metrics.While the integration of the available classification confidence layer available in the "MODIS Terra+Aqua Combined Land Cover Product" [32] was considered, further sensitivity analyses should be conducted to assess the effects of perturbations in input sequences and model response.Additional auxiliary or contextual data to provide more detail pertaining to spatial features in given input sequences should also be explored.This could include deriving additional layers pertaining to local spatial autocorrelation.Though the classification confidence layer did not enhance LCC forecasting performance in situations involving fine temporal resolution data, this additional data layer may be advantageous in data-scarce scenarios where improving temporal resolution or increasing the number of timesteps is not an option.It should be assessed whether increasing the number of additional features may improve the forecasting performance of the method or increase robustness of the method to varying geospatial input data characteristics such as temporal resolution, LC class cardinality, and the number of timesteps available.As such, geospatial data characteristics such as LC class cardinality can be maintained instead of undergoing aggregation in pre-processing procedures, while maintaining or enhancing yielded performance measures.
Given the lack of research endeavors exploring the effectiveness of LSTM for LCC forecasting, it was inconclusive as to what geospatial dataset characteristics were required to optimize the use of this modeling approach.By training and testing models using geospatial datasets with varying characteristics, this work aimed to contribute to future LCC forecasting applications by providing recommendations and an assessment displaying under which circumstances the method is most effective.In this application, it was determined that increasing the number of timesteps and obtaining data with finer temporal resolution enable the most optimal models to be developed for LCC forecasting.Likewise, the number of classes exhibiting change also showed impactful to method performance.Lastly, integrating additional data layers such as classification confidence proved useful in mitigating the effects of coarser temporal resolution on the method's capacity to simulate LCC.

2001 2017Figure 2 .
Figure 2. Land cover maps for British Columbia generated using actual data for years 2001 and 2017 are shown on the left.Corresponding classification confidence maps for years 2001 and 2017 are shown on the right.

Figure 2 .
Figure 2. Land cover maps for British Columbia generated using actual data for years 2001 and 2017 are shown on the left.Corresponding classification confidence maps for years 2001 and 2017 are shown on the right.

Figure 3 .
Figure 3. Number of cells composing each land cover class per year in the study area covering British Columbia, Canada from 2001 to 2017.

Figure 3 .
Figure 3. Number of cells composing each land cover class per year in the study area covering British Columbia, Canada from 2001 to 2017.

Figure 4 .
Figure 4. Stacked LSTM model with a land cover input sequence example using one-year temporal resolution.

Figure 4 .
Figure 4. Stacked LSTM model with a land cover input sequence example using one-year temporal resolution.

Figure 5 .
Figure 5. Stacked LSTM model configured for land cover and classification confidence input layers.

Figure 5 .
Figure 5. Stacked LSTM model configured for land cover and classification confidence input layers.

Figure 6 .
Figure 6.Actual land cover data for the Central Okanagan Region used for verification and visual assessment comparing forecasted outputs for 2017.

Figure 6 .
Figure 6.Actual land cover data for the Central Okanagan Region used for verification and visual assessment comparing forecasted outputs for 2017.

Table 4 .
Composition of each training set and for each temporal resolution.The number of cells belonging to each class has been shown, along with the original percentage of cells belonging to each class in the respective datasets for (a) the actual dataset and (b) the hybrid dataset.

Figure 7 .
Figure 7. Land cover classes obtained for year 2017 from models trained using the actual land cover dataset with (a) one-year, (b) two-year, (c) four-year, and (d) eight-year temporal resolution.

Figure 7 .
Figure 7. Land cover classes obtained for year 2017 from models trained using the actual land cover dataset with (a) one-year, (b) two-year, (c) four-year, and (d) eight-year temporal resolution.

Figure 8 .
Figure 8. Land cover classes obtained for year 2017 from models trained using the actual land cover dataset and classification confidence layer with (a) one-year, (b) two-year, (c) four-year, and (d) eightyear temporal resolution.

Figure 8 .
Figure 8. Land cover classes obtained for year 2017 from models trained using the actual land cover dataset and classification confidence layer with (a) one-year, (b) two-year, (c) four-year, and (d) eight-year temporal resolution.

Figure 9 .
Figure 9.Comparison of land cover forecasts centered on the Central Okanagan region, British Columbia, using the actual land cover dataset without and with classification confidence layer.

Figure 9 .
Figure 9.Comparison of land cover forecasts centered on the Central Okanagan region, British Columbia, using the actual land cover dataset without and with classification confidence layer.

Figure 10 .
Figure 10.Forecasted land cover classes obtained for year 2017 from models trained using the hybrid dataset with (a) one-year, (b) two-year, (c) four-year, and (d) eight-year temporal resolution.

Figure 10 .
Figure 10.Forecasted land cover classes obtained for year 2017 from models trained using the hybrid dataset with (a) one-year, (b) two-year, (c) four-year, and (d) eight-year temporal resolution.

Figure 11 .
Figure 11.Forecasted land cover classes obtained for year 2017 from models trained using the hybrid dataset and classification confidence layer with (a) one-year, (b) two-year, (c) four-year, and (d) eightyear temporal resolution.

Figure 11 .Figure 12 .
Figure 11.Forecasted land cover classes obtained for year 2017 from models trained using the hybrid dataset and classification confidence layer with (a) one-year, (b) two-year, (c) four-year, and (d) eight-year temporal resolution.

Figure 12 .
Figure 12.Comparison of land cover forecasts centered on the Central Okanagan region, British Columbia, using the hybrid dataset without and with classification confidence layer.

Table 1 .
Overview of original moderate-resolution imaging spectroradiometer (MODIS) land cover dataset characteristics.

Table 3 .
Years used to compose training and test datasets considering the four temporal resolution scenarios.The years used for training and testing target data have been underlined.

Table 5 .
Number of cells correctly and incorrectly simulated per class using the actual land cover dataset with the four temporal resolution options.

Table 6 .
Number of cells correctly and incorrectly forecasted per class using the actual land cover dataset and classification confidence layer with the four temporal resolution options.

Table 7 .
Number of cells correctly and incorrectly forecasted per class using the hybrid dataset with the four temporal resolution options.

Table 8 .
Number of cells correctly and incorrectly forecasted per class using the hybrid dataset and classification confidence layer with the four temporal resolution options.