Mitigating Imbalance of Land Cover Change Data for Deep Learning Models with Temporal and Spatiotemporal Sample Weighting Schemes

: An open problem impeding the use of deep learning (DL) models for forecasting land cover (LC) changes is their bias toward persistent cells. By providing sample weights for model training, LC changes can be allocated greater inﬂuence in adjustments to model internal parameters. The main goal of this research study was to implement and evaluate temporal and spatiotemporal sample weighting schemes that manage the inﬂuence of persistent and formerly changed areas. The proposed sample weighting schemes allocate higher weights to more recently changed areas based on the inverse temporal and spatiotemporal distance from previous changes occurring at a location or within the location’s neighborhood. Four spatiotemporal DL models (CNN-LSTM, CNN-GRU, CNN-TCN, and ConvLSTM) were used to compare the sample weighting schemes to forecast the LC changes of the Columbia-Shuswap Regional District in British Columbia, Canada, using data obtained from the MODIS annual LC dataset and other auxiliary spatial variables. The results indicate that the presented weighting schemes facilitated improvement over no sample weighting and the common inverse frequency weighting scheme for multi-year LC change forecasts, lowering errors due to quantity while reducing overall allocation error severity. This research study contributes to strategies for addressing the characteristic imbalances of multitemporal


Introduction
The abundance of openly available multitemporal remote sensing data continues to expand, accelerating studies of land change and pursuits of data-driven modeling techniques such as machine learning (ML) and deep learning (DL) [1]. These approaches have circumvented the need to encode nonlinear relationships between numerous variables in land change applications [2][3][4]. ML and DL methods are designed to learn patterns given training data, labels, and a loss function [5]. In particular, DL methodologies applied to multitemporal datasets have demonstrated favorable outcomes for LC classification and forecasting [5][6][7]. For example, recurrent neural networks (RNNs) are useful for extrapolating from timeseries data [5,6], and their combination with convolutional neural networks (CNNs) facilitates the extraction of spatial features from each timestep [7]. However, many DL models are sensitive to data imbalances, exhibiting biases toward samples belonging to majority groups characterizing the dataset [8][9][10]. Persistent areas dominate most land cover datasets [11], presenting a large challenge for the application of DL methodologies due to the relatively small amount of changed areas [12]. Therefore, mitigation of imbalance between changed and persistent areas is required for DL methodologies to be effective for LC change forecasting.
Previous studies have applied sampling strategies and data augmentation techniques to address the imbalance of changed and unchanged areas. For example, balanced sampling schemes were implemented to provide ML models equal numbers of changed and persistent samples using random sampling [13,14] and an iterative bootstrap sampling approach [15]. However, DL models benefit from larger amounts of data [4], inspiring geographic dataset augmentation procedures including adaptations of the synthetic minority over-sample technique (SMOTE) [16], transformations of data samples such as rotations and flips [17], and synthetic sample generation with generative adversarial networks (GANs) [18]. Despite this, such approaches have not been adapted to the dynamic characteristics of land change timeseries data, nor do they maintain directional spatial relationships in the case of manual transformations or the real-world spatiotemporal context of geographic data samples. Other strategies aimed at mitigating data imbalance include cost-sensitive methods such as sample weighting [9]. By allocating increased penalties to significant or minority samples, the spatial phenomena of interest have greater influence on the learned parameters of DL models [19]. Using sample weighting schemes, scarcer or anomalous geographic phenomena can be allocated higher importance in model training procedures without data-level manipulations.
Sample weights were previously explored to improve forecasts of underrepresented peak air pollution concentrations [20], to increase the importance of nearby samples for location recommendations [21], and to boost multi-class LC classification accuracy through inverse frequency weighted categorical cross-entropy [22]. Temporal and spatiotemporal distance decay were also implemented for modeling house prices with local regression techniques [23] and video target tracking with spatiotemporal DL models [24] to diminish the influence of locations or features based on both temporal and spatial proximity. While variations of temporal and spatiotemporal sample weighting schemes were explored for other model types and applications, it remains unknown how temporally and spatiotemporally weighted samples would affect the capacity of spatiotemporal DL models to forecast LC changes versus the commonly used inverse frequency weighting scheme.
The main objective of this research study was to propose and evaluate the potential effect of temporally and spatiotemporally weighted samples for managing the imbalances that inhibit DL model capacity to forecast LC changes. While it is acknowledged that perclass disparities characterize a second dimension of imbalance impacting the capacity of DL models to forecast LC changes [12], this research study focuses on LC changes overall. The new sample weighting schemes are implemented with the intent of ensuring more recently changed areas and changed locations with transitioning neighborhoods are more impactful in model training. This is achieved through inverse distance weighting schemes which allocate changed samples variable importance based on temporal and spatiotemporal proximity. Meanwhile, all persistent LC samples are assigned lower importance. The experimental cases compare four types of spatiotemporal DL models trained using the proposed sample weighting schemes versus those trained with no sample weights and inverse frequency weighted samples. The assessment uses change-focused measures to highlight potential improvements associated with the proposed sample weighting schemes for 5-year LC change forecasts. The goal is to determine if trends observed across all model types are maintained for a multistep LC forecast.

Study Area and Datasets
The study area selected for this research work was the Columbia-Shuswap Regional District in British Columbia, Canada (Figure 1a). In 2016, the population of this region was 51,366, and Salmon Arm was its largest city [25]. The MODIS dataset was utilized to provide annual LC data at 500 m spatial resolution, covering 2001 to 2020 [26]. The vast majority of the region was characterized by forests and shrublands, with these two LC classes spanning 45% and 40% of the area in 2001, respectively. Because per-class LC changes are not the priority of this research study, eight aggregated LC classes were used to preserve the characteristics of the region. The net change that occurred between 2001 and 2020 was 21,202 km 2 , with 2076 km 2 of average annual change (Figure 1b). provide annual LC data at 500 m spatial resolution, covering 2001 to 2020 [26]. The vast majority of the region was characterized by forests and shrublands, with these two LC classes spanning 45% and 40% of the area in 2001, respectively. Because per-class LC changes are not the priority of this research study, eight aggregated LC classes were used to preserve the characteristics of the region. The net change that occurred between 2001 and 2020 was 21,202 km 2 , with 2076 km 2 of average annual change (Figure 1b). In addition to multitemporal LC data, static spatial variables provide auxiliary information about drivers of land changes. Topographic variables provided are elevation and slope [27,28], alongside accessibility variables derived by calculating the Euclidean distances to population centers [6], roads [7], railways [3], protected areas [7,29], agricultural In addition to multitemporal LC data, static spatial variables provide auxiliary information about drivers of land changes. Topographic variables provided are elevation and slope [27,28], alongside accessibility variables derived by calculating the Euclidean distances to population centers [6], roads [7], railways [3], protected areas [7,29], agricultural reserves [30], and lakes and rivers [29]. The ASTER global digital elevation model was acquired for the topographic variables [31] and the proximity variables were computed with respect to data layers available from Statistics Canada [32,33]. All data layers were resampled to the MODIS dataset using nearest neighbor interpolation for categorical layers and bilinear interpolation for the continuous layers, since the LC dataset exhibits the coarsest spatial resolution. The LC data and auxiliary variables are reprojected to the NAD 1983 BC Environment Albers projected coordinate system, preserving planar area measurements. Data preprocessing procedures were completed in the ArcGIS Pro 2.9.1 software before deriving samples and sample weights in the next steps described.

Capturing Neighborhood Effects in Land Cover Data Samples
Before developing sample weights, the training data samples were extracted with respect to each cell of the study area. "Neighborhood effects" refer to the impact of changes and structure of phenomena in the vicinity of a sample location [34]. The LC at a sample location is highly dependent on the states and changes that occur around it, making neighborhood effects important to consider in LC change forecasting [4]. To integrate neighborhood effects in this research study, the LC composition surrounding each location at every timestep was included in each data sample. Considering the raster geospatial data layers described in Section 2.1, data samples were acquired for every cell comprising the study region using a Moore neighborhood configuration. A Moore neighborhood captures N cells provided a range parameter r, where N = (2r + 1) 2 cells [35]. Samples were obtained considering each cell as a central cell, with neighborhoods capturing cells within range r from the central location across all timesteps. The neighborhood size can also be referred to with respect to the areal dimensions captured by each sample, M × M, where M refers to how many cells comprise the longest edge of the central cell's neighborhood.
Data samples for the Columbia-Shuswap Regional District included both spatiotemporal LC and auxiliary spatial variables. The spatiotemporal LC obtained for each location was of size T × M × M × C, where T signifies the number of timesteps, and C represents the number of LC classes. The spatial variables were provided in the form of M × M × V, where V denotes the number of variables. The neighborhood dimensions used in this research study were set to 9 × 9 (or r = 4 cells in every direction from the central cell), according to other studies employing DL models [36][37][38] and findings suggesting land changes are captured well when M is less than 5 km with 500 m spatial resolution data [39]. Therefore, the 9 × 9 neighborhood captured the LC and spatial variables for a 20.25 km 2 area around each cell in the study region.

Model Specifications
The models implemented in this research study included those adapted for the spatiotemporal samples provided alongside auxiliary spatial variables as described in Section 2.2. Specific DL models called recurrent neural networks (RNNs) are useful for extracting patterns from sequential or timeseries data [37]. RNN implementations exhibit different variations in architectures, including long short-term memory (LSTM) [40] and gated recurrent units (GRUs) [41]. Temporal convolutional networks (TCNs) provide an alternative to RNNs for sequence modeling, using convolutional neural networks (CNNs) that operate on the temporal dimension of a geospatial dataset [42]. However, on their own, these sequence modeling techniques do not extract spatial correlations within data samples. Instead, capturing neighborhood effects first with traditional CNNs [43] and providing the outputs to sequence models (i.e., LSTM, GRU, and TCN) implements hybrid spatiotemporal models, which have been used to capture spatial and temporal patterns of land change [7,44]. Despite demonstrated integrations, these hybrid models do not capture explicit spatial and temporal relationships simultaneously. As such, convolutional LSTM (ConvLSTM) was formulated to accommodate spatiotemporal relationships directly, providing benefits over CNN-LSTM implementations [45].
The four spatiotemporal DL model types implemented in this research study were CNN-LSTM, CNN-GRU, CNN-TCN, and ConvLSTM. Each model was equipped with two input branches, accommodating the spatiotemporal LC sequences and the auxiliary spatial variables, respectively. This is similar to how 3D and 2D variables have been integrated in DL models in other geographic applications [46]. The general structure of the branched model implemented is shown in Figure 2.
The four spatiotemporal DL model types implemented in this research study were CNN-LSTM, CNN-GRU, CNN-TCN, and ConvLSTM. Each model was equipped with two input branches, accommodating the spatiotemporal LC sequences and the auxiliary spatial variables, respectively. This is similar to how 3D and 2D variables have been integrated in DL models in other geographic applications [46]. The general structure of the branched model implemented is shown in Figure 2. For all model types, the spatial variable input branch (denoted as "spatial branch" in Figure 2) was implemented with two sets of two convolution layers with 32, 32, 64, and 64 filters, respectively. After each set of two CNN layers, a 2 × 2 max pooling operation was applied. Next, the spatiotemporal LC input branch (denoted as "spatiotemporal branch" in Figure 2) was implemented according to the model type. The implementations for CNN-LSTM, CNN-GRU, and CNN-TCN were adapted from previous studies [46,47]. To implement the spatiotemporal branch of the model, spatial relationships from each timestep of the LC data samples were first extracted using CNN. Two sets of two convolution layers were followed by 2 × 2 max pooling operations, where the CNN layers were parameterized by 32, 32, 64, and 64 filters, respectively, with the ReLU activation function. The outputs of the CNN operations were flattened and provided to the temporal model component characterized by LSTM, GRU, or TCN of the respective models. The CNN-LSTM and CNN-GRU models each featured two recurrent layers of 32 and 128 neurons using the tanh activation function, respectively, based on previous implementations [36]. The CNN-TCN model featured two layers with 32 and 128 filters, with the ReLU activation function. The ConvLSTM implementation had two ConvLSTM layers, featuring 32 and 128 filters, respectively, with the ReLU activation function. The kernel size for all CNN and ConvLSTM layers was set to 3 × 3, as seen in previous research studies [36,48]. Before concatenating outputs of each model branch, a dropout factor of 10% was applied. Dropout regularization randomly drops a specified percentage of neurons with the intent of preventing the model from overfitting [49]. Lastly, the fully connected output layer featured nine neurons and the Softmax activation function, which outputted the probability of each LC class label [17]. For all models, the LC and spatial variable data samples were provided with neighborhood effects explained in Section 2.2. The LC data for each sample location were supplied directly to the spatiotemporal input branch in the form of T × M × Figure 2. Overview of the basic branched model structure used to accommodate the 9 × 9 spatiotemporal land cover sample sequences and the auxiliary spatial variables. The "spatial branch" is implemented with CNN layers. The "spatiotemporal branch" implementation varies according to the model type, characterized by CNN-LSTM, CNN-GRU, CNN-TCN, or ConvLSTM layers. Location x, y in the land cover sample is denoted in red, indicating the central cell of the neighborhood.
For all model types, the spatial variable input branch (denoted as "spatial branch" in Figure 2) was implemented with two sets of two convolution layers with 32, 32, 64, and 64 filters, respectively. After each set of two CNN layers, a 2 × 2 max pooling operation was applied. Next, the spatiotemporal LC input branch (denoted as "spatiotemporal branch" in Figure 2) was implemented according to the model type. The implementations for CNN-LSTM, CNN-GRU, and CNN-TCN were adapted from previous studies [46,47]. To implement the spatiotemporal branch of the model, spatial relationships from each timestep of the LC data samples were first extracted using CNN. Two sets of two convolution layers were followed by 2 × 2 max pooling operations, where the CNN layers were parameterized by 32, 32, 64, and 64 filters, respectively, with the ReLU activation function. The outputs of the CNN operations were flattened and provided to the temporal model component characterized by LSTM, GRU, or TCN of the respective models. The CNN-LSTM and CNN-GRU models each featured two recurrent layers of 32 and 128 neurons using the tanh activation function, respectively, based on previous implementations [36]. The CNN-TCN model featured two layers with 32 and 128 filters, with the ReLU activation function. The ConvLSTM implementation had two ConvLSTM layers, featuring 32 and 128 filters, respectively, with the ReLU activation function. The kernel size for all CNN and ConvLSTM layers was set to 3 × 3, as seen in previous research studies [36,48]. Before concatenating outputs of each model branch, a dropout factor of 10% was applied. Dropout regularization randomly drops a specified percentage of neurons with the intent of preventing the model from overfitting [49]. Lastly, the fully connected output layer featured nine neurons and the Softmax activation function, which outputted the probability of each LC class label [17]. For all models, the LC and spatial variable data samples were provided with neighborhood effects explained in Section 2.2. The LC data for each sample location were supplied directly to the spatiotemporal input branch in the form of T × M × M × C, and auxiliary spatial variable data were provided in the form of M × M × V to the spatial variable input branch.

Categorical Cross-Entropy Loss
Supervised DL models were trained by optimizing weights of a network with respect to some objective function or loss function [50]. Using the gradient descent (or stochastic gradient descent) algorithm, the aim is to minimize or maximize the amount of error between the real-world and projected value. Forecasting multi-class LC is formulated as a classification problem, requiring an objective function equipped for this probabilistic task. The categorical cross-entropy (CCE) is commonly used for classification problems, facilitating learning of multiple classes by backpropagating error with respect to forecasted values and real-world LC classes [51]. The CCE function computed with respect to sample i is expressed as follows as per previous studies [22]: where LC class labels are one-hot encoded vectors, denoted byŷ c and y c , whereŷ c represents the real-world LC of class c, and y c represents the forecasted LC for the location. C represents the number of LC classes.
In the computation of error with respect to the CCE function, every sample is given the same amount of influence on model updates. By adding a weight factor to the CCE function, samples can be given varying error penalties that consequently have differing impacts on model weight adjustments during model training procedures [52]. This is expressed as follows: where w i represents the sample weight of sample i. Values for w i assume the sample weights described in the next section.

Calculating Temporal and Spatiotemporal Sample Weights
This research study provides three new LC sample weighting schemes to improve model capacity to capture changes. The proposed temporal and spatiotemporal sample weights were implemented to bypass the need for any manual assignment of cost values to samples, which is challenging in many real-world applications [9]. The sample weighting schemes were based on temporal and spatiotemporal proximity to changes, achieved using the inverse distance weighting (IDW) approach [53]. For the task of LC change forecasting, the aim was to first identify whether a location underwent a change, then refine the sample weight based on whether the change happened recently or whether the location's neighborhood underwent a recent change. Accommodating temporal decay addresses temporal heterogeneity by giving higher importance to recently changed cells, thus decreasing the impact of changes that occurred long ago. For instance, notably larger quantities of changed areas were observed between 2002 and 2003 ( Figure 1b). Under the proposed sample weighting schemes, locations that changed but remained persistent following 2003 were weighted less than those that underwent changes more recently. Additionally, even if the change occurred earlier, a sample of a formerly changed location with an actively changing neighborhood was weighted with higher importance.
The effects of the three new sample weighting schemes are compared against unweighted samples and the widely used inverse frequency weighting scheme. Therefore, the five sample weighting schemes implemented in this research study were as follows: 1.
Unweighted (base case or "none"), where no sample weights were used; 2.
Binary weights (BW), where a traditional inverse frequency weighting scheme used the inverse frequency of changed versus persistent sample counts to assign sample weights; 3.
Temporal weighting scheme 1 (TW1), where the inverse temporal distance weight was computed with respect to the most recent change of the central cell;

4.
Temporal weighting scheme 2 (TW2), where the inverse temporal distance weight was computed with respect to the most recent change of the cell's neighborhood; 5.
Spatiotemporal weighting scheme (STW), where the inverse spatiotemporal distance weight was calculated with respect to the most recent change that occurred within the neighborhood of the central cell.
To compute the sample weights of the training data samples, changes were considered within the temporal extent of 2001 to 2014 (Figure 1b). The three steps to implement the weighting schemes are as follows: Step 1. Identify whether a change has occurred at the sample location, or central cell of the neighborhood (Figure 2). If the number of change incidents was one or greater from 2001 to 2014, the cell was considered as changed.
Step 2. Compute the inverse frequency weight according to the counts of persistent and changed cells. Similar to another research study [54], the initial weights of the samples were calculated on the basis of their overall type. For LC data, the initial weight of the changed samples was b c i = P/(P + C), and the weight of persistent samples was b p i = C/(P + C), where the number of changed cells is denoted by C, and the number of persistent cells is denoted by P. Persistent samples were assigned non-zero weights because they were still important to the learned model structure. Therefore, persistent sample weights (w p i ) assumed the values of b p i and required no further updates. This concludes the calculations required to implement the BW scheme, while temporal and spatiotemporal variation was added to changed sample weights in Step 3 for the TW1, TW2, and STW schemes.
Step 3. With the effect of persistent cells managed in Step 2, the sample weight calculations for the TW1, TW2, and STW schemes were applied to the changed sample weights (w c i ) as a function of the temporal and spatial variation occurring at the central cell and within its neighborhood over time. To implement the temporal weighting schemes (TW1 and TW2), the temporal distance from the most recent year of the training sample was computed with respect to the most recent change at the central cell (d cc ) and to the year of the most recent change occurring in the neighborhood of the central cell (d cn ), respectively. For the spatiotemporal weighting scheme (STW), the spatiotemporal distance was computed from the location and year of the latest central cell in a sample and the location and year of the most recently changed cell in its neighborhood d ST cn . Table 1 shows the formulations of TW1, TW2, and STW. By expanding the weighting schemes to consider changes taking place within a cell's neighborhood, the TW2 and STW schemes increased the weight of change samples with recent nearby transitions. This means that more dynamic change samples had greater influence on model parameter adjustments during the training procedure using the TW2 and STW schemes. The IDW power or exponent parameter was set to one because the spatial resolution of this research study was coarse and there were limited timesteps available, as well as to ensure weight values associated with of historical changes were non-zero. The resulting sample weights are presented in Figure 3. No manual adjustments or normalization techniques were applied to the weights to further influence sample importance. Therefore, the maximum possible weight values were constrained by Step 2, where the inverse proportion of changed and persistent cells was calculated. This led to the maximal weight value of 0.9 (Figure 3). Table 1. Sample weight scheme applied to samples identified as changed.

Sample Weight Scheme for Changed Locations Formula Description
Binary weight (BW) The inverse proportion of changed versus persistent samples Cell-change temporal weight (TW1) Temporal distance (d cc ) between most recent year (t n ) and the year of the most recent change event of the central cell (t cc )

Sample Weight Scheme for Changed Locations Formula Description
Neighborhood-change temporal weight (TW2) Temporal distance (d cn ) from the most recent year (t n ) and the year of change event occurring in the neighborhood of the central cell (t cn ) Spatiotemporal weight (STW)

Model Assessment
The measures selected for evaluating model performance in this study focused on LC changes instead of overall accuracy, since high accuracy values cannot be used to express the capacity of a model to forecast changes [55]. Instead, the assessment included figure of merit (FOM), producer's accuracy (PA), and user's accuracy (UA) [56], which were

Model Assessment
The measures selected for evaluating model performance in this study focused on LC changes instead of overall accuracy, since high accuracy values cannot be used to express the capacity of a model to forecast changes [55]. Instead, the assessment included figure of merit (FOM), producer's accuracy (PA), and user's accuracy (UA) [56], which were identified as more appropriate for evaluating LC change forecasts. FOM provides a measure of agreement between forecasted changes and real-world changes, calculated with respect to "hits", "misses", "wrong hits", and "false alarms". Computations for PA and UA involve subsets of the terms comprising the FOM measure. PA indicates the proportion of area that a model forecasts correctly as changed with respect to real-world changes, whereas UA suggests the amount of correctly forecasted changes versus all projected changes. The correct changes ("hits"), missed changes ("misses"), incorrect changes ("false alarms"), and correct persistence ("correct rejections") components of agreement and disagreement used to compute the FOM measure are also reported separately to showcase quantity and allocation of changes [57].
To conduct an error analysis, the error due quantity, error due to allocation, and allocation error distance [58] were calculated for each model and sample weight combination. Allocation error distance (AED) provides information about the severity of allocation errors, where the distance between the real-world locations and locations of erroneous forecasts were averaged with respect to each LC class [58]. Because LC class or category imbalance was not addressed in this research study, AED was computed with respect to all allocation errors (AED overall ), the largest classes (AED large ), the classes deemed "medium-sized" (AED medium ), and the smallest classes (AED small ). The purpose was to identify where the most severe AEs stemmed from. The large class size category encompassed evergreen forests, shrublands and savannas, and barren land, comprising 91.1% of the study area. The medium class size category captured permanent snow and ice, water bodies, and deciduous forests, covering 8.7% of the study area. The smallest classes were urban and built-up lands and croplands, occupying less than 1% of the study area. Python 3.9.1, GDAL 3.3, and Rasterstats 0.17.0 were used to implement the allocation error distance approach conveyed in prior work [58] and the other change-focused measures. In this research study, the FOM measure was used to identify six top-performing models based on the 2016 LC forecast. The performance of these models was then compared across the multi-year LC change forecasts to determine if trends were maintained. The mapped FOM components also supported a visual assessment for the forecasts produced by the best-performing models identified with respect to their capacity to forecast changes.

Experiment Settings
The models were implemented to the specifications described in Section 2.3 with Python 3.9.1 [59], the Keras API [60], TensorFlow 2.5.0 [61], and an open-source implementation of TCN (Keras-TCN 3.4.0) [62]. The CCE loss function described in Section 2.4 was employed in all models. To train the models, the batch size was set to 128, and the Adam optimizer was used with an initial learning rate of 0.01 [7]. Early stopping and learning rate reductions (using the "ReduceLROnPlateau" function from TensorFlow) were utilized to ensure that model training ceased or that the learning rate was decreased when performance gains were negligible [63].
In this study, the multi-year forecasts were generated using a "rolling-window" strategy according to the implementation in previous work [64]. Instead, "sequenceto-sequence" forecasting was demonstrated to circumvent the propagation of error across projected timesteps [44], as this approach would reduce the already limited timesteps available and lead to short sequences of MODIS LC data, which impeded DL models for LC change forecasting in a prior study [65]. With the "rolling-window" strategy, a multi-year LC change projection was produced using the previous forecast as the next timestep of the testing sequence. The LC data spanning 2001-2014 were used to populate the training dataset, while 2015 was withheld for model validation and 2016-2020 was used for model testing. Samples from the 2001-2014 training dataset were obtained using a rolling-window approach based on a previous application [66], where 10 timesteps comprised the training sequence, with the following timestep withheld as the training label. Therefore, the training data sequences spanned t n , t n+1 , . . . , t n+9 , with t n+10 as the training label. Each data sample comprised spatiotemporal LC data and static spatial variables with the neighborhood specifications expressed in Section 2.2. Samples for every location are provided to train the model with the sample weights described in Section 2.5 and shown in Figure 3. The four model types were trained considering each sample weight scheme (none, BW, TW1, TW2, and STW), where "none" refers to the experimental combinations or base case in which no sample weighting scheme was applied.

Multi-Year Change Assessment
The model assessment described in Section 2.6 was used to quantify the impact of the proposed sample weighting schemes across the four model types. With respect to the FOM measure obtained for 2016, the top six model and sample weight combinations were CNN-TCN STW , CNN-GRU STW , ConvLSTM TW2 , ConvLSTM STW , CNN-LSTM TW1 , and ConvLSTM TW1 (Figure 4). Following the FOM values obtained by these model and sample weight combinations, there was a 19.7% difference between the next model and sample weight combinations ( Figure 4) and a substantial drop in FOM values observed for the base case (denoted as "none"). All sample weighting schemes facilitated improved FOM measures over the base case for the 2016 LC change forecasts, regardless of model type. The BW scheme was associated with consistently improved FOM values compared to the base case, although the top combination using BW (CNN-GRU BW ) was 27.6-32.6% lower than the top six performers identified. In addition, the STW scheme was associated with higher FOM values for three of the four model types (CNN-GRU, CNN-TCN, and ConvLSTM) ( Figure 4). FOM values obtained with the TW2 scheme also enabled improved performance versus the BW scheme for the same three models. The TW1 scheme worked well for CNN-LSTM and ConvLSTM, but reduced FOM values for CNN-GRU and CNN-TCN.

Multi-Year Change Assessment
The model assessment described in Section 2.6 was used to quantify the impact of the proposed sample weighting schemes across the four model types. With respect to the FOM measure obtained for 2016, the top six model and sample weight combinations were CNN-TCNSTW, CNN-GRUSTW, ConvLSTMTW2, ConvLSTMSTW, CNN-LSTMTW1, and Con-vLSTMTW1 ( Figure 4). Following the FOM values obtained by these model and sample weight combinations, there was a 19.7% difference between the next model and sample weight combinations ( Figure 4) and a substantial drop in FOM values observed for the base case (denoted as "none"). All sample weighting schemes facilitated improved FOM measures over the base case for the 2016 LC change forecasts, regardless of model type. The BW scheme was associated with consistently improved FOM values compared to the base case, although the top combination using BW (CNN-GRUBW) was 27.6-32.6% lower than the top six performers identified. In addition, the STW scheme was associated with higher FOM values for three of the four model types (CNN-GRU, CNN-TCN, and Con-vLSTM) (Figure 4). FOM values obtained with the TW2 scheme also enabled improved performance versus the BW scheme for the same three models. The TW1 scheme worked well for CNN-LSTM and ConvLSTM, but reduced FOM values for CNN-GRU and CNN-TCN. After identifying the top six models with respect to FOM values computed for the 2016 forecasts, it was observed that these models maintained the highest FOM measures over the 5-year projection (Figure 5a). Meanwhile, the FOM values remained low for the base case, showing that no sample weights produce underperforming LC change models, regardless of model type. The BW scheme maintained consistent effects on all model types, although CNN-TCNBW and CNN-GRUBW showed a slight increase over time. In contrast to the initial trends seen in Figure 4, ConvLSTMTW1 surpassed ConvLSTMSTW and ConvLSTMTW2 after the 2017 projection, while CNN-TCNSTW, CNN-GRUSTW, and CNN-LSTMTW1 continued to yield the highest FOM measures (Figure 5a). The PA measures computed with respect to changed areas followed the same trends (Figure 5b), showing that the proportion of forecasted versus real-world changes was highest in forecasts obtained from the top six model and sample weight combinations. In contrast, the UA showed a trend dissimilar to those seen with FOM and PA measures (Figure 5c). The highest UA measures were obtained by CNN-LSTMNone and ConvLSTMNone, showing that the After identifying the top six models with respect to FOM values computed for the 2016 forecasts, it was observed that these models maintained the highest FOM measures over the 5-year projection (Figure 5a). Meanwhile, the FOM values remained low for the base case, showing that no sample weights produce underperforming LC change models, regardless of model type. The BW scheme maintained consistent effects on all model types, although CNN-TCN BW and CNN-GRU BW showed a slight increase over time. In contrast to the initial trends seen in Figure 4, ConvLSTM TW1 surpassed ConvLSTM STW and ConvLSTM TW2 after the 2017 projection, while CNN-TCN STW , CNN-GRU STW , and CNN-LSTM TW1 continued to yield the highest FOM measures (Figure 5a). The PA measures computed with respect to changed areas followed the same trends (Figure 5b), showing that the proportion of forecasted versus real-world changes was highest in forecasts obtained from the top six model and sample weight combinations. In contrast, the UA showed a trend dissimilar to those seen with FOM and PA measures (Figure 5c). The highest UA measures were obtained by CNN-LSTM None and ConvLSTM None , showing that the proportion of correctly simulated changes versus all projected change was high. The low quantities of changes projected with no sample weights boosted UA measures over time, as these measures were inflated by small amounts of projected LC change, as indicated by "hits", "false alarms", and "wrong changes" for both 2016 and 2020 (Figure 6a,b). Notably, ConvLSTM TW2 and ConvLSTM STW exhibited higher UA measures than the other top six models. This indicates that these combinations forecasted higher amounts of correctly changed areas out of all forecasted changes and reduced incorrectly changed areas, despite not attaining the maximal FOM or amount of hits (Figures 5c and 6). Overall, CNN-TCN STW yielded the highest amount of correctly changed area of all the 2020 LC change forecasts ( Figure 6). The highest quantity of hits or correctly changed area for each model type was associated with TW1, TW2, and STW, except for CNN-GRU TW1 . Additionally, CNN-TCN STW and ConvLSTM TW1 produced the highest number of false alarms of the top six models. models. This indicates that these combinations forecasted higher amounts of correctly changed areas out of all forecasted changes and reduced incorrectly changed areas, despite not attaining the maximal FOM or amount of hits (Figures 5c and 6). Overall, CNN-TCNSTW yielded the highest amount of correctly changed area of all the 2020 LC change forecasts ( Figure 6). The highest quantity of hits or correctly changed area for each model type was associated with TW1, TW2, and STW, except for CNN-GRUTW1. Additionally, CNN-TCNSTW and ConvLSTMTW1 produced the highest number of false alarms of the top six models.

Multi-Year Error Analysis
The highest amount of EQ was observed for every model's base case, in which no sample weights were used (Figure 7a). Meanwhile, three of the top six models (CNN-LSTMTW1, CNN-GRUSTW, and CNN-TCNSTW) provided the lowest EQ for their respective model types, showing that each projected more realistic quantities of changes. Con-vLSTMTW1 preserved the lowest EQ for the ConvLSTM model type, while EQ values attributed to ConvLSTMTW2 and ConvLSTMSTW gradually exceed those of ConvLSTMBW. Conversely, the base case attained the lowest EA measures, corresponding to the minimal quantities of false alarms and wrong changes observed (Figures 6 and 7b). Of the top six models, CNN-LSTMTW1, CNN-GRUSTW, CNN-TCNSTW, and ConvLSTMTW1 forecasted the highest amounts of changed area allocated incorrectly, while ConvLSTMSTW projected the lowest EA for each step of the 5-year projection.

Multi-Year Error Analysis
The highest amount of EQ was observed for every model's base case, in which no sample weights were used (Figure 7a). Meanwhile, three of the top six models (CNN-LSTM TW1 , CNN-GRU STW , and CNN-TCN STW ) provided the lowest EQ for their respective model types, showing that each projected more realistic quantities of changes. ConvLSTM TW1 preserved the lowest EQ for the ConvLSTM model type, while EQ values attributed to ConvLSTM TW2 and ConvLSTM STW gradually exceed those of ConvLSTM BW . Conversely, the base case attained the lowest EA measures, corresponding to the minimal quantities of false alarms and wrong changes observed (Figures 6 and 7b). Of the top six models, CNN-LSTM TW1 , CNN-GRU STW , CNN-TCN STW , and ConvLSTM TW1 forecasted the highest amounts of changed area allocated incorrectly, while ConvLSTM STW projected the lowest EA for each step of the 5-year projection. Considering the distance of erroneous allocations to real-world LC category locations, the AEDoverall, AEDlarge, AEDmedium, and AEDsmall maintained similar trends across the 5-year forecast (Figure 8). AEDoverall and AEDlarge values showing the smallest or nearest allocation errors were associated with the 2016 forecast by ConvLSTMTW2 (Figure 8a,b). For AEDoverall, a notable deviation was observed for CNN-TCNBW, where the overall allocation error severity exceeded the unweighted base case. The top six models produced overall allocation errors generally closer to the real-world areas than the unweighted base case in the projections for 2017-2020. The same trend was also noted for AEDlarge, except for the 2017 forecast produced by ConvLSTMTW1. However, it was observed that the erroneous large class allocations were either corrected or were more agreeable with the realworld 2018 LC allocations, as the AEDlarge value decreased for the next timestep. It should be noted that the spread of AEDoverall and AEDlarge values was not substantial, indicating that overall allocation errors and allocation errors attributed to the largest LC classes were marginal with respect to the spatial resolution of the dataset. The AEDmedium showed that the top six models produced allocation errors nearer to the real-world classes than the unweighted base case, except for the ConvLSTMSTW model (Figure 8c). While AEDsmall values computed from the 2016 forecast were zero for ConvLSTMSTW and ConvLSTMTW2, larger deviations were observed between 2018 and 2020. This implies agricultural or builtup areas were forecasted far from their real-world allocations. Considering the distance of erroneous allocations to real-world LC category locations, the AED overall , AED large , AED medium , and AED small maintained similar trends across the 5-year forecast (Figure 8). AED overall and AED large values showing the smallest or nearest allocation errors were associated with the 2016 forecast by ConvLSTM TW2 (Figure 8a,b). For AED overall , a notable deviation was observed for CNN-TCN BW , where the overall allocation error severity exceeded the unweighted base case. The top six models produced overall allocation errors generally closer to the real-world areas than the unweighted base case in the projections for 2017-2020. The same trend was also noted for AED large , except for the 2017 forecast produced by ConvLSTM TW1 . However, it was observed that the erroneous large class allocations were either corrected or were more agreeable with the real-world 2018 LC allocations, as the AED large value decreased for the next timestep. It should be noted that the spread of AED overall and AED large values was not substantial, indicating that overall allocation errors and allocation errors attributed to the largest LC classes were marginal with respect to the spatial resolution of the dataset. The AED medium showed that the top six models produced allocation errors nearer to the real-world classes than the unweighted base case, except for the ConvLSTM STW model (Figure 8c). While AED small values computed from the 2016 forecast were zero for ConvLSTM STW and ConvLSTM TW2 , larger deviations were observed between 2018 and 2020. This implies agricultural or built-up areas were forecasted far from their real-world allocations.

Visual Assessment
For the visual assessment, CNN-TCNSTW and ConvLSTMSTW were considered because the first exhibited the highest FOM, while the latter forecasted the smallest number of false alarms among the top six models. For the 2016 forecasts, the false alarms and misses exhibited a "salt-and-pepper" appearance (Figure 9a,c). The CNN-TCNSTW exhibited a few small clusters of false alarms in the west and southwest of the study region. However, for the 2020 LC forecast, CNN-TCNSTW showed more distinctive clusters of false alarms surrounded by correct changes. Areas that were persistent that were forecasted incorrectly as changed appeared to be near to or at the locations that had higher sample weights (Figures 3d and 9b). The 2020 forecast from ConvLSTMSTW showed smaller and fewer clusters of false alarms. It was shown previously that misses contributed the most

Visual Assessment
For the visual assessment, CNN-TCN STW and ConvLSTM STW were considered because the first exhibited the highest FOM, while the latter forecasted the smallest number of false alarms among the top six models. For the 2016 forecasts, the false alarms and misses exhibited a "salt-and-pepper" appearance (Figure 9a,c). The CNN-TCN STW exhibited a few small clusters of false alarms in the west and southwest of the study region. However, for the 2020 LC forecast, CNN-TCN STW showed more distinctive clusters of false alarms surrounded by correct changes. Areas that were persistent that were forecasted incorrectly as changed appeared to be near to or at the locations that had higher sample weights (Figures 3d and 9b). The 2020 forecast from ConvLSTM STW showed smaller and fewer clusters of false alarms. It was shown previously that misses contributed the most errors ( Figure 6), and missed changes were shown to be visually consistent across both 2020 projections (Figure 9b,d).

Discussion
The results obtained in this research study demonstrated that, regardless of the DL model chosen, the proposed sample weighting schemes were beneficial for forecasting LC changes. The TW1, TW2, and STW schemes generally showed improvement over the

Discussion
The results obtained in this research study demonstrated that, regardless of the DL model chosen, the proposed sample weighting schemes were beneficial for forecasting LC changes. The TW1, TW2, and STW schemes generally showed improvement over the traditional BW scheme. In particular, the STW scheme improved the FOM measures compared to the BW scheme across all model types explored. With respect to FOM values obtained for 2016 LC forecasts, six model and weight combinations were identified: CNN-TCN STW , CNN-GRU STW , ConvLSTM TW2 , ConvLSTM STW , CNN-LSTM TW1 , and ConvLSTM TW1 . It was observed that the highest FOM measures associated with these six combinations were maintained for the 5-year projection. Furthermore, the top six models attained the highest PA measures. The UA measures highlighted some interesting trends among the weighting schemes, notably for ConvLSTM. The gradually increasing UA observed for ConvLSTM STW with respect to cumulative forecasted changes indicated consistent increases in correct changes versus all projected changes, while forecasting fewer false alarms or incorrect LC transitions. Overall, STW was most beneficial for CNN-GRU and CNN-TCN across all experiments, while TW1, TW2, and STW similarly benefited ConvLSTM models. The TW1 scheme benefited CNN-LSTM most for all timesteps of the 5-year projection, and the FOM values associated with ConvLSTM TW1 also surpassed those of ConvLSTM TW2 and ConvLSTM STW after 2017. This aligns with the observation that sample weight values and variations of TW1 and STW are more similar than those characterizing TW2 or BW schemes ( Figure 3). However, the similarities did not expound the approximate 2% difference between FOM measures of CNN-LSTM and CNN-GRU with the TW1 and STW schemes (Figure 5a). This requires future investigation of model parameters, structure, and regularization techniques with respect to the weighting schemes. TW2 was associated with only one of the top six models (ConvLSTM TW2 ) and did not facilitate similar performance to models trained with STW for the other model types. This may be because TW2 sample weight values were more like those of the traditional inverse frequency weighting (BW) scheme observed across the study area (Figure 3a,c).
Considering the types of errors associated with the top six models, CNN-TCN STW forecasted the highest amount of correctly changed areas while forecasting the most false alarms for the 2016 projection. However, CNN-TCN STW forecasted the second most false alarms for the 2020 projection, superseded by ConvLSTM TW1 projecting the most persistent area incorrectly as changed. If maximizing the agreement of changed areas for the 5-year forecast was the sole objective, CNN-TCN STW would have been regarded as the "best" model and sample weight combination. Meanwhile, ConvLSTM STW forecasted 86.3% fewer false alarms with less error due to quantity (EQ) after 5 years, which was apparent in the visual assessment ( Figure 9). With respect to the AED measures, the expectation was that all sample weighting schemes would help mitigate allocation error severity overall and with respect to the largest classes, which was typically the case. The AED measures indicated that the worst allocation errors were generally associated with medium and small size classes, which makes sense because no techniques were used to address the LC class imbalance problem and since optimizing per-class change allocations was not the objective of this study. However, the AED medium indicated that the top six models generally produced less severe allocation errors than the unweighted base case. The exception to this trend was ConvLSTM STW , suggesting that one or more of the medium-sized classes were not well-allocated by this model. AED small measures also indicated ConvLSTM STW and ConvLSTM TW2 forecast built-up or agricultural areas far from their real-world allocations. This outcome was expected, as this second dimension of imbalance characterizing LC datasets adds challenges for multi-class change forecasting with DL models. As such, future research studies should investigate further combinations of sample weights with class weights or the focal loss function [67] to reduce the quantity and allocation error distance with respect to non-majority LC categories.
From the visual assessment, it was observed that the spatial variations of weights for the STW scheme ( Figure 3d) were associated with the agreeing and disagreeing allocations of LC changes forecasted with CNN-TCN STW and ConvLSTM STW (Figure 9). The CNN-TCN STW combination appeared more sensitive to the higher sample weight values in some areas, which was more noticeable in the 2020 forecast (Figure 9b). For instance, the projected hits and false alarms were typically seen clustered around locations with larger sample weights (Figures 3d and 9b). This may suggest that persistent samples still require reduced weights or undersampling strategies to manage their influence on learned model parameters. This outcome may also hinge upon the model type, since the 2020 forecast produced by ConvLSTM STW appeared less influenced by areas where the sample weight computed for the location was high. Conversely, in each resulting map, both sporadic and clustered areas of missed changes existed at similar spatial locations for areas with low weight values. However, given the appearance of more clustered hits and false alarms in the 2020 projections (Figure 9b,d), the STW scheme may be beneficial for future works seeking to manage properties like spatial variability [68].
The goal of this research study was to improve LC change forecasting with DL models by proposing and evaluating a sample weighting scheme that integrated temporal and spatiotemporal proximity from geographic timeseries data. The effect of sample weights derived from temporal and spatiotemporal distance from recent changes was unexplored for LC change forecasting with DL models, which are highly influenced by mostly persistent areas. It is also acknowledged LC change rates are typically small [69]. Missed changes were the most common error attributed to all model and sample weighting scheme combinations ( Figure 6). Yet, previous modeling endeavors attained 3.38% correct changes using 15-year temporal resolution data [27]. Therefore, 1.05% of correctly changed areas achieved by the CNN-TCN STW for the 5-year projection may have been somewhat comparable. Additionally, the FOM measure had a positive linear relationship with net observed changes [56]. For example, a land change model with 10-year temporal resolution obtained an FOM value of 9%, with less than 5% of the study area undergoing changes during that time period [28]. This corroborates the values obtained in this research study, in which CNN-TCN STW attained an FOM value of 7.8% with only 3.8% of the region undergoing changes from 2016 to 2020. Future work should consider further optimization of models and sample weighting schemes for datasets with finer spatial resolutions, expanded neighborhoods, longer LC sequences, and noisy datasets. Climatic variables are also important drivers of LC change [70] and should also be integrated to further enhance model capacity to forecast changes alongside the sample weighting schemes. Additional adjustments to computed sample weights may also be beneficial, as previous work identified that low weights were negligible in their effect on model training procedures [20]. Lastly, combinations of data augmentation [16] and the removal or undersampling of persistent samples [14] may be beneficial alongside the proposed sample weighting schemes. Nevertheless, the risk of removing potentially important LC data samples remains an open problem.

Conclusions
This research study explored the potential of proposed inverse temporal distance and inverse spatiotemporal distance sample weighting schemes for LC change forecasting with spatiotemporal DL models and multitemporal LC data available for the Columbia-Shuswap Regional District of BC. The rationale for training sample weighting was to decrease the influence of samples that underwent changes long ago while increasing the influence of changed samples that underwent more recent changes at the central location or within its neighborhood. With temporally and spatiotemporally weighted LC change samples, model forecasts showed consistent improvements in FOM and FOM components of agreement and disagreement versus the unweighted base case ("none") and the traditional inverse frequency weight (BW) schemes. While allocation errors remain an outstanding problem for the DL models, the proposed sample weighting schemes reduced the average distance of allocation errors to real-world LC categories overall and with respect to large and mediumsized LC classes. Based on the findings of this research study, the proposed sample weighting schemes enabled markedly better performance compared to using unweighted samples for LC change forecasting with spatiotemporal DL models. Of all sample weighting schemes, STW was consistently associated with improved FOM and PA measures for all model types versus the traditional BW scheme. It is recommended to explore, analyze, and further adjust the TW1, TW2, and STW sample weighting schemes with respect to other datasets and spatiotemporal DL model configurations in future studies.
Given the typically slow and scarce nature of LC change events that impede direct applications of DL models, this research study contributes to advancing strategies used to mitigate data imbalance for LC change forecasting and other geographic phenomena. It was previously unknown how temporal and spatiotemporal sample weighting schemes contribute to DL model capacity to forecast LC changes. As such, this research study introduced simple sample weights based on temporal and spatiotemporal proximity to change events, demonstrating improvements in DL model capacity to forecast LC changes. This was accomplished without randomly discarding potentially useful examples or augmenting synthetic or transformed samples that defy real-world spatial context and relationships. This research study can benefit any DL modeling endeavor that deals with LC data or imbalanced geographic timeseries data. With increasingly available open-source datadriven modeling approaches featuring sample weight parameter options, cost-sensitive learning techniques can be achieved without complex programmatic modifications. The new sample weighting schemes can also contribute to improving non-timeseries DL models or ML models more generally by allocating less significance to older samples obtained for geographic applications such as projecting urban growth, forest change, or agricultural expansion.