Deep Learning Classification of Cheatgrass Invasion in the Western United States Using Biophysical and Remote Sensing Data

: Cheatgrass ( Bromus tectorum ) invasion is driving an emerging cycle of increased fire fre-quency and irreversible loss of wildlife habitat in the western US. Yet, detailed spatial information about its occurrence is still lacking for much of its presumably invaded range. Deep learning (DL) has demonstrated success for remote sensing applications but is less tested on more challenging tasks like identifying biological invasions using sub-pixel phenomena. We compare two DL architectures and the more conventional Random Forest and Logistic Regression methods to improve upon a previous effort to map cheatgrass occurrence at >2% canopy cover. High-dimensional sets of biophysical, MODIS, and Landsat-7 ETM+ predictor variables are also compared to evaluate different multi-modal data strategies. All model configurations improved results relative to the case study and accuracy generally improved by combining data from both sensors with biophysical data. Cheatgrass occurrence is mapped at 30 m ground sample distance (GSD) with an estimated 78.1% accuracy, compared to 250-m GSD and 71% map accuracy in the case study. Furthermore, DL is shown to be competitive with well-established machine learning methods in a limited data regime, suggesting it can be an effective tool for mapping biological invasions and more broadly for multi-modal remote sensing applications.


Introduction
Cheatgrass (Bromus tectorum) was unintentionally introduced in North America in the late 19th century from Eurasia and is now found in every state in the contiguous US [1]. In the western US, it has become a dominant component in many shrubland and grassland ecosystems [2,3], resulting in an increase in fine fuels that can lead to a cycle of increased fire frequency, fire severity, and irreversible loss of native vegetation and wildlife habitat [4][5][6]. Detailed spatial information on the presence and abundance of cheatgrass is needed to better understand factors affecting its spread, assess fire risk, and be able to identify and prioritize areas for invasion treatment and fuels management. However, such information is lacking for much of the ostensibly invaded area in the western US, with exceptions for parts of the Great Basin ecoregion [7][8][9][10][11][12][13].
Cheatgrass invasion has been especially devastating in sagebrush (Artemisia spp.) ecosystems, which are home to a variety of sagebrush-obligate species such as greater sage-grouse (Centrocercus urophasianus). Sage-grouse historically occurred throughout a vast (~125,000 km 2 ) area of the western US and portions of southern Alberta and British Columbia, Canada, but now occupies approximately half that area [14]. Remote sensing approaches to mapping cheatgrass for such a large area represents a significant challenge due to diverse environmental conditions and difficulties obtaining enough ground-truth data to train predictive models. Downs et al. [15], which we revisit later in this section, mapped cheatgrass for the sage-grouse range with moderate success (71% accuracy). To establish context for the current work, we broadly review the various approaches previously taken to map cheatgrass.
Remote sensing approaches to mapping cheatgrass distribution, percent cover, and dynamics (e.g., die-off, potential habitat, phenological metrics) generally fall into three categories: those focusing on spectral signatures or phenological indicators in overhead imagery [10,12,13,[16][17][18][19][20][21]; those based on modeling the ecological niche of cheatgrass using known ranges of biophysical conditions of where cheatgrass is known to occur [22,23]; and those combining elements of those two approaches [7,8,11,15,24,25]. Much attention has been given to deriving phenological indicators of cheatgrass presence from spectral indices, such as the Normalized Difference Vegetation Index (NDVI), because its life cycle differs from many of the native plant species in its North American range. Cheatgrass is a winter annual that may begin growth in the late fall and senesce in late spring, whereas many native dryland ecosystem plants begin growing in mid to late spring and continue growth through summer under favorable precipitation conditions [2,26]. Thus, cheatgrass can be identified indirectly by assessing pixel-level chronologies of NDVI [7,10,12,13,19]. Phenological differences between cheatgrass and non-target vegetation can be difficult to detect in drier or cooler years, which is why some have focused on using imagery from years when cheatgrass is more likely to show an amplified NDVI response to above-normal winter or early spring precipitation [10,12]. Furthermore, the strength of this response vary among different landscapes because the timing of cheatgrass growth and senescence varies across ecological gradients such elevation [12], soils [27], and climatic conditions [10,12,28].
The separability of cheatgrass from other vegetation in overhead imagery (either by phenology or spectral characteristics) is also affected by its relative abundance within a pixel. Some have elected to use sensor platforms with a high revisit rate and coarse spatial resolution, such as MODIS [7,8,21,24,29] or AVHRR [10], which offer better potential for capturing within-season variation of cheatgrass growth but contain more spectral heterogeneity due to their coarse spatial resolution. Others have chosen platforms such as Landsat-7 or -8 in favor of their finer spatial resolution, but at the expense of less-frequent return cycles and risk of missing peak NDVI [10,12,13,[17][18][19]21]. Some have used multiple sensors to make independent predictions of cheatgrass at different geographic scales (e.g., [10]) or temporal periods (e.g., [12,21]). Recently, Harmonized Landsat and Sentinel-2 (HLS) data [30] was used to map invasive annual exotic grass percent cover, the dominant component of which the authors assumed is cheatgrass [31]. To our knowledge, combining concurrent data from multiple sensors with complimentary return cycles, spatial resolution, and spectral information to map cheatgrass has not been attempted.
We revisit Downs et al. [15], which became an important source of data and motivation for this study. Their approach utilized cheatgrass observations compiled from unrelated field campaigns throughout the western US to train a Generalized Additive Model that considered both remotely sensed phenological data (NDVI) and a broad suite of biophysical factors such as soil-moisture temperature regimes, vegetation type, potential relative radiation, growing degree days, and climatic factors (see Section 2 for data descriptions). However, 37 of 48 potentially useful biophysical factors were excluded from their model due to high correlation. While their model achieved reasonable (71%) test accuracy, they expressed uncertainty about map accuracy east of the Continental Divide due to substantially fewer observations in that region. We hypothesize that using a larger volume of remote sensing data and more robust machine learning approaches, including those in the deep learning domain, might benefit the problem.
Deep learning (DL) algorithms have received broad attention for environmental remote sensing applications such as land cover mapping, environmental parameter retrieval, data fusion and downscaling [32], as well as other remote sensing tasks such as image preprocessing, classification, target recognition, and scene understanding [33][34][35][36].
Reasons for the rise in popularity include well-demonstrated improvements in performance, ability to derive highly discriminative features from complex data, scalability to a diverse range of Big Earth Data applications, and improved accessibility to the broader scientific community [32,[35][36][37]. DL is also seen as a potentially powerful tool for extracting information more effectively from the rapidly increasing volumes of heterogeneous Earth Observation (EO) data [38][39][40].
Despite the advances with DL in remote sensing, its application to the field remains challenging due in part to a comparative lack of volume and diversity of labeled data as seen in other domains, and limited transferability of pre-trained models to remote sensing applications [34,37]. Zhang et al. [35] propose four key research topics for DL in remote sensing that remain largely unanswered: (1) maintaining the learning performance of DL methods with fewer adequate training samples; (2) dealing with the greater complexity of information and structure of remote sensing images; (3) transferring feature detectors learned by deep networks from one dataset to another; and (4) determining the proper depth of DL models for a given dataset. The availability of training data is a relevant concern in this study as the number of field observations is less than what many DL practitioners prefer and is typically seen in the literature. This concern applies broadly to the use of DL in spatial modeling of native and nonnative species, where field data are often time-consuming and expensive to collect or may not be readily accessible from other sources [12].
Our goal is to derive more discriminative, higher-resolution models of cheatgrass occurrence using Downs et al. [15] as a starting point. We expand from there by using DL and more traditional machine learning approaches to combine concurrent time series of Landsat-7 ETM+ and MODIS data. Our first objective is to compare the performance of all model types and configurations to identify a single high-performing model configuration. The second objective is to construct a consensus-based ensemble of the preferred model to generate a 30-m ground sample distance (GSD) map of cheatgrass occurrence for the historic range of sage-grouse. The results of this study are intended to provide more detailed information than previously available on the extent of cheatgrass invasion in the western US to support multiple land management agencies' efforts to mitigate impacts of cheatgrass invasion and facilitate further scientific investigation of factors affecting its spread.

Datasets
Three categories of data are used in this study: field observations of cheatgrass cover, time series earth observation satellite data, and biophysical spatial data. We utilize the same labeled field observations and 48 biophysical factors as Downs et al. [15], with the addition of a larger volume of multi-modal satellite data and three ancillary variables. Details about these data and their preparation are described in the following sections.

Field Observations
Downs et al. [15] compiled over 24,000 field vegetation measurements in the historic range of sage-grouse that were collected on multiple unrelated field campaigns between 2001 to 2014 ( Figure 1). Of these observations, 6418 are deemed useful based on geographic accuracy and overlap with the study area, completeness, and rigor of collection methods. This was further reduced to 5973 after removing observations that had incomplete satellite data and were less 60-m apart (i.e., the distance of at least two pixels). Nearest-neighbor spacing of field observations ranged from 61 to 79,421-m with a mean distance of 1837-m. Most of the excluded observations are from the U.S. Department of Agriculture (USDA) Forest Inventory and Analysis program, which does not provide the true geographic location of the publicly available version of its data. All field data were collected from transects ranging from 25 m to 100 m in length using point intercept or standardized plot frame techniques.

Satellite Imagery
Time series spectral data from both annual and seasonal composite MODIS Terra [41] and Landsat-7 ETM+ satellites [42] are used in this study (Table 1). Both sets of imagery are composited on a pixel-wise basis to reduce residual cloud and aerosol contamination and reflect the time of peak vegetation vigor as determined by maximum NDVI within the composite period. We use seasonal composite data for the approximate Northern Hemisphere spring (1 March-31 May) and summer (1 June-31 August) periods, which correspond to periods of peak productivity (spring) and senescence (summer) in the life cycle of cheatgrass.
The Landsat-7 annual and seasonal composite data spans most (2003-2012) of the period of selected field observations. We initially used MODIS data for the same period as Landsat-7, but later added annual and seasonal composite data for years 2001-2002 and 2015-2016 because it was found to improve results. In summary, we compiled annual and seasonal composite satellite data corresponding to all years that field observations were collected, except for 2001-2002 and 2014 for which we did not have Landsat-7 data. It is appropriate to use different time phases of satellite data in our models because these data are treated either as independent non-sequential variables or as independent time series in our models. Furthermore, the full temporal stack of satellite data at each sample location is considered in the analysis, not just the year corresponding to when the sample was collected (see Section 2.2). Google Earth Engine [43] is used to create annual and seasonal maximum NDVI composites from MODIS 16-day composite data and resample it to the spatial resolution of Landsat-7 (30 m GSD). For each MODIS and Landsat-7 composite time series, we also derive a delta-NDVI grid for each year that represents the pixel-wise difference between NDVI and the long-term median NDVI based on all data in the time series. This metric is like that used by Bradley and Mustard [10], who found the delta-NDVI between dry and wet years to be a useful indicator of cheatgrass presence as invaded areas tend to exhibit greater inter-annual NDVI variability than native vegetation.

. Biophysical and Ancillary Data
We include the six types of biophysical spatial data used by Downs et al. [15] as well as the ancillary datasets Level-III ecoregions [44], elevation, and gridded latitude and longitude ( Table 2). The first biophysical dataset, soil moisture-temperature regimes, is considered because these regimes are known to influence ecosystem resilience and resistance to invasive grasses, including cheatgrass [27,45,46]. These data were derived by Chambers et al. [47] from the SSURGO and STATSGO2 national soils databases and where necessary, we replicated their method to expand its geographic extent to provide complete coverage for our study area.
The second category of biophysical data used is vegetation data derived from the national-scale LANDFIRE Existing Vegetation Type dataset [48]. LANDFIRE vegetation types were generalized into broader plant community associations that are more appropriate for the scale of our analysis and unlikely to change over the time phase of our field observations and satellite data.
Potential relative radiation (PRR) is a unitless index of available solar radiation for photosynthetic activity that is based on solar geometry and terrain and calculated for a specified seasonal period [49]; thus, it is assumed static across years and the time phase of our satellite imagery. PRR is calculated for the same approximate growing season of cheatgrass used in the previous study (i.e., 1 October-30 June).
Growing degree days (GDD) is an index that represents the relative amount of time during a specified period that temperatures are above a given threshold that is considered suitable for growth of the target species [50]. GDD is calculated using the same data and parameters as Downs et al. [15]; i.e., 1-km gridded DAYMET daily minimum and maximum temperature data [51] for 1 October 2014 to 30 April 2015, and a minimum temperature threshold of 0 °C (32 °F). While GDD varies across years due to interannual climate variation, we use a single year to be consistent with the previous study which noted they were primarily interested in representing general geographic patterns of GDD.
The largest, and final, category of biophysical data is a collection of 4-km gridded climatic datasets depicting monthly and annual 30-year (1981-2010) norms for minimum and maximum temperature, and precipitation [52,53] (Table 2). From these data we also derive five seasonal climatic datasets that correspond to important periods during the growing season of cheatgrass: cumulative winter (December-February) precipitation, cumulative spring (April-May) precipitation, cumulative summer (July-August) precipitation, and winter (November-February) minimum and maximum temperature. Seasonal groupings were determined based on expert knowledge and exploratory analysis of cheatgrass occurrence from field data and climatic variables.

Variable Selection
We select four combinations of predictor variables that represent two generic approaches for mapping cheatgrass. The first approach is based on biophysical factors that affect ecological niche and is identical to that used by Downs et al. [15], with the addition of ecoregions and gridded latitude/longitude. The second approach combines ecological niche factors and spectral-spatial remote sensing. The purpose of comparing model performance with these sets is twofold: to evaluate potential gains and losses in classification accuracy by combining ecological and spectral data, and to evaluate gains and losses in classification accuracy by combining data from multiple sensors. For defining variable sets, let be the complete set of predictor variables for location in the study area, and let be a function that selects a defined set of variables from . The four sets of variables tested are further described in Table 3. All continuous predictor variables are standardized by subtracting their respective mean and dividing by the standard deviation.

Analysis Methods
Four machine learning methods for predicting cheatgrass occurrence are compared: Random Forest (RF), Logistic Regression (LR), Deep Neural Networks (DNNs), and Joint Recurrent Neural Networks (JRNNs). The classification objective for all models is two classes of cheatgrass occurrence above and below a strong distribution break at 2% cheatgrass canopy cover ( Figure 2). This break is selected because it provides a good balance of sample sizes in each class and allows us to avoid the assumption of true absence for sites where cheatgrass was not observed. The LR, DNN, and JRNN models are implemented in Python using the Tensorflow library (version 1.2.1), and RF models are implemented in Python using the Scikit-Learn library (version 0.18.2).

Sampling
Model training and selection is performed using k-fold cross-validation with 90% of the dataset. The other 10% of the data is randomly withdrawn for independent verification. As the data is assimilated from multiple field campaigns that are unevenly distributed throughout time and the study area, there is potential for spatial and temporal autocorrelation. To reduce these effects, samples are randomly shuffled and cross-validation splits are stratified using equal joint distributions based on ecoregion and generalized land cover. Spatial and temporal autocorrelation are further mitigated by ensembling kfold models as described in Section 2.3. The red line illustrates a strong break at 2% canopy cover and is used to define cover classes for this study.

Categorical Variables
Three categorical variables are included in the models: soil temperature and moisture regime ( ), generalized vegetation cover type ( ), and EPA Level III ecoregions ( ) (Table 2). As the other inputs to our algorithms are real-valued vectors, we employ two simple strategies for incorporating vector representations of the categorical variables: one-hot vectors in RF models, and embedding vectors for LR, DNN, and JRNN models. One-hot vectors are necessary for incorporating categorical data in RF classifiers. This method can become a limitation when the cardinality of categorical variables is high, causing the dimensionality of the transformed vector to become unmanageable. However, this limitation does not exist in our case because cardinality is low for our categorical variables (i.e., = 7, = 15, = 23). For a categorical variable with possible classes, for each class, we define a one-hot vector where: We define , , and as the one-hot vectors associated with the categorical values for a location.
Embedding vectors are a parametrized vector representation of categorical values. The values of embedding vectors are learned jointly with the other parameters of the LR, DNN, and JRNN models. We define three embedding matrices for the categorical variables, W ∈ ℝ × , W ∈ ℝ × , and W ∈ ℝ × where q is the size of the embedding vector, and the second dimension corresponds to the number of classes for a given categorical variable. We define , , and as the embedding vectors associated with the categorical values for a location.

Random Forest
The RF method has been shown to perform well for predictive vegetation mapping, is resilient to overfitting, and provides competitive results compared to machine learning approaches in low resource data regimes [54][55][56][57]. To optimize performance of RF models, we performed a search over 200 random RF hyperparameter configurations. Key hyperparameters that we sampled (and their range of values) include: (1) sampling method (with and without replacement); (2) criterion for splitting nodes (GINI index), (3) maximum number of features (square root of the total number of features); (4) minimum number of samples in a leaf node (1, 2, 4); (5) minimum number of samples in a split node (2, 5, 10); (6) maximum depth of a tree (10, 110); and (7) the number of decision trees in the forest (10-200).

Logistic Regression
We include LR as a baseline model in our study because it is used as the classification function in the DNN and JRNN models. With n as the number of continuous variables in the data subset, and the size of an embedding vector, the input for a given sample is a vector ∈ ℝ composed of standardized real values of the continuous predictor variables and vector representations of the categorical variables described in Section 2.2.1. ( is one of the variable selection functions defined in Table 3. The output of the LR model is a vector ∈ ℝ containing a probability distribution for the two classes of cheatgrass canopy cover for a given location: where the matrix ∈ ℝ × and vector ∈ ℝ are learned parameters. The softmax is defined as the elementwise vector-valued function that normalizes elements into a probability distribution: We fit the parameters of the logistic regression model by minimizing the cross-entropy between the ground truth and predictions with the ADAM optimization algorithm [58]. ADAM is a variant on the stochastic gradient descent that adjusts the model parameters adaptively according to estimates of the second-order moment of the gradient. We performed 200 training runs with random initializations and random uniform sampled L2 regularization weight (range 0.001-0.1) to select the best LR model.

Deep Learning Models
We evaluate two DL architectures (DNN and JRNN) for mapping cheatgrass. Collectively, these architectures offer a high degree of modeling flexibility for determining the pixel-, object-, or structure-based features, with an added capability of representing all three types of information to express complex relationships [35]. As a universal function approximator, a DNN can derive highly discriminative features that represent complex relationships between input variables in a high dimensional space. In contrast, RNNs (specifically JRNN in this study) can utilize their internal state and flow in multiple directions to exploit sequential connections that may be present in time-series imagery.
The input to the DNN model is the same as defined for the LR model. The architecture of the DNN model ( Figure 3) consists of L hidden layers h, which are recursively defined as: where the ReLU (Rectified Linear Unit) [59] operation is defined as the elementwise vector-valued operation: The output of the DNN is directed to a linear LR function (Equation (3)). Dropout normalization is also used to avoid overfitting and reduce generalization error [60], and batch normalization is used to better condition gradient updates [61]. Together, these techniques help to stabilize learning during training.
As an extension to the DNN, we employ a joint composition of bidirectional RNNs, or JRNN, with Long Short Term Memory (LSTM) to predict cheatgrass occurrence ( Figure  3). Bidirectional RNNs have been found effective for modeling difficult time-series problems and operate by processing sequence data in both directions, thus allowing output nodes to receive signals from both previous and future time steps [62]. The use of LSTM in an RNN can reduce training difficulties and improve the RNN's ability to model longterm dependencies [63]. LSTM's capability to track discriminative values over arbitrary time intervals is especially useful if there are response lags of unknown duration between dependent events in a time-series. Such is the case in this study as the timing and magnitude of cheatgrass growth (and subsequently its detectable presence in time-series spectral-spatial data) may be accelerated or lagged depending on climatic conditions across and within years.
Let and be MODIS and Landsat-7 time-series, where and are vectors of annual-, spring-, and summer-composite pixel values of the th year for all spectral bands in Table 1. We define two bidirectional LSTM networks, and , for these respective imagery time-series. The LSTMs provide condensed vector representations of the platform time-series for a given location. These vectors are concatenated with the categorical embeddings and the continuous biophysical variables [ ( )] and used as input to a DNN as previously described but with as: It is important to note the JRNN cannot be applied with the D1 variable set alone due to the inherent time-series structure of the model.
The parameters of the JRNN and DNN models are fit with the same optimization algorithm, objective function, and regularization strategies as described for the LR model. We perform 200 training runs with random initializations over randomly sampled hyperparameters to identify the best performing DL model configurations. Early stopping based on overall accuracy is used to mitigate overfitting model parameters for all models trained with stochastic gradient descent (LR, DNN, JRNN). The hyperparameters (and their ranges) that we sampled included: number of nodes per hidden layer (32-512 incremented by powers of 2); dropout rate (0-0.5); learn rate (0.001-0.03); and mini-batch size (16-128 incremented by powers of 2).  (2)) are provided as input . In the JRNN configuration, the condensed vector representations of imagery time-series created by the joint LSTM networks are concatenated with the D1(x) set of continuous biophysical variables and categorical ( , , ) predictors (Equation (7)) to create an input to the DNN.

Model Ensembling
The resultant probability distributions of cheatgrass occurrence (i.e., ≥2% cheatgrass canopy cover) are strongly bimodal with peaks above and below 0.5; therefore, we use a simple decision rule of p ≥ 0.5 to classify cheatgrass occurrence. Maps of cheatgrass occurrence are produced for each k-fold model from the cross-validation process and ensembled using a simple consensus (or class frequency) method. The method is used to create two types of ensemble maps of cheatgrass occurrence. The first type of ensemble map is a based on the consensus value (0 to 5) and is intended to display spatial agreement among folds. The second type of ensemble map is a binary representation of the first using thresholds ranging from k ≥ 1 to k = 5, which represent less to more restrictive ensemble predictions of cheatgrass occurrence, respectively. This ensemble method is devised to provide insight into the spatial agreement of the individual folds and to identify an optimal ensemble of the folds, as described in the next section.
All predictions are mapped at a 30-m GSD equivalent to that of Landsat-7. Certain portions of the study area corresponding to non-target land cover types (i.e., cultivated lands, pasture and hay, closed canopy deciduous and evergreen forest, urban/developed lands, water) as depicted in the national Cropland Data Layer [64] and National Land Cover Dataset [65] are masked from mapped predictions.

Model Performance
We use common performance metrics for binary classifiers including overall accuracy, precision, recall, and F1 score (harmonic mean of precision and recall), to evaluate the models and resultant ensemble maps. Acceptable accuracy is defined as >71% observed in our motivating study by Downs et al. [15]. Cross-validation performance is assessed by averaging performance metrics across the k test folds based on 90% of the entire dataset. Verification performance is assessed differently, as shown in Figure 4, due to the use of the consensus method to ensemble the k-fold predictions. Instead, k predictions are made for each verification sample and the consensus level that yields the best overall accuracy or F1 is then reported. In addition, we qualitatively investigate the effect of the consensus method on the performance metrics by plotting overall accuracy and F1 for all five consensus ensembles (i.e., k ≥ 1, k ≥ 2, k ≥ 3, k ≥ 4, k = 5) to determine which consensus level provides the best balance between overall accuracy and F1 (see Section 3.2). We chose to balance performance this way because overall accuracy can be misleading for imbalanced datasets and it helps to balance Type I and Type II error. Conversely, verification performance is assessed by ensembling the predictions from each of the k-fold models ( ) using the consensus method, which yields k estimates performance ( ). This same process is applied to each map pixel.

Model Selection
In cases where data are limited, reducing variance in estimators of generalization performance, such as k-fold cross-validation, may be as important as unbiased estimates from independent validation for estimating true generalization performance [66]. Given this consideration, our methods for model selection, and ultimately choosing a high-performing model to map cheatgrass distribution, are intended to accommodate the relatively limited size of our dataset and to independently verify generalization performance of the various methods and resultant map. We tested 10-and 5-fold cross-validation and found 5-fold cross-validation to be the most effective in reducing variance across folds while maintaining acceptable accuracy. Hyperparameters for each model class are chosen according to best average cross-validation accuracy from 200 random hyperparameter sets for each model. Final selection of a model for mapping cheatgrass is subsequently based on cross-validation performance to avoid potential model selection bias. Performance on 10% of the dataset reserved for verification is provided as a theoretically unbiased comparison of performance.

Results
The following results relate to our primary objectives as follows: (1) compare the performance of the LR, RF, DNN, and JRNN models tested with four combinations of biophysical and spectral-spatial variables to identify a high-performing model configuration (Section 3.1); and (2) construct a consensus-based ensemble of the preferred model to generate a 30-m GSD map of cheatgrass occurrence (Section 3.2).

Comparison of Model Types and Variable Sets
All model configurations that we tested demonstrated acceptable (>71%) overall accuracy in cross-validation and verification, except for LR with the biophysical variable set (D1). With this variable set, cross-validation accuracies improved slightly (2-4%) but verification accuracies did not improve relative to the previous study. The best performing models are those that combined Landsat-7 and MODIS with biophysical and ancillary data (D4), achieving cross-validation accuracies that are approximately 7-8% greater than the previous study (Table 4). Among these, the DNN-D4 configuration demonstrated the best cross-validation performance when considering overall accuracy (79.6%), variation in overall accuracy (2.47%; Table 5), and F1 (0.812; Table 6). Verification accuracy of this configuration is very similar (79.1%), suggesting the model is relatively stable when generalizing on unseen data. Table 4. Overall accuracy (%) of cross-validation (CV) and verification (V) across model types and variable sets. The kfold consensus threshold that yields the best verification accuracy is denoted in superscripts.   Table 6. Cross-validation (CV) and test F1 scores across model types and variable sets. The k-fold consensus threshold that yields the best verification F1 score is denoted in superscripts. While the DL architectures we implemented do enhance prediction of cheatgrass occurrence, we found no apparent trends in overall accuracy (Table 4) or F1 (Table 6) across variable sets that suggests they are superior or inferior to their LR and RF counterparts. Within each variable set the performance advantages of one model type over another are also not strongly distinctive. LR appears to be the slightly stronger model with set D3, although the JRNN achieves a slightly better F1 score with the verification dataset. Similarly, the DNN has slightly better cross-validation performance with D4 than the other models, but RF appears to generalize better with the verification dataset. However, note the verification accuracy of the RF model may be overly optimistic in this case as its crossvalidation accuracy is 4.4% lower.

Variable Set
When we evaluate the effects of different variable sets on model performance, we find that adding satellite data (D2, D3, and D4) improves model performance compared to just using biophysical and ancillary data (D1). The differences in overall accuracy (Table  4) and F1 scores (Table 6) between using only MODIS (D2) or only Landsat-7 (D3) are mixed, suggesting there is no obvious advantage of using one sensor in lieu of the other for our application. However, three of the four model types tested (RF, DNN, and JRNN) achieve their best performance when using both sensors (D4). LR achieves its best performance using only Landsat-7 for satellite data (D3).
We later hypothesized that the size of our training dataset (N = 5973) may not have been large enough to provide the DNN and JRNN models a competitive advantage over LR and RF. Recall, we had to discard more than 18,000 field observations due to imprecise geographic accuracy or other quality issues. We qualitatively tested this hypothesis by running the experiments with all the available data (N = 6418). Overall accuracy and F1 from cross-validation was boosted for all model configurations (Table 7), except for LR with set D1 where the F1 score decreased slightly. The DNN and JRNN models became more competitive with RF and LR across variable sets, although this trend is subtle. These findings support our suspicion, although more observation data and rigorous testing is needed to confirm.

Ensemble Mapping of Cheatgrass Occurrence
As described in Section 2.4, final selection of a model for mapping cheatgrass is based on cross-validation performance due to the limited amount of available data and steps taken to reduce risk of overfitting, model selection bias, and generalization error. Based on this selection criterion, the DNN-D4 configuration is selected to map cheatgrass. We plot trends in DNN-D4 test overall accuracy, F1 score, precision, and recall across all five consensus levels ( Figure 5) to examine tradeoffs of our ensembling approach and to identify an appropriate level for post hoc analysis and interpretation. Recall that k ≥ 1 is the least restrictive ensemble and k = 5 is the most restrictive in terms of the predicted area. As we expect, overall accuracy and precision increase as the consensus level becomes more restrictive and false-positives are reduced. Note that overall accuracy becomes unacceptable (<71%) at k ≥ 1 due to poor precision. Overall accuracy peaks (79.1%) at k ≥ 4 due to declining recall and increasing false-negatives, while the balance between precision and recall (F1) peaks at k ≥ 2 (F1 = 0.676). Therefore, we use the midpoint between peak accuracy and peak F1 of k ≥ 3 to produce an accuracy-F1-balanced map of cheatgrass occurrence (Acc. = 78.1%, F1 = 0.673, Prec. = 0.644, Rec. = 0.704; Figure 6).   Figure 6. Predicted distribution of cheatgrass occurrence (>2% canopy cover) in the historic range of sage-grouse (excluding areas classified as non-target land cover types) depicted as: (a) the spatial agreement (i.e., consensus of k-fold predictions), and (b) accuracy-F1-balanced consensus of k ≥ 3 overlaid by EPA Level-III ecoregion boundaries (numbering corresponds to ecoregion names in Table 8).
With this mapped prediction we estimate 253,727-km 2 (or 22%) of the historic range of sage-grouse (excluding non-target land cover types as described in Section 2.4) to be invaded by cheatgrass. The effect of balancing accuracy and F1 score on predicted areal extent is evident by comparing Figures 6 and 7, which reveals that even minor differences in ensemble performance can have significant impacts on the estimate of invaded area.
Visual assessment of the cheatgrass maps and zonal analysis by ecoregion (Table 8) confirms that cheatgrass invasion is extensive in the Northern and Central Basin and Range, Snake River Plain, and Columbia Plateau where it has been studied more extensively and is known to be pervasive. Other notable areas of apparent invasion that are less-studied include a region overlapping a southern portion of the Wyoming Basin and the northern portion of the Colorado Plateaus ecoregions, a region in the western portion of the Northwestern Great Plains ecoregion, and a region overlapping the southern portion of the Northwestern Great Plains and northern portion of the High Plains ecoregions.  Table 8. Predicted areal extents of cheatgrass occurrence by ecoregion based on k ≥ 3 spatial consensus. For reference, the total and masked areas of the historic range of sage-grouse are provided. The proportion of cheatgrass occupied area is relative to the masked area of the sage-grouse range. The numbering of ecoregions corresponds to map labels in Figure 6b.

Discussion
This study focused on developing more discriminative, higher-resolution models of cheatgrass occurrence for the historic range of the greater sage-grouse, using Downs et al. [15] as a baseline. In doing so, we were able to improve overall accuracy by approximately 7% and increase spatial resolution from 250-to 30-m GSD, relative to the previous study. We consider these improvements biologically significant because even minor differences in accuracy can result in large differences in predicted areal extents, especially for species that are widespread over large geographic areas like cheatgrass [67,68]. The accuracy of our accuracy-F1-balanced cheatgrass map (78.1%) is comparable to other studies that focused on much smaller regions in the Great Basin, Snake River Plain, and Colorado Plateau. For example, Bradley and Mustard [10] achieved 64% and 72% accuracy, respectively, using AVHRR and Landsat-7. In a related study [12], accuracies ranged from 54% to 74% using Landsat MSS, TM, and ETM+. It is worth noting, however, that these studies predicted more monotypic areas heavily infested with cheatgrass, whereas our study focused on identifying areas with at least 2% canopy cover of cheatgrass. Singh and Glenn [17] achieved 77% accuracy in southern Idaho using Landsat. Bishop et al. [19] reports higher model accuracies (85-92%) for seven national parks in the Colorado Plateau, although it is worth noting these estimates are based on the combined area of low and high probability of occurrence classes where cheatgrass was considered present if it occurred at >10% canopy cover; thus, making interpretation of accuracy difficult. In summary, we find our results encouraging compared to previous studies given the difference in geographic scale and ecological diversity of our study area, as well as lower threshold for detection of cheatgrass occurrence.
Combining biophysical, ancillary, and satellite data generally improved the performance of the four model classes that we tested, lending further credence to approaches for mapping cheatgrass that incorporate ecological niche factors and remote sensing [7,8,11,15,24,25]. Looking more closely at the satellite data, we found that combining concurrent MODIS and Landsat-7 data generally improves model performance compared to using either sensor alone. We attribute this result to choosing sensors with spectral-spatial characteristics that are complimentary to mapping cheatgrass and selecting robust machine learning techniques that are well-suited for deriving discriminative features from multi-modal data. This approach is simpler than fusing satellite data and provides greater flexibility for choosing and testing sensors for a given application. However, we do not discourage data fusion or use of fused satellite data such as HLS [30], which has shown to be useful for mapping exotic annual grass cover [31]. In fact, DL algorithms have shown promise for performing pixel-level image fusion [69]. As such, the combined modeling and data fusion capabilities of DL make it an intriguing tool for leveraging the rapidly increasing volume of EO imagery [38,39].
The similar performance among model architectures in this study underscores the importance of evaluating multiple analysis methods and variable combinations. In a meta-analysis of land cover mapping literature, accuracy differences due to algorithm choice were not found to be as large as those due to the type of data used [70]. While DL algorithms have been proven superior in many remote sensing applications [35][36][37], their performance also hinges on having sufficiently large datasets to learn highly discriminative features in the data. However, what defines "sufficiently large" is not common knowledge and depends on the complexity of the problem and learning algorithm [71]. This topic is considered by some to be one of the major research topics for DL in remote sensing that remains largely unanswered [35]. We did observe a benefit to all models from adding 10% more data, suggesting that sample size may be a limiting factor in our cross-model comparison. The performance of DL models in this study is still encouraging, however, given the circumstances and comparable performance to LR and RF under a limited data regime. This is consistent with others who have shown good performance using DL for similar land use/land-cover applications [71][72][73][74][75]. Acquiring more field data was beyond the scope of this study but should be a priority for future research given that more data has likely become available since the previous study.
We chose relatively simple DL methods as a logical first step to assess whether DL was appropriate for our application and might warrant investigating more computationally intensive methods such as convolutional neural networks (CNNs). CNNs are commonly used in overhead imagery remote sensing due to their ability to take advantage of information in neighboring pixels [36]. However, CNNs may not perform well in cases when the phenomena of interest occur in mixed pixels or exists in the sub-pixel space [74], such as is the case with cheatgrass. Furthermore, the problem can be exacerbated if higher resolution imagery is not available or there is significant cloud cover present. These considerations and greater ease of use of the DNN and JRNN methods factored into our decision to exclude CNN from this study. However, we suggest the relative success of the DNN and JRNN methods does warrant future testing of CNN approaches, and a logical next step might be developing joint DNN-CNN or JRNN-CNN architectures for a semisupervised classification.

Conclusions
In this paper, we propose two straightforward DL approaches (DNN and JRNN) using large predictive variable sets of biophysical and multi-modal remote sensing data (MODIS and Landsat-7 ETM+) to improve prediction (accuracy and spatial resolution) of the invasive exotic annual grass, cheatgrass. We benchmark DL models to two conventional machine learning algorithms (LR and RF) and compare results to a prior study that was an inspiration and data source for this study. Both DL approaches were found to improve prediction, although there was only a slight advantage over LR or RF with our dataset. We surmise that more labeled data is needed to achieve better performance with the DL methods but note the preferred DNN model provided a 7-8% accuracy improvement over the comparison study. The model's ability to predict cheatgrass occurrence over the historic range of sage-grouse (i.e., much of the western US) with comparable to improved accuracy compared to previous smaller scale studies is also noteworthy. Combining biophysical and multi-modal satellite data was also found to improve the prediction of cheatgrass in all models. A 30-m GSD map of cheatgrass occurrence is produced for the historic range of sage-grouse to help land managers and researchers better understand factors affecting its spread, assess fire risk, and identify and prioritize areas for treatment. We suggest future work explore existing models with additional observation data collected in subsequent years, along with an expansion of remote sensing time-series data. In addition, data augmentation techniques should be explored to increase the total population of training data and other DL architectures should be evaluated for performance improvements. Data Availability Statement: The data, results, code, and figures presented in this paper are openly available at www.github.com/pnnl/fieryfuture. Biophysical and satellite raster datasets are available on request from the corresponding author.