2.2. Recurrent Neural Networks
Recurrent neural networks (RNNs) are a class of neural network architectures that are specific to processing sequence data, where predictors and outcomes are ordered in time. An RNN cell is a basic processing unit where the output at a given time is a function of the input and the output of the cell at previous times ([
37], pp. 498–501). The simple RNN cell takes as input the set of predictors at time
t, denoted as
, as well as the output of the recurrent connection at the previous time step, which we refer to in this paper as the “recurrent state”. The right-hand side of
Figure 3 shows the computational flow in a single recurrent unit. The input
and the recurrent state
are combined in a linear combination with fixed weights
,
, and
b. This linear combination is modified by an activation function, denoted as
, to produce the cell output, which in the simple RNN cell is stored as the new hidden state.
The weight vector
is the same length as
, which represents input. The weight
and the bias
b are scalar values. For the simple RNN cell, the output is the same as the new recurrent state.
Figure 4 shows the recurrent unit “unrolled” in time, where the same computational unit is applied recursively. Over time, the recurrent state is modified, but the same set of weights and biases is used at each time step.
Multiple recurrent cells are typically stacked into a “recurrent layer”. One or more recurrent layers are used in conjunction with other neural network layers, such as the standard fully connected “dense” layer. We will refer to an “RNN” as any neural network architecture with one or more recurrent layer.
A popular type of recurrent model architecture is Long Short-Term Memory (LSTM) (
Figure 5), which was originally developed to address certain issues with the simple RNN cell [
38]. Like the simple RNN cell, the LSTM cell maintains a hidden state, which is the output of the cell at the previous time step. In addition to the hidden state, an LSTM cell maintains a “cell state”, denoted as
, which incorporates information about the output of the cell over more time steps in the past than just the previous time step. At time
t, the hidden state
and the predictors
are used to modify the cell state through a series of computational gates. The resulting cell state
can be viewed as a compromise between the old information as represented by the cell state
and the new information as represented by
and
. The final output of the cell at time
t is a combination of
,
, and
. This is stored as the new hidden state at time
t or
. LSTM cells are typically stacked into an LSTM layer that is combined with other neural network layers to form a more complicated model architecture. We will refer to an “LSTM” network as any neural network that contains one or more LSTM layers. The computational graph in
Figure 4 still applies to the LSTM cell, with the modification that the state passed recursively in time to the unit consists of both the hidden state and the cell state, and the output of the unit is the hidden state. For the LSTM cell, the combination of the hidden state and the cell state is the recurrent state.
In this paper, an RNN is trained to map time series of weather variables to time series of FMC, processing one time step at a time. The weather inputs at time t not only result in a predicted FMC at the same time t but also modify the recurrent state and thus affect the predicted FMC at future times . Since numerical weather forecasts are used as inputs, it is assumed that the weather is known in the future.
Neural networks contain many parameters that are fitted using a training set with an attempt to minimize a loss function of the model when compared to the observed data. With RNNs, the loss can be computed over time, so the model can learn to minimize the prediction error over a forecast window. During training, a fixed number of time steps, called the sequence length, is used to calculate the loss.
We use the mean squared error (MSE) as a loss function, which is standard in machine learning for a continuous response. Suppose that an input sequence is of length
T. The predicted FMC at time
t is denoted as
, and the observed FMC at time
t is
. The loss for a single sequence is
Loss is calculated by comparing the predicted time series of FMC to the observed values (
Figure 6).
2.3. RNN Training, Prediction, and Tuning
We trained a recurrent neural network (RNN) with gradient descent using the Adam optimizer. Inputs combine time-varying meteorological variables with static spatial features (longitude, latitude, elevation). The target variable is the 10 h dead fuel moisture content (FMC) measured at Remote Automated Weather Station (RAWS) sites.
Training data are arranged as a tensor of shape
(batch size, sequence length, features). Here, features is the number of unique inputs. Following (Zhang et al. [
39], Section 9.7), we use truncated backpropagation through time (BPTT). A single sample consists of a tensor of shape
(sequence length, features). At the start of processing a sample, the recurrent state is reset. At each time step of the sample, a one-dimensional tensor of features is input to the model, the recurrent state is updated, and a single output is generated. This process is repeated for the entire sample. The output of the network from an entire sample is a time series of FMC predictions of the same sequence length as the input sample, which requires setting the function argument
return_sequences to true [
40]. The model loss for the sample is the MSE, as seen in Equation (
3), averaged over the entire sequence length. The gradient of the loss with respect to all model parameters is then calculated via BPTT. Gradients are calculated for each sample sequence in a batch and then averaged across samples [
41]. Model parameters are updated after each batch. A single epoch of training is completed when all batches of samples have been processed. Early stopping halts training when validation performance no longer improves.
Static predictors are repeated at each time step. Batches include samples from multiple locations to promote generalization and learn the dependence on location. The model parameters are updated after each batch.
During training, the effective temporal depth of optimization is limited by the sequence length and batch size. Consequently, truncated BPTT is applied only within this finite window, ensuring computational stability ([
39], Section 9.7.1.2). In contrast, during prediction, the model operates with an unconstrained sequence length and a batch size equal to the unique number of locations for prediction, receiving new weather inputs sequentially and updating its recurrent state indefinitely. The prediction process is therefore equivalent to unrolling the RNN for an arbitrary number of time steps without truncation, allowing the model to integrate meteorological history over potentially long periods. The same network parameters, trained within the finite truncated BPTT horizon, are reused for recursive forecasting by copying the learned weights from the training configuration to the stateful inference model ([
37], p. 512).
Neural networks have many fixed hyperparameters, such as the number of cells to use within a layer, the overall number of layers, the batch size, the learning rate, etc. It is important to keep the data used to tune hyperparameters separate from the data used to estimate forecast accuracy. The metrics calculated on forecasts should represent the accuracy of the model when predicting entirely new data. The hyperparameter tuning used in this project will be described in
Section 2.6.
2.4. Recurrent Neural Network for Forecasting FMC
In this paper, we use the following RNN model architecture. The input is a collection of features consisting of weather variables and the geographic locations of RAWS. These inputs are fed into a single LSTM layer consisting of 64 cells with a hyperbolic tangent activation function. The LSTM layer outputs a sequence of the same length as the input data, and the entire sequence is passed through the network until the output layer. The output of the LSTM layer is fed into two dense, fully connected layers with Rectified Linear Unit (ReLU) activation and 32 and 16 cells. The LSTM layer and the dense layers process each time step in the sequence independently. Finally, the output of the model is a single neuron with linear activation. Linear activation is chosen for the final output layer, since we are mapping the inputs to a continuous real number. A single neuron is used as an output layer since model predictions are generated by deploying the model point-wise at each location, and thus the FMC at a given time is considered to be a single scalar value. The final model output is a two-dimensional vector, where the first dimension is the number of unique locations, and the second is the number of time steps (48 h in our case). Other model architectures can be used to generate outputs that correspond to a spatial grid, but we chose this architecture since it is the easiest to implement and is the most flexible when predicting across various spatial domains. The left side of
Figure 3 shows a diagram of the architecture of the RNN model used in this study.
Figure 3 shows the core model architecture. Additional hyperparameters for the model used in this paper are shown in
Table 2.
The HRRR model is used to provide weather input to the RNN. The HRRR provides hourly forecasts up to a maximum of 48 h. The sequence length used to structure the data input
(batch size, sequence length, features), described in
Section 2.2, is set to 48. Each input to the RNN consists of a 48 h sequence of observations, and every output of the model is a 48 h sequence of FMC predictions at the same times as the input. The batch size is a tuned hyperparameter, and the number of features was determined through the theoretical considerations described in
Section 2.1.2.
In this study, we evaluate the forecast accuracy over all of 2024 broken into 48 h periods to align with the sequence length of training inputs. The initial hidden state and cell state of the LSTM layers are zero by default. Starting at 00:00 UTC on 1 January 2024, we input 48 h of HRRR weather data and geographic input at a given set of test locations to generate the first set of forecasts. Then, for the next 48 h, the initial hidden state and the cell state of the LSTM layer are reset. This process is repeated until all forecasts are generated in 2024. The initial recurrent state of the LSTM layer is reset for each 48 h forecast period, and no information is retained across distinct forecasting periods. Resetting the recurrent state in this way is conducted for the purposes of estimating the forecast accuracy of the model 48 h into the future. Resetting the recurrent state is not required for this model architecture, and an operational version of the model could retain the recurrent state for arbitrarily long.
The hyperparameters for the RNN were selected with a restricted grid search, which evaluates a subset of all possible hyperparameter combinations. The combination of hyperparameters that results in the most accurate predictions on the validation set is chosen.
Table 2 lists the hyperparameters used for the RNN model. All hyperparameters listed were chosen via restricted grid search, except for the activation functions. The LSTM activation function is a hyperbolic tangent, the activation function for the internal dense layers is ReLU, and the activation function for the output is linear. The first two of these choices are viewed as sensible defaults, and the output layer has linear activation since it maps the inputs to a continuous real number. Any hyperparameters not listed in
Table 2 were set to their default values from the TensorFlow software project, which are generally accepted in the literature as reasonable defaults. The Adam optimizer was used for the gradient descent procedure. Additionally, the set of features was chosen based on theoretical considerations and was not subject to hyperparameter tuning, but a sensitivity analysis for the set of predictors used will be presented in
Section 3. The hyperparameter tuning method is discussed in
Section 2.6, and
Appendix A has additional technical details.
2.6. Analysis Design
We seek to forecast FMC at arbitrary locations and at future times. The model maps a time series of weather forecasts to a time series of spatial FMC forecasts based on FMC and weather data that were presented to it during training. In a sense, the model interpolates in space and extrapolates the weather-to-FMC mapping in time. We will compare the models using the root mean squared error (RMSE) at new locations and at times in the future of the times used for training. The RMSE for the FMC models is interpretable in units of percent. In all reported RMSE values in this paper, the square root is applied after any averaging operations; that is, the square root is always the final operation.
Cross-validation methods estimate the accuracy with which a model predicts unseen data ([
45], Section 7.10). In machine learning, samples are randomly drawn to form independent training, validation, and test sets ([
45], Section 7.2). The training set is used to fit the model, the validation set to tune hyperparameters and monitor overfitting, and the test set, kept separate, to estimate final predictive accuracy. In time-dependent problems, the test data should come from a period after the training data to reflect the real forecast conditions. This requirement conflicts with the common assumption that training and test samples are independent and identically distributed because temporal ordering introduces dependence. The evaluation of time series models must therefore balance statistical independence with causal realism. In spatially dependent problems, the test data needs to be from locations that were not used in training model parameters. If a model is used to predict at locations included in its training data, it has already learned aspects of the data structure at those locations, leading to overly optimistic accuracy estimates [
24].
To estimate the forecast error of FMC models at unobserved locations, we apply a spatiotemporal cross-validation procedure. In the hyperparameter tuning step, FMC observations from 2022 are used to train a set of candidate model architectures. A random sample of 90% RAWS is used to generate the training locations and 10% to generate the testing locations. Then, the forecast accuracy of those models is compared at the testing locations over all of 2023. During the forecast period, HRRR inputs are used to generate predictions, but no FMC data from the testing time period or testing locations are used to inform predictions. The model architecture with the lowest RMSE when forecasting in 2023 was selected.
With the model architecture fixed, the model is trained from scratch on observations in 2023 with a random sample of testing locations and training locations. The trained model is used to forecast at locations not used in the training over all of 2024 broken into 48 h forecast windows. The final accuracy metrics for the model are calculated by comparing the forecasts from 2024 to FMC observations at the testing locations. The training period always precedes the forecast period in time, and the random sampling of locations accounts for spatial uncertainty. A schematic diagram of this cross-validation method is shown in
Figure 7, with the training period of 2023 and the forecast period of 2024.
ODE+KF and the climatology methods do not fit directly into the training and testing paradigm used in ML. ODE+KF is run in forecast mode (i.e., without data assimilation) over all of 2024 at the test locations in 48 h increments to mimic the initialization with a spin-up period. ODE+KF utilizes 24 h of data prior to the start of the forecast as a spin-up period (i.e., with data assimilation). Each iteration of 24 h spin-up plus 48 h forecasting was independent. The climatology method produces forecasts by retrieving historical data for each time in 2024 and each testing location. Therefore, the climatology method does not utilize spatial holdout like the ML models, so the accuracy metrics for the climatology method are more optimistic than those for the ML methods.
To account for the variability in the random spatial samples and randomness due to model initialization, the training in 2023 and forecasting in 2024 were repeated 500 times with different random seeds. So, 500 random samples of RAWS were used as training and test splits, and 500 different initial weights were used for the ML models. These will be referred to as replications from now on. The replications are used to construct uncertainty bounds on the final accuracy metrics. There were 151 RAWS with valid data in the forecast period of 2024, and each replication had 16 RAWS in the test set, and each RAWS was included in the test set on average 53.0 times in distinct replications. Additional sources of uncertainty come from the HRRR weather inputs, but the HRRR model only provides a single deterministic forecast, rather than an ensemble or a probabilistic forecast that could be easily incorporated into an estimate of uncertainty. To quantify the uncertainty from HRRR weather inputs, we would need an estimate of the uncertainty for each weather variable at each location in the study region and across the entire year. Currently, that would require making many strong assumptions about the model, since no such analysis exists to our knowledge. For these reasons, we do not account for the uncertainty in weather inputs in the error metrics presented in this paper, but any systematic errors or biases from the weather inputs are constitutive of the error metrics presented in this paper.
Within each replication, the forecast residuals are calculated by subtracting the predicted FMC from the observed FMC for each model and at each time. The residual shows whether the forecast was too high or too low, and it can be positive or negative. We then calculate the squared residual, which is always positive. Within a replication for a given model, squared residuals are calculated across a set of test set locations and a set of forecast times. The per-replication MSE is calculated by averaging the squared residuals over all test locations and test times for a given replication. Then, the overall RMSE is calculated by averaging all the per-replication RMSE values and taking the square root. Uncertainty bounds are calculated from the square root of one standard deviation of the per-replication MSE values. The overall model bias is calculated in the same way using the raw residuals rather than the squared residuals. This metric is used to analyze whether the forecasts are systematically too high or too low. The overall model RMSE and bias are calculated by averaging over all locations and times in the test set. Both the RMSE and bias are interpretable in units of percent FMC. Equation (
5) shows the mathematical form of the RMSE estimate used to quantify forecast error across the entire study region and all times in 2024, using the mathematical definitions in
Table 6. Equation (
7) shows the mathematical form of the bias estimate. Equation (
6) shows how the standard deviation of the RMSE is calculated, and the standard deviation of the bias estimate is calculated in an analogous way.
We calculate the per-replication MSE by averaging the squared error over all times and locations:
The overall RMSE is calculated by averaging the per-replication RMSE over all replications:
We estimate the uncertainty in the overall RMSE by calculating the square root of the standard deviation of the MSE across all replications:
We calculate the bias of the models by averaging the error across all times and replications, and the associated standard deviation of the bias is calculated in an analogous way to the standard deviation of the MSE:
We further analyze the forecast error by location and by hour of the day. For the RMSE by hour of the day, model errors are averaged over location, replication number, and days of the year to produce a single RMSE estimate for each hour to analyze how the forecast error changes throughout the diurnal FMC cycle. Further, this analyzes how the accuracy changes as the forecast is run longer into the future. To analyze the spatial variability in the forecast error, the RMSE and bias are calculated by averaging over all times and replications in the test set, producing accuracy metrics for each RAWS. Again, the replications are used to construct uncertainty bounds. Within a single replication, we obtain an estimate of the forecast accuracy for 16 of the total RAWS. After many replications, we end up with estimated RMSE and bias, and uncertainty bounds, for every RAWS with data availability in 2024. The standard deviations of these RMSE estimates across replications are calculated in an analogous way to Equation (
6).