Integrating Deep Learning and Hydrodynamic Modeling to Improve the Great Lakes Forecast

: The Laurentian Great Lakes, one of the world’s largest surface freshwater systems, pose a modeling challenge in seasonal forecast and climate projection. While physics-based hydrodynamic modeling is a fundamental approach, improving the forecast accuracy remains critical. In recent years, machine learning (ML) has quickly emerged in geoscience applications, but its application to the Great Lakes hydrodynamic prediction is still in its early stages. This work is the ﬁrst one to explore a deep learning approach to predicting spatiotemporal distributions of the lake surface temperature (LST) in the Great Lakes. Our study shows that the Long Short-Term Memory (LSTM) neural network, trained with the limited data from hypothetical monitoring networks, can provide consistent and robust performance. The LSTM prediction captured the LST spatiotemporal variabilities across the ﬁve Great Lakes well, suggesting an effective and efﬁcient way for monitoring network design in assisting the ML-based forecast. Furthermore, we employed an explainable artiﬁcial intelligence (XAI) technique named SHapley Additive exPlanations (SHAP) to uncover how the features impact the LSTM prediction. Our XAI analysis shows air temperature is the most inﬂuential feature for predicting LST in the trained LSTM. The relatively large bias in the LSTM prediction during the spring and fall was associated with substantial heterogeneity of air temperature during the two seasons. In contrast, the physics-based hydrodynamic model performed better in spring and fall yet exhibited relatively large biases during the summer stratiﬁcation period. Finally, we developed a statistical integration of the hydrodynamic modeling and deep learning results based on the Best Linear Unbiased Estimator (BLUE). The integration further enhanced prediction accuracy, suggesting its potential for next-generation Great Lakes forecast systems.


Introduction
The Laurentian Great Lakes, including Lake Superior, Lake Michigan, Lake Huron, Lake Ontario, and Lake Erie, are the world's largest freshwater systems. The five lakes However, the ML application to the Great Lakes is still in its early stages. Until recently, deep learning has begun to be used for the Great Lakes hydrodynamic modeling, in particular, in wave simulation in Lake Erie and Lake Michigan [36,37]. Hu et al. [37] tested two ML approaches for predicting wave height and period at two selected buoy locations in Lake Erie, based on the local wind information. The results show that both ML techniques improved the prediction of wave height spikes under strong wind conditions compared to a physics-based wave model (WaveWatch III). Nonetheless, their predictions were limited to two observational locations and does not provide wave conditions for the entire Lake Erie. On the other hand, Feng et al. [36] applied a multi-layer perceptron (MLP) network for wave prediction in Lake Michigan. The MLP was trained using the data generated by a physics-based SWAN (Simulating WAves Nearshore) model for [2005][2006][2007][2008][2009][2010][2011][2012][2013][2014] and tested for wave prediction in 2015 for the entire Lake Michigan. Hence, the MLP performance depended on the SWAN model accuracy. Their research focused on using deep learning as an emulator of the SWAN to improve computational efficiency.
Applying deep learning to the LST prediction makes the situation more challenging. Unlike waves that are primarily controlled by wind features, the LST is influenced by various meteorological factors such as surface air temperature, wind, surface heat fluxes, as well as lake thermodynamics and hydrodynamics. This paper explores the feasibility of using deep learning techniques to predict LSTs in all five Great Lakes under a scenario when only limited observational data are available. Furthermore, we apply an explainable artificial intelligence technique to examine the importance of features to the trained deep learning model. Finally, we seek an approach to integrate data-driven deep learning and physics-based hydrodynamic modeling that can further enhance forecast accuracy.

Neural Network
According to the Universal Approximation Theorem, artificial neural networks (ANNs) can approximate any nonlinear function with sufficient data for training [38]. At the very core of deep learning, ANNs are versatile and scalable, making them ideal for tackling large and highly complex nonlinear problems. ANN and other deep learning methods have dramatically improved performances in processing time series, images, text, speech, etc. [39]. A typical ANN framework comprises an input layer, one or more hidden layers, and an output layer. Every layer except the output layer includes a bias neuron and is fully connected to the next layer. For ANNs to work properly, they often need to be well trained using a labeled dataset, including inputs (i.e., features) and desired outputs (i.e., targets). The training of an ANN framework determines the neural network hyperparameters, including the connection weights and biases in each layer. This learning process is carried out iteratively until a desired level of accuracy is achieved between the predicted outputs and the expected results, measured by a loss function.

LSTM
Depending on the type of inputs, different algorithms in neural networks could be used. Long Short-Term Memory Neural Networks [LSTM] [40,41] are known to produce desirable results while dealing with time-series data. LSTM is a Recurrent Neural Network that analyzes time series data and receives inputs as well as its own outputs from the previous time step to make predictions at the current time step. Although the basic concept of mapping the inputs and outputs remains the same, the training algorithm in LSTM differs from a regular multi-layer ANN model. The structure of an LSTM cell is demonstrated in Figure 1.
The current input vector X t and the previous short-term state h (t−1) are fed to four different, fully connected layers, which serve different purposes [42]. The main layer is the one that outputs g t , which has the usual role of analyzing the current inputs X t and the previous state h (t−1) . The other three layers are gate controllers, in which the logistic activation function is used. The forget gate Γ f controls which parts of the long-term state should be erased. The input gate Γ i controls which parts of g t should be added to the long-term state. Finally, the output gate Γ o controls which parts of the long-term state should be read and output at this time step to h t . Equations (1)- (6) show the operations through which every time step of the data goes through to produce an output in an LSTM cell. (1) where W c , W f , W i , W o are the weight matrices of each of the four layers for their connection to the input vector X t . U c , U f , U i , U o are the weight matrices of each of the four layers for their connection to the previous short-term state h (t−1) . b c , b f , b i , b o are the bias terms for each of the four layers. The weights and biases are determined through model training and are responsible for learning the mapping function between inputs and outputs. The outputs from the LSTM cell are usually not in the desirable shape; thus, a batch normalization layer or dense layer is required to produce the desired outputs.

Architecture
The architecture of the LSTM model for the Great Lakes is demonstrated in Figure 2, which consists of three LSTM layers, two batch normalization layers, and a dense layer. The model takes five days of historical feature data and predicts the target on the fifth day. The hyperparameters of the model are determined through trial and error to reach the best model performance. The test values and the optimal values of the hyperparameters are given in Table 1. The first LSTM layer has 32 neurons, which takes the five days of historical feature data as inputs. Its outputs are fed into a batch normalization layer to normalize the inputs for the second LSTM layer, which includes 16 neurons. The third LSTM layer has eight neurons and takes inputs from the second batch normalization layer. Its outputs are fed to a dense layer, which finally predicts the results. During model training, additional dropout layers with a dropout rate of 0.2 are added to avoid overfitting. The LSTM model architecture, which includes three LSTM layers, two batch normalization layers, and a dense layer. n a is the number of neurons in each LSTM layer. h t and c t are hidden activation states and cell memory states. X t is the input time series, andŶ t is the predicted output.

Data Processing
The LSTM input features consist of seven meteorological variables that are critical to the LST variation, including downward shortwave, downward longwave, latent heat, sensible heat, surface air temperature, zonal wind speed, meridional wind speed, with water depth as an optional eighth input feature. The target variable for the LSTM prediction is the LST. In this study, we evaluated the feasibility of using observations from the proposed hypothetical monitoring network(s) in each lake to train the LSTMs. For example, we test a hypothetical monitoring network in Lake Erie consisting of 14 observing sites that were designed in the International Field Years of Lake Erie (IFYLE) program in 2005 ( Figure 3) to train the LSTM model for Lake Erie. It is referred to as a "hypothetical" monitoring network because there are no long-term (decade-long) observations in these proposed monitoring locations, as it was a seasonal monitoring effort in the summer of 2005. Therefore, we extracted the required variables at the proposed 14 sites from the reanalysis datasets for the LSTM training. After that, we evaluated the LSTM performance in predicting the LST over the entire lake with gridded over-lake meteorological inputs. The LSTM performance, in turn, can be used to evaluate the effectiveness of the proposed monitoring network in assisting the ML-based forecast. Such a method has been widely used in the Observing System Simulation Experiments (OSSEs) [43][44][45][46]. The same procedure and analysis are conducted for each lake. The 7 meteorological features at the 14 hypothetical monitoring sites were extracted from the Climate Forecast System Reanalysis (CFSR) dataset for 1995-2010 and from the CFSR's upgraded extension (i.e., Climate Forecast System Version 2 (CFV2)) available from 2011 to present at the National Center for Environmental Prediction (NCEP) [47]. For simplicity, we refer to both datasets as CFSR. CFSR is global reanalysis data with the assimilation of satellite radiances and all available conventional and satellite observations, and has been used to drive hydrodynamic models for the Great Lakes in various studies [13,17,48,49].
The target variable (daily LSTs) at the 14 hypothetical monitoring sites was extracted from the Great Lakes Surface Environmental Analysis (GLSEA; https://coastwatch.glerl. noaa.gov/glsea/glsea.html, accessed on 26 May 2022) developed by the NOAA Great Lakes Environmental Research Lab (GLERL). The data are derived from the Advanced Very High-Resolution Radiometer (AVHRR) satellite imagery and updated daily with information from the cloud-free portions of the previous day's satellite imagery. GLSEA products applied a smoothing algorithm to generate a continuous evolution of the LST as described in Schwab et al. [50]. The GLSEA-LST is currently the best available resource to examine the spatial and temporal variability of LST for the entire Great Lakes [2,3,51].
The training dataset includes 17 years (1995-2011) of 6205 daily data (29 February in leap years were ignored) at 14 locations. At each location, the dataset was grouped into two matrices, including an input matrix with the shape of (6205 days, 7 features) and a labeled output matrix of (6205 days, 1 target). In the designed LSTM, we used 5 days of historical feature data to predict the LST. Hence, the inputs were further temporalized into 6205 daily instances, each of which includes the historical 5-day feature data. For instance, to predict the LST on the 5th day, the selected features on the 1st, 2nd, 3rd, 4th, and 5th days were used as inputs. Thus, the resultant input matrix had a shape of (6205, 5, 7) for each point. Finally, the temporalized data for all 14 sites were stacked to create the input dataset for training, which had a shape of (86,870 (i.e., 6205 × 14), 5, 7). Correspondingly, the output dataset for training had a shape of (86,870 (i.e., 6205 × 14), 1).
An important transformation in the data processing for machine learning models is feature scaling. This is critical when input attributes have very different scales, as in this study. A standardization transformation was performed to make all the input features and the target on the same scale. The Standard Scaler function in the scikit-learn package was employed for this task. The mean and variance of the training dataset were stored and used to standardize the testing dataset.

LSTM Training and Validation
For each of the Great Lakes, an LSTM model was developed following the same procedure. The LSTM was implemented using the Keras API in the TensorFlow open-source platform. The training and validation of the LSTM were performed in Google Colab using GPU acceleration. The LSTM for the input datasets with and without the water depth feature were trained separately. The training dataset (1995-2011), after standardization and temporalization, was randomly shuffled into a training dataset and a validation dataset with a ratio of 95:5. The hyperparameters of the models, such as the number of activations, number of LSTM layers, and number of epochs, were determined through trial and error to produce the best model performance. The optimal values of these hyperparameters are given in Table 1. Two optimizers (i.e., Adam and Stochastic Gradient Descent (SGD)) were tested. The Adam optimizer converged faster and produced slightly better model performance. Therefore, the Adam optimizer and the mean-square-error (MSE) loss function were employed to train all the LSTM models. To avoid model overfitting or underfitting, the number of LSTM layers and the number of nodes in each layer must be properly chosen. By using a grid search technique, it was determined that three LSTM layers with activation units of (32,16,8) produced better performance. In addition, a dropout method was applied to avoid model overfitting. Two dropout layers with a dropout rate of 0.2 were added after the first and second LSTM layers. Other parameters, such as learning rate, epochs, batch size, and validation split, were determined by optimizing the model performance.
LSTM validation was carried out by comparing the LSTM performances on the training and validation datasets to determine whether the model is overfitting, underfitting, or well trained. If a model has good performance on the training data but poor generalization to the validation data, it is considered overfitting. If a model has poor performance on both the training and validation data, it is deemed to be underfitting. On the other hand, a well-trained model should perform well on both the training and validation datasets. During model validation, the model parameters determined from training, including all the hyperparameters, weights, and bias terms, remain unchanged. The validation dataset that was not used for training was fed into the model to produce the predictions. The model performance was evaluated by computing the MSE error between the model predictions and the labeled data in the validation dataset. Figure 4 shows the model performances on the training and validation datasets in Lake Erie. The training and validation errors reached a sufficiently low level with increased epochs. The loss functions decreased rapidly with increasing epochs (which defines the number of times that the learning algorithm will work through the entire training dataset) to a low level as the weights and bias terms in the model were optimized through model training. The model performance on both the training and validation data reached its optimized state with greater than 100 epochs, indicating that the model was well trained and did not suffer from overfitting or underfitting. Similar performances were observed in the training and validation for other lakes.

LSTM Prediction
The above validation confirms that the trained LSTM can predict the LST at the 14 selected hypothetical monitoring locations. However, the goal of this study is to evaluate if the trained LSTM is capable of predicting the LST spatiotemporal patterns over the entire lake. We generated mesh grids ( Figure 5) that cover the entirety of each lake to employ the trained LSTM to predict the LST at each mesh grid based on its meteorological features extracted from the CFSR dataset. The same grid mesh is used for a hydrodynamic model described in the next section. Therefore, the LSTs predicted from the LSTM and the hydrodynamic model are consistent for direct comparison.

Hydrodynamic Modeling
In parallel to deep learning, we applied a physics-based hydrodynamic model to simulate the surface temperatures in the Great Lakes to identify the strengths and weaknesses of ML and mechanistic modeling. The hydrodynamic model used in this study is based on the Finite Volume Community Ocean Model (FVCOM) [52]. FVCOM is a free-surface, primitive equation hydrodynamic model that solves the momentum, continuity, temperature, salinity, and density equations and is closed physically and mathematically using turbulence closure submodels. FVCOM is solved numerically using the finite-volume method in the integral form of the primitive equations over an unstructured triangular grid mesh and vertical sigma layers. The model grid resolution varies from 1-2 km near the coast to 2-4 km offshore ( Figure 5), with a total of 35,000 model grids and 40 vertical sigma layers. The Mellor-Yamada level-2.5 (MY25) turbulence closure model [53] was used for simulating vertical mixing processes, and the horizontal diffusivity was calculated based on horizontal velocity shear and grid resolution using the Smagorinsky numerical formulation [54]. The FVCOM model was run in the two-way coupled atmosphere-lake system [3] and driven by hourly surface meteorological forcing, including incoming shortwave and longwave radiations, air temperature, atmospheric pressure, wind fields, total cloud cover, and relative humidity.

LST Spatiotemporal Pattern from the LSTM Prediction
As presented in Section 2, we established the LSTM to predict the LST for each of the Great Lakes, including Lake Superior, Lake Michigan, Lake Huron, Lake Ontario, and Lake Erie. As Lake Erie has several concerning environmental issues associated with lake temperature (e.g., harmful algal blooms, large-scale hypoxic conditions), we use Lake Erie as an example to present detailed model results and analyses. The LSTM performances in the other four lakes are consistent and presented in Section 4.5.
We highlight the LSTM predictions of the two most important LST characteristics of the Great Lakes ecosystem: the temporal evolution of lake-wide average LST and the seasonal spatial patterns of LST. Figure 6 presents the daily lake-wide average surface temperature for 1995-2011. The LSTM performed very well in predicting the temporal variations of the surface temperature in all years in comparison to the GLSEA data. Both the seasonal and interannual variabilities are well captured by the LSTM. The model accurately reproduced seasonal variability for the warmer years (e.g., 2005, 2010) and the cold years (e.g., 2000,2004). The correlations between LSTM predictions and GLSEA data are very high at the daily scale (>0.99), low in annual mean bias (±0.3°C), and the root-mean-square-errors (RMSE) (0.5-1.0°C) across all years.  Figure 7 presents the spatial climatology of LSTs averaged over the 17-year prediction by the LSTM in comparison to the GLSEA reanalysis data. The LST is generally homogeneous in winter. In spring, the lake temperature warms progressively from the shallow western basin and southern coastal water to the deep water in the central and eastern basins when the stratification is developing with cross-slope LST gradients. The warmest surface water exists in the whole lake in summer, with relatively warmer waters in the shallow western basin and cooler water along the north coast. In fall, the LST decreases first in the western basin due to its shallowness, while the waters in the deep central and eastern basins are still relatively warm. The LSTM predicted the spatial patterns of the water temperature reasonably well, particularly during summer and winter, with biases less than 0.5°C in most regions. However, relatively large warm (1-2°C) biases appeared in the central and northern basins in spring and cold biases (1-2°C) in these regions in fall.

LST Spatiotemporal Pattern from the FVCOM Prediction
The above results confirmed that the LSTM can be an efficient and effective tool for LST prediction, yet, physics-based hydrodynamic modeling is still considered the most fundamental approach. The hydrodynamic model resolves the physical processes described by primitive equations that satisfy the conservation of mass, momentum, and energy. Therefore, the hydrodynamic model simulates the time evolution of entire 3-D hydrodynamic processes (not only surface temperature), including water currents, surface elevation, temperature, eddy viscosity, and diffusivity. More importantly, it provides us with a mechanistic understanding of the system.
The results from the hydrodynamic model FVCOM are presented in Figure 8. Overall, the FVCOM also simulated the seasonal and interannual variability of the LST well. However, the model over-predicted the summer surface temperature in most of the years, with some most noticeable summer warm biases in 1998, 2000, 2004, 2006, and 2009. This is also shown in the model simulated seasonal spatial pattern and error distribution of the LST (Figure 9). The FVCOM simulation showed a close agreement with GLSEA data in the winter, spring, and fall seasons in terms of detailed spatial patterns and temperature gradients in the lake. However, in the summer, a noticeable warm bias (1-2°C) exists in the region below the latitude of 42.2 degrees as well as in the northern end of the lake. The model predicted a stronger meridional temperature gradient than was observed in the GLSEA. Note that the exception for the comparison is the surface temperature difference near the northern coast. The narrow strip of cold water is caused by local upwelling, which is realistically resolved by the hydrodynamic model and is often missed in the GLSEA dataset due to its lower accuracy near the coast.

Integration of Hydrodynamic Model and LSTM
The above analyses show that both LSTM and FVCOM can predict the LST reasonably well, yet with different bias characteristics. Therefore, we explore the approach to integrating both predictions to optimize the estimation of lake status. In this study, we test a variational method to incorporate the results from the LSTM and FVCOM to further improve prediction accuracy.
Denote the LST as a state vector X. X P and X L are results from FVCOM and LSTM with respective error covariance matrices P and R. The Best Linear Unbiased Estimator (BLUE) theory states that the best linear estimate (X a ) is the solution to the following variational optimization problem: where J is called the cost function of the analysis (or misfit, or penalty function) for the hydrodynamic model penalty term J P (X) and LSTM penalty term J L (X). The solution to Equation (8) is where K equals K = P(P + R) −1 (11) Applying Equations (10) and (11) to each model grid point for localization, the best linear estimate (X a (i)) at a given grid point i is: where σ P (i) and σ L (i) are the root mean squared error (RMSE) of the FVCOM and LSTM predictions relative to the GLSEA at grid point i. The σ P (i) 2 and σ L (i) 2 were calculated for each season, respectively, based on the results for 1995-2011. For the upwelling region along the northern coast, we assigned normalized empirical weights of 0.8 and 0.2 to the FVCOM and LSTM predictions since only the hydrodynamic model simulation resolved the local upwellings. The improvement of LST estimation through the LSTM-FVCOM integration is significant and evident in Figures 10 and 11. Temporally, integrating the LSTM and FVCOM model results improved the LST predictions during all seasons. As summarized in the seasonal climatology (Figure 10), the warm biases (RMSE of 0.72°C) in lake-wide average surface temperature during summer in the FVCOM results are largely removed, and so are the cold bias with (RMSE of 0.65°C) during the winter. Similarly, the cold bias (RMSE of 0.8°C) in the LSTM prediction during the fall is also effectively corrected. The integration produced a more accurate estimate of LST than the individual prediction from either LSTM or FVCOM. Spatially, the LSTM and FVCOM integration efficiently reduced spatial errors in all seasons (Figure 11). The biases are minimized to (±0.5°C) for nearly the entire lake, except for a small area at the northern end of the lake in spring. The integration significantly reduced the relatively large errors in the central and northern basins from the LSTM prediction during spring and fall (Figure 7 vs. Figure 11) as well as the summer warm bias in the majority of the lake in the hydrodynamic modeling (Figure 9 vs. Figure 11). More importantly, the spatial temperature gradients and variability across the slope from shallow water to deep water and between the basins are well captured, reinforcing the enhanced prediction accuracy through the integration. The integrated results also retain upwelling regions as expected.

Prediction beyond the Training Period
The results in Section 4.1 show that the LSTM trained with limited data from the hypothetical monitoring network has the ability to predict the spatial and temporal patterns in the entire lake over the training period. This section examines whether the LSTM can be used to make predictions beyond the training period. We extended the LSTM prediction for a 5-year period (2012-2016) beyond the training period (1995-2011) with exact parameters, including all the hyperparameters, weights, and bias terms unchanged. The satisfactory performance of the LSTM is well retained for 2012-2016. The LSTM accurately captured the seasonal variation of LST as well as its interannual variability, indicated by the envelopes ( Figure 12). The results during the 5-year period appeared to be even slightly better than in the training period due to the upgraded, higher-resolution CFSR meteorological dataset after 2011 (i.e., CFV2, see Section 2.3). Improvement in hydrodynamic simulations due to the upgrade of CFSR since 2011 has also been documented in other studies [49].

Evaluation of the LSTM Performance for Other Lakes
The LSTMs for each lake were trained using data from 14 points following the same approach we used for Lake Superior, as described in Sections 2.3 and 2.4. The results are consistent with what we presented for Lake Erie, as summarized in Figure 13. The predictions for all lakes retain high correlations (>0.99) with the GLSEA reanalysis, low mean biases (0.02-0.77°C), and low RMSE values (0.3-0.9°C). The consistent LSTM performance across the five lakes suggests that the findings from Lake Erie's analysis are applicable to the other four lakes, such as the integration of deep learning and hydrodynamic modeling. It should be noted that the selection of the 14 hypothetical monitoring sites in the four lakes was fairly arbitrary (there was no trial-and-error or iterative tuning for selecting sites). The site selection was completed in a one-time attempt with a general principle that the sites chosen should have good coverage over the entire lake and represent both deep basins and coastal waters. The satisfactory results under such conditions reinforce the robustness of the LSTM performances. Lastly, we acknowledge that the selection of the monitoring sites may be further optimized to achieve better results to some extent, although this is beyond the scope of this study.

Effect of Water Depth
Water depth (bathymetry) is one of the most critical inputs in a physics-based hydrodynamic model. However, it is unclear whether it is an essential feature for LSTM training. The water depth is spatially varying but remains constant temporally. To examine its impact on LSTM performance, we added water depth as the eighth feature and re-trained the LSTM. Neural networks, in general, can learn to disregard non-essential input features during training. The features affecting more on the target are weighted more than the less relevant features. By comparing the model skills with and without the water depth feature, we found that the water depth feature was deemed redundant in the LSTM in our case. The skill metrics such as correlation, mean bias, and RMSE remain similar in the two LSTMs trained with or without the water depth feature, and adding the water depth feature does not further improve the LSTM prediction. This is likely because water depth is temporally constant, and, therefore, its effects can be directly modeled by the bias terms in the LSTM.

Understanding of Model Performance
While deep learning models are often considered as blackboxes, there is a growing need to understand the relationship between features and predictions. To that end, we employed an explainable artificial intelligence (XAI) technique named SHapley Additive exPlanations (SHAP) to uncover how the features impact the predictions [55]. SHAP can be used for explaining the prediction of an ML model by computing the contribution of each feature to the prediction. The SHAP analysis shows that the air temperature is the most influential feature of the LST prediction, with a significantly larger SHAP value than the other features (Figure 14a). Notice that the results suggest a strong correlation (which can be nonlinear) between the air temperature and LST, rather than indicating the causal relationship between the two, as physical conditions near the lake-atmosphere interface strongly influence both variables. The SHAP analysis also supports previous efforts that aimed to utilize the air temperature to reconstruct the lake temperature by developing simplified empirical functions [56,57]. These studies achieved certain success in reconstructing the lake-wide mean temperature but could not predict spatiotemporal patterns of LST as presented in this work using deep learning. In addition, the impact of the features on the LST is generally consistent with our mechanistic understanding of how LST would be affected by environmental factors, including the positive impact of air temperature, longwave radiation on the LST, and the negative impacts of latent and sensible heat fluxes, and wind speed (Figure 14b). Knowing that the air temperature is the most influential feature, we hypothesize that the relatively large bias in LSTM prediction during the spring and fall is closely associated with spatial anomalies in both air temperature and LST. The most substantial heterogeneity of air temperature occurs during the spring and fall with a spatial variability of ±2°C, while it is fairly homogeneous during the winter and summer with a spatial variability of less than 0.8°C (Figure 15a). The strong correlation of spatial patterns between air temperature and LST anomalies reinforces the SHAP analysis ( Figure 15b).
However, from the hydrodynamic modeling perspective, simulating the summer thermal structure is more challenging, partly due to the strong summer stratification. In the central basin, the hydrodynamic model often underestimates the mixed-layer depth and results in a warmer surface layer [8]. In addition, modeled temperature and circulation patterns are also sensitive to the anticyclonic wind vorticity, which can also lead to thinning of the hypolimnion in the central basin [12]. In the western basin, where the water depth is much shallower and often unstratified, the warm bias is most sensitive to surface heat flux with much less buffering effect due to its shallowness. To test the impact of atmospheric forcing uncertainty on the hydrodynamic modeling, we also conducted the simulation in which the FVCOM model is driven by in situ observation-interpolated forcing [13]. With this forcing, LST biases were reduced during the summer but caused more significant warm biases (not shown) during the fall due to the underestimated surface heat fluxes and wind speed.

Future Model Improvement
In the current study, the LST prediction at each grid cell is independent and determined by input features over the grid cell and the previous LST at the grid. This is because we aimed to evaluate the feasibility of using scattered observational data from the proposed hypothetical monitoring network(s) in each lake to train the LSTMs. The results confirm that the current LSTM framework worked well with comparable accuracy to the hydrodynamic modeling. However, if aiming to use data with a more completed spatial coverage in the entire lakes (e.g., satellite images and reanalysis data), we can alternate the ML framework to consider the spatial changes in the LST and the atmospheric forcing from neighboring grids in the prediction. This can be achieved with a hybrid convolutional neural network (CNN) and LSTM approach, in which the model learns the spatial feature of atmospheric and lake variables by CNN and then utilizes LSTM to examine the temporal relations in lake thermal progression [58]. In addition, a possible approach is to learn the dynamics in the frequency domain using Fourier neural operator (FNO), which uses a Fourier layer that implements a Fourier transform, a linear transform, and an inverse Fourier transform for a convolution-like operation to extract mesh-independent spatiotemporal features [59][60][61].
In this study, we demonstrated a postprocessing framework to statistically integrate the LSTM and FVCOM results. We note that there are various ways in which deep learning approaches can help improve hydrodynamic modeling. For instance, deep learning has been used to approximate subscale processes to improve climate and ocean modeling [30,35]. Wang et al. [62] also applied a deep learning approach to reconstruct Reynolds stresses in RANS simulations of a turbulent flow. In these studies, deep learning approaches were used to predict the unclosed terms in a hydrodynamic model to improve the prediction accuracy and efficiency of the model. Meanwhile, deep learning can assist hydrodynamic modeling by optimizing ensemble averages of physics-based model predictions [31,32] and can be embedded in the hydrodynamic model to parameterize the air-sea feedback [34].
On the other hand, ML models can also be improved by injecting the physical constraints into the design of deep learning models to force the outputs of deep learning to respect physical laws related to mechanistic modeling [29,63]. For example, one can apply the energy balance rule in the loss function of deep learning models when predicting lake temperature. Moreover, simulated data generated from a physical model can be used to pre-train the deep learning model to allow deep learning to converge faster with a smaller number of ground truth samples [64]. Furthermore, deep learning models can also be used to more efficiently calibrate the physical model [65]. However, those approaches are yet to be tested for hydrodynamic modeling and will be part of our future study.

Summary and Conclusions
In this paper, we applied LSTMs to predict the spatiotemporal variation of LST, an essential environmental variable and climate indicator in the Great Lakes. Furthermore, we demonstrated an integrated framework that incorporates LST predictions from physicsbased hydrodynamic modeling and data-driven deep learning to further enhance prediction accuracy. Compared to the previous ML applications to the Great Lakes prediction, our work advances the deep learning application in three aspects: • Previous ML studies of the Great Lakes hydrodynamic forecast focused on wave predictions, which are controlled primarily by wind features. These studies either developed the emulator of the physics-based wave model in a single lake [36] or developed a wind-wave nonlinear relationship at a few specific sites [37]. Using seven meteorological features, this study is the first one to apply deep learning to predict the spatiotemporal patterns of the LSTs across the five Great Lakes. • Our approach is highly efficient in developing systematic predictions across the Great Lakes. In the previous study [36], the training of the surrogate model required a large amount of training data generated from the physics-based wave model. This approach has two drawbacks (1) developing a physics-based model for all the five lakes becomes a prerequisite, and (2) the accuracy of the physics-based model constrains the deep learning performance. This study demonstrated the feasibility of training LSTMs with limited observations to make reliable predictions for the entire lake. The predictions from the LSTM models have consistent and robust performance across the lakes and are able to capture the temporal and spatial variabilities of LSTs for the entirety of the five Great Lakes. This is important as designing the observation network for data collection is primarily limited by the costs of deployment and maintenance. • We further examined the features through an explainable AI technique (i.e., SHAP) to better understand their contributions to the model prediction. The SHAP analysis revealed air temperature is the most influential feature for predicting the LST in the trained LSTMs. The prediction bias is closely associated with substantial spatial heterogeneity of air temperature, particularly during spring and fall.
• Lastly, we demonstrated the integration of data-driven deep learning and mechanistic hydrodynamic modeling in the Great Lakes LST prediction. Our results showed that using the variational method to integrate the FVCOM and LSTM results can further enhance prediction accuracy. While the hydrodynamic model provides us with the mechanistic understanding and description of the Great Lakes system, a welltrained deep learning model could serve as an auxiliary tool to the hydrodynamic model simulation. Therefore, this work offers a new viable avenue for developing the next-generation Great Lakes forecast system.