A Review of Machine Learning Applications in Land Surface Modeling

: Machine learning (ML), as an artiﬁcial intelligence tool, has acquired signiﬁcant progress in data-driven research in Earth sciences. Land Surface Models (LSMs) are important components of the climate models, which help to capture the water, energy, and momentum exchange between the land surface and the atmosphere, providing lower boundary conditions to the atmospheric models. The objectives of this review paper are to highlight the areas of improvement in land modeling using ML and discuss the crucial ML techniques in detail. Literature searches were conducted using the relevant key words to obtain an extensive list of articles. The bibliographic lists of these articles were also considered. To date, ML-based techniques have been able to upgrade the performance of LSMs and reduce uncertainties by improving evapotranspiration and heat ﬂuxes estimation, parameter optimization, better crop yield prediction, and model benchmarking. Widely used ML techniques used for these purposes include Artiﬁcial Neural Networks and Random Forests. We conclude that further improvements in land modeling are possible in terms of high-resolution data preparation, parameter calibration, uncertainty reduction, efﬁcient model performance, and data assimilation using ML. In addition to the traditional techniques, convolutional neural networks, long short-term memory, and other deep learning methods can be implemented.


Introduction
Machine learning (ML) and artificial intelligence (AI) progressively impact society, supported by significant enhancement of big data, computational efficiency, easily available data storage, and uninterrupted connectivity. ML algorithms are being increasingly applied in Earth and Environmental modeling studies as a result of growing resources of extensive data sets, easy computation, and upgraded ML algorithms. Some well-explored areas with ML applications in Earth Science are climate modeling [1,2], hydrologic modeling [3,4], remote sensing [5,6], etc. Land Surface models (LSMs) are the components of climate models, which simulate land surface processes, such as the partitioning and consumption of energy, moisture, momentum, and carbon. The understanding of land surface feedback on the meteorological and climatological features have gained attention in recent years [7][8][9]. The land provides forcing from the surface to the atmosphere via energy balance, surface heating (supplying buoyancy for convection), and surface roughness (generating eddies that modulate atmospheric boundary layer). Land-surface interactions are communicated through vegetation and soil moisture. LSMs are crucial to understand and predict the dynamics of the land surface and its involvement within the Earth system. The complex processes that interconnect different components of the terrestrial system, and the depth of convolution present in all of those processes, make the LSMs less amenable. With the advancement of land surface observation systems, such as the Global Observing System of World Meteorological Organization (WMO) and FLUXNET around the world, Earth 2021, 2 improved estimates of land surface radiation, ecology, hydrology, biology are now possible. However, representing the land-surface interactions properly in the model still remains a challenge. Specifically, in the areas of complex terrain and strong surface-atmosphere coupling. Major components of an LSM include (1) atmospheric forcing, (2) physical, chemical, and biological processes, (3) parameters, and (4) outputs.
The adoption of ML in land surface modeling has so far been gradual, but the related fields are now highly emerging. ML-based methods can create viable, complementary avenues toward knowledge discovery in land modeling. ML is often categorized into traditional methods and deep learning (DL) based methods. While traditional methods include random forests (RF), support vector machines (SVM), classification and regression trees (CART); DL techniques include Deep Learning Artificial Neural Network (DL-ANN), Long Short-Term Memory (LSTM), Convolutional Neural Network (CNN) etc. Overall, ML models have repeatedly outperformed simpler statistical models and provided improved generalized solutions to unforeseen circumstances in many applications of Earth Sciences. The recent growth of big hydrologic data through remote sensing and data compilation has fostered the development.
The difference between the process-based LSMs and ML models is that the model structure for LSMs is based on the underlying physical principles that we believe are valid, whereas ML model structure is completely data-driven. Despite this fundamental difference, they can complement each other. Two strategies popular in literature in this regard are: (1) integrating LSMs and ML, (2) introducing physical constraints on ML models. Strategy (1) Is quite straightforward. One common way to implement it is to ensure the application of pure ML models only when the climatology is similar to the training data; otherwise, in case of rare events, the combined model relies on the LSM. In this way, modelers can avoid the uncertainty coming from ML models in the extreme cases while making the model less computationally expensive in general application. On the other hand, (2) acknowledges that data-driven algorithms often fail to satisfy the wellknown physics law constraints as they are not generally enforced with those. They tend to violate constraints on individual samples while optimizing the overall performance. LSM outputs can be used to pose constraints on the ML models to overcome this. Recently, interrogative studies have also been invoked to interpret ML models to ask questions like-(1) how trustworthy are ML models? or (2) how to interpret the processes that drive ML model systems?
The objectives of this review paper are (1) to feature the areas of improvement in land modeling using ML and (2) discuss the most important ML techniques in detail, and (3) suggest future directions for further improvements. To achieve these goals, we searched literature in google scholar database using a combination of the relevant key words: 'machine learning', 'land surface modeling', 'crop modeling', 'evapotranspiration estimation', 'parameters and uncertainty' etc. Moreover, the bibliographic lists of these articles were also considered to create an extensive list of articles. The most relevant 83 articles are cited and discussed in this paper.
This study demonstrates the potential of ML to propel LSMs and vice versa. Critically, this review paper provides information about (1) the importance of LSMs in different applications (2) the difficulties and limitations of traditional LSMs. (3) application of ML in LSMs in past literature. (4) concise and simple technical overview of ML for LSMs. (5) potential directions where ML can contribute to solving challenges in modeling land surface processes. The results from this study have implications for creating inexpensive, improved and tractable land surface models with less and quantifiable uncertainty. This in turn will help to construct upgraded coupled climate models to provide more realistic and refined future projections.

LSMs: Importance, Then and Now
As mentioned in the introduction, LSMs are numerical models that simulate land surface processes, such as absorption and partitioning of radiation, water, and carbon between the land surface and atmosphere. Provided with meteorological forcing as inputs (from an atmospheric model either in 'coupled' mode or an 'uncoupled' mode), they estimate latent heat fluxes (LH), sensible heat fluxes (SH), carbon fluxes, surface runoff, deep drainage, reflected solar and emitted longwave radiation as output [10,11]. While LH and SH control the boundary layer properties and precipitation; net carbon flux influences the atmospheric CO 2 content. These estimates play a critical role in determining the effects of human-modified land surface and human emissions on changes in the climate. LSMs are perhaps the most efficient tools to predict how the continuously evolving earth surface will modify the hydroclimate in coming years and centuries. The extents of modeling activities with LSMs include multiple interlinked disciplines (such as atmospheric modeling, crop modeling, and hydrologic modeling) relevant to this overarching problem.
LSMs were originally developed by the atmospheric modeling community who needed physical boundary conditions consisting of energy and moisture partitioning, albedo, and surface roughness to indicate the impact of the surface on the atmospheric processes. Richardson [12], in 1922, first mentioned the importance of stomatal conductance on weather processes. Early studies, such as Charney et al. [13] used albedo as a proxy for vegetation and started investigating the effects of deforestation in terms of it. Starting from the 1980s, scientists started understanding the land surface-atmosphere interactions [14,15]. Garatt et al. [16] summarized the importance of land surface in climate modeling in a review paper. He discussed different boundary layer schemes and the results from global climate model (GCM) sensitivity studies using these schemes. He concluded that the regional and global climate is significantly influenced by albedo, surface moisture and roughness, and the inclusion of vegetation. However, till then, it was not clear how much spatial detail of the surface is sufficient to accurately represent the lower boundary conditions. For that decade, improvements of LSMs were driven by the need to understand the effects of deforestation in various parts of the world. In the 2000s, scientists started to visualize the importance of land in the context of sub-seasonal to seasonal forecasting. The land surface was identified as a slowly varying component of the earth system, which has a major role in modulating the atmospheric response at a longer timescale than weather prediction. Koster et al. papers [7,17], in connection with the Global Land-Atmosphere Coupling Experiment (GLACE), identified soil moisture as an important factor altering evaporation and precipitation. They also highlighted the regions where strong coupling between soil moisture and precipitation exists. For the first time, they introduced the concept of 'coupling strength' to quantify such coupling, which is still being widely used in land-atmosphere interaction studies. However, while modeling these interactions, there exists a huge variation among the global models, attributable to the uncertainties in terrestrial and atmospheric branches, and the models fail to represent the land-surface coupling accurately [18]. Specifically, they found systematic biases in near-surface temperature, humidity, and precipitation, which contribute to the uncertainty. Seneviratne et al. [19] summarized the findings related to soil moisture-precipitation relations in a review paper and concluded that the relationship between soil moisture and precipitation is evident in observations and models. However, significant uncertainty remains in quantifying those in terms of the strength of coupling, and persistence characteristics. These studies indicate the need for further improvement in land surface models. The need for LSMs to quantify such biogeophysical and biogeochemical feedbacks to the climate system has formed the basis of their recent development efforts.
At present, LSMs have expanded from their initial simple biophysical configurations [20] to include representations of stomatal functioning [21], scaling information from leaf to canopy [22], soil moisture dynamic and surface hydrology [23,24], crop processes [25,26], land surface heterogeneity [27], dynamic vegetation [28,29], urban environment [30], land cover management [31,32], plant demographic processes and plant hydraulics [33], groundwater dynamics [34], soil microbial dynamics [35], leaf mesophyll process, nitrogen, phosphorus, carbon cycling and their mutual interactions [36]. The incorporation of processes in LSMs is driven by the need for extensive user communities, including ecologists, crop modelers, atmospheric scientists, biogeochemists, hydrologists, who explore interactions between different components of the system. Some widely used LSMs across the globe include Interaction Soil-Biosphere-Atmosphere (ISBA, [36]), The Community Land Model (CLM, [31]), JSBACH [37], Joint UK Land Environment Simulator (JULES, [38]), LPJ-GUESS [39], Noah-MP [40]. Along with the increasing capability of representing processes, LSMs are enhancing their spatial resolution as well, with the improvement in resolution of the atmospheric models. As the scope of LSMs broadens with the support of computational advancements, the questions of cognitive uncertainty and unresolved heterogeneity emerges as a challenge.

Complexity and Limitations of LSMs: Prospect of ML
The diversity of the interconnected processes in the terrestrial system, and the levels of entanglement present in these processes, pose a hurdle to build tractable land models. The propensity of scientists to focus on their own specific area of interest and the reality that the earth system is indeed complex are both responsible for this complex nature. Often, this reaches a point where no individuals are able to completely understand all aspects of any particular model, and the development teams strive to meet all the requirements placed on modern LSMs [11]. Even though, large uncertainty remains in our understanding and modeling of the interactions between atmospheric and terrestrial branches of the hydrologic cycle due to the non-trivial mechanisms at the land surface. Figure 1 illustrates the convoluted and connected processes in a typical LSM. The major parts, such as, atmosphere, hydrology, urban processes, agriculture; and plant physiology, soil biogeochemistry, soil physics related to each of those, are interlinked in an LSM. These major components are further segregated into smaller yet complicated processes. For example, agriculture includes fertilizer and pesticides usage, biomass burning, harvesting, irrigation, tillage, residual treatment etc. (Figure 1). The interactions are defined by the exchange of information between different parts of the model. However, some of the processes are still oversimplified in modeling. As such, most of the LSMs classify plant species into plant functional types (PFTs), within which the parameters are undifferentiated. Simulations consisting of a limited number of PFTs ignore biodiversity within a simulation grid. This may lead to uncertainty in the strength of climate responses when coupled to a climate model. Furthermore, understanding the combined effects of major greenhouse gases, such as Carbon dioxide, Methane, and Nitrogen dioxide on global warming are still at early stages due to constraints in the measurements of multiple gases. Limited models have the capacity to simulate such effects, which requires realistic carbon and nitrogen cycling processes.
LSMs are often applied at large spatial scales aimed to simulate the interactions between climate and land surface. Nonetheless, validation data for these models are obtained from flux tower sites. This geographical gap usually limits the accuracy of the models. Microbes may play fundamental roles in altering biogeochemical cycling as 'ecosystem engineers' [41]. However, very few LSMs include the effects of such organisms in an explicit manner. This limits our ability to estimate the climatic impacts of changes in soil biological community composition and diversity. In addition, the unavailability of high-resolution land-surface data affects the LSMs in capturing the effects of spatial heterogeneity. The development of high-accuracy fine resolution data is important for the interpretation of observations and model simulations.
Some of the limitations of LSMs can be highly benefited from the enormous information currently available from satellite data. However, extracting useful knowledge from terabytes of data provided by observation and LMS simulations is challenging. In contrast, ML models are simple in nature in terms of structure and easy to simulate the output once trained properly. Figure 2 illustrates the general structure of a multilayer perceptron model, which is a commonly used feedforward ANN type ML model and uses supervised learning techniques for training. Compared to LSMs (Figure 1), the structure is much simpler and furthermore, the model parameters are data driven. ML techniques can help and complement LSMs in several ways, including surrogate modeling, physics-guided machine learning, parameter estimation, and data assimilation to reduce the uncertainties and generate useful knowledge from large amounts of observational data. Some of the ML applications in land modeling are described in the next section. LSMs are often applied at large spatial scales aimed to simulate the interactions between climate and land surface. Nonetheless, validation data for these models are obtained from flux tower sites. This geographical gap usually limits the accuracy of the models. Microbes may play fundamental roles in altering biogeochemical cycling as 'ecosystem engineers' [41]. However, very few LSMs include the effects of such organisms in an explicit manner. This limits our ability to estimate the climatic impacts of changes in soil biological community composition and diversity. In addition, the unavailability of highresolution land-surface data affects the LSMs in capturing the effects of spatial heterogeneity. The development of high-accuracy fine resolution data is important for the interpretation of observations and model simulations.
Some of the limitations of LSMs can be highly benefited from the enormous information currently available from satellite data. However, extracting useful knowledge from terabytes of data provided by observation and LMS simulations is challenging. In contrast, ML models are simple in nature in terms of structure and easy to simulate the output once trained properly. Figure 2 illustrates the general structure of a multilayer perceptron model, which is a commonly used feedforward ANN type ML model and uses supervised learning techniques for training. Compared to LSMs (Figure 1), the structure is much simpler and furthermore, the model parameters are data driven. ML techniques can help and complement LSMs in several ways, including surrogate modeling, physicsguided machine learning, parameter estimation, and data assimilation to reduce the uncertainties and generate useful knowledge from large amounts of observational data. Some of the ML applications in land modeling are described in the next section.

Estimation of Evapotranspiration
Evapotranspiration (ET) is perhaps the most discussed variable in the land surface modeling since it is a major part of both water and energy (in the form of LH) balance. It is also a critical factor in the carbon cycle, acting as a trade-off between photosynthesis and transpiration. Since direct measurement of global terrestrial ET is implausible, one of the central uses of LSMs is providing ET estimates using other comprehensively measured hydrometeorological variables. Some noteworthy applications regarding ET estimation are listed below.
Alemohammad et al. [42] developed an ANN approach to estimate monthly LH, SH, and gross primary productivity (GPP) at a global scale for 8 years at 1 • × 1 • resolution using remotely sensed solar-induced fluorescence (SIF) measurements in addition to conventional precipitation, temperature, soil moisture, snow cover and net radiation (R n ) as inputs. When compared to eddy covariance measurements from FLUXNET, their method 'WECANN' outperformed traditional surface fluxes products provided by Global Land Evaporation Amsterdam Model (GLEAM), Moderate Resolution Imaging Spectroradiometer (MODIS) and European Centre for Medium-Range Weather Forecasts (ECMWF). The ML retrievals also demonstrated capability to represent the extents of several extreme drought and heat events. They were also able to analyze the effects of extreme climatic events on surface turbulent fluxes and GPP. Prior to this study, neural networks were primarily applied to satellite observations to estimate LH [43]. Some studies also applied generalized neural networks and artificial intelligence models at a local-regional scale [44][45][46][47]. However, these techniques, when applied to the global scale, often failed to predict extremes and perform outside their calibration range [48].
A major limitation of traditional ML methods in Earth Science, as pointed out by Zhao et al. [49], was the issue of not conserving the surface energy budget, leading to unrealistic predictions of various surface fluxes causing problems in implementation with a coupled atmospheric model. Physics-based models, although complicated, tend to have superior interpretability [50]. To overcome this, [49] developed the first 'physicsconstrained' ML model for ET estimation, which applied ML techniques while keeping the energy conservation equations valid, to provide a global ET estimate. Their hybrid model was able to learn the nonlinear relations from the data while obeying the physical laws. The study was significantly robust as they used 82 eddy covariance sites from FLUXNET with nine different PFTs and implemented a feedforward ANN escorted by that. This kind of modeling can enhance the capability to simulate ET during extreme climatic conditions like heatwaves and droughts.
Another study, Pan et al. [51], recently attempted to estimate global terrestrial ET (GTET) using two machine learning algorithms, RF, and Model Tree Ensemble and predicted its annual mean, interannual variability, and trends. They concluded that the utilization of satellite retrievals and deep-learning methods, and model-data fusion advances the predictive understanding of GTET. Critically, the ML methodology provided similar results to remote sensing-based products indicating a significant increasing trend in GTET.

Parameter Estimation and Uncertainty Assessment
LSMs are not properly equipped to accurately simulate the real world and lead to significant uncertainty when trying to capture the land-atmosphere interactions. One major reason behind this is the use of parameterizations for complex processes such as ET estimation, which includes an abundance of soil and vegetation parameters. For example, current global LSMs have over 20 parameters that are linked to just ET estimation [52]. Other sources of uncertainty in the LSMs include internal variability and forcing uncertainty.
Chaney et al. [52] provided global parameter estimates at 5 km spatial resolution for the Noah LSM using 85 eddy covariance sites in the global FLUXNET network and linked the most sensitive parameters to local environmental characteristics using an ML algorithm-extremely randomized trees (Extra-trees). They concluded that FLUXNET could be potentially used to tune the parameters of LSM. The calibration led to a significant enhancement in model performance in terms of Kling-Gupta Efficiency increasing from 0.54 to 0.71. Furthermore, their leave-one-out cross validation method revealed the potential to relate calibrated model parameters to local environmental conditions. Uncertainty quantification of LSMs is closely related to the parameter optimization problem. As such, it is far-fetched to obtain the accurate set of sensitive parameters for all observable variables, and different combinations of parameters may show similar skill to reproduce observations, also termed as 'equifinality' in hydrology [53]. An alternative method is to construct the probability density function of parameters, explaining the uncertainty in parameter estimation. Swada [54] applied ML techniques for parameter optimization and uncertainty assessment with global satellite observations. The technique was to create an ML model, which emulates the relationship between model parameters and LSM outputs and is computationally cheaper than LSM. They applied this technique, involving ML and Markov Chain Monte Carlo (MCMC), to EcoHydro-SiB (a modified version of [20]) model over a part of the Sahel region in Africa, and was successful in obtaining the nonparametric posterior distribution of four unknown parameters and enhanced the soil moisture and vegetation dynamics simulation skill of the LSM. The addition of ML techniques made optimization 50,000 times faster. They highlighted the importance of selecting suitable prior, consideration of error in meteorological forcing, and inclusion of satellite observations in the modeling chain.
In a recent study, Dagon et al. [55] investigated the biogeophysical parameter space of CLM5 and determined the sensitivity of parameters to get insight into the role of parameter choices on the overall model uncertainty. They implemented an ML approach to globally calibrate six parameters of CLM5, selected by a sensitivity analysis, to the observations of water and carbon fluxes. Specifically, they trained their feedforward ANNs to emulate CLM5 outputs given parameters as predictors. The trained ML models were able to estimate global optimal parameter values and quantify the contribution of parameter uncertainty to the overall uncertainty in LSM simulated GPP and LH. Moreover, they were able to inspect several interpretation methods to better understand the inner workings of the ANN as emulators of CLM.

Crop Yield Prediction
Crop yield is often predicted leveraging the physics based LSMs. Early and reliable crop yield forecasts are critical for farmers and decision-makers in food security policymaking, planning, and trade. Agriculture is greatly affected by weather and climate, and hence, large-scale crop growth simulations under a changing climate in a regional and global scale remains a priority. At the same time, the climate is affected by the extension of agricultural practices due to land-atmosphere interactions. The majority of the studies implemented a 'hybrid' approach, combining process-based modeling and ML, to provide a better yield estimate than those could provide individually.
Everingham et al. [56] provided a framework to combine RF (their ML technique) and APSIM (a process-based crop model) to predict the annual variation of sugarcane yield in Australia. Biomass indices, an output from the crop model, were identified to be critical as a predictor for the ML model for the next month's yield prediction. They also considered observed rainfall, temperature, radiation, and seasonal climate indices as input features for ML algorithms. This is an example of ML aiding the investigation of the relative importance of several predictors. Feng et al. [57] implemented a similar approach, with addition of climate extreme indices, to refine the prediction of wheat yield in Australia. They compared the performance of a process-based model ASPIM with several combinations of the process-based model and ML models, such as ASPIM+ multiple linear regression (MLR) and ASPIM+ MLR+ RF. They reported an additional improvement of 19% in the yield prediction accuracy as compared to ASPIM+MLR and 33% as compared to APSIM alone when the RF was included in the modeling chain.
Schlund et al. [58] developed a two-step approach to constrain the projected GPP at the end of the 21st century in the Representative Concentration Pathway 8.5 scenario with ML. They fed observational data into the ML algorithms that have been trained on Coupled Model Intercomparison Project (CMIP5) data to learn the relationships between present-day carbon estimates and the future scenario GPP (target variables). Their ML model showed superior performance to the CMIP5 ensemble mean. It also predicted an increasing GPP in the high latitudes. Folberth et al. [59] built novel crop meta-models from coarser GCMs to derive estimates of crop yield at a high spatial resolution without requiring the crop model set up at high-resolution. ML methods used in this study were RF and gradient boosting regression tree (GBRT). They were able to go to 0.25 • from 0.5 • and the results showed high accuracy (R 2 > 0.96) in predicting maize yields, ET and crop available water.
In a recent study, Shahhosseini et al. [60] showed that adding physics-based crop model variables as input features to ML models can improve the performance of ML models by 29% on average in the US Corn Belt. They compared the performances of several ML models such as RF, linear regression (LR), least absolute shrinkage and selection operator (LASSO) regression, Light Gradient Boost (LightGBM), Extreme Gradient Boost (XGBoost), and also an ensemble of them to investigate their added value individually and in combination. This study also assessed the relative importance of various variables from the process-based models as input features to different ML models. As such, they included phenology related variables, crop-related variables, and soil-related variables separately to investigate their individual added value in model predictions. The results indicated that the soil-water related variables were most important for crop yield prediction over the US corn belt.

Hybrid Simulation of Land-Surface Variables Other Than ET
Soil moisture, turbulent momentum, and heat flux are some of the critical variables of land-atmosphere interactions apart from ET. Pelissier et al. [61] combined Noah LSM and an ML technique to improve the prediction of top-layer soil moisture at nine AmeriFlux tower sites. They were able to obtain a 3-fold decrease in error metric. In particular, they highlighted the potential of learning structural error of a model using a hybrid model at a point-scale. However, they indicated that building a global scale model will need the inclusion of remote sensing data and account for uncertainty propagation through LSMs. In another study, ML was found better than the Monin-Obukhov similarity theory to predict momentum and heat fluxes [62]. The main objective of this paper was to substitute the Monin-Obukhov similarity theory for calculating fluxes in the LSM named 'Veg3d'. This technique of replacing a parameterization with an ML scheme has been extensively used in climate models [63,64]. [62] also showed that the results of ANN involved LSMs were equivalent, if not superior, to the conventional method. They indicated that implementation of an ML technique might save 5% of a central processing unit (CPU) time of a regional climate model, and this can be further improved by replacing all uncertain components of the climate models with neural network subroutines. This is a further impetus to implement ML methods for efficient prediction.

Benchmarking the LSMs
As the LSMs become increasingly complex and observational data volumes rapidly expand, there is a growing need for frequent and intensive testing and evaluating the models to fully utilize the richness of large Earth System data sets like satellite or FLUXNET measurements. Multi-scale synthesis and Terrestrial Model Intercomparison Project (MsT-MIP) [65,66] and Land Model Testbed (LMT) [67] are two such initiatives that incorporated ML techniques for this purpose. Schwalm et al. [68] quantified divergence as the spread in output from multiple models and an observational constraint. Considering inter-model spread as a measure of the approximations in physical and biogeochemical processes in the models, the ML experiments highlighted the uncertainties in the structures of the carbon pools and advised against the hard parametric limits on ecosystem function. Their results confirmed that model intercomparison projects like MsMTIP ensemble provide a proper framework to link model skill to structure, excluding confounding factors while taking into account inter-model spread. RF helped to find model structural characteristics, which serve as important factors of skill for several MsTIP variables. LMT, with a similar objective, leveraged existing tools of International Land Model Benchmarking (ILAMB) and was able to launch thousands of ensemble simulations simultaneously while perturbing the parameters of LSMs. They developed the ML-based benchmarking workflow to increase the diagnostic capacity of LMT. New relationship metrics were designed using ML methods to benchmark the LMT produced ensemble simulations against in-situ measurements of GPP, ET, and SH. The initiative was an outstanding attempt to accelerate the model development cycle by means of careful assessment of model fidelity and analysis of model outputs.

Techniques of Machine Learning
ML methods are automated or semi-automated techniques of data inference without any prior assumptions (data-driven). ML techniques can be broadly classified into (1) supervised learning and (2) unsupervised learning. Supervised learning is based on a priori specification of output and one or more inputs. In unsupervised learning, there is only input and no outputs, and the model is aimed to discover patterns in the data. Most applications in land surface models are based on supervised learning, where neural networks or other machine learning models are trained to provide a known output from the model. The supervised ML techniques can be further classified into two categories of (a) traditional ML methods and (b) DL based methods. In this section, we discuss the specific ML techniques used by the previously mentioned papers (for different purposes), with brief details, under these two categories. Our goal was to identify the most popular ones from these two categories and include further detailed description about those. The discussion is also summarized in Table 1. Following convention, ANN (traditional) is considered as networks with single hidden layer and DL-ANNs are the networks with multiple hidden layers.

ML Algorithm Category Purpose Reference
Feedforward ANN Traditional Provide monthly estimates of GPP, LH and SH at a global scale [42] ELM and GRNN Obtain ET o from temperature data in southwest China [44] RT, Bagging, RF and SVM Provide ET estimates in central Florida [46] SVM and GANN Provide crop ET estimates in China [47] RF, RT, KRR, RS, ANN Provide CO 2 , LH, SH, R n at multiple sites globally [48] Extra-Trees Obtain global parameter estimates for Noah land model [52] MCMC Parameter optimization of land model [54] RF Improving regional crop yield prediction [56,57,59,60] GBRT Constrain uncertainty in GPP estimates [58] DL-ANN

Traditional ML Methods
Common traditional ML methods include SVM, CART, bootstrap aggregating (Bagging), Kernel ridge regression (KRR), RF, ANN etc. Alemohammad et al. [42] used a feedforward ANN as a supervised learning approach to retrieve monthly estimates of GPP, LE, and SH at a global scale using remotely sensed SIF estimates besides other meteoro-logical predictors like precipitation, temperature, soil moisture, snow cover, and R n . The ANN consisted of three layers: (1) an input layer connected to input data, (2) one hidden layer, and (3) an output layer producing LE, SH, and GPP outputs. A number of neurons in the hidden layer were chosen based on the complexity of the problem. [44] compared the performance of an extreme learning machine (ELM) technique and generalized regression neural network (GRNN) model in predicting reference evapotranspiration (ET o ). Their ELM consisted of 50 hidden nodes, which performed better than the GRNN, which is similar to feedforward neural networks but based on nonlinear regression theory of function estimation. More specifically, GRNN contains four layers named summation layer, input layer, pattern layer, and output layer [68]. They were able to reproduce more accurate ET o estimates than the conventional empirical Hargreaves model. A similar study was conducted by Gocic et al. [45] over two weather stations in Serbia. [46] compared the supervised learning approaches Regression Tree (RT), Bagging, RF, and SVM while predicting actual ET. An RT model utilizes a decision tree as a predictive model, and target variables are real values. RTs can be further advanced by gradient boosting (GBRT) [58,59]. 'Bagging' is a machine learning ensemble technique. RF is a set of uncorrelated simple regression trees. In contrast, SVMs are supervised learning models to deal with classification and regression problems [69]. Some studies have used genetic algorithms to optimize the ANN (GANN) [47] to predict crop-specific ET estimates.
Apart from these commonly used techniques, KRR [70] and regression splines (RS) [71] are also used by few studies to estimate ET [48]. Global model parameter estimation in [52] benefited from Extra-Trees [72]. MCMC has also been used with ML (Gaussian Process Regression) for parameter optimization [54]. RF is the most commonly used technique for crop yield estimate [56,57]. To understand the relative importance of predictors, which may provide information about the usefulness of them at predicting the target variable, 'feature importance' is normally computed internally by these methods. Overall, ANN and RF are the most widely used traditional ML methods in LSM applications. However, RTs with improved 'gradient boosting' of 'Bagging' can be extremely powerful to create data-driven models for land-surface applications. We discuss RF in further due to its significant vogue in past literature.

Random Forests (RF)
RF [73][74][75] are supervised non-parametric ML algorithms, which are ensemble-learning methods using decision trees as base learners. This is a fast, flexible and robust approach to high-dimensional data mining that relies on aggregating the results of an ensemble of simple estimators. Decision trees are intuitive ways to classify objects or label objects by binary splitting. Variable space is divided into a set of boundaries, and a model is fitted to each set (which can be a constant in the simplest case). However, overfitting tendency of such trees can be reduced by an ensemble of randomized trees and that motivates the usage of 'bagging' or RF technique (equipped with some additional randomization techniques reducing the correlation between the trees). In LSM applications, RF is mostly used within the context of regression, i.e.; for continuous prediction rather than categorical. Users generally provide the number of trees in the forest and a criterion to measure the quality of a split, such as mean square error. Hyperparameters are also provided to improve the model performance and control overfitting. In general, statistical learning has two main purposes: (1) prediction and (2) inference. In the case of RF, interpreting the model can be done either by looking at a single tree in the forest or by examining the relative importance of independent variables. Overall, RF method is found to be (1) consistent (2) reducing the bias and variance simultaneously, (3) achieving convergence at high rate and (4) adaptive to sparsity, which makes it applicable even with noisy predictor variables [74].
RF is often superior to other traditional ML methods like SVM or CART because it (1) considers an ensemble of decision trees and does not need multiple models like others (e.g.; SVM) to provide a probabilistic prediction, (2) provides more generalized solution, (3) needs no feature scaling (4) can handle high dimensional spaces as well as large number of training examples.

Deep Learning (DL) methods
Some commonly used DL methods include DL-ANN, CNN, LSTM, variational autoencoders (VAN) and generative adversarial networks (GAN). However, till date, DL-ANNs are the only deep learning tools used with land modeling. For example, [43] used a neural network with two hidden layers containing fifteen neurons in total and LH, and SH were output from the model. Physics guided machine learning models have also used DL-ANN as the ML technique [49]. DL-ANNs are also useful in constructing emulators of LSMs in to characterize the uncertainties and parameter optimization [55]. Dew point temperature is an important parameter to assess surface humidity conditions, critical for agricultural purposes. Constantin et al. [76] realistically predicted dew point temperatures at sub-daily scale using other meteorological variables as input to a DL-ANN consisting of three hidden layers. We discuss ANN and DL in further detail below.

Artificial Neural Networks (ANN)
ANN is the supervised learning approach to approximate relationships between input and output using interconnecting units called neurons. The development of ANNs was motivated by the functioning of the human brain. Like the brain, the connections between neurons decide the function of the network. Multiple inputs given to the neuron are assigned with different weights, calculated by minimizing a loss function, that collectively determines the importance of input signals [77]. This part is termed as 'training' and some popular training methods are backpropagation, evolutionary algorithm and gradient descent. Output signal is produced by the summation block, which adds all of the weighted inputs algebraically. The input and output layers are called 'visible layers' as they are directly connected to inputs and outputs, respectively. Hidden layers are the layers in the middle (see Figure 2). 'Activations' are the outputs sent to the (i + 1) th layer from the i th layer, which is calculated based on the transformation functions (such as sigmoidal, tanh, rectified linear units or ReLU) of combined input from (i − 1) th layer. 'Width' is defined by the number of neurons on a layer and 'depth' is defined by the number of layers in the network.
Deep Learning (DL) is a suite of tools based on carefully designed large ANNs with multiple hidden layers. Compared to non-deep networks and traditional earliergeneration ML methods, DL is characterized by an abundance of layers to process the complex information in big data, addition of unsupervised learning units and effective regularization techniques. These allow automatic extraction of features from input data. By contrast, traditional ML models, such CART and SVM, need human construction of relevant input features for best results, which is often the most tedious step. Many of these conventional methods are also not suitable for generalization purposes and working with large data sets that DL can handle. Though DL-ANN are the only DL techniques used with land modeling so far, LSMs can be further benefitted from some of these techniques previously in water science, especially hydrology [78,79] and envisage more potential uses for those in the future. For example, CNN can be used to estimate crop yields by merging LSM and satellite data. However, DL can face some of the issues like other ML methods, such as convergence of the learning process dependent on training data, hidden layers acting as a black box, etc.
In addition to DL techniques, gaussian process emulation can be used for the simple representation of complex land-surface features and test different potential values of parameters. High-resolution land cover data can be obtained using Kernel Regression. Wavelet transform can also be used to quantify the information transfer between different model components of an LSM.

Possible future directions
Combining the applications of ML in past literature and keeping in mind the still existing problems in land surface modeling, we provide some future directions to improve the current status of land modeling leveraging machine learning techniques.

1.
ML techniques can be a suitable way to reduce the process complexity of the LSMs. Currently, the complex interactive processes of the terrestrial system and way they are represented in conventional LSMs makes them intractable. It is often difficult to assess the added value of a complex process given its cost of burdening the model. ML can help in this regard to take a modular approach, where modelers can represent a complex process in multiple ways and test the model performance easily. This will help in reducing the structural uncertainty of the models as well. The model intercomparison projects (e.g.; MsTMIP) are often constrained by the under-sampling of the potential range of model configurations. Such technique will also help us in collaborative development of the model, rather than adding processes on a specific need basis.

2.
Processes in the model which have little physics knowledge or complex calculation, but more data availability, could be replaced by a ML-based surrogate. As mentioned in Section 4.4, there have been some attempts to predict important land surface properties by a hybrid modeling approach, but we are still lacking in exploring some of the more fundamental variables which can be easily provided by ML applications as the amount of observations increase in recent days. For example, hydrology, phenology and snow cover fraction. Hydrologic observations are increasing all over the globe with advanced velocimetry techniques and phenology is easily obtained now from satellite remote sensing data. Improving these components leveraging ML will help upgrade the overall LSM performance.

3.
Parametric uncertainty can be investigated, and parameters can be better optimized following Dagon et al. [55]. However, we need to consider longer data records (now possible with the availability of computing resources) and more output variables. Good quality globally gridded observations are now available. For example, Atmospheric CO 2 from National Oceanic and Atmospheric Administration (NOAA), Commonwealth Scientific and Industrial Research Organisation (CSIRO), gridded FLUXNET from Max Planck Institute for Biogeochemistry (MPI-BCG), river flow (GRDC), albedo (MODIS), and so on (ILAMB project, etc.).

4.
Data assimilation is proved to be an important tool to improve model simulations at different earth system modeling applications [80,81]; and advancement in ML can further enhance that. Conventional data assimilation methods include Kalman filter and variational approaches. These methods have underlying assumptions of normality, Markovian processes, zero error covariance and similar ML algorithms, being completely data-driven, may improve the assimilation in terms of speed, accuracy, and efficiency.

5.
There are several avenues for better crop yield prediction. Soil hydrology related variables such as soil moisture or drought indices (obtained from either remote sensing, in-situ measurements, or process-based models) can further improve the ML predictions on crop yield. Precision agriculture is another advanced recent agricultural technology which includes sensors, robotics, and AI to assemble 'big data' which can further be processed with ML models to develop a sustainable agricultural practice with enhanced yield. We can use such data and the extracted information to assess the yield variability among regions to make informed decisions [82]. As such, the yield stability maps, included in the landscape, can provide information about environment friendly areas and constant low yield areas. We should focus on intensifying the high yield areas sustainably. On the other hand, low yield zones can be improved by perennial bioenergy crops and recycling of plant available nutrients. 6.
Interpretability of ML models [83] has potential to reveal some of the hidden links and physics between different parts of the terrestrial hydrology. Relative importance of the predictors to predict a specific outcome could be more rigorously analyzed for similar interpretations.

Conclusions
The implementation of ML techniques in earth system modeling has been widely accepted by many researchers in recent years. Direct application of such techniques, particularly on LSM, is quite recent. LSM plays an important role in simulating the feedback and interactions between the land surface and the atmosphere. As we escalate toward rapid changes in the climate and land use, the future of the terrestrial biosphere and hydrology remains a global concern. LSMs are the fundamental tools to simulate and predict the state of the terrestrial surface and play a critical role in optimizing the amount of carbon dioxide to limit global warming. This provides motivation for improving the performance of LSMs.
The complex nature of linked land-surface processes makes it difficult to create a tractable LSM. The interactive components include atmosphere, urban environment, canopy processes, agriculture, hydrology, soil dynamics, snow physics, soil processes and so on. Recently with the need for hydrologists, ecologists, biogeochemists, physiologists, the structure of LSMs is extremely complex. We argue that ML methods can help us achieve better performance of LSMs, make it efficient and also create a flexible structure of the models. With the help of ML and adequate data in recent ages, some critical applications of LSM have already been upgraded.
The main improvements were in the areas of ET estimation, parameter calibration, crop yield, model benchmarking, and uncertainty quantification. ET estimation is important in earth system modeling as ET is an important part of both water and energy cycles. ML methods were able to use satellite data and improve global ET estimates. The major complexity of the LSMs come from a significant number of parameterized processes within LSMs. ML surrogates were able to replace some of these processes to improve the model performance and reduce the uncertainty. Crop yield prediction majorly benefitted from the 'hybrid' model approach where process-based models and ML models were combined to achieve better yield predictions. Model benchmarking is a relatively newer concept but an extremely important part of LSM improvement, and ML has been helpful in this context too.
The widely used ML techniques along with land surface modeling are ANN, RF, and modified versions of Gradient trees. Supervised learning is often used with the help of these techniques to improve the LSM performance. Other ML techniques are also often used along with these, such as genetic algorithms, SVM, GBRT, 'Bagging,' MCMC, etc. We foresee that there is scope to apply the recent DL methods like CNN and LSTM regression to improve several parts of the land surface modeling, following other applications in water science and earth system modeling. For example, high-resolution data preparation and improved data assimilation techniques can leverage these advanced methods.
Additional improvements are possible in the areas of complexity reduction of LSMs, multiple parameter calibration, structure revision, uncertainty reduction, and sustainable crop yield estimate. Interpretability of ML models should be carefully investigated to further explore the underlying physics and processes behind the yet less-known land surface-related processes.