Evaluation of Machine-Learning Models for Predicting Aeolian Dust: A Case Study over the Southwestern USA

: Aeolian dust has widespread consequences on health, the environment, and the hydrology over a region. This study investigated the performance of various machine-learning (ML) models including Multiple Linear Regression (MLR), Support Vector Machines (SVM), Random Forests (RF), Bayesian Regularized Neural Networks (BRNN), and Cubist (Cu) in predicting dust emissions over the Southwestern United States (US). Six meteorological and climatic variables (precipitation, air temperature, wind speed, ENSO, PDO, and NAO) were used to predict dust emissions. The correlation (r) and root mean square error (RMSE) for ﬁne dust vary from 0.67 to 0.80, and 0.40 to 0.52 µ g/m 3 , respectively. For coarse dust, the r and RMSE vary from 0.69 to 0.73, and 2.01 to 2.34 µ g/m 3 , respectively. The non-linear ML models outperformed linear regression for both ﬁne and coarse dust. ML models underestimated high concentrations of dust. Machine-learning models better predict ﬁne dust than coarse dust over the Southwestern USA. Air temperature was found to be the most important predictor, followed by precipitation, for both ﬁne- and coarse- dust-prediction over the region. These results improve our understanding of the predictability of Southwestern US dust.


Introduction
The dust cycle is a key factor in the environment [1] and the global climate system [2] through the scattering and absorbing of sunlight [3], and it is often associated with an adverse effect on human health [4], traffic, and industrial machinery [5,6]. Reduced soil moisture due to low precipitation and/or higher temperature increases soil erodibility [2,7,8]. Drought-induced loss of vegetation cover further amplifies dust emissions from the arid and semiarid regions [9,10]. The Southwestern United States (SWUS) is characterized by a dry climate and is a major US dust source, with large dust emissions in all four seasons [11]. Dust emissions in the SWUS (Figure 1) peak in the Spring (March-May). Average monthly fine dust (PM2.5) and coarse dust (PM10) concentrations over the SWUS are 1.1 µg/m 3 and 6.22 µg/m 3 , respectively. Achakulwisut et al. [12] and Hand et al. [11] noted the increasing trends in fine dust concentrations over the Southwestern US over the last decades.
Previous studies that examined the relative contribution of meteorological variables to dust variability are based on linear regression (e.g., [12,13]). Okin and Rheis [14] show that there exists a significant relationship between ENSO anomaly and dust event frequency in the Southwestern United States based on correlation. Machine learning (ML) techniques have merged recently with great promise in environmental studies [15]. For example, Lee et al. [16] compared the ML models in detecting dust aerosol from satellite images. Similarly, Ebrahimi-Khusfi et al. [17] predicted dusty days based on ML algorithms. Using a machine learning (ML) model, the non-linear relationship between dust emissions and meteorological variables can be better characterized. However, much less effort has been spent on accessing the accuracy of ML models to predict dust emissions. emissions and meteorological variables can be better characterized. Howeve effort has been spent on accessing the accuracy of ML models to predict dust The purpose of this study is to evaluate the accuracy of machine-learnin predict aeolian dust. Previous studies on the relationship between dust em meteorology are based on soil moisture and/or vegetation [8,18,13]. Much le been made to determine the relative role of precipitation and temperature on bility. This study compares the relative importance of precipitation and tem predicting aeolian dust over the Southwestern US. Finally, the ML models' p in predicting fine dust (particle diameter ≤ 2.5 µm; PM2.5) and coarse dust (pa eter 2.5-10 µm; PM10) are compared.

Study Area and Data
The Southwestern US is a prominent dust source [19]. Major dust sour region are the Chihuahuan Desert, the Colorado River, and the High Plain near-surface dust concentrations over the Southwestern USA are available fro agency Monitoring of Protected Visual Environments (IMPROVE) network ( ble online: https://views.cira.colostate.edu/fed/Express/ImproveData.aspx (ac April 2022). The IMPROVE stations ( Figure 1) have provided fine and coarse urements since 1988. The sampler vacuum pump is run for 24 h and collects µg/m 3 ). Observations were performed every Saturday and Wednesday prior to then, the dust has been measured every third day, continuing to the present t Previous studies show that dust emissions over the Southwestern US a correlated with drought [4], wind speed [13], and affected by climatic telecon Therefore, six meteorological and climatic factors that explain dust emission tion (pr); the 2 m air temperature; the near-surface (10 m) wind speed; the El N ern Oscillation (ENSO); the Pacific Decadal Oscillation (PDO); and the North cillation (NAO), were chosen as predictors. Measurements of the monthly tot tion, average 2 m air temperature, and 10 m wind speed were taken from the N ican Regional Reanalysis (NARR) ( [21]: available online: https://psl.noaa.go ded/data.narr.monolevel.html (accessed on 10 April 2022)), available at 0.3° × tion. ENSO (Nino3.4), PDO, and NAO indices were taken from the National O Atmospheric Administration (NOAA: available online: https://psl.noaa.gov/d indices/list/ (accessed on 10 April 2022)). The most IMPROVE sites are locate The purpose of this study is to evaluate the accuracy of machine-learning models to predict aeolian dust. Previous studies on the relationship between dust emissions and meteorology are based on soil moisture and/or vegetation [8,13,18]. Much less effort has been made to determine the relative role of precipitation and temperature on dust variability. This study compares the relative importance of precipitation and temperature in predicting aeolian dust over the Southwestern US. Finally, the ML models' performance in predicting fine dust (particle diameter ≤ 2.5 µm; PM2.5) and coarse dust (particle diameter 2.5-10 µm; PM10) are compared.

Study Area and Data
The Southwestern US is a prominent dust source [19]. Major dust sources over the region are the Chihuahuan Desert, the Colorado River, and the High Plains. Observed near-surface dust concentrations over the Southwestern USA are available from the Interagency Monitoring of Protected Visual Environments (IMPROVE) network ( [20]; available online: https://views.cira.colostate.edu/fed/Express/ImproveData.aspx (accessed on 10 April 2022). The IMPROVE stations ( Figure 1) have provided fine and coarse dust measurements since 1988. The sampler vacuum pump is run for 24 h and collects the dust (in µg/m 3 ). Observations were performed every Saturday and Wednesday prior to 2001. Since then, the dust has been measured every third day, continuing to the present time.
Previous studies show that dust emissions over the Southwestern US are strongly correlated with drought [4], wind speed [13], and affected by climatic teleconnection [14]. Therefore, six meteorological and climatic factors that explain dust emissions, precipitation (pr); the 2 m air temperature; the near-surface (10 m) wind speed; the El Niño-Southern Oscillation (ENSO); the Pacific Decadal Oscillation (PDO); and the North Atlantic Oscillation (NAO), were chosen as predictors. Measurements of the monthly total precipitation, average 2 m air temperature, and 10 m wind speed were taken from the North American Regional Reanalysis (NARR) ( [21]: available online: https://psl. noaa.gov/data/gridded/data.narr.monolevel.html (accessed on 10 April 2022)), available at 0.3 • × 0.3 • resolution. ENSO (Nino3.4), PDO, and NAO indices were taken from the National Oceanic and Atmospheric Administration (NOAA: available online: https://psl.noaa.gov/data/climateindices/list/ (accessed on 10 April 2022)). The most IMPROVE sites are located on federal lands and national parks that are often not at the center of the dust sources. Therefore, studies on the relationship between dust and meteorology are performed on a regional scale rather than on a grid/station scale (e.g., [4,18]). We performed analysis using regional dust intensity, total precipitation, average air temperature, and average wind speed averaged over the region (Figure 1). We used monthly Climate 2022, 10, 78 3 of 10 average dust concentrations from 1988 to 2010 as training data and those from 2011 to 2020 as test data.

Multiple Linear Regression (MLR)
Regression analyses are widely used to describe the linear relationship between a response variable and one or more explanatory variables [25]. The MLR equation is as follows: where i = n observations, y = dependent/response variable, x = independent/explanatory variables, β o = y-intercept, and ε = the model's residuals/errors.

Support Vector Machine (SVM)
SVMs are supervised learnings used for both classification and regression [26]. SVM regression is often called Support Vector Regression (SVR) in the literature. SVM has two layers, where weights are non-linear in the first layer and linear in the second layer [27][28][29][30]. The SVM decision function is represented as where non-linear function ϕ(.) maps x into a feature space, ω and b are parameters to be determined by maximizing their objective functions, and N is the number of observations. The parameters are estimated by minimizing the sum of the empirical risk (the first term of Equation (3)) and the complexity term (the second term of Equation (3)): where C is a positive constant that determines the trade-off between the model complexity and the extent up to which model errors larger than are tolerated, ||ω|| 2 is the regularization term denoting the Euclidian norm, and L ε is the loss function that is insensitive to and has the advantage that all data are not necessary to describe the regression vector ω. The radial basis kernel functions are more suitable for handling non-linear problems and have fewer tunable parameters [31].

Random Forest (RF)
RF is a non-parametric algorithm within a decision tree [32]. RF consists of a combination of decision trees fitted by randomly selected subsets samples from training data. The RF algorithm builds a K number of regression trees from an (x) input vector. After K such trees {T(x)} k 1 are grown, RF predictions are made as the average of all trees [33] expressed as:

Bayesian Regularized Neural Networks (BRNN)
The BRNN comprises an artificial neural network (ANN) and the Bayesian method to estimate optimal parameters. The complex model is penalized in the Bayesian framework and reduces the overfitting problem [34,35]. BRNN imposes prior distributions on the parameters of the model. The following objective function is minimized based on gradient optimization to estimate parameters [36]: where E D is the sum of squared errors, M is the ANN model, Ew(w|M) is squared ANN architecture weights, α and β are objective function parameters, and αEw shows weight decay (α is a decay coefficient). W should be smaller to reduce the overfitting tendency.

Cubist (Cu)
The non-parametric Cu method is based on the rule-based model tree proposed by Quinlan [37][38][39][40]. The Cu models linearly combine two models [41]. The Cu model combines the prediction from the current model and the parent model above it in the tree [41,42].

Results and Discussion
The ML models' performance in predicting fine dust and coarse dust is shown in Table 1. The correlation (r) of the ML-model predicted and -observed dust concentration from 2011 to 2020 ranged from 0.65 to 0.81, and the root mean square error (RMSE) ranged from 0.40 µg/m 3 to 0.48 µg/m 3 for fine dust. Similarly, for coarse dust, the correlation varied from 0.67 to 0.71, and the RMSE varied from 2.08 µg/m 3 to 2.27 µg/m 3 . The correlations for fine dust were greater than the correlations for coarse dust, implying that ML models better predict fine dust than they do coarse dust. The scatter plots of observed and predicted dust are shown in Figures 2 and 3 for fine dust and coarse dust, respectively. For fine dust, all ML models underestimated high concentrations of dust. The Cubist model, compared to other models, largely underestimated both high and low dust concentrations. As with the fine dust (Figure 2), ML models underestimated high concentrations of coarse dust (Figure 3). of thunderstorm outflow winds/wind gusts from deep convection [43] that cannot be fully captured by the monthly average wind speed.   The relative importance of the predictor variable was calculated based on the percentage increase in RMSE without that particular variable as a predictor [23] and is shown in Figure 4. Temperature and precipitation are more important in predicting regional dustiness over the Southwestern US region for both fine dust and coarse dust. Climatic Machine learning algorithms show great potential (with correlations > 0.65) for predicting dustiness in the Southwestern US. ML models work better for predicting fine dust than for predicting coarse dust. Studies show that the earth system model (ESM) estimated fine dust emissions at less than 10% of the total dust emissions, and earth system models perform poorly in simulating coarse dust [2]. The poor performance of ML models in predicting coarse dust is most likely due to the short transport distance and short retention of coarse dust in the air. The IMPROVE observation stations are located on federal lands and national parks far from the dust sources [20]. The IMPROVE stations are more likely to miss the coarse dust. Therefore, regional meteorology cannot explain the variability of coarse dust. ML models also underestimate high-concentration dust events for both fine and coarse dust events. The high concentration of dust often occurs as a result of thunderstorm outflow winds/wind gusts from deep convection [43] that cannot be fully captured by the monthly average wind speed.
The relative importance of the predictor variable was calculated based on the percentage increase in RMSE without that particular variable as a predictor [23] and is shown in Figure 4. Temperature and precipitation are more important in predicting regional dustiness over the Southwestern US region for both fine dust and coarse dust. Climatic teleconnection (ENSO, PDO, and NAO) are more important predictors of coarse dust than fine dust.  Previous studies have shown that bareness is the most important factor for dust variability over the region (e.g., Figure 6 in [13] and Figure 3 in [18]). The amount of bareness or vegetation of a region depends on both precipitation and temperature. In this study, we saw the relative importance of temperature and precipitation in dust emissions. Re- Previous studies have shown that bareness is the most important factor for dust variability over the region (e.g., Figure 6 in [13] and Figure 3 in [18]). The amount of bareness or vegetation of a region depends on both precipitation and temperature. In this study, we saw the relative importance of temperature and precipitation in dust emissions. Reduced precipitation promoted lower soil moisture, leading to more intense dust emissions. The opposite was true for temperature. On a monthly timescale, as in this study, dust emissions strongly respond to temperature due to soil moisture depletion in topsoil. The impacts of precipitation on dust emissions are stronger on a longer timescale (i.e., annual) due to changes in vegetation cover [44].

Conclusions
This study investigated several machine-learning (ML) models' abilities to predict aeolian dust over the Southwestern US. The observed dust was taken from the Interagency Monitoring of Protected Visual Environments (IMPROVE) network while the regional meteorology data (precipitation, temperature, wind speed) were retrieved from North American Regional Reanalysis (NARR). The models' performances for fine dust and coarse dust were compared. Then, the relative importance of the predictors was assessed. The main conclusions from this study can be summarized as follows: 1.
The non-linear models performed better than linear regression to predict both fine and coarse dust. All ML models underestimated high concentrations of dust. 2.
ML models better predicted fine dust than coarse dust over the study region. 3.
The air temperature was the most important meteorological variable, followed by precipitation, for predicting monthly dust over the region.
The Southwestern US region is likely to see severe drought due to both reduced precipitation and increased temperatures [45] reducing surface moisture and vegetation, which enhance erodibility and dust emissions. Temperature and precipitation, being the most important predictors, imply the presence of future, warming-enhanced drought [46,47] over the region, associated with increased dust emissions and the related severe health and environmental concerns. ML models show the potential for predicting dustiness over the region, helping effective mitigation efforts.