Environmental Pollution Analysis and Impact Study-A Case Study for the Salton Sea in California

A natural experiment conducted on the shrinking Salton Sea, a saline lake in California, showed that each one foot drop in lake elevation resulted in a 2.6% average increase in PM2.5 concentrations. The shrinking has caused the asthma rate continues to increase among children, with one in five children being sent to the emergency department, which is related to asthma. In this paper, several data-driven machine learning (ML) models are developed for forecasting air quality and dust emission to study, evaluate and predict the impacts on human health due to the shrinkage of the sea, such as the Salton Sea. The paper presents an improved long short-term memory (LSTM) model to predict the hourly air quality (O3 and CO) based on air pollutants and weather data in the previous 5 h. According to our experiment results, the model generates a very good R2 score of 0.924 and 0.835 for O3 and CO, respectively. In addition, the paper proposes an ensemble model based on random forest (RF) and gradient boosting (GBoost) algorithms for forecasting hourly PM2.5 and PM10 using the air quality and weather data in the previous 5 h. Furthermore, the paper shares our research results for PM2.5 and PM10 prediction based on the proposed ensemble ML models using satellite remote sensing data. Daily PM2.5 and PM10 concentration maps in 2018 are created to display the regional air pollution density and severity. Finally, the paper reports Artificial Intelligence (AI) based research findings of measuring air pollution impact on asthma prevalence rate of local residents in the Salton Sea region. A stacked ensemble model based on support vector regression (SVR), elastic net regression (ENR), RF and GBoost is developed for asthma prediction with a good R2 score of 0.978.


Introduction
Salton Sea is one of the largest lakes in California. Since the water in the Salton Sea cannot be flown to the ocean, the concentrated salt level keeps increasing and reaches 50 percent more than the ocean [1]. It can be shown from the natural experiment [2] that the shrinking of the Salton Sea leaded to increase in the PM 2.5 concentrations, which can cause the asthma rate to keep going up among the kids in the Salton Sea region [3]. Recently, soil evaluation [4] and water quality prediction [5] of the Salton Sea have been made by using machine learning (ML) and big data techniques. However, few publications focus on Salton Sea environmental pollution analysis and impact study. This paper aims to develop datadriven ML models to forecast the air quality, dust emission due to shrinkage of the Salton Sea and its impact on human health. The Salton Sea area has one of the worst air qualities in the U.S. With the industrial emission and the pollution, and the special geographic environment, their residents, are suffering from the pollution and have many health issues. The air pollution in the imperial county has a serious impact on their resident's health, especially for the kids still in K-12. In the imperial county, one of the elementary schools has 17 percent of students suffering from Asthma. Those 64 students have inhalers kept in the office [6]. That is a relatively high percentage for a single disease in a certain area, which can show how much the factorial dust affects people's health. The main factors causing bad air quality in the Salton Sea area are microparticles in the air. When the particle has a diameter of fewer than 10 µm, it can enter the human lungs and bloodstream [7]. People can develop lung diseases like asthma after being exposed to microparticles in the air for a long time. Because of the poor air quality, lots of residents already have existing diseases like asthma, and the coronavirus is attacking human beings' lungs, which leads to a high mortality rate. The author DeLara claims that, even after more than a decade of controlling the pollution in the Salton Sea, the number of asthma and chronic obstructive pulmonary disease (COPD) patients still has not decreased [8].
Our efforts focus on forecasting air quality and dust emission to study the impacts on the asthma prevalence rate of local residents in the Salton Sea region. The models used in air quality prediction have connections to either the mechanism models or the ML models. Due to the significant growth in sensor technologies, a great deal of data has been made available in the public domain. The potential of ML models has earned a significant amount of attention. Similar to many previously reported models , our paper is related to the ML models.
In this paper, we have proposed an improved long short-term memory (LSTM) model to predict the hourly O 3 and CO based on air pollutant and weather data in the previous 5 h with higher accuracy. In addition, while most of the existing papers generally focus on one method to predict the particulate matter (PM) concentration [17][18][19][20][21]26], our study aims to develop two methods based on the different proposed ensemble models to predict the PM concentration by using the weather data and the satellite data, respectively. This paper shows real-time satellite-based PM 2.5 and PM 10 concentration maps in the Salton Sea area to visualize the regions with the highest and lowest amount of pollutants. While most of the research focuses on forecasting the asthma prediction by a single model [27][28][29][30][31], this paper proposes a stacked ensemble the model for asthma prevalence rate of the local residents in Salton Sea region with a higher accuracy, in which the value of R2 is 0.978.
Structure: The rest of this paper is organized as follows. The related work is summarized in Section 2. Section 3 shows a detailed description of training and testing data preparation. Section 4 describes the methodology for each selected model and the proposed ensemble models. In Section 5, the results and case study are provided. Section 6 discusses the results of this research in the light of other similar studies. Section 7 concludes this paper.

Literature Survey
Due to the lake shrinkage and exposure, dry lake beds are becoming potential sources of particulate matter, and it further increases the air pollution. According to Gholami et al., the dry bed of the Jazmurian great lake is the main source of dust emissions in that region [9]. The models used in air quality prediction have connections to either the mechanism models or the ML models. Similar to many previously reported models , our paper is related to the ML models, which learn from the data and explore the relationship between air quality data and other parameters.
Many ML and neural network methods have been applied to predict air pollutant concentrations and dust emissions. Some research is related to the models such as LSTM, gradient boosting decision trees (GBDT) and deep feed-forward neural networks (DFNN), which are for shorter time predictions [11]. Other research has tended to focus on models such as least absolute shrinkage and selection operator (LASSO), support vector regression (SVR), random forest (RF), k-nearest neighbor (kNN), eXtreme gradient boosting (XGBoost) algorithm, and artificial neural networks (ANN) for a longer time predictions [10]. Fan   In addition to an air quality comparison, we also performed an existing survey for MLbased impact analysis. These included papers spread across various regions that use various ML models for studying the impact of environmental conditions on human health. Table 2 lists the summary of various research papers highlighting the impact of air/dust/other environmental conditions on human health, particularly in relation to asthma conditions. Many of these studies related to the impact of indoor or outdoor air quality on human health. These included asthma predictions, the impact of air pollution on COVID mortality and the impact on human behaviors due to air pollution [27], which is a unique study for identifying the impact of pollution on the behavior of people. Razavi-Termeh et al. used environmental factors along with map locations to locate regions in the city with high chances of asthma. They have used the spatial correlation between asthma and air pollution by utilizing patients' distance from streets and parks [28]. In another study conducted in Seoul [29], a hybrid deep learning model (HDLM) based on vector autoregressive (VAR) and DFNN was used, which utilizes time series analysis to show the relationship between environmental pollutants and asthma statistics predictions. Chavda [30] combined daily air quality data with hospitalization statistics for asthma patients in California to show the correlation between the two. Experiments show that the decision tree (DT) outperforms other models. Kim et al. proved that the LSTM outperformed the Multinomial Logistic (MNL) by 57-84% increase in the ability to predict the chances of asthma in children because of inside air pollution [31]. Table 2 lists the literature survey on the air pollution impact.   Table 3 lists the comparison of six forecasting and regression models for predicting air pollution along with the advantages and disadvantages. ANN [12] Predicting future air quality by learning from time-series historical data

Technology Survey
(1) Working well for short term time series forecasting. (2) Improving the prediction accuracy while keeping the parameter counts minimum.
(1) Difficult to forecast when there are outliers in data. (2) Often not interpretable.
SVR [13] Predicting discrete values, which tries to fit the best line (hyperplane) within a threshold value (1) Effective in the higher dimension. (1) Not suitable for large datasets.
(2) Not robust when the data set has more noise.
LR [14] Predicting PM concentrations for current (d) day by using particulate matter data at (d-1) day (1) Easier to implement with a much smaller number of parameters. (2) Simple and cheapest when data is less and linear.
(1) Assumes independence between input features, but this is not true for air and weather datasets, and hence this needs to be handled prior. (2) Sensitive to outliers and hence outliers must be dealt with properly before forecasting.

Data Collection
To forecast the hourly air pollutant, we used the California Air Resource Board (ARB) to collect the hourly air data. The Air Quality Data Query tool (AQDQT) [33] from ARB is used to collect raw data for air pollutants. The Meteorology Data Query Tool (MDQT) [34] from ARB is used to collect hourly weather data. There is a separate .csv file for each year for each pollutant and weather.
Since the PM data from the meteorological station only reflects the PM concentration around the station. In order to explore the PM 2.5 and PM 10 situation around the Salton Sea area and show the temporal and spatial particulate matter change, we collect the data of normalized difference vegetation index (NDVI), the distance to the Salton Sea, weather, air pollutants, and historical weather and air pollutants. The data of NDVI can be taken from the Moderate Resolution Imaging Spectroradiometer (MODIS), which covers the Riverside and Imperial counties with a spatial resolution of 1 km or 500 m. MODIS data are available on NASA [35], which can be downloaded from Earth Engine using Cygwin. The distance to the Salton Sea can be taken from Landsat 8 satellite images in the Salton Sea area, which can be collected from U.S. Geological Survey and downloaded using Google's gsutil tool from Google Cloud Storage [36]. For each Landsat 8 data folder, it contains 11 band images, one MTL.txt file (product metadata file), and one ANG.txt file, which contains the angle-coefficient of the sensor-viewing angle. Bands 2-4 are visible in blue, green, and red. Combining bands 2-4, we obtained a true-colour image. The in situ PM data, weather and pollutants data are taken from the Environmental Protection Agency (EPA) [37].
To study the health impacts related to asthma prediction, different data sources will be used. Asthma emergency department (ED) visit rates, hospitalizations and mortality rates are collected from California Health and Human Services (CHHS) open data portal for both Imperial and Riverside counties [38,39]. The statistical data of asthma prevalence was collected from the Ask CHHS website [40]. The health data was combined with air quality pollutants [41] such as NO 2 , SO 2 , CO and O 3 , and particulate matter, such as PM 2.5 and PM 10 , to study its impact on human health. The change in meteorological data [42] was also considered such as temperature, humidity, wind, and air pressure to study its effects on asthma. In addition, surface area data [43] is used to study the impact of Salton Sea shrinkage on air pollution and asthma.

Data Preprocessing
In this paper, five different kinds of data or images were preprocessed respectively by the following steps. Step 1: For the AQDQT and MDQT data, we used the following steps to preprocess the raw data of the air pollutants and meteorological data. • Merge year-wise files for each pollutant and weather into one; • Remove descriptive variables like name, units, quality, prelim, met source, and site name from the dataset; • Represent each pollutant and weather data by using a unique column for a given date and hour; • Use pandas backfill and forward fill to impute null values; • Create line plots, box plots and statistical correlation plots to make outlier and anomaly detection.
Step 2: For MODIS data, we used the following steps to preprocess the raw data of NDVI. Step 3: For Landsat 8 satellite images, we used the following steps to preprocess the raw data of the distance to the Salton Sea.
• Generates open water cover mask for Landsat 8 using water detect [44]; • Obtain the shapefile of the Salton Sea area.
Step 4: For the asthma data, we used the following steps to preprocess the raw data.
• Collect the asthma data for Salton Sea counties, zip code and stratified at age group; • Map zip codes to monitoring sites to merge year-wise air-quality pollutants, weather and surface area data; • Perform data cleaning on each dataset separately using pandas and NumPy; • Impute the missing values with group mean (year, county, monitoring site); • Drop all the redundant columns from the merged dataset.
Step 5: For the EPA data, the yearly air quality data of six pollutants can be preprocessed by the following steps.
• List standard air pollution statistics for all six criteria pollutants per single county per year by each row; • Merge all the csv into a single data frame that can be used for further analysis.

Training Data Preparation
In this paper, different features are performed by different models and each model is trained separately. The corresponding train, validation and test datasets are prepared athe s following steps.
Step 1: For hourly air pollutant forecasting, we created three new features, which are the season, weekend flag and peak hours for each time step based on the date of the observation. The label encoder from scikit learn library was used to encode categorical features. The standard scalar from the scikit learn library was used to standardize the pollutants and weather. The lag and lead features are added to forecast future values. The data is shifted to add 5 h of previous lag features and sorted as per date. We split the data into train, test and validation sets, which are 2015-2017 for training models, 2018-2019 for validation, 2020 for testing the models and 2021 for analyzing quality, showed in Table 4. The split data is reshaped to 3D format for DL models.
Step 2: For particulate matter prediction, we divided the prepared data into three datasets, which are the training dataset (60%), validation dataset (20%), and testing data (20%), as shown in Table 4.
Step 3: For the health impact study, one-hot encoding was performed to transform categorical features into numeric values. Outliers were handled in the target feature by applying logarithmic transformation. Min-max normalization was used for feature scaling of data. The feature importance method was used to select the feature. The data is split into 80% for training and 20% for testing, as shown in Table 4.

Model Development
To perform the impact analysis of saline seas pollution, we have performed three tasks and hence the models in our paper are divided into three parts. Each part had its own data needs; hence, separate models were developed, evaluated, and validated respectively. Firstly, we created time series forecasting models for hourly air pollutant concentration prediction, which include hourly CO and O 3 prediction and hourly PM 2.5 and PM 10 prediction. Then, we performed an analysis on the prediction of daily PM 2.5 and PM 10 based on satellite data in the area highlighting the impact of degrading saline seas on air. Finally, we created prediction models for showing the impact of saline seas air quality and decreasing surface area on health, particularly on asthma.

Hourly CO and O 3 Prediction
To perform hourly forecasting of CO and O 3 , we have improved three base models, including the LSTM, CNN and DFN [26], by using the previous 5 h of data, as shown in Figure 1. The previous 5 h of NO 2 , SO 2 , O 3 , CO, PM 10 , wind speed, temperature, relative humidity and barometric pressure are given as inputs to each model to predict the upcoming concentration of CO in the air. The previous 5 h of O 3 , CO, NO 2 relative humidity, temperature and wind speed are given as input to each model to predict the upcoming concentration of O 3 in the air.

1.
LSTM: In this paper, the first step in model development would be to transform input data into an appropriate 3D format for LSTM. One of the advantages of using this model is that it retains the time aspect of data and helps identifying complex non-functional relationships between data compared to statistical models which only focus on linearity in data. In this paper, we went through various iterations of the finetuning model by changing LSTM units per layer, adding additional LSTM layers and selecting different features. We got best results for 5 past hours' data with 50 LSTM layers for carbon monoxide and further tuned model to avoid overfitting by adding 11 regularizes and 0.5 dropout. We also did an early stop with a patience value of 20 to save the best model, shown in Figure 1a. Similar to CO, a model for O 3 is developed with a dropout rate of 0.4. It had nine input features from the past 5 h. 2.
CNN: In this paper, a one-dimensional CNN is used. We have one CNN layer followed by a pooling layer and then tune the number of hidden layers with "ReLU" activation and add a dropout layer if required. We only added a dropout layer after the pooling layer and the fully connected layer 1 for CO prediction. The final layer would have one output without any function. Figure 1b shows the CNN model architecture for hourly CO prediction. 3.
DFN: In this paper, the DFN model was employed to forecast air pollutants with our data, in which the LSTM layer includes 24 LSTM memory units. We removed the flexible dropout layer for ozone prediction. For CO prediction, we used the DFN model with a flexible dropout layer and the dropout rate can be obtained by 0.19 + 0.0025 × g [26], which is 0.2025, in which the window size g was chosen as 5 h for our data, as shown in Figure 1c.
2. CNN: In this paper, a one-dimensional CNN is used. We have one CNN layer followed by a pooling layer and then tune the number of hidden layers with "ReLU" activation and add a dropout layer if required. We only added a dropout layer after the pooling layer and the fully connected layer 1 for CO prediction. The final layer would have one output without any function. Figure 1b shows the CNN model architecture for hourly CO prediction.
3. DFN: In this paper, the DFN model was employed to forecast air pollutants with our data, in which the LSTM layer includes 24 LSTM memory units. We removed the flexible dropout layer for ozone prediction. For CO prediction, we used the DFN model with a flexible dropout layer and the dropout rate can be obtained by 0.19 + 0.0025 × g [26], which is 0.2025, in which the window size g was chosen as 5 h for our data, as shown in Figure 1c.

Hourly 2.5 and 10 Prediction
To perform the hourly forecasting of 2.5 and 10 , we proposed an ensemble model, which is created by using a RF regressor and gradient boosting (GBoost) regressor. Models were individually optimized using the Bayes optimization method. This method uses the surrogate function and the concept of the Bayes theorem for tuning the hyperparameters of ML models. This method is efficient and fast for models with continuous and conditional parameters. In this paper, an ensemble model performs the weighted average of these two models for making final predictions. We assigned different weights to each model based on their individual scores. RF regressor with weight of 0.75 and GBoost regressor with weight 0.25 gave the best results. To predict the upcoming hour of

Hourly PM 2.5 and PM 10 Prediction
To perform the hourly forecasting of PM 2.5 and PM 10 , we proposed an ensemble model, which is created by using a RF regressor and gradient boosting (GBoost) regressor. Models were individually optimized using the Bayes optimization method. This method uses the surrogate function and the concept of the Bayes theorem for tuning the hyperparameters of ML models. This method is efficient and fast for models with continuous and conditional parameters. In this paper, an ensemble model performs the weighted average of these two models for making final predictions. We assigned different weights to each model based on their individual scores. RF regressor with weight of 0.75 and GBoost regressor with weight 0.25 gave the best results. To predict the upcoming hour of PM 2.5 concentration in the air, the model is given 5 h of past concentrations of seven features, which are PM 2.5 , PM 10 , barometric pressure, dew point, wind speed, humidity and temperature, shown in  concentration in the air, the model is given 5 h of past concentrations of seven features, which are 2.5 , 10 , barometric pressure, dew point, wind speed, humidity and temperature, shown in Figure 2. Similar to 2.5 , the model is given 5 h of past concentrations of seven features, which are 2.5 , 10 , CO, 2 , 3 , 2 and wind speed to predict the upcoming hour of 10 concentration in the air.

Daily Satellite-Based PM 2.5 and PM 10 Prediction
For the prediction of the PM 2.5 and PM 10 concentrations based on satellite data, we have developed three base models, which are the the RF, the SVR and XGBoost. To obtain a better performance, we created two ensemble models of the above three models, including a weighted average ensemble model and stacked ensemble, which were implemented for making the final prediction. To predict the daily PM 2.5 concentration in the air, each model is given the data of NDVI, distance to the Salton Sea, PM 2.5 , PM 10 , barometric pressure, dew point, wind speed, humidity and temperature as input. To predict daily PM 10 concentration in the air, each model is given data of NDVI, distance to the sea we studied, PM 2.5 , PM 10 , CO, NO 2 , O 3 , SO 2 and wind speed as input. Except for inputs and model parameters, each model architecture for PM 10 prediction is similar to PM 2.5 ; therefore, we only show each model architecture for the PM 2.5 .

1.
RF: RF is a tree-based ensemble model. It is easy to use and has good performance on large data. We use Grid Search CV with threefold to optimize the parameters. RF model architecture for daily PM 2.5 prediction is shown in Figure 3a. The parameters of RF model are "bootstrap: true, max_depth: 50, max_features: auto, min_samples_leaf: 2, min_samples_split: 2, n_estimators: 500" for PM 10 .

2.
SVR: SVR is very similar to support vector machines (SVM), which can be used in classification and clustering problems. While iterating SVR, we put a Grid Search CV of parameters, using threefold cross-validation, and search for different combinations in order to obtain the better result. We use SVM as our base model in the "PM concentration prediction using satellite images" part. The parameters of SVR model are "kernel = 'rbf', degree = 3, gamma = 'auto', C = 100" for PM 2.5 , and "kernel = 'rbf', degree = 3, gamma = 'scale', C = 100" for PM 10 .

3.
XGBoost: XGBoost is simple, efficient and easy to implement, which is suitable for dealing with a large number of pulsar candidates with an excellent generalization performance. XGBoost model architecture for daily PM 2.5 prediction is shown in Figure 3b.  Stacked ensemble model: The stacked ensemble model is created by using SVR, RF and XGBoost, in which the base learners are SVR, RF and XGBoost and the meta learner is LR. The models of SVR, RF and XGBoost are used as Level-0 stage, and our meta learner LR is used as Level-1 in order to find target features for daily PM prediction. Figure 3d shows the stacked ensemble model with linear regression for daily PM 2.5 prediction.
5. Stacked ensemble model: The stacked ensemble model is created by using SVR, RF and XGBoost, in which the base learners are SVR, RF and XGBoost and the meta learner is LR. The models of SVR, RF and XGBoost are used as Level-0 stage, and our meta learner LR is used as Level-1 in order to find target features for daily PM prediction. Figure 3d shows the stacked ensemble model with linear regression for daily

Asthma Prevalence Rate Prediction
For health impact prediction, we used air pollutants data, weather data and asthma data to predict asthma prevalence rate among different ages. Starting with RF, GridSearch was used to finetune each model. The SVR was also included to obtain the lowest error rate, thus yielding a better fitting model. The third model, the elastic net regression (ENR) model, was employed to reduce overfitting problems in linear models and to eliminate coefficients about unimportant attributes. Since GBoost is one of the powerful algorithms in ML, which focuses on minimizing the bias error by combining several weak learners to form a strong learner. We tuned multiple parameters of GBoost and found the optimal n_estimators value, which is critical for asthma prediction. To obtain a better performance, we created two ensemble models of all four models as discussed above. Methods like

Asthma Prevalence Rate Prediction
For health impact prediction, we used air pollutants data, weather data and asthma data to predict asthma prevalence rate among different ages. Starting with RF, GridSearch was used to finetune each model. The SVR was also included to obtain the lowest error rate, thus yielding a better fitting model. The third model, the elastic net regression (ENR) model, was employed to reduce overfitting problems in linear models and to eliminate coefficients about unimportant attributes. Since GBoost is one of the powerful algorithms in ML, which focuses on minimizing the bias error by combining several weak learners to form a strong learner. We tuned multiple parameters of GBoost and found the optimal n_estimators value, which is critical for asthma prediction. To obtain a better performance, we created two ensemble models of all four models as discussed above. Methods like weighted average and stacked ensemble are implemented for making the final prediction as follows. The input of each model is the data of NO 2 , SO 2 , O 3 , CO, PM 2.5 , PM 10 , wind speed, pressure, dew point, temperature, relative humidity and the healthy data. The output is the prediction of the ED asthma visits.

Case Study Results
In this section, the Salton Sea, one of the largest lakes in California, is taken as an example to analysis the environmental pollution and study its impact on health. The water level is decreasing, and the special terrain makes the Salton Sea react poorly to pollution. Not only do factories dump industrial waste into it, but the low rainfall precipitation rate

Case Study Results
In this section, the Salton Sea, one of the largest lakes in California, is taken as an example to analysis the environmental pollution and study its impact on health. The water level is decreasing, and the special terrain makes the Salton Sea react poorly to pollution. Not only do factories dump industrial waste into it, but the low rainfall precipitation rate causes the Salton Sea to shrink. The proposed models are developed to forecast the air quality, dust emission due to shrinkage of the Salton Sea, and their impacts on human health as following.

Hourly Air Pollutant Prediction Results
We have developed models for each of the four pollutants, which are O 3 , CO, PM 2.5 and PM 10 . These are the major pollutants impacting the Salton Sea area. These pollutants are directly correlated to dust and temperature in the region. Based on these pollutants, the final Air Quality Level (AQL) will be determined for the upcoming hour in the area. Air Quality Index (AQI) level will be based on the US EPA standards.

1.
Hourly CO and O 3 Prediction: Three proposed models, which are the LSTM, the CNN and the DFN, were developed and evaluated for predicting the upcoming hour CO and O 3 concentration. Models were trained for 100 epochs using the Adam optimizer and loss was evaluated using mean squared errors (MSE). Training of the LSTM model compared to the other two models seemed to be stable and learned better after 100 epochs, and the loss of both training and validation data is close to each other. The results comparison for the different models is shown in Table 5. We obtained best results with LSTM for CO and O 3 . Test data samples with predicted and actual values are shown in Figure 5a,c for CO and O 3 , respectively. The line chart for predicted and actual CO and O 3 values is shown in Figure 5b,d for the LSTM model, respectively. Red dotted and blue lines for predicted and actual values of CO and O 3 in the test data are very close to each other for LSTM model. Comparison is drawn after re-scaling values to the original unit of data, i.e., ppm. Results in Figure 5 showed that there is a strong relationship between both the values and our model gave accurate results with very less errors.

2.
Hourly PM Prediction Using Meteorological Station Data: ML models were developed and tuned. DL models are not able to provide the best results, and there is no training in using the DL models. We developed an ensemble model of RF and GBoost for PM 2.5 and PM 10 prediction. Test data samples with predicted and actual values are shown in Figure 5e,g for PM 2.5 and PM 10 respectively. Line chart for predicted and actual PM 2.5 and PM 10 values is shown in Figure 5f,h for the ensemble model of RF and GBoost, respectively. We can see in the results that the model has predicted results accurately. The proposed models in this paper have high capabilities and strengths over other models for our targeted problem.

Satellite-Based Daily Particulate Matter Prediction Results
The PM data from the meteorological station only reflects the PM concentration around the station. As such, in order to explore the PM 2.5 and PM 10 situation around the Salton Sea area and show the temporal and spatial particulate matter change, RF, SVR, XGBoost were developed to explore the relationship between NDVI from the satellite data and ground-level PM concentration. Three proposed models were tuned using grid search. Table 6 shows the performance of three base models of SVR, RF, XGBoost and the two ensemble models. For the PM 2.5 prediction, the stacked ensemble outperformed the other models with a good R2 score of 0.76. Hence, the stacked ensemble model is selected as a candidate model. For PM 10 prediction, the weighted average ensemble model has the highest accuracy. Additionally, we explored the stacked ensemble and weighted average ensemble methods to identify the best model for our research. We focus on showing the daily PM concentration map in the Salton Sea area. As for the Salton Sea area, the expanding dry lakebed is a significant source of dust during the late spring to early summer [45]. Satellite-based PM 2.5 and PM 10 concentration maps were created to visualize the regions with the highest and lowest pollutants. Figures 6 and 7 show the distribution of PM 2.5 and PM 10 across different months, respectively. The red line represents the Salton Sea area. Each small square in the figures represents an area of 2 km by 2 km. The darker the square, the higher the PM concentration. The PM concentration around the Salton Sea and its surroundings can be intuitively displayed in the form of snapshots using ML at different times and places, making it more convenient for people to compare and analyze the PM concentration. From the comparison, we can see that spring (from March to May), and summer (from June to August) have the highest concentrations. This can be due to the fact that wind speed is high around this time.

Satellite-Based Daily Particulate Matter Prediction Results
The PM data from the meteorological station only reflects the PM concentration around the station. As such, in order to explore the PM 2.5 and PM 10 situation around the Salton Sea area and show the temporal and spatial particulate matter change, RF, SVR, XGBoost were developed to explore the relationship between NDVI from the satellite data and ground-level PM concentration. Three proposed models were tuned using grid search. Table 6 shows the performance of three base models of SVR, RF, XGBoost and the two ensemble models. For the PM 2.5 prediction, the stacked ensemble outperformed the other models with a good R2 score of 0.76. Hence, the stacked ensemble model is selected as a candidate model. For PM 10 prediction, the weighted average ensemble model has the highest accuracy. Additionally, we explored the stacked ensemble and weighted average ensemble methods to identify the best model for our research. We focus on showing the daily PM concentration map in the Salton Sea area. As for the Salton Sea area, the expanding dry lakebed is a significant source of dust during the late spring to early summer [45]. Satellite-based PM 2.5 and PM 10 concentration maps were created to visualize the regions with the highest and lowest pollutants. Figures 6 and  7 show the distribution of PM 2.5 and PM 10 across different months, respectively. The red line represents the Salton Sea area. Each small square in the figures represents an area of 2 km by 2 km. The darker the square, the higher the PM concentration. The PM concentration around the Salton Sea and its surroundings can be intuitively displayed in the form of snapshots using ML at different times and places, making it more convenient for people to compare and analyze the PM concentration. From the comparison, we can see that spring (from March to May), and summer (from June to August) have the highest concentrations. This can be due to the fact that wind speed is high around this time.

Health Impact Prediction Results
Four models, RF, SVR, ENR, and GBoost, were developed to predict the asthma prevalence in both counties of the Salton Sea. We conducted hyperparameter tuning for all the estimators using Gridsearch () with a cv = 5. The best parameters given by the GridSearchCV are used as the estimators for training. Additionally, we explored the weighted average ensemble model and stacked ensemble method to identify the best model for our research. Weights for RF, SVR, ENR, and GBoost were set to 0.3, 0.1, 0.1, 0.5, respectively, based on their individual performance. The weighted average model has a good R2 score of 0.95. We further attempted to improve the model performance by creating a stacked ensemble model. Here, the base learners are RF, SVR, ENR, GBoost, and the meta learner is LR. The stacked ensemble outperformed the other models with a good R2 score of 0.978. Hence, the stacked ensemble is selected as a candidate model for predicting the health impact. Table 7 shows the results of the comparison of all the models.

Health Impact Prediction Results
Four models, RF, SVR, ENR, and GBoost, were developed to predict the asthma prevalence in both counties of the Salton Sea. We conducted hyperparameter tuning for all the estimators using Gridsearch () with a cv = 5. The best parameters given by the GridSearchCV are used as the estimators for training. Additionally, we explored the weighted average ensemble model and stacked ensemble method to identify the best model for our research. Weights for RF, SVR, ENR, and GBoost were set to 0.3, 0.1, 0.1, 0.5, respectively, based on their individual performance. The weighted average model has a good R2 score of 0.95. We further attempted to improve the model performance by creating a stacked ensemble model. Here, the base learners are RF, SVR, ENR, GBoost, and the meta learner is LR. The stacked ensemble outperformed the other models with a good R2 score of 0.978. Hence, the stacked ensemble is selected as a candidate model for predicting the health impact. Table 7 shows the results of the comparison of all the models.  Table 8 shows the performance of these selected models in various research with time series data for air quality forecasting. We can see that the DL models tend to perform better by identifying even nonlinear relationships in time steps. The proposed models in this paper have high capabilities and strengths over other models for our targeted problem. Additionally, with these models, it is easier to implement and evaluate multivariate time series for forecasting even multiple time steps if required. Once we predicted the individual pollutants, we calculated the overall air quality index for that hour and compared original and predicted AQI values. Table 9 shows the comparison of original and predicted AQI results. The final accuracy for the AQI level is 86.7%.  Table 10 shows the performance of the recently proposed models in various research with satellite data and other variables as the input for particulate matter forecasting. For the PM 2.5 prediction, the proposed weighted average ensemble model of RFR, SVR, and XGB outperformed most of the other models with a good R2 score of 0.76. For PM 10 prediction, the proposed stacked ensemble model of RF, SVR, XGBoost and LR has the highest accuracy among all of the other models.  Table 11 shows the performance of the recently proposed models in various research with air quality and other variables as the input for health prediction. We can see that the proposed stacked ensemble model of RF, SVR, ENR, GBoost and LR in this paper outperformed the other models with a good R2 score of 0.978.

Conclusions
In this paper, we have proposed forecasting models to predict the health impacts caused by the air quality in the Salton Sea, which have been divided into three main parts, the environmental air pollutants, particulate matter and the asthma ED visits, respectively. Each model performed relatively well. Firstly, for hourly air pollutant forecasting, the LSTM model was deployed on both O 3 and CO forecasting by using the previous 5 h of all pollutants and weather conditions. The MSE loss function and Adam optimizer were employed to evaluate the performance, and the results showed that the LSTM model obtained the best results due to low error. Secondly, as for hourly PM 2.5 and PM 10 prediction, the ensemble model of weighted average method based on RF and GBoost are proposed by using the previous 5 h of air pollutants and weather conditions. The models are tuned by implementing Bayes optimization to obtain the best result. We can achieve a 0.9 score for ozone prediction. Then, for particulate matter prediction, the proposed ensemble model of weighted average method obtained the best result for predicting daily PM 2.5 , and daily PM 10 has the best result while using the stacked ensemble model by comparing R2, RMSE and MAE values. Finally, for the health impact study, we used the SVR, ENR, RF and GBoost models on asthma ED visits prediction, in which the stacked ensemble was selected as a candidate model by comparison with the weighted average method with a good R2 score of 0.978.
We have two goals for future work. Above all, we can enhance our satellite data by incorporating more satellites in future. Our specific goal is to collect data from other satellites and make a finer prediction of PM concentration around the Salton Sea. In addition, our broad goal is to develop real-time health impact prediction dashboards to highlight the relationship between various environmental factors and asthma prevalence rates for more cities around the Salton Sea in the future. Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: https://saltonsea-air-health.herokuapp.com/, accessed on 13 May 2022.