In the Seeking of Association between Air Pollutant and COVID-19 Confirmed Cases Using Deep Learning

The COVID-19 pandemic raises awareness of how the fatal spreading of infectious disease impacts economic, political, and cultural sectors, which causes social implications. Across the world, strategies aimed at quickly recognizing risk factors have also helped shape public health guidelines and direct resources; however, they are challenging to analyze and predict since those events still happen. This paper intends to invesitgate the association between air pollutants and COVID-19 confirmed cases using Deep Learning. We used Delhi, India, for daily confirmed cases and air pollutant data for the dataset. We used LSTM deep learning for training the combination of COVID-19 Confirmed Case and AQI parameters over the four different lag times of 1, 3, 7, and 14 days. The finding indicates that CO is the most excellent model compared with the others, having on average, 13 RMSE values. This was followed by pressure at 15, PM2.5 at 20, NO2 at 20, and O3 at 22 error rates.


Introduction
Although we remember and contemplate that during 26 January-3 October 2020, more than 300,000 people died in the United States, with two thirds of those deaths directly associated to COVID-19 [1], we might also assess what the newest science says about the pandemic. We know that those who live in places with severe levels of air pollution will face several hazards concerning their respiratory health throughout this outbreak. Currently, new research focuses on the correlations between air pollution and severe COVID-19 sickness, emphasizing the crucial need for everyone to breathe clean air. Research published in December 2020 attempted to assess the extent to which COVID-19 mortality is due to long-term exposure to fine particle pollution [2]. Using a combination of epidemiological data, satellite data, and other monitoring data worldwide, the researchers concluded that chronic air pollution might be responsible for 15% of COVID-19 fatalities globally [2]. The experts also distinguished air pollution generated by fossil fuels and 1.
To train the integration of COVID-19 Confirmed Case and AQI parameters in four different lag times, 1, 3, 7, and 14 days, using long short-term memory (LSTM) deep learning.

3.
To evaluate and compare the RMSE values for the trained models.
The contribution of this paper might leverage the research of correlation and prediction analysis of air pollutants and COVID-19 using different approaches, such as lag times and LSTM methods combined with correlation analysis.

Background Review and Related Work
Cui et al. [9] discovered that the residents of a severely polluted area of China were more likely to die of SARS than residents in a less polluted area. Kan et al. [10] discovered that increases in particulate matter air pollution enhanced the probability of dying from the disease during the 2003 SARS pandemic. Numerous viruses, including adenovirus and influenza virus, have been proven to be transmitted by air particles. Zhao et al. [11] concluded that particulate matter was probably a factor in the propagation of the 2015 avian influenza. According to Chen et al. [12], air pollution can hasten the spread of respiratory diseases.

Research on Association of Air Pollutant and COVID-19
Researchers conducted the study related to the association of air pollution and COVID-19 in various countries. We examined their works to enrich our knowledge of this topic, as follows.
Zhu et al. [13] investigated the association between ambient air pollution and coronavirus infection. Between 23 January 2020 and 29 February 2020, in China, daily confirmed cases, air pollution concentrations, and climatic data were collected in 120 cities. They used a generalized additive model to examine the relationships between six air pollutants (PM 2.5 , PM 10 , SO 2 , CO, NO 2 , and O 3 ) and verified instances of COVID-19.
Gupta et al. [14] estimated the increased risk of coronavirus disease , caused by severe acute respiratory syndrome coronavirus 2, by establishing a link between the mortality rate of infected individuals and air pollution, specifically Particulate Matters (PM) with aerodynamic diameters of 10 m and 2.5 m. Nine Asian cities' data are studied using statistical techniques such as analysis of variance and regression modeling.
Lolli et al. [15] quantified the relationship between COVID-19 transmission and meteorological and air quality indices in two significant urban regions in Northern Italy, Milan, and Florence, as well as the autonomous province of Trento. Milan, the capital of the Lombardy region, is often regarded as the heart of Italy's HIV epidemic.
Bashir et al. [16] investigated the relationship between COVID-19 and climatic indicators in New York City, United States of America. They analyzed secondary public data from the New York City Department of Health and the National Weather Service in the United States of America. The average temperature, lowest temperature, maximum temperature, rainfall, average humidity, wind speed, and air quality are all covered in the research. The Kendall and Spearman rank correlation tests were used to analyze the data.
Suhaimi et al. [17] investigated the relationships between air quality, climatic variables, and COVID-19 cases in Kuala Lumpur, Malaysia. The Department of Environment Malaysia provided air pollutants and meteorological data from 2018-2020, whereas the Ministry of Health Malaysia provided daily new COVID-19 case data in 2020.
Mehmood et al. [18] used geospatial tools to analyze the relationship between COVID-19 cases, air pollution, meteorological, and socioeconomic characteristics in three provincial capital cities and the federal capital city of Pakistan.
Hoang and Tran [19] investigated the temporal association in seven metropolitan centers and nine regions across Korea using the generalized additive model. The findings indicate a substantial nonlinear relationship between daily temperature and verified COVID-19 cases.
Travaglio et al. [20] matched current SARS-CoV-2 cases and fatalities from public databases to regional and subregional air pollution data collected across England.
In Singapore, Lorenzo et al. [21]  There is a positive association between a region's degree of air pollution and the mortality associated with COVID-19, demonstrating that air pollution is a significant and hidden factor exacerbating the worldwide burden of COVID-19-related mortality. Lolli et al. (2020) The correlation between meteorological and air quality indicators and COVID-19 transmission was quantified.
Northern Italy, Milan, and Florence Although elements such as temperature and humidity are inversely connected with viral transmission, air pollution (PM 2.5 ) is positively correlated (to a lesser degree).

Bashir et al. (2020)
The connection between COVID-19 and climatic factors was analyzed.
New York City, USA The COVID-19 pandemic was substantially related with average temperature, lowest temperature, and air quality.

Mehmood et al. (2021)
Using geospatial approaches to examine the connection between COVID-19 instances, air pollution, meteorological, and socioeconomic characteristics.
Three out of four provinces of Pakistan (Punjab, Sindh, Khyber Pakhtunkhwa) The findings reveal that daily COVID-19 is positively linked with PM 2.5 and other meteorological variables, implying that climate has a significant role in determining the COVID-19 incidence rate in Pakistan.

Hoang and Tran
The generalized additive model was used to evaluate the temporal connection between ambient air pollution, weather, and COVID-19 infection.

Seven metropolitan cities and nine provinces across Korea
Daily temperature had a substantial nonlinear relationship with verified COVID-19 cases.

Travaglio et al. (2021)
Evaluated recent SARS-CoV-2 cases and fatalities from public databases to regional and subregional air pollution data collected at several locations.

England
There is a positive correlation between COVID-19 mortality and infectivity and air pollution concentrations, notably nitrogen oxides.

Lorenzo et al. (2021)
Determine the relationship between core air pollutant concentrations, climatic factors, and daily verified COVID-19 cases.

Singapore
There is a statistically significant positive correlation between NO 2 , PSI, PM 2.5 , and temperature and COVID-19 case numbers.

Author (Year) Objective Location Finding
Mandalapu et al.
The link between air pollution and COVID-19 severity has been studied at the regional and metropolitan levels, but it is uncertain if this link holds true at the neighborhood level.

Los Angeles County, California
Eighteen of the twenty-three significant comparisons for the COVID-19 weekly death rate confirmed that NO 2 levels were higher in neighborhoods with higher COVID-19 weekly death rates. Similarly, 12 of the 19 comparisons confirmed the same relationship with CO levels, as 14 of the 23 comparisons confirmed the same relationship with ozone levels, and 6 of the 6 comparisons confirmed the same relationship with PM 10 .

Sidell et al. (2022)
To examine at both long-term and short-term air pollution exposure, as well as COVID-19 occurrence, from 1 March 2020 to 28 February 2021.

Southern California
In all case peaks before February 2021, long-term PM 2.5 and NO 2 exposures were linked to an elevated probability of COVID-19 occurrence. Short-term exposures to PM 2.5 and NO 2 were also linked. Air pollution may have a role in raising the likelihood of COVID-19 infection.

Luo et al. (2022)
This study assessed the relationship between population movement and air quality in 332 Chinese cities from January to March (2019-2021), and the influence of three city factors (pollution level, city scale, and lockdown status) in this impact.
Three-hundred and thirty-two Chinese cities Lower migration was linked to lower pollution levels (other than O 3 ). Susceptibility to pollution changes is more probable as NO 2 decreases and O 3 increases, whereas insusceptibility to pollution is more likely for CO and SO 2 , and in cities with low migration. Cities with less air pollution and dense populations may benefit the most from lowering PM 10 and PM 2.5 . Those with rigorous traffic limits have higher links with population movement and air pollution than cities without limitations. The impacts of inter-city migration (ICM) and within city migration (WCM) on air pollution were found to be minor when city characteristics were considered.
Abdullah et al.
The connection between the Air Pollution Index (API) and COVID-19 infections is the objective of this research.

Huang et al. (2022)
Data on air pollution and verified COVID-19 cases were collected from five severely affected cities in three South American nations. COVID-19's spread was measured using daily real-time population regeneration (Rt). The influence of environmental contaminants on the pandemic was investigated using two commonly used models: generalized additive models (GAM) and multiple linear regression.

South America
(1) In all five locations, Rt, which potentially represents COVID-19 dissemination, exhibited a progressive drop. (2) Rt had a substantial effect on PM 10 and SO 2 in all of the locations studied. These two contaminants should be better monitored by regulators. (3) In cities with varying levels of air pollution, the link between air pollution and the spread of COVID-19 varied. The results indicate that there is a significant relationship between air pollution and COVID-19 infection.

Research on Prediction of Air Pollutant and COVID-19 Using Deep Learning
Aragão et al. [27] examined climate factors as extra features in a data-driven multivariate prediction model to predict the number of COVID-19 deaths in Brazilian states and significant cities in the short future. The basic premise is that by including these climatic characteristics as inputs to data-driven model training, the prediction performance increases when compared with single-input models. For both the multivariate and univariate models, the training adopted a Stacked LSTM as the network architecture. Using the mean fitting error, average forecast error, and the profile of the cumulative deaths for the forecast as evaluation criteria, the tests revealed that the best multivariate model is more skillful than the best standard data-driven univariate model we found. These findings suggest that by using additional important variables as input for a multivariate method, the quality of prediction models may be improved even more.
Al-Qaness et al. [28] presented an upgraded version of the adaptive neuro-fuzzy inference system (ANFIS) for forecasting the air quality index in Wuhan City, China. The PSOSMA is a hybrid optimization approach that uses a novel modified meta-heuristics (MH) algorithm, and a slime mold algorithm (SMA), which is enhanced by employing the particle swarm optimizer to increase ANFIS performance (PSO). The proposed PSOSMA-ANFIS was trained using three years of air quality index time series data and then used to forecast fine particulate matter (PM 2.5 ), sulfur dioxide (SO 2 ), carbon dioxide (CO 2 ), and nitrogen dioxide (NO 2 ) for a year. The suggested PSOSMA was also compared with various MH algorithms used to train ANFIS. The results discovered that the improved ANFIS incorporating PSOSMA outperformed the other methods.
Zhou et al. [29] discussed the COVID-19 forecasting, using the relevance of government initiatives in their suggested model, the Interpretable Temporal Attention Network (ITANet). Long short-term memory (LSTM) for temporal feature extraction and multi-head attention for the long-term dependency caption are used in the proposed model, which has an encoder-decoder architecture. The ITANet outperforms other models when it comes to anticipating COVID-19 new confirmed cases.
Saravanan et al. [30] described the impact of lockdown measures on air quality and rainwater accumulation in major cities. With respect to varying time length and climatic variables, the effects of COVID-19 on the environment during lockdown conditions were compared with those without lockdown conditions. During the lockdown, the concentrations of particulate pollution in Chennai, Bangalore, Delhi, and Melbourne were measured. The findings of this research indicate the effects of government actions and give a detailed perspective of the death rate in relation to air quality decrease.
Xu, et al. [31] created three deep learning models in their study to forecast the number of COVID-19 cases for Brazil, India, and Russia, including CNN, LSTM, and CNN-LSTM. The LSTM model, among the models constructed in this study, has the best forecasting performance, which indicates an improvement in prediction accuracy over certain current models.
Fu, et al. [32] used experimental public data sets from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE), the Air Quality Open Data Platform, the China Meteorological Data Network, and the WorldPop website. The Dual-link Bi-GRU Network predicts the epidemic scenario, and the Gauss-Newton iteration method quantifies the relationship between epidemic spread and other feature parameters. Among the selected characteristic elements, the study discovered that population density had the most positive link with pandemic spread, followed by the number of landing planes.
Mumtaz, et al. [33] suggested an indoor air quality monitoring and prediction system based on the newest Internet of Things (IoT) sensors and machine learning capabilities, which can assess a variety of indoor pollutants. An IoT node including numerous sensors for eight pollutants, including NH 3 , CO, NO 2 , CH 4 , CO 2 , PM 2.5 , as well as the ambient temperature and air humidity, has been designed for this purpose. With an accuracy of 99.37%, precision of 99%, recall of 98%, and F1-score of 99%, this model has showed promise in forecasting air pollutants' concentrations as well as overall air quality.

LSTM Network
In nonlinear sequence prediction issues, the LSTM network is a recurrent neural network (RNN) design that can learn order dependence. They have a habit of memorizing things for a long period. The memory cell, which substitutes classic neurons' hidden layers, is at the foundation of the LSTM network [34]. The LSTM networks, similarly to other RNNs, feature recurrent cells, but instead of a single NN gate, the recurring cell has an interactive input gate, output gate, and forget gate [35]. The cell remembers values for arbitrary time intervals, and these three gates control the flow of information into and out of the cell. Based on the past state, accessible memory, and current input, this structure guarantees that the LSTM can recognize which cells are stimulated and compressed. The LSTM networks were created to solve the problem of disappearing gradients that might occur when training traditional RNNs. As there may be unexpected delays between critical occurrences in a time series, LSTM networks are ideally suited for categorizing, processing, and generating predictions based on time series data. In many cases, LSTM has an advantage over RNNs, hidden Markov models, and other sequence learning approaches due to its relative insensitivity to gap length; therefore, we selected LSTM as the model to predict the integration of COVID-19 and air pollutant data.

Materials and Methods
In this section, we presented the materials and methods, including the dataset used in this paper, the research workflows, and the LSTM training method.

Dataset
The dataset was extracted from different resources, as follows. Based on these resources, we obtained the parameters, as described in Table 2.

Research Workflows
First, we collected the dataset from two resources. The first resource is from the Indian government's COVID-19 dataset per state, from which, Delhi was selected. The second resource was the AQI parameters based on city/state (again, we selected Delhi). Then, we integrated these two resources based on per day values. After that, we separated the lag time between air pollution and COVID-19 confirmed cases for 1, 3, 7, and 14 day lag times. In this case, the majority approach of lag time selections rely on trials to determine the best time-lags, which may not always be sufficient in real-world circumstances [36][37][38][39]. These approaches, on the other hand, are mostly based on trial-and-error scenarios, which necessitates the training of various models multiple times in order to identify the best among them. Next, we train the dataset using LSTM and use the models [40][41][42]. Figure 1 shows the workflows of this research.

Research Workflows
First, we collected the dataset from two resources. The first resource is from the Indian government's COVID-19 dataset per state, from which, Delhi was selected. The second resource was the AQI parameters based on city/state (again, we selected Delhi). Then, we integrated these two resources based on per day values. After that, we separated the lag time between air pollution and COVID-19 confirmed cases for 1, 3, 7, and 14 day lag times. In this case, the majority approach of lag time selections rely on trials to determine the best time-lags, which may not always be sufficient in real-world circumstances [36][37][38][39]. These approaches, on the other hand, are mostly based on trial-and-error scenarios, which necessitates the training of various models multiple times in order to identify the best among them. Next, we train the dataset using LSTM and use the models [40][41][42]. Figure 1 shows the workflows of this research.

Data Preprocessing
In machine learning, data preparation is a critical step that helps improve data quality and facilitates the extraction of relevant insights from the data [43,44]. After we integrated COVID-19 confirmed cases and air pollutants, we completed the data preprocessing. The data preprocessing was conducted as follows.

1.
For handling the missing values, we marked all NA values with 0.

2.
To make sure that the calculations are fine, we ensured that all data were floats.

3.
To complete the standardization of all values, we normalized all features.

4.
Then, we converted our time-series data to a supervised learning problem.

5.
For the model's training requirements, we split the dataset into training and test sets. 6.
Next, we paired the input and outputs from the data sequence. 7.
For LSTM model, we needed to reshape the input into 3D (samples, timesteps, features).

LSTM Training Modelling
The design network for the training model is illustrated in Figures 2 and 3, as follows.
5. For the model's training requirements, we split the dataset into training and test sets. 6. Next, we paired the input and outputs from the data sequence. 7. For LSTM model, we needed to reshape the input into 3D (samples, timesteps, features).

LSTM Training Modelling
The design network for the training model is illustrated in Figures 2 and 3, as follows.

5.
For the model's training requirements, we split the dataset into training and test sets. 6. Next, we paired the input and outputs from the data sequence. 7. For LSTM model, we needed to reshape the input into 3D (samples, timesteps, features).

LSTM Training Modelling
The design network for the training model is illustrated in Figures 2 and 3, as follows.   The training was based on MAE Loss, with an Adam Optimizer. We implemented the EarlyStopping method to avoid overfitting. The fit network was set in 200 epochs, and a 72 batch size.
To make a prediction, the process is as follows.

Results
Based on the designed experiments, we have 28 models for comparison. The results are as follows.

Matrix Correlations
Based on the matrix correlation in Figure 4, it can be seen that there are 3 parameters that have a strong positive correlation, which are pressure, NO 2 , and PM 2.5 at 0.53, 0.45, and 0.42, respectively.

Results
Based on the designed experiments, we have 28 models for comparison. The results are as follows.

Matrix Correlations
Based on the matrix correlation in Figure 4, it can be seen that there are 3 parameters that have a strong positive correlation, which are pressure, NO2, and PM2.5 at 0.53, 0.45, and 0.42, respectively.

Model Training Results
The purpose of the learning algorithm is to find a decent match between an overfit and an underfit model. A good fit is defined as a training and validation loss that declines to the point of stability with a slight difference between the two final loss values. The model's loss is usually always smaller than the validation dataset on the training dataset. It implies that a divergence between the training and validation loss learning curves should be expected. If the training loss plot drops to the point of stability, the plot of learning curves reveals a satisfactory match. The validation loss plot reaches a point of stability, with a tiny gap between it and the training loss, as shown in Figures 5-18. Figures 5, 7, 9 , 11, 13, 15 and 17 illustrate that all the learning curves were a good fit; therefore, when plotted, the prediction looks to be substantially closer to the test set, as shown in Figures 6, 8, 10 •

COVID-19 Confirmed Cases and all AQI Parameters
should be expected. If the training loss plot drops to the point of stability, the plot of learning curves reveals a satisfactory match. The validation loss plot reaches a point of stability, with a tiny gap between it and the training loss, as shown in Figures 5-18. Figures 5, 7, 9 , 11, 13, 15 and 17 illustrate that all the learning curves were a good fit; therefore, when plotted, the prediction looks to be substantially closer to the test set, as shown in Figures  6, 8, 10, 12, 14, 16 and 18.
•  ing curves reveals a satisfactory match. The validation loss plot reaches a point of stability, with a tiny gap between it and the training loss, as shown in Figures 5-18. Figures 5, 7, 9 , 11, 13, 15 and 17 illustrate that all the learning curves were a good fit; therefore, when plotted, the prediction looks to be substantially closer to the test set, as shown in Figures  6, 8, 10, 12, 14, 16 and 18.

RMSE and Variance Model Comparison
The RMSE comparison graph in Figure 19 illustrates the comparison of LSTM models of COVID-19 confirmed cases and air pollutant parameters. The dataset was divided into four lag times: 1, 3, 7, and 14 days. 1.
The first model contains all air pollutant parameter training, which uses 12 parameters, PM 2.5 , PM 10

RMSE and Variance Model Comparison
The RMSE comparison graph in Figure 19 illustrates the comparison of LSTM models of COVID-19 confirmed cases and air pollutant parameters. The dataset was divided into four lag times: 1, 3, 7, and 14 days.   The explained variance is a metric for determining how much variability exists in a machine learning model's predictions. To put it another way, it is the difference between the expected and forecasted values. Understanding how much information we can lose by reconciling the dataset is a crucial subject. At least 60% of the variation in a machine learning model must be explained. The goal is to have a value that is low. From the Graph Variance Score Comparison in Figure 20, it can be seen that the fourth model (pressure parameter training) has excellent variance scores at more than 0.9, whereas the first model has the worst variance scores. NO 2 , PM 2.5 , CO, and humidity models have a score of more than 0.8, which is also acceptable. by reconciling the dataset is a crucial subject. At least 60% of the variation in a machine learning model must be explained. The goal is to have a value that is low. From the Graph Variance Score Comparison in Figure 20, it can be seen that the fourth model (pressure parameter training) has excellent variance scores at more than 0.9, whereas the first model has the worst variance scores. NO2, PM2.5, CO, and humidity models have a score of more than 0.8, which is also acceptable.

Conclusions and Future Work
This study investigates the correlation of COVID-19 confirmed cases with AQI parameters using deep learning. The dataset was divided into four lag times, 1, 3, 7, and 14 days. From the lag times experiments, it can be found that one day lag time has an excellent RMSE. The deep learning models were good using the association of COVID-19 and air pollutants in the 1 day lag time scenario. We also performed the correlation in the matrix correlation coefficient, and the results show that the orders are pressure, NO2, PM2.5, PM10, CO, and O3, followed by humidity; however, these orders were different when we trained using deep learning. Seven models have experimented with deep learning LSTM, and COVID-19 confirmed cases with all air pollutant parameters, PM2.5, NO2, pressure, O3, CO, and humidity. From the training models, we found that CO is the most excellent model compared with the others, having on average, 13 RMSE values. CO is followed by pressure at 15, PM2.5 at 20, NO2 at 20, O3 at 22, humidity at 37, and finally, all air pollutant parameters at 76. As a result of the finding, we assume that CO, Pressure, NO2, and PM2.5 have a significant role in COVID-19 confirmed rates. In the future, more machine learning algorithms can be conducted to compare these results. Moreover, other data resources, such as people mobility, social media, and other countries' data, might be analyzed in deep experiments.

Conclusions and Future Work
This study investigates the correlation of COVID-19 confirmed cases with AQI parameters using deep learning. The dataset was divided into four lag times, 1, 3, 7, and 14 days. From the lag times experiments, it can be found that one day lag time has an excellent RMSE. The deep learning models were good using the association of COVID-19 and air pollutants in the 1 day lag time scenario. We also performed the correlation in the matrix correlation coefficient, and the results show that the orders are pressure, NO 2 , PM 2.5 , PM 10 , CO, and O 3 , followed by humidity; however, these orders were different when we trained using deep learning. Seven models have experimented with deep learning LSTM, and COVID-19 confirmed cases with all air pollutant parameters, PM 2.5 , NO 2 , pressure, O 3 , CO, and humidity. From the training models, we found that CO is the most excellent model compared with the others, having on average, 13 RMSE values. CO is followed by pressure at 15, PM 2.5 at 20, NO 2 at 20, O 3 at 22, humidity at 37, and finally, all air pollutant parameters at 76. As a result of the finding, we assume that CO, Pressure, NO 2 , and PM 2.5 have a significant role in COVID-19 confirmed rates. In the future, more machine learning algorithms can be conducted to compare these results. Moreover, other data resources, such as people mobility, social media, and other countries' data, might be analyzed in deep experiments.   Taiwan under Grant No. TCVGH-T1087804, TCVGH-T1097801, TCVGH-T1107803, TCVGH-1107201C, TCVGH-T1117803, TCVGH-NK1099003, and TCVGH-1103602D. Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: The data are available in a publicly accessible repository that does not issue DOIs. Publicly available datasets were analyzed in this study. This data can be found here: https: //api.COVID-19india.org/csv/latest/states.csv and https://aqicn.org/data-platform/COVID-19/ accessed on 1 January 2022.

Conflicts of Interest:
The authors declare no conflict of interest.