1. Introduction
Clean air is paramount for healthy human life, thus making air quality maintenance an integral part of public health policy. However, in recent years due to increasing urbanization, industrialization, and deforestation, the issue of air pollution is becoming more and more potent. Air pollution is caused primarily due to the introduction of harmful chemical, biological, and particulate matter into our atmosphere. Among these dangerous materials, the most common and abundant is particulate matter 2.5 (PM2.5). This fine particulate matter is composed of a mixture of solid and liquid particles in the air. Their abundance above a certain threshold leads to smog and a hazy environment. When inhaled into the human body, these result in various cardiac and pulmonary problems. There exists a correlation between air pollution and meteorological conditions. Factors such as wind, rain, temperature, pressure, ultraviolet radiation, and humidity can impact air pollution in a region. Therefore, a thorough understanding of the weather is pertinent when analyzing a region’s air quality properly.
Pakistan especially has suffered the full brunt of this crisis. Many megacities have been suffering from smog and haze, resulting in various health problems for the residents. According to the World Health Organization, the air quality inside Pakistan is generally considered unsafe. The most recent data published by the WHO indicates that PM
2.5 concentration across the region is, on average, 58 µg/m
3, which is higher than the prescribed safety factor of 10 µg/m
3 [
1]. PM
2.5 concentration is exceptionally high in urban centers such as Lahore, Karachi, and Islamabad. In 2019, with an extremely high PM
2.5 concentration of 68 µg/m
3, Pakistan was declared the world’s 2nd most polluted country by the world air quality report published by IQAir [
2].
Figure 1,
Figure 2 and
Figure 3 depict the PM
2.5 concentration from 2019 to 2021.
An efficient and streamlined monitoring system needs to be developed through which the government can record and forecast air quality levels. This intervention would allow the local populace to take precautionary measures in the event of deterioration in air quality levels. At the same time, it would help the government bodies in terms of evidence-based policymaking for air pollution abatement. However, such a robust system requires a highly efficient forecasting model, which can accurately predict air quality over time. We tried to fill this gap using machine learning algorithms to develop an air quality prediction model. The main contributions of this work are as follows.
Data on air quality, air pollutants (PM2.5), and meteorological conditions for multiple cities in Pakistan were combined to produce a novel dataset.
The impact of different meteorological conditions such as temperature, humidity, precipitation, wind speed, dew point, and pressure on the daily PM2.5 concentration in multiple Pakistani cities was found.
Several machine and deep learning models, including multivariate FbProphet, LSTM, and LSTM encoder–decoder, were used for the daily and hourly forecasting of PM2.5 levels across numerous cities in Pakistan.
2. Related Work
Various Environmental Protection Agencies (EPA) have offered a variety of methodologies for calculating the air quality index. While most of these agencies have shifted toward the state of art machine learning techniques for forecasting the AQI, many agencies still rely on mathematical calculation. In [
3], the author employed the popular machine learning method of support vector regression (SVR) to forecast the pollutant and particulate levels and predict the resulting value of the AQI in California, USA. The author employed the radial base function (RBF) and SVR to obtain the most accurate prediction. In the six AQI categories defined by the US Environmental Protection Agency, the proposed model was able to perform at a high accuracy of 94.5%. The proposed approach’s limitation was its limited amount of data and parameters, especially for NO2 and PM
2.5. In [
4], the author suggested an AQI and NOx forecasting method using SVR and the random forest method. Their proposed study showed that the SVR-based model performed better than the random forest model for forecasting AQI and NOx. In [
5], the author proposed using multinominal regression and K nearest neighbor to predict different AQI buckets. These buckets contained the overall classification of the AQI as good, moderate, and severe.
Environmental Protection Agencies have proposed various methods for measuring the air quality index. While most of these agencies have shifted towards state-of-the-art machine learning techniques for forecasting the AQI, many agencies still rely on mathematical calculation.
In [
6], the author used the previous day’s temperature, humidity, dew point, wind speed, pressure, visibility, and precipitation as predictors in their ANFIS model. The author employed techniques such as collinearity tests and forward selection (FS) to minimize the cost and time of calculation. These techniques removed the redundant input variables and selected different input variable combinations. This method produced a different model for the different constituents of pollutant prediction with a better accuracy and reduced the computational time.
In [
7], the author proposed the use of hybrid single decomposition (HSD) and hybrid two-phase decomposition (HTPD) for predicting the AQI a day before the next day in advance. Among all the models, the performance of HTSD was the most accurate. Their model successfully reduced the raw data instability and simplified the intrinsic complexities of daily AQI prediction.
In [
8], the author found a clear relationship between visibility and the AQI. The author concluded that the AQI and image visibility were negatively correlated. As visibility increased, the AQI value decreased and vice versa. The author employed these images with high and low PM
2.5 concentrations to obtain high-frequency information. The SVR model was then updated using this data. This approach provided a rapid and cost-effective method for the prediction of the AQI. In [
9], the author conducted a comparative analysis of air quality in Taiwan and London. The author analyzed the air quality in multiple stations in Taiwan and proposed an enhanced decision tree model that could predict air quality levels with an R
2 value of 0.71 and root mean squared error (RMSE) of 7.06.
To improve the accuracy of forecasting air pollutants in Shenzen, China, ref. [
10] proposed a hybrid method consisting of ARIMA and the prophet method. They applied this hybrid model to 11 stations in the city and performed an error evaluation. They found that these hybrid methods improved the prediction result. However, the proposed method’s processing speed was slow compared to other machine learning approaches. Similarly, in [
11], the author proposed using the ARIMA model to forecast the air pollutant (NO
x, SO
2, SPM, and RSPM) levels for the next five years in Nanded city, Maharastra, India.
It is necessary to conduct a complete analysis of all the factors that influence the air pollution in a region. However, most research has been limited to the relationship between the weather and air pollution. In contrast to conventional methods, ref. [
12] used XGboost and Bayesian optimization to investigate environmental, demographic, economic, and meteorological causes. This case study, which was conducted in the USA, provided excellent results.
In [
13], the author proposed forecasting air quality for the next 48 h using a combination of neural network models. These models included artificial neural networks (ANN), convolutional neural networks (CNN), and long short-term memory (LSTM). They made use of this model to extract spatial–temporal relations. Their model outperformed many state-of-the-art models. The only weakness in their approach was the noise in their data. This noise was due to the use of different machines for data collection, which decreased the accuracy of the results.
In [
14], the author proposed using a hybrid ensemble model, CERL, for forecasting the hourly air quality in northwest China. The advantage of using CERL was that it exploited the benefits of both feed-forward and recurrent neural networks. Through this model, they forecasted the air pollutants from 1 to 8 h ahead with a relatively high accuracy from 1 to 20% concerning the step size.
In [
15], the author used a novel deep learning method for forecasting PM
2.5 concentrations in Beijing, China. The proposed architecture consisted of a hybrid deep learning model, which was a combination of a one-dimensional convolutional neural network (1D-CNN) and a bi-directional long short-term memory (Bi-LSTM). A 1D-CNN was used to extract local trends and spatial features, while a Bi-LSTM was used to learn spatial–temporal dependencies. The author conducted extensive experiments and achieved a satisfactory accuracy with this model.
Air pollutant concentration is dependent on various factors. These factors are usually either left unexplored or used in their entirety to forecast pollutants. In [
16], the author suggested a new feature extraction method for air pollutant prediction, especially for PM
2.5. The author proposed a causality-based linear method to extract the most relevant features for predicting PM
2.5. Their findings proved that the proposed feature extraction had vastly improved prediction results.
In [
17], the author proposed using a combination of RNN and LSTM to forecast O
3 levels for the next 8 to 72 h. The author used a decision tree to identify input variables of the highest importance and then used these features for training the model. The proposed model was able to achieve a satisfactory accuracy, and the mean absolute error was less than 2 for the 72-h sequence for forecasting. The disadvantage of this approach was that it utilized a limited number of features for training the model, which might lead to optional bias in the results. Similarly, the author analyzed the AQI and PM
2.5 concentration in the Chinese city of Fuzhou. The author applied the ARIMA model to analyze and forecast the PM
2.5 concentration between 2014–2016. The results of the study concluded that the PM
2.5 concentration had an intricate relation between seasons and that the concentration was sufficiently more significant in winter compared to summer. This study was unique as it was conducted on new data as well as it being able to analyze the seasonality of PM
2.5 over time.
This study consists of five sections.
Section 2 discusses the methodology employed for forecasting the hourly and daily forecasting of PM
2.5.
Section 3 presents the results and analyzes them.
Section 4 discusses the research findings. Finally,
Section 5 concludes the study and discusses the limitations and future work in this domain.
3. Materials and Methods
This section describes the system architecture and all the models used to solve the problems highlighted in the objectives. This section consists of 4 subsections.
Section 3.1 defines the system architecture for the solution of the problem state.
Section 3.2 explains the process for the collection of data from multiple sources.
Section 3.3 deals with the preprocessing of data.
Section 3.4 describes the methodology of implementing proposed models to obtain the required forecast.
3.1. Model Architecture
The model architecture describes the overall steps taken to derive results from a system.
Figure 4 depicts the model architecture for the proposed system. In the first step, the air quality and weather data are collected from the required sources, as illustrated in
Figure 4. Then, this data undergoes the required data preprocessing and feature engineering. Then, the data is split and scaled and passed over the proposed models, i.e., multivariate FbProphet, LSTM, and LSTM encoder and decoder, to forecast the future PM
2.5 levels. Finally, the proposed models are tested and compared after undergoing hyper parameterization.
3.2. Data Collection
The data in this study was collected primarily from two sources: air quality data from sensors located at US embassies across Pakistan [
18] and meteorological data from the World Weather website [
19]. Data was recorded at hourly intervals from mid 2019 to Feb 2021. The meteorological data consisted of hourly features such as time, humidity, temperature, precipitation (in inches), UV index, wind speed, cloud cover, visibility, dew point, and pressure. Similarly, the air quality consisted of hourly features such as time, AQI, and PM
2.5 concentration.
Table 1 displays the statistical distribution of the Lahore, Islamabad, and Karachi datasets.
The dataset was divided into training and testing datasets. For the daily PM2.5 prediction, the training dataset was from mid 2019 to Jan 2019, while the testing dataset was from Jan 2021 to Feb 2021 (30 Days). Similarly, for the hourly PM2.5 prediction, the training dataset was from mid 2019 to Feb 2019, while the testing dataset was from the last 72 h (72 H).
3.3. Data Preprocessing and Feature Extraction
Data preprocessing is a process through which the raw data undergoes a rigorous transformation to make it understandable and implementable for the end user. Data preprocessing is essential for any analysis, as uncleaned unprocessed data will only lead to terrible results. Data quality directly influences the overall quality of information derived from results.
The collected data in this study was processed and made available for feature extraction through multiple preprocessing techniques such as missing value imputation, data cleaning, encoding of categorical features, and data scaling.
Feature engineering and selection is the part of machine learning that predominately affects the overall efficiency of any proposed model. It requires a complete understanding of the data and a thorough analysis. This analysis not only helps in understanding the raw data’s features but also helps to create newer ones. In this study, the feature engineering was divided into four parts. The first part involved removing irrelevant features whose value remained the same throughout the data. These features included QC name, Loc ID, weather code, duration, and weather icon URL. The next part combined a few different features to produce a new feature, e.g., hour, day, month, and year, to form a DateTime feature. Feature engineering also needs to resolve the problem of multicollinearity. Multicollinearity in this dataset occurred when there was a strong relationship between our dependent variable or features. It would seriously impact the overall interpretability and generalization of our model. The solution to this problem was to use Pearson correlation. The models were then fed this processed data for additional analysis and prediction.
3.4. Modelling
In recent years, neural network models such as LSTM have proven to be extremely useful in solving such time series problems. LSTM is a modified version of the recurrent neural network (RNN).
Figure 5 illustrates an LSTM block. LSTM work exceptionally well in solving long-term dependency problems. The core feature of the LSTM is its cell state. The cell state is a horizontal line running on top of the multiple LSTM block, as shown in
Figure 4. Cell states are similar to conveyor belts that carry information from multiple LSTM blocks. Besides the cell state, LSTM also consists of three gates, namely (1) forget gate, (2) input gate (3) output gate. The forget gate is primarily used to remove unwanted information from the cell state.
The status of the cell is updated using input gates. First, the sigmoid layer will decide which value it will update through the following equation:
Then, the Tanh layer will create a vector of new candidate values to be added through the following equation:
The final cell state contains both these values, as shown in the following equation:
Output gates are used for calculating the value of the next hidden state. It achieves this by passing the current hidden and previous state through the sigmoid function. The new cell state generated is also passed through the Tanh function. Then, these values are multiplied to acquire the next hidden state used for prediction. These two mathematical equations are listed as follows:
In this study, besides incorporating neural network models such as LSTM and LSTM encoder and decoder, a thorough comparative analysis was also conducted between these models and a traditional time series model such as FbProphet for the forecasting of PM2.5.
According to the proposed methods, this study made daily and hourly PM2.5 concentration forecasts for Lahore, Islamabad, and Karachi for 30 days and 72 h. To achieve this goal, multivariate FbProphet, LSTM, and LSTM encoder–decoder models were utilized, and models were created separately for each city.
This study used the Keras tuner framework for optimizing hyperparameters. The parameters were selected based on the grid search mechanism. These parameters were then optimized based on the MAPE metric. This study used the parameters that achieved the lowest MAPE value for the models.
Table 2 shows the parameters used by the multivariate FbProphet model for hourly and daily forecasting.
Table 3 displays the LSTM model architecture, and
Table 4 illustrates the parameters used for tuning. Moreover, the time lag used for the LSTM model training was 3 and 4 for daily and hourly forecasting, respectively.
This study used the MAPE metric to compare the performance of the proposed models. MAPE is the average percentage error in any forecast. It is handy in the time series model due to its ability to stop the negative and positive errors from canceling each other out. Moreover, calculating errors in the form of percentages allows the user to conduct a better comparative evaluation. Moreover, one of this study’s goals was to conduct a comparative analysis of different models; this metric appeared to be a perfect fit.
This study used LSTM encoder–decoder for sequence-to-scalar forecasting instead of conventional sequence-to-sequence forecasting. LSTM encoder–decoder was also applied to acquire the hourly and daily forecast for the PM
2.5 concentration in Lahore, Islamabad, and Karachi. The lag interval was the only difference between the daily and hourly forecast processes. The lag value was 4 for hourly forecasting and 3 for daily forecasting.
Table 5 illustrates the general architecture for the final model for all three cities.
Table 6 shows the parameters chosen in all the cities for hourly and daily forecasting.
5. Discussion
The weather significantly impacts the PM
2.5 concentration in different regions. A similar analysis was done by [
9], where the author used penalties to find the impact of weather conditions on PM
2.5 concentration. In this analysis, they concluded that a decrease in temperature and wind speed was the primary reason behind the increase in PM
2.5 concentration. However, unlike the research conducted by [
9], this research also analyzed other weather parameters such as humidity, pressure, and cloud cover. Moreover, this study concluded that besides wind speed and temperature, other parameters shown in
Figure 7,
Figure 15 and
Figure 23 also had a substantial negative correlation with the PM
2.5 concentration in Pakistan.
The models proposed in this study achieved a very good MAPE value for forecasting daily and hourly PM
2.5 levels in multiple cities in Pakistan.
Table 7 displays the combined results for the proposed models. These results confirmed that the prediction was better for hourly forecasting than daily forecasting. This was because hourly forecasting had more data (11,000 records) than daily forecasting (650 records). Similarly, deep learning techniques, such as LSTM and LSTM encoder–decoder, outperformed the more conventional machine learning models. This was mainly because deep learning models are more robust and flexible in handling a sudden peak in PM
2.5 concentration. Moreover, from the results, it could also be seen that the proposed multivariate models (multivariate FbProphet, LSTM, and LSTM encoder–decoder) performed better than the traditionally used univariate model. These results highlighted the importance of including more relevant features for forecasting PM
2.5 concentrations.
The author in [
13] also used similar deep learning approaches. In [
13], the author proposed forecasting air quality for the next 48 h by utilizing a combination of neural network models, which included ANN, CNN, and LSTM. In this study, the author conducted forecasting for Taiwanese and Chinese datasets. The author in [
13] did not use weather as a feature, as their primary aim was the spatiotemporal analysis of air pollutants. In contrast, this research analyzed the data from Pakistan and included weather information as they significantly impacted the region’s pollutant concentration. In [
13], the authors obtained an MAE value of up to 10 for a 6-h prediction in different regions of China. Similarly, in terms of accuracy, LSTM encoder and decoder obtained the highest accuracy among the models employed in this study. In contrast, the multivariate FbProphet model obtained the lowest accuracy. LSTM encoder and decoder obtained a MAPE value of up to 7.4% for hourly (72 h) and 15.07% for the daily forecast (30 days). Given the dataset utilized in this study and the low MAPE value, LSTM encoder–decoder was the ideal model for forecasting PM
2.5 concentrations.
This study should have used statistical tests such as the T-test, ANOVA, and F-test for a more thorough model comparison. Similarly, instead of only using MAPE, other metrics such as mean absolute error (MAE) and root mean squared error (RMSE) would have provided a more in-depth analysis of the research. These tests and metrics would have provided more conclusive evidence about whether there was any statistical difference between the results of the proposed models. However, MAPE was sufficient for explaining the general behavior for an initial model comparison. Not including statistical tests was one of the limitations of this research and would be the topic of our future research.
6. Conclusions
In conclusion, this research fulfilled all its goals and objectives. This study analyzed the impact of weather on the PM
2.5 concentration across multiple cities in Pakistan. Through an in-depth exploratory data analysis and feature engineering, this study found that all the weather parameters negatively correlated with the PM
2.5 concentration. This research also compared different machine learning models for daily and hourly forecasts of PM
2.5 concentrations. The study proved that the LSTM encoder–decoder model performed best for this dataset. Furthermore, this study provided higher-level information to the timeseries models through a combination of pollutants and weather data. This additional information was crucial in improving the model’s accuracy, as shown in
Table 7, with it ranging from 15.1% to 63.9%
In the future, it is suggested to use additional data for a more in-depth analysis. Moreover, it is also suggested to add other features, such as emissions data, to acquire a more comprehensive analysis of the models. Quantitative statistical tests such as the T-test, ANOVA, and F-test should be utilized going forward in order to obtain more definitive results. Finally, the latest state-of-the-art time series models, such as the transformers and attention-based models, could also be incorporated to forecast pollutants. These models, with their ability to eliminate recurrence and parallelization, can lead to less complex and more accurate models.