Next Article in Journal
Analytical Evaluation of MCE Collapse Performance of Seismically Base Isolated Buildings Located at Low-to-Moderate Seismicity Regions
Previous Article in Journal
Building Reliable Massive Capacity SSDs through a Flash Aware RAID-Like Protection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Machine Learning-Based Prediction of Air Quality

by
Yun-Chia Liang
1,*,
Yona Maimury
1,
Angela Hsiang-Ling Chen
2,* and
Josue Rodolfo Cuevas Juarez
1
1
Department of Industrial Engineering and Management, Yuan Ze University, Taoyuan City 320, Taiwan
2
Department of Industrial and Systems Engineering, Chung Yuan Christian University, Taoyuan City 320, Taiwan
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2020, 10(24), 9151; https://doi.org/10.3390/app10249151
Submission received: 29 November 2020 / Revised: 16 December 2020 / Accepted: 18 December 2020 / Published: 21 December 2020

Abstract

:
Air, an essential natural resource, has been compromised in terms of quality by economic activities. Considerable research has been devoted to predicting instances of poor air quality, but most studies are limited by insufficient longitudinal data, making it difficult to account for seasonal and other factors. Several prediction models have been developed using an 11-year dataset collected by Taiwan’s Environmental Protection Administration (EPA). Machine learning methods, including adaptive boosting (AdaBoost), artificial neural network (ANN), random forest, stacking ensemble, and support vector machine (SVM), produce promising results for air quality index (AQI) level predictions. A series of experiments, using datasets for three different regions to obtain the best prediction performance from the stacking ensemble, AdaBoost, and random forest, found the stacking ensemble delivers consistently superior performance for R2 and RMSE, while AdaBoost provides best results for MAE.

1. Introduction

Worldwide, air pollution is responsible for around 1.3 million deaths annually according to the World Health Organization (WHO) [1]. The depletion of air quality is just one of harmful effects due to pollutants released into the air. Other detrimental consequences, such as acid rain, global warming, aerosol formation, and photochemical smog, have also increased over the last several decades [2]. The recent rapid spread of COVID-19 has prompted many researchers to investigate underlying pollution-related conditions contributing to COVID-19 pandemics in countries. Several shreds of evidence have shown that air pollution is linked to significantly higher COVID-19 death rates, and patterns in COVID-19 death rates mimic patterns in both high population density and high PM2.5 exposure areas [3]. All the above mentioned raises an urgent need to anticipate and plan for pollution fluctuations to help communities and individuals better mitigate the negative impact of air pollution. To do so, air quality evaluation plays a significant role in monitoring and controlling air pollution.
The Environmental Protection Agency (EPA) tracks the commonly known criteria pollutants, i.e., ground-level ozone (O3), Sulphur dioxide (SO2), particulates matter (PM10 and PM2.5), carbon monoxide (CO), carbon dioxide (CO2), and nitrogen dioxide (NO2). These substances are in compositions of a common index, called the Air Quality Index (AQI), indicating how clean or polluted the air is currently or forecasted to become in areas. As the AQI increases, a higher percentage of the population is exposed. Different countries have their air quality indices, corresponding to different air quality standards. In the United States, the US Environmental Protection Agency monitors six pollutants at more than 4000 sites: O3, PM10, PM2.5, NO2, SO2, and lead. Rybarczyk and Zalakeviciute [4] reviewed a selection of the 46 most relevant journal papers and found more studies with O3, NO2, PM10 and PM2.5, and less on an overall AQI.
Recent researches focus more on advanced statistical learning algorithms for air quality evaluation and air pollution prediction. Raimondo et al. [5], Garcia et al. [6], and Park et al. [7] have used neural networks to build models for predicting the prevalence of individual pollutants, e.g., particulates matter measuring less than 10 microns (PM10). Raimondo et al. [5] used a support vector machine (SVM) and artificial neural network (ANN) to train models. Their best ANN model attained almost 79% for specificity with only a 0.82% false-positive rate, while their best SVM model at a specificity of 80% with a false positive rate of only 0.13%. Yu et al. [8] proposed a random forest approach, named RAQ, for AQI category prediction. Then, Yi et al. [9] applied deep neural networks for AQI category prediction. Veljanovska and Dimoski [10] applied different settings to outperform k-nearest neighbor (k-NN), decision tree, and SVM for predicting AQI levels. Their ANN model achieved an accuracy of 92.3%, outperforming all other tested algorithms.
The work presented in this paper focuses on the development of AQI prediction models for acute air pollution events 1, 8, and 24 h in advance. The following machine learning (ML) algorithms are investigated, i.e., random forest, adaptive boosting (AdaBoost), support vector machine, artificial neural network, and stacking ensemble methods to train models. As well, this research observes how prediction performance decays over longer time frames, and the precision is measured with three commonly used scale-dependent error indexes: mean absolute error (MAE), root mean squared error (RMSE), and R-squared (R2).

2. Machine Learning Prediction Methods

Machine learning involves computational methods which learn from complex data to build various models for prediction, classification, and evaluation. The study attempts to build forecasting models capable of efficient pattern recognition and self-learning. In this section, the underlying principle of five machine learning methods as the canonical procedure will be discussed respectively.

2.1. Support Vector Machine

Support vector machine, a supervised learning method for classification, regression, and outlier detection, constructs the hyperplane that acts as a boundary between distinct data points and thus the output can be deduced hereafter [11]. Two distinctive versions of SVM are shown in Figure 1. For classification problem in Figure 1a, data points that lie at the edge of an area closest to the hyperplanes are considered as support vectors. The space between these two regions is the margin between the classes. Hyperplanes will determine the number of classes incurred in the dataset and the output of unseen data will be predicted according to which class holds the most similarity with the new data. As for regression problem in Figure 1b, an approximation of such hyperplane to a non-linear function is constructed at the maximal margin with linear regression. Hence, the additional parameter, known as the ε-insensitive loss is introduced to tolerate some deviations that lie inside the ε region tube [12].
The boundary lines (dashed lines) across the hyperplane (solid line) in SVR (stands for support vector regression) are defined with regards to parameter ε, in which the resulting lines are the shifted function in the amount of –ε and +ε from the hyperplane (assume it is a straight line with an equation of <w, xi > +b). The SVR uses a penalty concept introduced by parameter C (cost factor) for output variables outside the boundaries either above ξ i or below ( ξ i * ) . Nevertheless, data points inside the boundaries will be exempted. Since support vectors represent the data points located near these boundary lines (see Figure 1b), if the ε moves further from the hyperplane, the number of support vectors decreases; otherwise, the number of support vectors increases as the ε approximates towards the hyperplane. Finally, since most realistic problems aren’t linear, the kernel trick is commonly performed by mapping training data onto the high-dimensional feature space. Kernel functions, e.g., linear, polynomial, radial basis function (RBF), sigmoid, hyperbolic tangent, etc., are used to convert the once inseparable input data into the separable ones.
The parameter ε has brought a couple of advantages, yet is sometimes difficult to tune. Hence, scholars from Australian National University proposed the substitution of parameter ε into parameter ν (hereinafter referred to as ν-SVM) to avoid such a tedious parameter tuning process for regression [13]. Moreover, parameter ν is also applicable for classification, where it becomes the replacement for cost factor C [14]. Values of parameter v with the upper bound of training margin errors and lower bound for the support vectors are recommended from 0 to 1 so that the ν-SVM can offer a more meaningful parameter interpretation [15].

2.2. Random Forest

Another prominent machine learning method, random forest, a supervised learning ensemble algorithm, combines multiple decision trees to form a forest and the bagging concept, that latter adds the randomness into the model building. The random selection of features is used to split the individual tree while the random selection of instances is used to create training data subset for each decision tree. At each decision node in every tree, the variable from the random number of features is considered for the best split. If the target attribute is categorical, random forests will choose the most frequent as its prediction. On the other hand, if it’s numerical, the average of all predictions will be chosen.
Similar to SVM, the random forest can tackle both classification and regression case. For prediction, each test data point is passed through every decision tree in the forest. The trees then vote on an outcome and the prediction is produced from a majority vote among the models and henceforth resulting in a stronger and more robust single learner. Random forests can overcome the prediction variance that each decision tree has, in the way that the prediction average will approximate the ground truth (classification) or true value (regression). Figure 2 shows the illustration of a random forest that consists of m number of trees.

2.3. Adaptive Boosting

The next method, Adaptive Boosting, also came from a branch of ensemble methods where combine several weak learners yet with the sequential arrangement instead of a parallel setting as what random forest does. Boosting trains the base models in sequence one by one and assigns weights to the classifiers based on their accuracy to predict a random set of input instances. By such means, the more accurate classifiers will have more contribution in the final answer. The weights are also attributed to each input item depending on how difficult the instance to be predicted as on average by all classifiers. The higher the weight, the harder it is to estimate the ground truth for the instance and therefore this item will have a higher chance to appear as the training subset in the succeeding iteration. In other words, the boosting process concentrates on the training data that are hard to classify and over-represents them in the training set for the next iteration. The loop will start to be more substantial, as the focus is gathered to solve the difficult-to-predict instances using the stronger classifiers. Classifiers are the base algorithms utilized to perform the prediction, where the common one used in AdaBoost is a decision tree. It also can be constructed from different types of algorithms, e.g., mix of a decision tree, logistic regression, and naïve Bayes (for classification).

2.4. Artificial Neural Network

The next approach preferred in this study is the artificial neural network. Being the earliest algorithm invented among all, ANN is not only seen as the “universal approximator” which can estimate any arbitrary function well [16], but also as the initiator of the most recent progress in the artificial intelligence field as of now, called as deep learning or deep neural network. The neural network simulates the structure and networks of the human brain in the process of information learning. For a human, new things are learned by training the biological neurons in the brain using some examples, where the knowledge extracted will later be stored in the memory. In an ANN, a considerable amount of input data is fed into the artificial neurons where all neurons are trained and the network is adjusted to get a better response, or more specifically output, e.g., in a prediction, or a recognition task. The adjustment of the network is performed by updating the weight (wi,1, wi,2, wi,3,…, wi,r) that each neuron has and biases which are the adder for each summation procedure (see Figure 3). The complexity of the network itself is determined by the number of hidden layers. Furthermore, the net output, denoted by a i , will be transformed non-linearly by the activation function ( f ) to form an output y i   that will be forwarded to the next hidden layer. There are numerous types of activation function that are employed to bring the non-linearity property to the input signal as to adapt with a variety of instances and hence results to the highly adaptable network. These are including sigmoid, ReLU, leaky ReLU, hyperbolic tangent (tanh) function, and so on.

2.5. Linear Regression

Linear regression is probably the method where most of the academicians started their first machine learning experience. Its main working principle lies behind the fitting of one or more independent variables with the dependent variable into a line in n dimensions. n usually denotes the number of variables within a dataset. This line is supposedly created as it would be minimizing the total errors when trying to fit all the instances into the line. Under machine learning, linear regression is equipped with the capability to learn continuously by optimizing the parameters in the model. These parameters are including w0, w1, w2,…, wm (as illustrated in Figure 4). Most commonly, optimization is carried out by a method called gradient descent. It works by partially deriving the loss function and all parameters will be updated by subtracting the previous value with the derivative times a specified learning rate. The learning rate can be tuned by the simplest way, which is rule of thumb (trial and error), or a more sophisticated rule, e.g., meta-heuristic. Another parameter that is left for tuning is the amount of generalization added to the model. Regularization is undergone as an effort to lessen the chance of overfitting and increase the robustness of the model. Two types of regularization used in linear regression are lasso and ridge regression. Lasso regularization will eliminate less important feature by letting the feature’s coefficient to zero, and retain another more important one. Ridge regularization on the other hand will not try to eliminate a feature, but instead, tries to shrink the magnitude of coefficients to get a lower variance in the model.

2.6. Stacking Ensemble

Though coming from the same branch, stacking is quite different from the random forest and boosting strategy in AdaBoost in several ways. In bagging, variance in the final ensemble model is reduced by the random selection of a subset of features as well as instances for each predictor to execute the parallel and independent learning. The outcomes are then aggregated by the averaging method to generate an ensemble prediction. Boosting, on the other hand, will pass the dataset through all the learners which are set sequentially. Each instance and learner are given the attribute, the so-called weight, that is going to be updated on each pass (instance) and each iteration (learner). The weighting procedure results in the uneven contribution of each learner to the final prediction, and uneven prioritization to each instance for the training process - which substitutes the output averaging process mechanism and randomization for training subset in the bagging concept.
For stacking, each base predictor takes the whole dataset without any differentiation on the input and works in the canonical way to produce the result. The special property of this method lies in the aggregation mechanism. After the learning, the outputs from the predictors then become the inputs for the aggregator algorithm to produce the final prediction. The training set in the first learning process occupied by the base predictors is different with the one utilized by the aggregator algorithm because the dataset fed into the predictors has been transformed into the models which are later combined to form the new features. Fitting the aggregator algorithm onto the same instances causes a bias since the inputs are created from these instances. However, splitting two types of datasets raises another problem for a limited amount of data. To overcome this, the common k-folds cross-validation approach is usually adopted to provide more data for training both predictors and aggregators thereby facilitate a more accurate performance measure [17]. In practice, stacking usually considers multiple types of learners to build the prediction, while bagging and boosting are more common to have only homogenous learners. Besides the algorithms used, the design of stacking ensemble can also be altered by the stacking level. If the number of levels is more than 2, the layer in the middle will be filled with multiple aggregators. However, since increasing the number of levels will cost on the time computation, this parameter usually remains in default (i.e., level size = 2).

3. Implementation Methodology

The methodology in this study consists of the following procedures: data collection and preprocessing, feature selection, time windowing, and model building. All the machine learning models exploited in this study will be constructed on the open-source data mining platform, Orange, a software programmed under the python script. In this section, the details of procedures will be discussed respectively.

3.1. Data Collection

The main pollutant emissions in Taiwan are due to energy production industry, traffic, waste incineration and agriculture. In Taiwan, six pollutants (O3, PM2.5, PM10, CO, SO2, and NO2) are monitored and controlled based on their concentration time-series. Types of data used as predictors to perform analysis involve AQ: air quality data, MET: meteorological data, and TIME: the day of the month, day of the week, and the hour of the day. From 1 January 2008 to 31 December 2018, air quality data are collected from several monitoring stations across Taiwan and reported via the EPA’s website [18]. With the same timeframe, meteorological data are provided in 1-h intervals by Taiwan’s Central Weather Bureau (CWB) from three air monitoring stations: Zhongli (Northern Taiwan), Chuanghua (Central Taiwan), and Fengshan (Southern Taiwan). The datasets represent different environmental conditions related to air pollutant concentration.

3.2. Data Pre-Processing

The number of raw data points for the Zhongli, Changhua, and Fengshan monitoring stations includes 91,672, 94,453, and 94,145, respectively. The analysis of these readings begins with a crucial phase – data preprocessing. Various preprocessing operations precede the learning phase. At any particular time, one invalid variable will not affect the whole data group, and thus it will just be either marked blank or, where available, replaced by a value sourced from the CWB, without eliminating the full row. The missing values are treated by imputation to recover the corresponding values. Given the lack of spatial proximity of the readings to the original monitoring stations, the missing values are imputed for relative humidity, temperature, and rainfall, without using wind speed or wind direction. The next imputation process used the k-NN algorithm to substitute the rest of the invalid or missing data that did not qualify for the previous imputation process. Note that the percentage of missing values is lower than 1.3% in all three-station datasets.
Then, input and target data are normalized to eliminate potential biases; thus, variable significance won’t be affected by their ranges or their units. All raw data values are normalized to the range of [0, 1]. Inputs with a higher scale than others will tend to dominate the measurement and are consequently given greater priority. Normalization not only improves the model learning rate, but also supports k-NN algorithm performance because the imputation is decided by the distance measure.

3.3. Feature Engineering

In regard to selecting features in the predictive models, the hourly AQI readings with the highest index out of 6 pollutants: O3, PM2.5, PM10, CO, SO2, and NO2 are selected. To convert the time-window-specific concentration of 6 pollutants, the AQI Taiwan Guidelines [18] are adopted and the AQI is manually calculated using the following Equations (1) and (2), where index values of O3, PM2.5, and PM10 are needed to define AQI in Taiwan, and the lack of one or more of these values will significantly reduce the accurate assessment of current air quality.
A Q I = m a x I O 3 , I P M 2.5 , I P M 10 , I C O , I S O 2 , I N O 2 , I O 3 , I P M 2.5 , I P M 10 , o t h e r w i s e
Pollutant concentration ( v a l u e i ) is converted to pollutant index ( I i ) by the following formula:
I i =   L B j + v a l u e i l b i u b i l b i × U B j L B j
where i =   O3, PM2.5, PM10, CO, SO2, NO2; j denotes which level in AQI system occupied by the concentration of the specific pollutant using categories of good, moderate, unhealthy which includes specific groups, unhealthy, very unhealthy, and hazardous. The data transformation defines the time-window-specific concentration to calculate I i values. For example, based on the AQI from Taiwan’s EPA website [18], the concentration v a l u e O 3 = 0.06   ppm will fall in the interval with l b O 3   = 0.055 ppm and u b O 3 =   0.070 ppm corresponding to the “moderate” pollutant level with L B m o d e r a t e = 51 and U B m o d e r a t e = 100. The v a l u e O 3 is defined by matching either of two conditions: if the 8-h average concentration is more precautionary for a specific site and is also below 0.2 ppm, then this value is used; otherwise, the 1-h average concentration will be considered. Both v a l u e P M 2.5 and v a l u e P M 10 are the moving average values which consider two time-windows, i.e., the last 12 h and 4 h (see Table 1). Other variables, such as v a l u e C O   and v a l u e N O 2 only account for a single time window, i.e., last 8 h and 1 h, respectively. Meanwhile, v a l u e S O 2 emphasizes the 24-h average concentration if the 1-h average concentration exceeds 185 ppb; otherwise, the 1-h average value will be used.
The AQI mechanism introduces several new variables to train the prediction model (Table 1). For several pollutants, time windows other than hourly are more sensitive in determining AQI; therefore, the prediction interval related to the accuracy of long-term predictions is under investigation to clarify the time dependency between consecutive data points. As the AQI calculation is already established, the future value of the AQI readings in three different time intervals will be regarded as target variables and are summarized in Table 2.

3.4. Performance Evaluation

According to Isakndaryan et al. [19], the most used metrics are RMSE (root mean squared error) and MAE (mean average error), calculated based on the difference between the prediction result and the true value, while another metric, R2 (R-squared) is essential to explain the strength of the relationship between predictive models and target variables [20]. These three metrics provide a baseline for comparative analysis across different parameter settings for each model and across different methods. However, performance validation leads to a bias when the data set is split, trained, and tested only one time. This also means the result drawn from the testing dataset may no longer be valid after the testing subset is changed. To overcome this problem, each model is re-built 20 times using different random subsets of training and testing samples. The splitting proportion remains the same (80:20). All metrics report only a single value from the average performance of 20 identical models validated into 20 different subsets of testing instances.

4. Results and Discussion

This section is organized into three parts. First, a general description of the dataset is provided. The datasets are mainly based on geographic distribution across Taiwan. The second part discusses the detailed development of AQI prediction models following their parameter setting and imputation. The last part evaluates the performance of the AQI forecasting models.

4.1. Data Summary

In the Zhongli dataset, moderate is the most frequent AQI level in any given month (Figure 5a). Unhealthy occurs more frequently in December through April, indicating that peak pollution usually occurs in winter and spring. The year-based grouping (Figure 5b) clearly shows a general drop in pollution levels from 2014 to 2018, with a small uptick in 2016. In general, the moderate class accounts for 51% cases while good and unhealthy, respectively, account for 28% and 21%.
Similar to the Zhongli AQI pattern, pollution in Changhua peaks in March (Figure 6a). However, the degree of air pollution is more severe in Changhua, with unhealthy accounting for 59% of March readings, as opposed to 39% for Zhongli. Like Zhongli, higher AQI levels in Changhua are also clustered in winter and spring, but September, October, and November also featured significant instances of the unhealthy class (respectively 35%, 38%, and 41%). In general, Changhua has poorer air quality than Zhongli, with more frequent AQI > 100 incidents both monthly and annually. However, the full-year AQI readings in Figure 6b show that air quality has gradually improved over time, with a 34% drop in instances of unhealthy from 2008 to 2018.
Southern Taiwan, especially Kaohsiung City, is notorious for its poor air quality due not only to emissions from nearby industrial parks but from particulate matter blowing in from China and Southeast Asia. Figure 7a,b shows significant instances of the unhealthy class (red bars) air quality for most of the year, with reduced pollution levels only in May to September. The worst air quality is concentrated in December and January (respectively 78% and 80% unhealthy).
The winter spike in air pollution is partly due to seasonal atmospheric phenomena that trap air pollution closer to the ground for extended periods. From October to March, Fengshan air quality readings are good less than 5% of the time. In terms of year-based AQI class composition, not much improvement is seen until in 2014–2015 with a sharply declined unhealthy scores after which levels remain relatively stable. Overall, for the 11 years, the Fengshan dataset is dominated by AQI > 100 (46%) followed by 51 AQI 100 (37%), and AQI 50 (17%).

4.2. AQI Prediction Model

Table 3 specifies the design of the parameters used to generate the prediction models for all dataset (Zhongli, Changhua, and Fengshan). Note that each particular constant for each dataset supposedly contains three values. However, to ease the documentation, any similar value being used across all datasets or at least across different time steps will be written only once. For example, Changhua dataset which uses the number of trees (i.e., 100) in AdaBoost for all time step categories. Additionally, parameter m in the random forest has only one value in all models. To be able to evaluate the ability of each model in accomplishing the task, 80% of data points will be fed into each training process, while the remaining 20% are spared for the testing purpose.
Table 4 describes the evaluation results of Zhongli F1-AQI prediction using 5 methods with and without imputation. It can be inferred that machine learning algorithms performed very well in predicting future AQI levels in Zhongli for the following hour. The linear kernel is shown to be the best input transformation technique for SVM, with R2 results of 0.953 (without imputation) and 0.963 (with imputation). Imputation allows SVM to produce improvement in all evaluation metrics. Furthermore, in terms of MAE score, SVM-RBF outperforms SVM-Linear, but the opposite is true for the RMSE score. This may be due to RBF having more samples with a larger prediction error despite a smaller average error (larger errors produce a greater penalty for RMSE).
The performance of random forest, AdaBoost, ANN, and stacking ensemble algorithm are all comparable. Random forest and stacking ensemble algorithm obtain slightly better R2 performance (0.001). Unlike with SVM, imputation does not affect the prediction results for AdaBoost, random forest or the stacking ensemble algorithm, indicating their robustness to missing data. On the other hand, imputation only provides a small degree of improvement on ANN, resulting in tied R2 values with AdaBoost. Several loss regression functions (square, linear, exponential) are tested on AdaBoost but without a decisive performance outcome due to efforts to avoid bias since the interpretation could be distorted by randomness, especially given very minor degrees of difference.
Table 5 summarizes the results for the 8-h Zhongli AQI prediction. The R2 value of 0.764 is the best value obtained by the stacking ensemble method. Nonetheless, the performance of SVM becomes worse with an R2 value less than 0.6 across all kernels. The values of MAE and RMSE are 17 and 23, respectively. However, ANN and random forest perform better than SVM, with R2 scores exceeding 0.7 and error metrics just slightly lower than those obtained with AdaBoost and stacking ensemble. The results match the expectation since the uncertainty increases with the longer period and leads to higher difficulty in the forecast. The study also finds that the overall values are worse than that of the F1-AQI prediction.
Table 6 shows that no method used for targeting F24-AQI prediction produced an R2 score above 0.6, with the lowest score of 0.091. Simply put, the yielded predictions fit the dataset poorly. Stacking ensemble still ranks first, but the R2 gap to the second-best method (AdaBoost-Linear) is larger than in the previous cases. SVM performance tracked far behind the other methods with the highest score for evaluation metrics obtained by RBF kernel. However, the R2 score is so low that the SVM method is considered not preferable for 24-h prediction.
Predictive model results for F1-AQI Changhua are similar to those for F1-AQI Zhongli. Stacking ensemble, AdaBoost, and random forest provide the best performance for one-hour AQI level prediction (see Table 7). These algorithms perform better for all evaluation metrics in Changhua than in Zhongli. Also, the imputation process reduces the performance of SVM, but not the other algorithms.
When it comes to the F8-AQI prediction (as shown in Table 8), the Changhua prediction again outperforms that of Zhongli. AdaBoost and stacking ensemble both yield R2 scores exceeding 0.8. Without imputation, stacking ensemble outperforms the other methods. However, with imputation, AdaBoost performance is comparable to that of stacking ensemble. SVM-linear gives the highest MAE and RMSE results, i.e., 23.412 and 31.189, respectively. These error metrics can be further reduced to 19.623 and 25.628 by imputation.
In the Zhongli dataset, the time step selection affects the performance of machine learning methods, and this is consistent with the results for the F24-AQI prediction models in Changhua (Table 9). Declination occurs across all models with a very low R2. The SVM-Polynomial gives the worst performance for the imputed dataset and an MAE value exceeding 30, and an RMSE value exceeding 40. The best performance is still obtained by the stacking ensemble method, with an R2 score of 0.605, and MAE and RMSE values respectively below 19 and 26. Among all kernels used by SVM, the radial basis function appears to be the most effective for 24-h AQI predictions. Moreover, AdaBoost-exponential slightly underperforms stacking ensemble in terms of R2 and RMSE, but consistently provides better MAE results.
Table 10 summarizes the results for the one-hour prediction model without and with k-NN imputation step in the Fengshan dataset. Stacking ensemble learning outperforms other techniques in terms of RMSE and R2, while SVM obtains the worst performance in every prediction case. However, imputation slightly enhances the results, particularly for the RBF and linear kernels, but not for the polynomial kernel which shows a performance decline using the imputed dataset. Also note that while comparing with the results in the other two cities, Zhongli and Changhua, Fengshan shows the best performance in all evaluation measures.
In terms of eight-hour prediction, imputation has a significant impact on SVM-Linear, increasing R2 from 0.318 to 0.546 (as shown in Table 11). RMSE and MAE are also improved by 10% and shift closer to the performance of other SVM kernels. Of the three locations, application of machine learning algorithms has the biggest impact on 8-h predictions in Fengshan, with stacking ensemble providing the greatest improvement, followed by AdaBoost, random forest, ANN, and SVM. This sequence is consistent for all results.
As summarized in Table 12, for the 24-h predictions in Fengshan, while overall SVM results are not promising, the other methods show quite acceptable evaluation scores. The top three methods (stacking ensemble, AdaBoost, and random forest) obtained R2 scores exceeding 0.71 for which the MAE and RMSE results are comparable to the F8-AQI prediction for Fengshan. Surprisingly, the stacking ensemble is found to be affected by imputation but, even with imputation, the MAE value is still higher than that of all AdaBoost versions (linear, square, and exponential). AdaBoost and stacking ensemble show consistent results, and AdaBoost generally obtains worse RMSE and R2 but better MAE.

4.3. Implementation of AQI Forecasting Model

This section describes a simulation-like AQI forecasting using stacking ensemble and AdaBoost (the two best methods from the analyses in Section 4.2) as backend techniques. Each prediction is accompanied by a prediction interval (PI) within a 95% confidence level, which describes a given tolerance for the prediction value such that there is 95% chance that the actual observation could fall within this range. The prediction interval is calculated using the formula below [21]:
P I = z α 2 × σ ( 3 )
where σ represents the standard deviation of the residual errors defined as [22]:
σ =   1 n 2 i = 1 n ( y i y ^ i ) 2  
Prediction intervals that reflect the uncertainty of a model’s output should be adjusted dynamically as new observations are received every hour, thus ensuring that the prediction interval is always current. The one-month samples (December 2018) from the Zhongli dataset are used to obtain the standard deviation.
As shown in Figure 8a, the higher the prediction time step, the wider the tolerance needed to represent the estimation. AdaBoost and stacking ensemble outperform the other techniques tested in the previous section, obtaining similar predictions and prediction intervals. The predictions here are all based on authentic data, where the best models in each prediction category are reused. Figure 8b shows another forecast constructed during winter, providing an example of poor air quality cases captured in the prediction of F1-AQI, F8-AQI, and F24-AQI using AdaBoost and stacking ensemble.
Figure 9 provides an illustration on how the information will be provided and visualized given a sample of upcoming data for the monitoring and forecasting of the air quality. Noted that as shown by the graph, the higher the time step of prediction the wider the tolerance needed to escort the estimation. AdaBoost and stacking are two methods that outperform other techniques tested in the previous section. Their predictions are close to each other and so are the prediction intervals. The predictions here are based on the real scheme, where the best models of them in each category of prediction were reused again by incorporating the actual values from 24 features.

5. Conclusions

Applying artificial intelligence methods provides promising results for AQI forecasting. This study obtained data collected by EPA and CWB of Taiwan over 11 years. Three regions (North: Zhongli, Central: Changhua, South: Fengshan) in Taiwan were considered, including two notorious places (Changhua and Fengshan) for their bad air quality all year round. With good results for R2, stacking ensemble and AdaBoost offer the best performance of target predictions based on three different datasets. To be more specific, the stacking ensemble delivers the best RMSE results, while AdaBoost provides the best MAE results. All results show that SVM yields the worst results among all methods explored, and only provides meaningful results for 1-h predictions. The results also confirm that the two machine learning methods, AdaBoost and stacking ensemble, employed in this study can outperform popular methods in the literature, such as SVM, random forest, and ANN. In other words, AdaBoost and stacking ensemble can be considered new and superior alternatives for AQI forecast.
This study also indicates that prediction performance varies over different regions in Taiwan. Comparing results from datasets sourced from three different regions displays best results for Fengshan AQI prediction (Southern Taiwan), where performance decay with increased time step is less pronounced than those in Zhongli (north) and Changhua (central). Also, 95% confidence intervals for 1-h, 8-h and 24-h forecast are calculated, respectively. Compared to the single value prediction, the 95% C.I. can provide a better reference to the decision-maker. For example, an event planner can decide if the outdoor activities can go on based on the air quality forecast with better confidence. Future work should focus on improving performance using stacking ensemble, AdaBoost and random forest with hyperparameter optimization, particularly for predictions with larger time steps (F8-AQI and F24-AQI).

Author Contributions

Conceptualization, Y.M., J.R.C.J., and Y.-C.L.; methodology, Y.M. and Y.-C.L.; software, Y.M.; validation, Y.M. and Y.-C.L.; formal analysis, Y.M.; data curation, Y.M.; writing—original draft preparation, Y.M.; writing—review and editing, Y.-C.L. and A.H.-L.C.; visualization, Y.M.; supervision, Y.-C.L.; project administration, Y.-C.L.; funding acquisition, Y.-C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the Ministry of Science and Technology, Taiwan, MOST 106-2221-E-155-025, MOST 107-2221-E-155-043.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. World Health Organization. Air Pollution. Available online: https://www.who.int/health-topics/air-pollution#tab=tab_1/ (accessed on 13 March 2020).
  2. Ghorani-Azam, A.; Riahi-Zanjani, B.; Balali-Mood, M. Effects of Air Pollution on Human Health and Practical Measures for Prevention in Iran. J. Res. Med. Sci. 2016, 21, 1–12. [Google Scholar]
  3. Conticini, E.; Frediani, B.; Caro, D. Can Atmospheric Pollution Be Considered a Co-factor in Extremely High Level of SARS-CoV-2 Lethality in Northern Italy? Environ. Pollut. 2020, 261, 114465. [Google Scholar] [CrossRef] [PubMed]
  4. Rybarczyk, Y.; Zalakeviciute, R. Machine Learning Approaches for Outdoor Air Quality Modelling: A Systematic Review. Appl. Sci. 2018, 8, 2570. [Google Scholar] [CrossRef] [Green Version]
  5. Raimondo, G.; Montuori, A.; Moniaci, W.; Pasero, E.; Almkvist, E. A Machine Learning Tool to Forecast PM10 Level. In Proceedings of the Fifth Conference on Artificial Intelligence Applications to Environmental Science, San Antonio, TX, USA, 14–18 January 2007; pp. 1–9. [Google Scholar]
  6. Garcia, J.M.; Teodoro, F.; Cerdeira, R.; Coelho, R.M.; Kumar, P.; Carvalho, M.G. Developing a Methodology to Predict PM10 Concentrations in Urban Areas Using Generalized Linear Models. Environ. Technol. 2016, 37, 2316–2325. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Park, S.; Kim, M.; Kim, M.; Namgung, H.-G.; Kim, K.-T.; Cho, K.H.; H, K.; Kwon, S.-B. Predicting PM10 Concentration in Seoul Metropolitan Subway Stations Using Artificial Neural Network (ANN). J. Hazard. Mater. 2018, 341, 75–82. [Google Scholar] [CrossRef] [PubMed]
  8. Yu, R.; Yang, Y.; Yang, L.; Han, G.; Move, O.A. RAQ A Random Forest Approach for Predicting Air Quality in Urban Sensing Systems. Sensors 2016, 16, 86. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  9. Yi, X.; Zhang, J.; Wang, Z.; Li, T.; Zheng, Y. Deep Distributed Fusion Network for Air Quality Prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, 19–23 August 2018; pp. 965–973. [Google Scholar]
  10. Veljanovska, K.; Dimoski, A. Air Quality Index Prediction Using Simple Machine Learning Algorithms. Int. J. Emerg. Trends Technol. Comput. Sci. 2018, 7, 25–30. [Google Scholar]
  11. Muhammad, I.; Yan, Z. Supervised Machine Learning Approaches: A Survey. Ictact J. Soft Comput. 2015, 5, 946–952. [Google Scholar] [CrossRef]
  12. Awad, M.; Khanna, R. Support Vector Regression. In Efficient Learning Machines; Apress: Berkeley, CA, USA, 2015. [Google Scholar]
  13. Schölkopf, B.; Smola, A.J.; Williamson, R.; Bartlett, P. New Support Vector Algorithms. Neural Comput. 2000, 12, 1207–1245. [Google Scholar] [CrossRef] [PubMed]
  14. Chang, C.-C.; Lin, C.-J. Training ν-Support Vector Regression: Theory and Algorithms. Neural Comput. 2002, 14, 1959–1977. [Google Scholar] [CrossRef] [PubMed]
  15. Wu, X.; Srihari, R. New v-Support Vector Machines and Their Sequential Minimal Optimization. In Proceedings of the Twentieth International Conference on Machine Learning, Washington, DC, USA, 21–24 August 2003; AAAI Press: Washington, DC, USA, 2003; pp. 824–831. [Google Scholar]
  16. Yu, L.; Wang, S.; Lai, K.K. Basic Learning Principles of Artificial Neural Networks. In Foreign-Exchange-Rate Forecasting With Artificial Neural Networks; Yu, L., Wang, S., Lai, K.K., Eds.; Springer: Boston, MA, USA, 2007; pp. 27–37. [Google Scholar]
  17. Rocca, J. Ensemble Methods: Bagging, Boosting and Stacking. Available online: https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205 (accessed on 23 April 2019).
  18. Taiwan’s Environmental Protection Administration. Taiwan Air Quality Monitoring Network. Available online: https://taqm.epa.gov.tw/taqm/en/b0201.aspx (accessed on 13 March 2020).
  19. Iskandaryan, D.; Ramos, F.; Trilles, S. Air Quality Prediction in Smart Cities Using Machine Learning Technologies Based on Sensor Data: A Review. Appl. Sci. 2020, 10, 2401. [Google Scholar] [CrossRef] [Green Version]
  20. Dufour, J.M. Coefficients of Determination; McGill University: Québec, QC, Canada, 2011. [Google Scholar]
  21. Brownlee, J. Prediction Intervals for Machine Learning. Available online: https://machinelearningmastery.com/prediction-intervals-for-machine-learning/ (accessed on 30 May 2018).
  22. Shrestha, D.L.; Solomatine, D.P. Machine Learning Approaches for Estimation of Prediction Interval for the Model Output. Neural Netw. 2006, 19, 225–235. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Overview of SVM algorithm: (a) SVM for classification; (b) SVM for regression.
Figure 1. Overview of SVM algorithm: (a) SVM for classification; (b) SVM for regression.
Applsci 10 09151 g001
Figure 2. Illustration of a random forest algorithm.
Figure 2. Illustration of a random forest algorithm.
Applsci 10 09151 g002
Figure 3. Illustration of artificial neural network.
Figure 3. Illustration of artificial neural network.
Applsci 10 09151 g003
Figure 4. Demonstration of linear regression’s learning process.
Figure 4. Demonstration of linear regression’s learning process.
Applsci 10 09151 g004
Figure 5. Composition of AQI classes in Zhongli: (a) Month-based; (b) Year-based and Overall-based.
Figure 5. Composition of AQI classes in Zhongli: (a) Month-based; (b) Year-based and Overall-based.
Applsci 10 09151 g005
Figure 6. Composition of AQI classes in Changhua: (a) Month-based; (b) Year-based and Overall-based.
Figure 6. Composition of AQI classes in Changhua: (a) Month-based; (b) Year-based and Overall-based.
Applsci 10 09151 g006
Figure 7. Composition of AQI classes in Fengshan: (a) Month-based; (b) Year-based and Overall-based.
Figure 7. Composition of AQI classes in Fengshan: (a) Month-based; (b) Year-based and Overall-based.
Applsci 10 09151 g007
Figure 8. Forecast of AQI: (a) on 28 May 2019, 07:00; (b) on 4 February 2019, 03:00.
Figure 8. Forecast of AQI: (a) on 28 May 2019, 07:00; (b) on 4 February 2019, 03:00.
Applsci 10 09151 g008
Figure 9. Illustration of Air Quality Monitoring and Forecasting System.
Figure 9. Illustration of Air Quality Monitoring and Forecasting System.
Applsci 10 09151 g009
Table 1. Other features added to the prediction model.
Table 1. Other features added to the prediction model.
NoFeatureTypeDescription
1O3 8-hrNumericCalculated based on O3 average of last 8 h
2PM10 moving averageNumericCalculated as follows: (0.5 × average of PM10 in the last 12 h) + (0.5 × average of PM10 in the last 4 h)
3PM2.5 moving averageNumericCalculated using the same rule as the PM10 moving average
4CO 8-hrNumericThe average concentration for the last 8 h
5AQI indexNumericAQI value based on the maximum index between the AQI pollutants (PM10, PM2.5, NO2, SO2, O3, and CO)
Table 2. Description of target variables.
Table 2. Description of target variables.
No.TargetTypeDescription
1F1-AQINumericAQI index for the next 1 h
2F8-AQINumericAQI index for the next 8 h
3F24-AQINumericAQI index for the next 24 h
Table 3. Parameter Design of ML Methods.
Table 3. Parameter Design of ML Methods.
MethodParameter Design (F1/F8/F24)
ZhongliChanghuaFengshan
Random ForestNo. of Trees = 100/200/200No. of Trees = 200No. of Trees = 200/100/200
m= 4
Min. observation = 6/6/3Min. observation = 6/3/3Min. observation = 6/3/3
AdaBoost# of Trees= 100
α = 0.8/0.9999/0.9α = 0.8/0.9/0.9999α = 0.8/0.9/0.8
SVMLinearC = 3/0.1/0.12, v = 0.5C = 3/0.12/0.1, v = 0.5C = 3/0.12/0.1, v = 0.5/0.5/0.9
PolynomialC = 3/0.7/0.9, v = 0.5/0.2/0.1, γ = auto, c = 3/5/3, d = 1C = 3/0.9/0.9, v = 0.5/0.2/0.1, γ = auto, c = 3, d = 1C =3/0.9/0.9, v = 0.5/0.2/0.9, γ = auto, c = 3, d = 1
RBFC = 3/1/1, v = 0.5, γ = autoC = 3/3/1, v = 0.5, γ = autoC = 3/3/1, v = 0.5/0.5/0.2, γ = auto
Max. # of Iterations = 3000
ANNActivation function: Identity; Optimizer: L-BFGS-B;
No. of Input neurons = 24; No. of Hidden neurons over layers = 50/50/50; No. of Output neurons = 1
α = 0.0001; Max. of Iterations = 300
Stacking EnsembleRegularization: L2 Ridge regression; α = 0.3
Table 4. Results of ML Algorithms for Zhongli F1-AQI Prediction.
Table 4. Results of ML Algorithms for Zhongli F1-AQI Prediction.
MethodWithout ImputationWith Imputation
RMSEMAER2RMSEMAER2
SVM-Polynomial9.8368.2750.9238.1456.8270.947
SVM-RBF9.2985.1190.9318.8324.6170.938
SVM-Linear7.6596.0500.9536.7905.2170.963
Random Forest3.2552.2080.9923.2572.2070.992
AdaBoost-Square3.2912.1870.9913.3372.1850.991
AdaBoost-Linear3.3282.1910.9913.3082.1890.991
AdaBoost-Exponential3.3362.1930.9913.3272.1930.991
ANN3.5722.4380.9903.3782.3960.991
Stacking Ensemble3.2362.1960.9923.2432.1990.992
Table 5. Results of ML Algorithms for Zhongli F8-AQI Prediction.
Table 5. Results of ML Algorithms for Zhongli F8-AQI Prediction.
MethodWithout ImputationWith Imputation
RMSEMAER2RMSEMAER2
SVM-Polynomial24.30817.9810.52623.24417.1350.567
SVM-RBF23.37517.2830.56223.35817.2780.563
SVM-Linear24.26218.3270.52826.67420.1740.430
Random Forest17.47112.4080.75517.47712.4130.755
AdaBoost-Square17.38611.8010.75817.35211.7880.759
AdaBoost-Linear17.27311.6930.76117.22111.6790.762
AdaBoost-Exponential17.28311.6910.76117.28411.6850.761
ANN18.78613.5020.71718.75913.4860.718
Stacking Ensemble17.16711.8040.76417.17811.7990.764
Table 6. Results of ML Algorithms for Zhongli F24-AQI Prediction.
Table 6. Results of ML Algorithms for Zhongli F24-AQI Prediction.
MethodWithout ImputationWith Imputation
RMSEMAER2RMSEMAER2
SVM-Polynomial33.63924.7990.09834.19425.0340.068
SVM-RBF30.63523.3400.25230.33523.0530.267
SVM-Linear37.00128.9040.09136.83528.5950.081
Random Forest24.97418.6480.50325.00718.6670.502
AdaBoost-Square24.21916.7240.53324.22616.7530.532
AdaBoost-Linear24.03916.5860.54024.07416.6140.538
AdaBoost-Exponential24.05316.5740.53924.09916.6200.537
ANN29.15021.9570.32329.11321.9270.325
Stacking Ensemble23.82516.6670.54823.83116.6930.548
Table 7. Results of ML Algorithms for Changhua F1-AQI Prediction.
Table 7. Results of ML Algorithms for Changhua F1-AQI Prediction.
MethodWithout ImputationWith Imputation
RMSEMAER2RMSEMAER2
SVM-Polynomial12.41911.090.90714.11612.6830.880
SVM-RBF9.6724.6390.9449.5964.4970.944
SVM-Linear9.1697.0550.94910.0337.6380.939
Random Forest3.0592.0550.9943.1052.0660.994
AdaBoost-Square3.0932.0460.9943.1262.0540.994
AdaBoost-Linear3.0892.0430.9943.1152.0480.994
AdaBoost-Exponential3.0932.0460.9943.1262.0540.994
ANN3.9142.5050.9913.8702.5410.991
Stacking Ensemble3.0392.0430.9943.0762.0570.994
Table 8. Results of ML Algorithms for Changhua F8-AQI Prediction.
Table 8. Results of ML Algorithms for Changhua F8-AQI Prediction.
MethodWithout ImputationWith Imputation
RMSEMAER2RMSEMAER2
SVM-Polynomial26.22520.0820.58525.81819.9190.598
SVM-RBF25.54819.4220.60625.73019.5970.600
SVM-Linear31.18923.4120.41325.62819.6230.603
Random Forest18.43513.7110.79518.42313.7070.795
AdaBoost-Square17.87712.7470.80717.87112.7340.807
AdaBoost-Linear17.82512.7320.80817.81012.7180.809
AdaBoost-Exponential17.82212.7330.80817.81512.7290.808
ANN20.45115.3290.74820.31215.2130.751
Stacking Ensemble17.80112.8550.80917.79212.8560.809
Table 9. Results of ML Algorithms for Changhua F24-AQI Prediction.
Table 9. Results of ML Algorithms for Changhua F24-AQI Prediction.
MethodWithout ImputationWith Imputation
RMSEMAER2RMSEMAER2
SVM-Polynomial40.66231.6890.00641.83432.7270.052
SVM-RBF34.87926.9770.26934.85226.9480.270
SVM-Linear37.45129.0470.15737.09228.7030.173
Random Forest26.76520.2810.57026.78620.2990.569
AdaBoost-Square26.28218.7810.58526.28818.7990.585
AdaBoost-Linear25.78618.2040.60025.81718.2460.599
AdaBoost-Exponential25.74718.1440.60225.77318.1850.601
ANN30.91923.7530.42630.80323.6470.430
Stacking Ensemble25.63018.2550.60525.65518.2940.604
Table 10. Results of ML Algorithms for Fengshan F1-AQI Prediction.
Table 10. Results of ML Algorithms for Fengshan F1-AQI Prediction.
MethodWithout ImputationWith Imputation
RMSEMAER2RMSEMAER2
SVM-Polynomial7.0725.5560.9747.2755.8210.973
SVM-RBF9.1195.5420.9578.3244.7020.964
SVM-Linear9.5297.4000.9538.4856.6210.963
Random Forest2.9711.8690.9952.9791.8680.995
AdaBoost-Square3.0201.7710.9952.9961.7660.995
AdaBoost-Linear2.9951.7670.9952.9831.7600.995
AdaBoost-Exponential3.0201.7710.9952.9961.7660.995
ANN3.9662.5440.9923.8212.5850.992
Stacking Ensemble2.9251.8230.9962.9211.8140.996
Table 11. Results of ML Algorithms for Fengshan F8-AQI Prediction.
Table 11. Results of ML Algorithms for Fengshan F8-AQI Prediction.
MethodWithout ImputationWith Imputation
RMSEMAER2RMSEMAER2
SVM-Polynomial25.65819.8060.65925.51219.6060.663
SVM-RBF25.81020.3920.65525.66520.2480.659
SVM-Linear36.29226.8590.31829.59823.0280.546
Random Forest16.63412.1110.85716.60612.1000.857
AdaBoost-Square16.44011.3990.86016.49811.3910.859
AdaBoost-Linear16.36711.3640.86116.33911.3670.862
AdaBoost-Exponential16.38711.3730.86116.39811.3670.861
ANN19.11214.2850.81118.97514.1820.814
Stacking Ensemble16.30211.5170.86216.27911.5270.863
Table 12. Results of ML Algorithms for Fengshan F24-AQI Prediction.
Table 12. Results of ML Algorithms for Fengshan F24-AQI Prediction.
MethodWithout ImputationWith Imputation
RMSEMAER2RMSEMAER2
SVM-Polynomial35.20328.0250.35735.33028.1400.353
SVM-RBF37.69630.3000.26335.36828.4850.351
SVM-Linear35.95428.5200.32936.51128.7630.309
Random Forest23.38817.4760.71623.38417.4850.716
AdaBoost-Square22.93515.9320.72722.92715.9390.727
AdaBoost-Linear22.66315.7430.73422.65415.7530.734
AdaBoost-Exponential22.70815.7770.73322.72315.7900.732
ANN27.00820.5420.62226.88220.4160.625
Stacking Ensemble22.87216.3720.72922.61816.1050.735
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Liang, Y.-C.; Maimury, Y.; Chen, A.H.-L.; Juarez, J.R.C. Machine Learning-Based Prediction of Air Quality. Appl. Sci. 2020, 10, 9151. https://doi.org/10.3390/app10249151

AMA Style

Liang Y-C, Maimury Y, Chen AH-L, Juarez JRC. Machine Learning-Based Prediction of Air Quality. Applied Sciences. 2020; 10(24):9151. https://doi.org/10.3390/app10249151

Chicago/Turabian Style

Liang, Yun-Chia, Yona Maimury, Angela Hsiang-Ling Chen, and Josue Rodolfo Cuevas Juarez. 2020. "Machine Learning-Based Prediction of Air Quality" Applied Sciences 10, no. 24: 9151. https://doi.org/10.3390/app10249151

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop