Abstract
Following the rapid development of various industrial sectors, air pollution frequently occurs in every corner of the world. As a dominant pollutant in Malaysia, particulate matter PM10 can cause highly detrimental effects on human health. This study aims to predict the daily average concentration of PM10 based on the data collected from 60 air quality monitoring stations in Malaysia. Building a forecasting model for each station is time-consuming and unrealistic; therefore, a hybrid model that combines the k-means clustering technique and the long short-term memory (LSTM) model is proposed to reduce the number of models and the overall model training time. Based on the training set, the stations were clustered using the k-means algorithm and an LSTM model was built for each cluster. Then, the prediction performance of the hybrid model was compared with the univariate LSTM model built independently for each station. The results show that the hybrid model has a comparable prediction performance to the univariate LSTM model, as it gives the relative percentage difference (RPD) less than or equal to 50% based on at least two accuracy metrics for 43 stations. The hybrid model can also fit the actual data trend well with a much shorter training time. Hence, the hybrid model is more competitive and suitable for real applications to forecast air quality.
1. Introduction
In line with the rapid development of various industrial sectors, air pollution frequently occurs worldwide, including in Malaysia. According to the World Health Organization (WHO) [1], air pollution is defined as the contamination of indoor and outdoor environments by impurities that modify the natural features of the environment. Data collected by WHO reveal that most of the global population breathes highly contaminated air that exceeds WHO guidelines. Air pollution can cause detrimental effects on human health, especially the respiratory system, and becomes one of the fundamental sources of morbidity and mortality [1].
In Malaysia, the air pollutant index (API) adopts six main air pollutants and serves as an indicator to deliver accurate and insightful information on air quality status in any area to the public [2]. Rani et al. [3] analyzed the trend of the API in Malaysia from years 2010 to 2015 based on various categories by using XLSTAT. In October 2010, the concentration of particulate matter 10 μm or less in diameter, better known as PM10, was extremely high in some areas in Johor following the occurrence of forest fires in Indonesia, which led to high API values as the highest relative subindex of monitored pollutants that account for the API readings [3,4]. This suggests that such fine dust often found in polluted air contributes greatly to the variability of the API [3].
Particulate matter is not just the main air pollutant in the Southeast Asia region, but is also identified as the most severe city pollutant around the globe [5,6]. For instance, most of the daily average PM10 concentration at three monitoring stations in Buenos Aires from the years 2010 to 2018 exceeded the standard limit of WHO guidelines, that is, 50 g/m3 [7]. Some research findings highlight that the particulate matter concentrations have certain correlations with the weather conditions, four seasons and monsoons [8,9,10].
Due to the increasing public awareness of the dangers of air pollution, numerous air quality-related studies have been performed using various statistical and deep learning models, including forecasting and clustering. Clustering is an exploratory data analysis technique that investigates the fundamental structure of data [11]. By adopting the clustering technique, the data are assigned into several distinct groups based on their degree of similarity before any further analysis or modeling can be performed. As the data within the cluster can be treated using the same analysis technique, it can save costs and computation time. There are several types of clustering methods, such as partitional clustering, hierarchical clustering and fuzzy clustering. Hierarchical clustering groups similar objects into clusters that eventually merge into a single cluster, whereas fuzzy clustering is a soft-clustering technique in which the objects can be clustered into more than one cluster. As a partitional clustering method, the k-means algorithm is one of the most common and popular techniques since it can be implemented easily [12]. It classifies data with closer centroid values into the same cluster such that the differences between the clusters are maximized. For instance, k-means clustering was used to analyze the significant changes in air quality in Southampton [13]. While Kim et al. [14] applied this algorithm to cluster monitoring stations in the United States based on different temporal patterns of PM2.5, Beaver and Palazoglu [15] adopted it to classify classes of ozone episodes in San Francisco.
Air quality time series clustering in Malaysia is often utilized to identify the pattern between the clusters and categorize the area into zone based on the pollution level so that government policies can be executed accurately [16]. In this context, Suris et al. [17] clustered the PM10 data in Malaysia using dynamic time warping (DTW) as the dissimilarities measure. Adopting four clustering techniques, that is, k-means, partitioning around medoid (PAM), agglomerative hierarchical clustering (AHC) and fuzzy k-means (FKM), the results show that the clusters were formed mainly on the basis of the region and geographical location of the stations instead of the station category and local economic activities. A similar result was obtained by Rahman et al. [11], whereby the stations were classified into high, medium and low pollution regions, respectively, using the AHC technique based on the daily average PM2.5 concentration.
As climatic and environmental issues concern society, air quality forecasting has become the focus among researchers as an accurate prediction that can reduce the effect of pollution on humans and the biosphere [18]. Therefore, various types of prediction models have been applied in previous studies. For instance, Aditya et al. [19] used the logistic regression and autoregression (AR) models to detect air quality and predict the concentration of PM2.5. A similar approach is shown in the research by Bhalgat et al. [18], which adopted AR and autoregression integrated moving average (ARIMA) models to predict the concentration of sulfur dioxide (SO2). Meanwhile, Guo et al. [20] used a geographically and temporally weighted regression model to calibrate the spatiotemporal dynamic PM2.5 concentrations to manage haze pollution in China. The random forest method is also deemed capable of modelling various concentrations of air pollutants, such as PM2.5 and ozone [21,22]. In fact, random forest regression is believed to predict air pollutant concentrations more accurately than linear regression and decision trees [23].
In recent years, neural networks have been preferred by researchers rather than the abovementioned traditional models due to their ability to fit non-linear data with higher accuracy [10]. The long short-term memory (LSTM) model is a deep learning method modified based on the concept of the recurrent neural network (RNN). Given its strength in solving the shortcomings of the RNN model, such as poor performance with tasks that involve long-term dependency and a vanishing and exploding gradient, the LSTM is found to be suitable to predict sequential data, including time series data. The outstanding performance of the LSTM model is observed through a lower root mean squared error (RMSE) in predicting the prices of gold [24] and Bitcoin [25], as well as influenza-like illnesses and respiratory diseases [26].
In terms of air quality prediction, the LSTM model also possesses great potential to give an accurate result [27]. The findings obtained by Bakar et al. [28] show that the multivariate LSTM model predicted the PM10 concentration at five selected monitoring stations most accurately with the lowest RMSE values, followed by the univariate LSTM model and the univariate ARIMA model. Aiming to increase prediction accuracy, hybrid models that involve a combination of techniques are gaining popularity in the research field. Zhang et al. [29] discovered that the combination of principal component analysis (PCA) and least squares support vector machine (LSSVM) can reduce the noise in meteorological data, hence giving more accurate predictions in API than the ARIMA model. The PCA–ANN model that uses only the significant parameters also seems competitive in giving a better prediction than the standalone artificial neural network (ANN) model [30].
For the case of clustering-based LSTM model, it considers the changes in features that are more specific in each cluster, making it an ideal choice to improve prediction accuracy. Yulita et al. [31] utilized fuzzy clustering and bidirectional LSTM (Bi-LSTM) to obtain higher accuracy and precision in classifying sleep stages. In accordance with the findings obtained in the study on the load prediction for dynamic spectrum allocation performed by Liu et al. [32] using AHC–LSTM, Li et al. [33] also found that type-2 fuzzy clustering-based LSTM can increase the accuracy with a much shorter model training time in long-term traffic volume prediction than the LSTM, random forest, back propagation network (BPN) and deep neural network (DNN).
Besides the abovementioned combinations, k-means clustering is also one of the widely used techniques in hybrid models. Ao et al. [10] first clustered meteorological data according to seasons using the k-means algorithm, then combined the clustering results with the air pollutant concentrations to be input into the Bi-LSTM model. It was found that the proposed model outperforms the other models as it can overcome the continuous fluctuation in meteorological conditions. Using the k-means–LSTM model, Baca et al. [34] also obtained a better air quality prediction in Andahuaylas, Peru.
Air quality prediction is indeed important for society to take preliminary preparations and preventive measures against poor air conditions. In order to figure out the potential of the hybrid model in predicting the daily average PM10 concentration in Malaysia, this study proposes a clustering-based LSTM model and compares its performance with the univariate LSTM model without clustering. Being a state-of-the-art deep learning method, the LSTM model usually outperforms conventional forecasting models in prediction accuracy. However, it is too time-consuming and unrealistic to construct the model individually for each station, especially in real-life applications. If the model is trained based on a few samples and generalizes its finding to all stations, it might cause an undesirably low accuracy at some stations outside the sampling. Therefore, such a combination of techniques is deemed capable of increasing the prediction accuracy with much less computation time, thus proving to be more efficient than the classical forecasting technique.
2. Materials and Methods
2.1. Data Preprocessing
The data used in this study are the daily average PM10 concentrations monitored at 60 air quality monitoring stations in Malaysia from 5 July 2017 to 31 January 2019, provided by the Malaysian Department of Environment (DOE). The dataset, with a length of 576 days for each time series, was divided into the training set and test set based on a ratio of 8:2 [18,26,35]. Data normalization was carried out in order to eliminate the effect of a wide range observed in the PM10 concentration, to speed up the training process and to increase prediction accuracy [35]. The training data was scaled into a range of [0, 1] using the min–max scaler as follows:
where and refer to the scaled data and the original data, respectively, whereas and represent the minimum and maximum values of the data, respectively.
2.2. Time Series K-Means Clustering
The k-means approach is a partitional clustering technique that decomposes the data into a set of disjointed clusters based on the nearest centroids.
Let as a data matrix, where represents the -th variable observed for the -th object. According to Kobylin and Lyashenko [36], the k-means algorithm usually adopts the Euclidean distance as the proximity measure:
This distance measure has been proven competitive in terms of time series classification accuracy [37].
Additionally, the shape-based DTW distance can also be implemented to measure the proximity in time series clustering. Despite being a good similarity and dissimilarity measure [17], this approach typically consumes more computation time due to its dynamic and complicated calculations [38]. Since the time series data are of the same length, the Euclidean distance has been chosen as the proximity measure [39].
The procedure for time series k-means clustering is as follows:
- (i)
- Initiate the -cluster based on the randomly chosen cluster centroids;
- (ii)
- Allocate each datapoint into the nearest cluster by employing the Euclidean distance;
- (iii)
- Recompute the cluster centroids based on the current cluster members;
- (iv)
- Repeat steps (ii) and (iii) until no there are changes in the cluster membership.
The k-means algorithm classifies a time series into clusters in such a way that the within-group sum of squares (WGSS) is minimized. According to Maharaj et al. [40], the objective function of the k-means clustering is as follows:
where is the degree of membership of the -th object in the -th cluster that takes the value of . If , it indicates that the -th object is in the -th cluster. On the contrary, shows that the -th object is not in the -th cluster.
Choosing an optimum number of clusters could be a challenging task. In this study, the optimal is chosen based on the internal index, that is, the WGSS visualized on the elbow plot and the silhouette index. For each time series, the error is defined as the distance to the nearest cluster [41].
The that gives the highest gradient and the sharpest elbow curve is chosen as the candidate before it is evaluated by the silhouette index, as shown below:
where is the average distance within the cluster and represents the average distance between the clusters. This index is a metric that evaluates the accuracy of a clustering technique based on scores between −1 and 1. A coefficient of 1 indicates that the clusters are well separated and clearly distinguished, whereas a score of −1 means that the clusters are not appropriately partitioned. If the silhouette index has a value of 0, it shows that the distance between the clusters is insignificant. Therefore, a higher index score indicates a better separation of the clusters [42,43].
2.3. Model LSTM
2.3.1. Introduction
An LSTM model is the extension of RNN and is capable of learning long-term dependency and storing the information for a long period. These characteristics of LSTM make it a state-of-the-art model, especially in time series prediction, which highly depends on the changing patterns of previous values.
Generally, the chain-like LSTM structure consists of three gates that control the flow of information in the memory cell, namely, the forget gate, input gate and output gate. In every cell, there are two types of non-linear activation functions, that is, the sigmoid function and the hyperbolic tangent (tanh) function. The other components of the LSTM cell include the cell state and hidden state. At each gate, there exist weights, , and biases, .
According to Colah [44], the key to LSTM is the cell state, which is the horizontal line running through the top of the diagram shown in Figure 1.
Figure 1.
LSTM cell structure.
The cell state runs straight down the entire chain with a few minor linear interactions. Information can flow along the cell state under the control of three gates that are composed of a sigmoid neural net layer and a pointwise multiplication operation. The sigmoid layer gives an output between 0 and 1 to indicate how much of each component should be let through. None of the information can flow through the gates when a value of 0 is output. On the other hand, a value of 1 indicates all the information can be let through.
The process in the LSTM cell begins at the forget cell, whereby the sigmoid layer determines what information needs to be removed from the cell state. Looking at the former hidden state and input data , it outputs a value between 0 and 1 for each number in the former cell state . This process can be described by the following equation:
Next, the new information to be stored in the cell state will be determined in two steps. Firstly, the sigmoid layer at the input gate will determine which values are to be updated. Secondly, the tanh layer will produce a vector of new candidate values that could be added to the cell state. These processes can be expressed as follows:
Then, a combination of the outputs will be used to update the former cell state into the new cell state . The former cell state is multiplied by to lose the decided information before it is added to the product of . These are the new candidate values that have been scaled by how much each cell state value should be updated. The process is described by the following equation:
Finally, the output gate decides what information should be output based on the filtered cell state. Firstly, the former hidden state and the input data will be run through the sigmoid layer to decide which part is to be eliminated. Then, the cell state will be put through the tanh layer to generate the values between −1 and 1 before multiplying by the output from the sigmoid layer. Eventually, only the decided portion will be output. The following equation summarizes the processes that occur at the output gate:
2.3.2. Multivariate LSTM Model
As more than one feature are considered when constructing the hybrid model for each cluster, the LSTM model is said to be multivariate. In this study, the mean squared error (MSE) was adopted as the loss function.
Adaptive moment estimation (Adam) was employed to update the weights in the neural network based on the training data. The number of epochs was set as 100. Aiming to avoid overfitting, early stopping was employed to stop the training whenever there was no improvement in the model performance for 15 consecutive epochs [45].
The optimum values for other hyperparameters, such as the dropout rate, hidden neuron, timestep, batch size and hidden layer, were determined by using the manual tuning approach to obtain the best model performance at the training stage.
2.3.3. Univariate LSTM Model
A univariate LSTM model is a model that is trained based on one feature only, that is, it only involves one time series. The model construction process is the same as in the multivariate LSTM model, except for the number of input features.
2.4. Comparison of Model Prediction Performance
There are three accuracy metrics adopted as the prediction performance indicators for the constructed models in this study, namely RMSE, mean absolute error (MAE) and mean absolute percentage error (MAPE).
Then, the relative percentage difference (RPD) was calculated for each accuracy metric to compare the prediction performance between both models. Generally, the RPD is computed using the following formula:
where and are the values measured by the first and second methods, respectively, which are the values obtained from the proposed hybrid model and univariate LSTM model in this case. The RPD is a common method to compare two experimental values when there is no theoretical value as a reference [46]. A good RPD value can be defined based on the types of experiments. In general, an acceptable RPD value ranges from 0% to 50% [47].
2.5. Framework
This study involves three main components, namely, the time series clustering phase, the modeling phase and a comparison of the model prediction performance, as summarized in Figure 2.
Figure 2.
Flow chart of the framework.
As the first step to constructing the proposed model, the air quality monitoring stations were grouped into clusters by utilizing the time series k-means clustering approach based on the training set. Then, a multivariate LSTM model was trained for each cluster. Combined with the clustering results, the observed values in the test set were compared with the corresponding predicted values based on RMSE, MAE and MAPE.
After that, a univariate LSTM model was constructed independently for each station by using the same hyperparameter settings with its corresponding hybrid model. Hence, a total of 60 univariate LSTM models were built. Similar to the proposed model, the prediction performance for each univariate model was measured based on three accuracy metrics. Lastly, the prediction accuracy was compared between both models by using RPD.
3. Results and Discussion
3.1. Descriptive Analysis
The dataset was split into a training set and a test set by a ratio of 8:2, where the training set consists of data ranging from 5 July 2017 to 30 September 2018 and the test set comprises the last four months, that is, from 1 October 2018 to 31 January 2019.
Table 1 shows the minimum value, maximum value and quartiles for the whole dataset.
Table 1.
Minimum value, maximum value and quartiles for the whole dataset (g/m3).
3.2. Time Series K-Means Clustering
Before the clustering and modeling phases were carried out, the training set was scaled into a range of [0, 1] by adopting min–max normalization. Then, the 60 monitoring stations were clustered based on the k-means algorithm. To identify the optimum clusters, the values of WGSS were calculated and visualized in Figure 3 for .
Figure 3.
Elbow plot.
By using the elbow method, the optimum number of clusters was estimated to be between and 4. To further validate the goodness of separation, the silhouette index was applied to the identified candidates. Table 2 shows the silhouette scores for each number of clusters.
Table 2.
Silhouette scores for each number of clusters.
Based on the table above, has the highest silhouette score, while has the lowest index. A higher index score indicates a better partitioning of the data, hence is said to be the optimum number of clusters.
The clustering results show that Cluster 1 consists of 19 stations, whereas Cluster 2 comprises 41 stations. Table 3 lists the cluster membership for the daily average PM10 concentration according to the stations.
Table 3.
Cluster membership for daily average PM10 concentration according to stations.
Figure 4 shows the distribution of stations according to clusters.
Figure 4.
Distribution of air quality monitoring stations according to clusters.
It was found that most stations in Cluster 1 are in the more developed states along the west coast of Peninsular Malaysia, such as Selangor, Perak, Pulau Pinang and Kuala Lumpur. On the other hand, Cluster 2 is mainly made up of stations that are widely distributed in the less developed states around the east coast of Peninsular Malaysia and east Malaysia, including Terengganu, Kelantan, Sabah and Sarawak.
Moreover, the number of stations based on categories according to the clusters is shown in Figure 5.
Figure 5.
Bar chart for the number of stations based on categories according to clusters.
The figure above demonstrates that most stations in Cluster 1 are located in suburban and urban areas in Klang Valley with only one station falling in the rural and industrial areas, respectively. In addition, the majority of the stations in Cluster 2 are categorized as suburban, followed by rural, industrial and urban. On top of that, it was observed that there are more stations located in suburban, rural and industrial areas in Cluster 2 as compared to Cluster 1, which has more urban stations.
After classifying the test set into the clusters, the minimum values, maximum values and quartiles according to the clusters are tabulated in Table 4.
Table 4.
Minimum values, maximum values and quartiles according to clusters (g/m3).
Table 4 highlights that the range of the daily average PM10 concentration for the whole dataset in Cluster 2, that is, 231.45 g/m3, is much higher than the range of 173.66 g/m3 in Cluster 1. The station locations that mainly spread in the neighboring states might give rise to this situation in accordance with a similar level of haze pollution carried by the monsoon winds [8,9]. On the other hand, the median of the daily average concentration of PM10 of the whole dataset in Cluster 1 is higher than Cluster 2 by 9.08 g/m3. Such a circumstance is believed to be closely related to the fact that most stations in Cluster 1 are in highly developed areas, including Klang Valley and Pulau Pinang [11].
The time plots of the daily average concentration of PM10 for the training set and test set of the selected stations in each cluster are extracted and visualized in Figure 6 and Figure 7, respectively.
Figure 6.
Time plots of daily average concentration of PM10 for training set of selected stations in each cluster: (a) Cluster 1; (b) Cluster 2.
Figure 7.
Time plots of daily average concentration of PM10 for test set of selected stations in each cluster: (a) Cluster 1; (b) Cluster 2.
From Figure 6, it can be seen that the stations within each cluster have a similar and stable time series pattern across the time range, except for a few spikes observed during a certain period. The drastic increase in the concentration of PM10 for both clusters around August until mid-September 2018 seems to be closely associated with the transboundary haze that affected most areas of Malaysia at that point.
According to Yusof [48], the unhealthy API readings were recorded in some states due to haze originating from North Sumatra and West Kalimantan at the time. The situation became worse and lasted until September as the southwest monsoon wind blew toward Peninsular Malaysia. Some states also experienced hot and dry climates with less rainfall, giving rise to the increase in the daytime temperature. Such weather caused wildfires in certain locations, for instance, the occurrence of peatland fires in Klang, Selangor [49]. As a result, the air quality decreased at station CA21B in Klang, followed by an increase in the daily average concentration of PM10 to the maximum value of 180.23 g/m3 in Cluster 1.
Referring to the time plots in Cluster 2, the highest daily average concentration of PM10 during the hazy period was recorded by station CA55Q, which is located in Permyjaya, Miri, Sarawak. This situation was deemed to be primarily driven by the forest fires at the nearby Industrial Training Institute, Permyjaya, which reduced the air quality in Miri and worsened the hazy conditions. According to Kawi [50], the API reading in Miri reached an unhealthy level of 130 in the morning on 19 August 2018. In conjunction with the nearly unhealthy API readings caused by the wildfire smoke from West Kalimantan, Indonesia, the PM10 concentration at other stations in Sarawak, such as Bintulu, Mukah, Sibu and Sarikei, also reported an increase during the hazy period.
Generally, the values of the test set data are at a lower level compared to the training set, that is, not exceeding 75 g/m3 in both clusters, as shown in Figure 7. It then leads to a small difference of 3.41 g/m3 in the data range between both clusters based on Table 4.
In a nutshell, the time series k-means clustering has assigned the stations into two clusters with a size of 19 and 41 stations, respectively. This result forms the basis of the proposed model.
3.3. Construction of Hybrid Models
A multivariate LSTM model was trained based on the training set for each cluster. An optimum setting of the values of the hyperparameters was tuned manually to achieve the best model performance in the training phase. After a few trials, it was found that the models for both clusters perform well under the same hyperparameter settings as tabulated in Table 5.
Table 5.
Optimum hyperparameter settings according to clusters.
By applying the settings above, the MSE and RMSE, as well as the computation time were computed to evaluate the fitness of the hybrid models to the training set, as shown in Table 6.
Table 6.
Model performance of hybrid models and computation time in training phase.
As depicted in the table above, the RMSE values for both of the hybrid models are significantly low in the training phase, indicating that the constructed models can learn the trend of the training set well. In terms of the training time, both models required a similar duration, between 83 s and 85 s.
3.4. Construction of Univariate LSTM Models
By using the same hyperparameter settings with the corresponding hybrid models as shown in Table 5, a univariate LSTM model was constructed independently for each station. The model performance and computation time were recorded in Table 7 to assess the degree of fitness of each model to the training set.
Table 7.
Model performance of univariate LSTM models and computation time in training phase.
Overall, the RMSE values for the univariate LSTM models during the training phase are comparatively higher than the hybrid models, indicating a more unsatisfied fitness to the training set. Nevertheless, there are 38 stations with RMSE values lower than 0.1 in the training phase. In addition, about 74 s to 99 s were needed to train the univariate models.
3.5. Comparison of Prediction Performance between Hybrid Models and Univariate LSTM Models
The prediction performance was computed by comparing the predicted values and the actual test data based on three accuracy metrics, namely RMSE, MAE and MAPE. Then, the difference in prediction performance between the two models was measured based on RPD for each metric. If a model has a smaller value than another for at least two metrics, then it is said to have a better prediction performance. Moreover, a hybrid model is said to have comparable prediction accuracy to the univariate model if the RPD values are less than or equal to 50%. Table 8 displays the abovementioned values for all the stations; the smaller values of accuracy metrics and RPD values below or equal to 50% are listed in bold.
Table 8.
Comparison of prediction performance between hybrid models and univariate LSTM models.
Based on Table 8, the hybrid model has recorded a lower value for at least two accuracy metrics at two stations in Cluster 1, which are CA16W and CA17W. Despite having a better prediction performance for most stations, the univariate model does not significantly outperform the hybrid model based on RPD values. This is because the RPD values are more than 50% for at least two accuracy metrics at only four stations, which are CA21B, CA22B, CA33J and CA34J. Hence, a conclusion stating that the proposed model has a competitive prediction performance in Cluster 1 can be drawn.
On the other hand, it is highlighted that the proposed model is capable of giving a more accurate prediction for station CA02K based on much lower RMSE, MAE and MAPE values compared to the univariate model. Focusing on the RPD values, the prediction performance of the proposed model only varies significantly from the univariate model at 13 stations in Cluster 2.
There are 39 stations with an RPD less than or equal to 50% for RMSE. Among these stations, 12 of them have RPD values within 0–10%, 6 stations have RPD around 10–20%, 10 stations and 3 stations have a range of 20–30% and 30–40%, respectively, while the rest have RPD values within 40–50%. Meanwhile, most of the satisfactory RPD values based on MAE fall in the range of 0–10% (12 stations), followed by the range of 30–40% (9 stations), 10–20% and 20–30% (8 stations, respectively) and 40–50% (6 stations). Lastly, 47 stations have an RPD less than or equal to 50% for MAPE. It is observed that most of the RPD values based on MAPE fall in the range of 0–10% (18 stations), followed by 10–20% (12 stations), 20–30% (7 stations), 40–50% (6 stations) and 30–40% (4 stations). In short, the hybrid model can output a competitive prediction performance compared to the univariate model, as it records an acceptable range of RPD values based on all three metrics.
If the prediction performance of the hybrid model does not significantly vary from the univariate model based on RPD for at least two accuracy metrics at each station, then it can be concluded that the proposed model is suitable to forecast the PM10 concentration at that station. From Table 8, the hybrid model seems to be potentially adopted as the PM10 prediction model for 43 stations (71.67%), whereas the univariate LSTM model is more suitable to be employed for the stations in Johor, Terengganu and Sabah.
Figure 8 shows the actual and predicted values for selected stations from both clusters. Both models can fit the actual data trend well for stations CA10A (Cluster 1) and CA01R (Cluster 2). Plots from other stations were also investigated and similar results were observed.
Figure 8.
Actual and predicted values for selected stations from each cluster: (a) CA10A (Cluster 1); (b) CA01R (Cluster 2).
To summarize, the prediction accuracy of the hybrid model does not significantly deviate from the univariate model, as the RPD values are within the 50% acceptable range at 43 stations for 71.67% of the stations. This has proven the capability of the hybrid model to predict the PM10 concentration at a similar accuracy level to the univariate model. Furthermore, the hybrid model can capture and fit the actual data trend quite well for most stations with a rather shorter computation time than the univariate LSTM model. This is closely related to the fact that only one hybrid model is constructed for each cluster, whereas the univariate model is individually constructed for each station, leading to a total model training time of 4951.842 s for 60 univariate models and just 168.237 s for two hybrid models. Such a rather shorter computation time without any drawback on prediction performance or trend fitness has made the hybrid model a more ideal forecasting model.
Nevertheless, the occurrence of hazy conditions at certain periods in the training set that negatively affected the air quality of each location at different levels is one of the factors that leads to a better prediction accuracy of the univariate LSTM model for some stations. The PM10 concentration increases drastically during hazy days in conjunction with the high emissions of particulate matter and greenhouse gases. On the other hand, PM10 is at a low concentration during normal days as the aerosol particles are released by mobile sources, including motor vehicles, and stationary sources, such as factories [6]. Due to the nature of the hybrid model that uses the data from all the stations within the same clusters to predict the PM10 values without considering much about the localized pollution level as in the univariate model, this might cause the tendency to overestimate PM10 for some stations that are less affected by the transboundary haze.
In addition, the concentration of PM10 is mainly influenced by other meteorological factors, such as wind speed, temperature and relative humidity [6]. The concentration of particulates is found to have a correlation with the temperature, wind speed, dew point and air pressure [6,19]. In accordance with this, Zhang et al. [51] found that there is a significant correlation between particulates and relative humidity during the winter season in Nanyang. Meanwhile, Pineda Rojas et al. [7] also revealed that the high daily average PM10 concentration is often recorded when the sky cover and relative humidity are low. Similar to the finding that the PM10 concentration is high during the southwest monsoon season [9], Yassen and Jahi [8] discovered that the TSP concentration in Klang Valley is higher during that season as compared to the rainy season. Thus, it can be concluded that different real-time meteorological conditions at each station will influence the concentration of particulate matter and lead to a slightly lower prediction accuracy of the hybrid model for some stations.
4. Conclusions
In brief, this study proposed a novel hybrid model that combines both the k-means clustering technique and the state-of-the-art LSTM model in predicting the daily average PM10 concentration in Malaysia. Throughout the study, comparisons were made between the hybrid model and the univariate LSTM model in terms of prediction performance, trend fitting and computation time.
In this study, 60 air quality monitoring stations were divided into two distinct clusters by adopting the time series k-means clustering method. Cluster 1 consists of 19 stations that are mainly distributed in highly developed areas, such as Klang Valley and Pulau Pinang, such that most of them fall under the urban and suburban categories. On the other hand, Cluster 2 comprises 41 suburban and rural stations that are located mainly on the east coast of Peninsular Malaysia, Sabah and Sarawak. The within-cluster time series patterns are quite similar and relatively stable with a few unexpected spikes, especially during the transboundary hazy period.
The results show that the hybrid model can give a comparable prediction performance to the univariate LSTM model based on the RPD values for three accuracy metrics. In terms of fitting the actual trend, the hybrid model can capture the patterns of daily average PM10 concentration, although it gives a poorer result compared to the univariate model for some stations due to several factors, such as the hazy period in the training set that contaminated the air quality at a different level and the varying meteorological conditions at each location. In addition, the hybrid model significantly outperforms the univariate LSTM model based on its much shorter training time, suggesting the capability of the proposed model to effectively increase the prediction efficiency in real-life applications.
As for the future research direction, it is suggested to consider the other meteorological factors, especially wind speed, during the clustering phase to reduce their impacts on the PM10 concentration. Moreover, the hourly PM10 concentration also warrants further study so that the public can better plan their daily activities beforehand. In such a context, two-step k-means clustering could be implemented to better capture the variation in the PM10 concentration before constructing the forecasting model for each subclass of the main clusters. Last but not least, a comparison between hybrid models that employ different forecasting methods, such as ARIMA, gated recurrent unit (GRU) and LSSVM models, can be carried out to identify which combination of techniques can predict the PM10 concentration better.
Author Contributions
Conceptualization, N.M.A. and H.Y.L.; methodology, N.M.A. and H.Y.L.; software, H.Y.L.; validation, N.M.A. and M.A.A.B.; formal analysis, H.Y.L.; investigation, H.Y.L.; resources, N.M.A. and M.A.A.B.; data curation, N.M.A. and H.Y.L.; writing—original draft preparation, H.Y.L.; writing—review and editing, N.M.A. and M.A.A.B.; visualization, H.Y.L.; supervision, N.M.A. and M.A.A.B.; project administration, N.M.A.; funding acquisition, M.A.A.B. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by Universiti Kebangsaan Malaysia with the grant number GP-K017073.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The data were obtained from the Malaysian Department of Environment (DOE) and are available from DOE upon request.
Acknowledgments
The authors would like to express their utmost gratitude to the Malaysian Department of Environment (DOE) for providing the air quality data used in this study. In addition, the authors would also like to thank Universiti Kebangsaan Malaysia for the allocation of the research grant, GP-K017073.
Conflicts of Interest
The authors declare no conflict of interest.
References
- WHO. Air Pollution. Available online: https://www.who.int/health-topics/air-pollution (accessed on 15 May 2022).
- Kamaruddin, S.B. UKM Pakarunding Kaji Semula Cara Nilai Kualiti Udara. Available online: https://www.ukm.my/news/Latest_News/ukm-pakarunding-kajli-semula-cara-nilai-kualiti-udara/ (accessed on 15 May 2022).
- Rani, N.L.A.; Azid, A.; Khalit, S.I.; Juahir, H.; Samsuding, M.S. Air Pollution Index Trend Analysis in Malaysia, 2010–2015. Pol. J. Environ. Stud. 2018, 27, 801–807. [Google Scholar] [CrossRef]
- Malaysian Department of Environment (DOE). Pengiraan Indeks Pencemar Udara (IPU). Available online: http://apims.doe.gov.my/pdf/API_Calculation.pdf (accessed on 20 January 2023).
- Al Jallad, F.; Al Katheeri, E.; Al Omar, M. Concentrations of Particulate Matter and Their Relationships with Meteorological Variables. Sustain. Environ. Res. 2013, 23, 191–198. [Google Scholar]
- Chooi, Y.H.; Yong, E.L. The Influence of PM2.5 and PM10 on Air Pollution Index (API). In Proceedings of the Civil Engineering Research Work: Environmental Engineering, Hydraulics & Hydrology, UTM, Johor Bahru, Malaysia, 7–8 June 2016; pp. 132–143. [Google Scholar]
- Pineda Rojas, A.L.; Borge, R.; Mazzeo, N.A.; Saurral, R.I.; Matarazzo, B.N.; Cordero, J.M.; Kropff, E. High PM10 Concentrations in the City of Buenos Aires and Their Relationship with Meteorological Conditions. Atmos. Environ. 2020, 241, 117773. [Google Scholar] [CrossRef]
- Yassen, M.E.; Jahi, J.M. Investigation of Variations and Trends in TSP Concentrations in the Klang Valley Region, Malaysia. Malays. J. Environ. Manag. 2007, 8, 57–68. [Google Scholar]
- Rahman, S.R.A.; Ismail, S.N.S.; Raml, M.F.; Latif, M.T.; Abidin, E.Z.; Praveena, S.M. The Assessment of the Ambient Air Pollution Trend in Klang Valley, Malaysia. World Environ. 2015, 5, 1–11. [Google Scholar]
- Ao, D.; Cui, Z.; Gu, D. Hybrid Model of Air Quality Prediction Using K-Means Clustering and Deep Neural Network. In Proceedings of the 38th Chinese Control Conference, Guangzhou, China, 27–30 July 2019; pp. 8416–8421. [Google Scholar]
- Rahman, E.; Hamzah, F.M.; Latif, M.T.; Dominick, D. Assessment of PM2.5 Patterns in Malaysia Using the Clustering Method. Aerosol Air Qual. Res. 2022, 22, 210161. [Google Scholar] [CrossRef]
- Ariff, N.M.; Bakar, M.A.A.; Zamzuri, Z.H. Academic Preference Based on Students’ Personality Analysis through K-Means Clustering. Malays. J. Fund. Appl. Sci. 2020, 16, 328–333. [Google Scholar] [CrossRef]
- Shafi, J.; Waheed, A. K-Means Clustering Analysing Abrupt Changes in Air Quality. In Proceedings of the Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 5–7 November 2020; pp. 26–30. [Google Scholar]
- Kim, S.B.; Park, S.K.; Sattler, M.; Russell, A.G. Characterization of Spatially Homogeneous Regions Based on Temporal Patterns of Fine Particulate Matter in the Continental United States. J. Air Waste Manag. Assoc. 2008, 58, 965–975. [Google Scholar] [CrossRef] [PubMed]
- Beaver, S.; Palazoglu, A. A Cluster Aggregation Scheme for Ozone Episode Selection in the San Francisco, CA Bay Area. Atmos. Environ. 2006, 40, 713–725. [Google Scholar] [CrossRef]
- Aghabozorgi, S.; Shirkhorshidi, A.S.; Teh, Y.W.; Soltanian, H.; Herawan, T. Spatial and Temporal Clustering of Air Pollution in Malaysia: A Review. In Proceedings of the International Conference on Agriculture, Environment and Biological Sciences (ICFAE’14), Antalya, Turkey, 4–5 June 2014; pp. 67–72. [Google Scholar]
- Suris, F.N.A.; Bakar, M.A.A.; Ariff, N.M.; Mohd Nadzir, M.S.; Ibrahim, K. Malaysia PM10 Air Quality Time Series Clustering Based on Dynamic Time Warping. Atmosphere 2022, 13, 503. [Google Scholar] [CrossRef]
- Bhalgat, P.; Pitale, S.; Bhoite, S. Air Quality Prediction Using Machine Learning Algorithms. Int. J. Comput. Appl. Technol. Res. 2019, 8, 367–370. [Google Scholar] [CrossRef]
- Aditya, C.R.; Chandana, R.D.; Nayana, D.K.; Praveen, G.V. Detection and Prediction of Air Pollution Using Machine Learning Models. Int. J. Eng. Trends Technol. 2018, 59, 204–207. [Google Scholar]
- Guo, B.; Wang, X.; Pei, L.; Su, Y.; Zhang, D.; Wang, Y. Identifying the spatiotemporal dynamic of PM2.5 concentrations at multiple scales using geographically and temporally weighted regression model across China during 2015–2018. Sci. Total Environ. 2021, 751, 141765. [Google Scholar] [CrossRef] [PubMed]
- Guo, B.; Zhang, D.; Pei, L.; Su, Y.; Wang, X.; Bian, Y.; Zhang, D.; Yao, W.; Zhou, Z.; Guo, L. Estimating PM2.5 concentrations via random forest method using satellite, auxiliary, and ground-level station dataset at multiple temporal scales across China in 2017. Sci. Total Environ. 2021, 778, 146288. [Google Scholar] [CrossRef]
- Guo, B.; Wu, H.; Pei, L.; Zhu, X.; Zhang, D.; Wang, Y.; Luo, P. Study on the spatiotemporal dynamic of ground-level ozone concentrations on multiple scales across China during the blue sky protection campaign. Environ. Int. 2022, 170, 107606. [Google Scholar] [CrossRef]
- Sharma, R.; Shilimkar, G.; Pisal, S. Air Quality Prediction by Machine Learning. Int. J. Sci. Res. Sci. Technol. 2021, 8, 486–492. [Google Scholar] [CrossRef]
- Uh, B.H.; Majid, N. Comparison of ARIMA Model and Artificial Neural Network in Forecasting Gold Price. J. Qual. Meas. Anal. 2021, 17, 31–39. [Google Scholar]
- Chee, K.C.; Omar, N. Bitcoin Price Prediction Based on Sentiment of News Article and Market Data with LSTM Model. Asia-Pac. J. Inf. Technol. Multimed. 2020, 9, 1–16. [Google Scholar]
- Tsan, Y.T.; Chen, D.Y.; Liu, P.Y.; Kristiani, E.; Nguyen, K.L.P.; Yang, C.T. The Prediction of Influenza-Like Illness and Respiratory Disease Using LSTM and ARIMA. Int. J. Environ. Res. Public Health 2022, 19, 1858. [Google Scholar] [CrossRef]
- Khumaidi, A.; Raafi’udin, R.; Solihin, I.P. Pengujian Algoritma Long Short Term Memory untuk Predikasi Kualitas Udara dan Suhu Kota Bandung. J. Telematika 2020, 15, 13–18. [Google Scholar]
- Bakar, M.A.A.; Ariff, N.M.; Mohd Nadzir, M.S.; Ong, L.W.; Suris, F.N.A. Prediction of Multivariate Air Quality Time Series Data Using Long Short-Term Memory Network. Mal. J. Fund. Appl. Sci. 2022, 18, 52–59. [Google Scholar] [CrossRef]
- Zhang, Y.; Yang, M.; Yang, F.; Dong, N. A Multi-Step Prediction Method of Urban Air Quality Index Based on Meteorological Factors Analysis. In Proceedings of the International Conference on Environment, Renewable Energy and Green Engineering (EREGCE 2022), Online, China, 22–24 April 2022; p. 01010. [Google Scholar]
- Azid, A.; Juahir, H.; Toriman, M.E.; Kamarudin, M.K.A.; Saudi, A.S.M.; Hasnam, C.N.C.; Aziz, N.A.A.; Azaman, F.; Latif, M.T.; Zainuddin, S.F.M.; et al. Prediction of the Level of Air Pollution Using Principal Component Analysis and Artificial Neural Network Techniques: A Case Study in Malaysia. Water Air Soil Pollut. 2014, 225, 2063. [Google Scholar] [CrossRef]
- Yulita, I.N.; Fanany, M.I.; Arymurthy, A.M. Fuzzy Clustering and Bidirectional Long Short-Term Memory for Sleep Stages Classification. In Proceedings of the 2017 International Conference on Soft Computing, Intelligent System and Information Technology, Denpasar, Bali, Indonesia, 26–29 September 2017; pp. 11–16. [Google Scholar]
- Liu, L.; Jahromi, H.M.; Cai, L.; Kidston, D. Hierarchical Agglomerative Clustering and LSTM-Based Load Prediction for Dynamic Spectrum Allocation. In Proceedings of the 2021 IEEE 18th Annual Consumer Communications & Networking Conference (CCNC), Las Vegas, NV, USA, 9–12 January 2021; pp. 1–6. [Google Scholar]
- Li, R.; Hu, Y.; Liang, Q. T2F-LSTM Method for Long-Term Traffic Volume Prediction. IEEE Trans. Fuzzy Syst. 2020, 28, 3256–3264. [Google Scholar] [CrossRef]
- Baca, H.A.H.; Valdivia, F.d.L.P.; Ibarra, M.J.; Cruz, M.A.; Baca, M.E.H. Air Quality Prediction Based on Long Short-Term Memory (LSTM) and Clustering K-Means in Andahuaylas, Peru. In Proceedings of the 2021 Future of Information and Communication Conference (FICC): Advances in Information and Communication, Vancouver, Canada, 29–30 April 2021; pp. 179–191. [Google Scholar]
- Chen, H.; Guan, M.; Li, H. Air Quality Prediction Based on Integrated Dual LSTM Model. IEEE Access 2021, 9, 93285–93297. [Google Scholar] [CrossRef]
- Kobylin, O.; Lyashenko, V. Time Series Clustering Based on the K-Means Algorithm. J. La Multiapp 2020, 1, 1–7. [Google Scholar] [CrossRef]
- Lkhagva, B.; Suzuki, Y.; Kawagoe, K. New Time Series Data Representation ESAX for Financial Applications. In Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW’06), Atlanta, GA, USA, 3–7 April 2006; pp. 17–22. [Google Scholar]
- Sardá-Espinosa, A. Time-Series Clustering in R Using the dtwclust Package. R. J. 2019, 11, 22–43. [Google Scholar] [CrossRef]
- Hautamaki, V.; Nykanen, P.; Franti, P. Time-Series Clustering by Approximate Prototypes. In Proceedings of the 19th International Conference on Pattern Recognition, Tampa, FL, USA, 8–11 December 2008; pp. 1–4. [Google Scholar]
- Maharaj, E.A.; D’Urso, P.; Caiado, J. Time Series Clustering and Classification, 1st ed.; CRC Press: Boca Raton, FL, USA, 2019. [Google Scholar]
- Aghabozorgi, S.; Shirkhorshidi, A.S.; Teh, Y.W. Time-Series Clustering—A Decade Review. Inf. Syst. 2015, 53, 16–38. [Google Scholar] [CrossRef]
- Bhardwaj, A. Silhouette Coefficient. Available online: https://towardsdatascience.com/silhouette-coefficient-validating-clustering-techniques-e976bb81d10c (accessed on 31 May 2022).
- Denyse. Time Series Clustering—Deriving Trends and Archetypes from Sequential Data. Available online: https://towardsdatascience.com/time-series-clustering-deriving-trends-and-archetypes-from-sequential-data-bb87783312b4 (accessed on 31 May 2022).
- Colah. Understanding LSTM Networks. Available online: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ (accessed on 31 May 2022).
- Vijay, U. Early Stopping to Avoid Overfitting in Neural Network—Keras. Available online: https://medium.com/zero-equals-false/early-stopping-to-avoid-overfitting-in-neural-network-keras-b68c96ed05d9 (accessed on 10 January 2023).
- NC State University Physics Department. Percent Error and Percent Difference. Available online: https://www.webassign.net/question_assets/ncsucalcphysmechl3/percent_error/manual.html (accessed on 10 January 2023).
- Northern Territory Department of Lands, Planning and the Environment (DLPE). Appendix D—Data Quality Objectives, Quality Assurance, Quality Control. Available online: https://ntepa.nt.gov.au/__data/assets/pdf_file/0003/286149/Edith-River-Investigation-Report (accessed on 10 January 2023).
- Yusof, N.A.M. Jerebu Akibat Kebakaran di Sumatera dan Kalimantan. Available online: https://www.bharian.com.my/berita/nasional/2018/08/463184/jerebu-akibat-kebakaran-di-sumatera-dan-kalimantan (accessed on 10 January 2023).
- Nufael, A. Malaysia Alami Jerebu Akibat Pembakaran Terbuka di Kalimantan. Available online: https://www.benarnews.org/malay/berita/my-jerebu-180817-08172018183152.html (accessed on 10 January 2023).
- Kawi, M.R. IPU Sarawak Naik, Miri Catat Bacaan Tidak Sihat. Available online: https://www.bharian.com.my/berita/wilayah/2018/08/463688/ipu-sarawak-naik-miri-catat-bacaan-tidak-sihat (accessed on 10 January 2023).
- Zhang, M.; Chen, S.; Zhang, X.; Guo, S.; Wang, Y.; Zhao, F.; Chen, J.; Qi, P.; Lu, F.; Chen, M. Characters of Particulate Matter and Their Relationship with Meteorological Factors during Winter Nanyang 2021–2022. Atmosphere 2023, 14, 137. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).