PM2.5 Concentration Prediction Model: A CNN–RF Ensemble Framework

Although many machine learning methods have been widely used to predict PM2.5 concentrations, these single or hybrid methods still have some shortcomings. This study integrated the advantages of convolutional neural network (CNN) feature extraction and the regression ability of random forest (RF) to propose a novel CNN-RF ensemble framework for PM2.5 concentration modeling. The observational data from 13 monitoring stations in Kaohsiung in 2021 were selected for model training and testing. First, CNN was implemented to extract key meteorological and pollution data. Subsequently, the RF algorithm was employed to train the model with five input factors, namely the extracted features from the CNN and spatiotemporal factors, including the day of the year, the hour of the day, latitude, and longitude. Independent observations from two stations were used to evaluate the models. The findings demonstrated that the proposed CNN–RF model had better modeling capability compared with the independent CNN and RF models: the average improvements in root mean square error (RMSE) and mean absolute error (MAE) ranged from 8.10% to 11.11%, respectively. In addition, the proposed CNN–RF hybrid model has fewer excess residuals at thresholds of 10 μg/m3, 20 μg/m3, and 30 μg/m3. The results revealed that the proposed CNN–RF ensemble framework is a stable, reliable, and accurate method that can generate superior results compared with the single CNN and RF methods. The proposed method could be a valuable reference for readers and may inspire researchers to develop even more effective methods for air pollution modeling. This research has important implications for air pollution research, data analysis, model estimation, and machine learning.


Introduction
According to the World Health Organization (WHO), air pollution kills nearly 7 million people worldwide every year. Currently, nine out of ten people breathe air that exceeds WHO guidelines for pollutants, with those living in low-and middle-income countries suffering the most [1]. Pollutants that are major public health problems include particulate matter, carbon monoxide, ozone, nitrogen dioxide, sulfur dioxide, black carbon, and polycyclic aromatic hydrocarbons. Many scholars have suggested that air pollutants can cause considerable damage to humans and the living environment [2][3][4][5][6][7][8][9].
Ambient particulate matter (PM), a primary type of air pollution, consists of small solid or liquid particles suspended in the air; therefore, it is also called atmospheric PM. Generally, suspended particles with an aerodynamic diameter of less than or equal to 2.5 µm are classified as fine particulate matter (PM2.5) [10]; those with an aerodynamic diameter less than or equal to 10 µm are classified as particulate matter (PM10). As early as the 1970s, studies indicated a relationship between suspended particulates and human health. Multiple studies have verified that aerosols cause diseases of the respiratory and cardiovascular systems, which may lead to conditions such as asthma, lung cancer, birth defects, and premature death [11]. Therefore, governments worldwide are currently Int. J. Environ. Res. Public Health 2023, 20, 4077 3 of 13 In this study, the focus was on developing a method for multistation PM2.5 concentration prediction models. To ensure data acquisition convenience, meteorological and air pollution parameters from ground air quality monitoring stations, along with their corresponding spatiotemporal parameters, were used for modeling. The CNN convolutional layer is used to extract key features through deep learning techniques. The RF machine learning technique has shown good predictive accuracy, stability, and speed. Therefore, this study proposed a hybrid method, CNN-RF, that combines deep learning with machine learning. This design exploits the advantages of each method to construct a stable, reliable, and accurate prediction model for PM2.5 concentrations.
This paper is organized as follows: Section 2 introduces the related algorithms and experiments employed in this study and describes the construction of the proposed CNN-RF framework. Section 3 presents an analysis and comparison of the research results. Section 4 states the conclusion and recommendations for future research.

Datasets and Preprocessing
This study selected the industrial city of Kaohsiung as the research area due to its persistent air pollution. According to observation data from the Taiwan Air Quality Monitoring Network, Kaohsiung had the highest annual average PM2.5 concentrations in Taiwan in 2021 ( Figure 1). This study used the observation data collected by 13 monitoring stations in the Kaohsiung area in 2021 to construct a PM2.5 concentration model; the data, which included hourly weather conditions, air pollution values, location, and time, were downloaded from the Taiwan Air Quality Monitoring Network of the Environmental Protection Administration. The observation data from 11 monitoring stations were employed for model training. To avoid overfitting, which could possibility affect the model's performance and verify the generalization ability of the training model, independent observation data from two monitoring stations in Fuxing and Fongshan were used for model testing. Machine learning methods typically use a variety of factors to predict PM2.5 concentrations, including meteorological, air pollution, spatiotemporal, land use, and satellite remote sensing data [56][57][58][59][60]. The factors selected depend on the modeling goals and available data. In this study, the focus was on developing a method for multistation PM2.5 concentration prediction models. To ensure data acquisition convenience, meteorological and air pollution parameters from ground air quality monitoring stations, along with their corresponding spatiotemporal parameters, were used for modeling. The CNN convolutional layer is used to extract key features through deep learning techniques. The RF machine learning technique has shown good predictive accuracy, stability, and speed. Therefore, this study proposed a hybrid method, CNN-RF, that combines deep learning with machine learning. This design exploits the advantages of each method to construct a stable, reliable, and accurate prediction model for PM2.5 concentrations.
This paper is organized as follows: Section 2 introduces the related algorithms and experiments employed in this study and describes the construction of the proposed CNN-RF framework. Section 3 presents an analysis and comparison of the research results. Section 4 states the conclusion and recommendations for future research.

Datasets and Preprocessing
This study selected the industrial city of Kaohsiung as the research area due to its persistent air pollution. According to observation data from the Taiwan Air Quality Monitoring Network, Kaohsiung had the highest annual average PM2.5 concentrations in Taiwan in 2021 ( Figure 1). This study used the observation data collected by 13 monitoring stations in the Kaohsiung area in 2021 to construct a PM2.5 concentration model; the data, which included hourly weather conditions, air pollution values, location, and time, were downloaded from the Taiwan Air Quality Monitoring Network of the Environmental Protection Administration. The observation data from 11 monitoring stations were employed for model training. To avoid overfitting, which could possibility affect the model's performance and verify the generalization ability of the training model, independent observation data from two monitoring stations in Fuxing and Fongshan were used for model testing.  The testing stations used in this study are part of the Taiwan Air Quality Monitoring Network's six traffic air quality monitoring stations. These stations are strategically located in areas with high traffic flow or high pollution caused by traffic emissions, both of which are prevalent in the prosperous areas of Kaohsiung. Fongshan, one of the testing station's locations, is the most populous administrative district in Kaohsiung. In terms of geographical location, the Fuxing station is located adjacent to the training stations, while the Fongshan station is positioned in the middle of the training stations, making it farther away from the training stations than the Fuxing station. The testing data consist of 15,943 observations, which is 18% of the training data, meeting the necessary requirements for model testing. Therefore, the selected testing stations are in high traffic flow areas with a dense population and high land use, which are factors that significantly impact PM2.5 concentration. The difference in distance between the testing stations and the adjacent training stations allows for an examination of the modeling performance influenced by the spatial factor ( Figure 2).
The testing stations used in this study are part of the Taiwan Air Quality Monitoring Network's six traffic air quality monitoring stations. These stations are strategically located in areas with high traffic flow or high pollution caused by traffic emissions, both of which are prevalent in the prosperous areas of Kaohsiung. Fongshan, one of the testing station's locations, is the most populous administrative district in Kaohsiung. In terms of geographical location, the Fuxing station is located adjacent to the training stations, while the Fongshan station is positioned in the middle of the training stations, making it farther away from the training stations than the Fuxing station. The testing data consist of 15,943 observations, which is 18% of the training data, meeting the necessary requirements for model testing. Therefore, the selected testing stations are in high traffic flow areas with a dense population and high land use, which are factors that significantly impact PM2.5 concentration. The difference in distance between the testing stations and the adjacent training stations allows for an examination of the modeling performance influenced by the spatial factor ( Figure 2). This study collected the hourly observation data of 13 monitoring stations from 1 January to 31 December 2021. Excluding missing data, this study collected 88,383 training observations and 15,943 independent testing observations, including 8018 and 7925 observations from the Fuxing and Fongshan stations, respectively (Table 1). Each observation was composed of eight air pollution factors, specifically CO, NO2, NO, NOX, SO2, O3, PM10, and PM2.5; five meteorological factors, which include wind speed, wind direction, relative humidity, rainfall, and ambient temperature, as well as four spatiotemporal factors, namely day of the year (DoY), hour of the day (HoD), latitude (Lat), and longitude (Long). This study aims to develop a new model for predicting PM2.5 concentration. Given the air pollution and meteorological factors at a particular location and time, the model estimates the unknown PM2.5 concentration. However, the focus of the study is not on examining the relationship between PM2.5 concentration and air pollution or meteorological factors. Therefore, the PM2.5 concentration collected is treated as the dependent variable in the proposed CNN-RF model, and the other factors are treated as independent variables. This study collected the hourly observation data of 13 monitoring stations from 1 January to 31 December 2021. Excluding missing data, this study collected 88,383 training observations and 15,943 independent testing observations, including 8018 and 7925 observations from the Fuxing and Fongshan stations, respectively (Table 1). Each observation was composed of eight air pollution factors, specifically CO, NO 2 , NO, NO X , SO 2 , O 3 , PM10, and PM2.5; five meteorological factors, which include wind speed, wind direction, relative humidity, rainfall, and ambient temperature, as well as four spatiotemporal factors, namely day of the year (DoY), hour of the day (HoD), latitude (Lat), and longitude (Long). This study aims to develop a new model for predicting PM2.5 concentration. Given the air pollution and meteorological factors at a particular location and time, the model estimates the unknown PM2.5 concentration. However, the focus of the study is not on examining the relationship between PM2.5 concentration and air pollution or meteorological factors. Therefore, the PM2.5 concentration collected is treated as the dependent variable in the proposed CNN-RF model, and the other factors are treated as independent variables. Because the independent variables did not contribute equally to the model fitting, bias could occur. Min-max normalization, one of the most common methods to normalize data, was implemented for the independent variables. For each variable, the minimum value is transformed into 0, the maximum value is transformed into 1, and every other value is transformed into a value between 0 and 1. This method is able to improve the convergence speed and accuracy of the model. The conversion method (Equation (1)) is expressed as follows: where X nom is the normalized independent variable, X is the original independent variable, X min is the minimum value of the original independent variable, and X max is the maximum value of the original independent variable.

Proposed CNN-RF Framework
CNN is a class of ANN for processing data with a grid-like pattern. It employs a convolutional deep learning technique to achieve feature extraction; features are automatically deduced and optimally tuned for the desired outcome. CNN has a mathematical structure that typically consists of three layers, which include the convolutional layer, the pooling layer, and the fully connected layer. The first two convolutional layers and the pooling layer extract features (i.e., feature learning), whereas the fully connected layer maps the extracted features for the final output. The convolutional layer, which plays a key role in CNN, completes multiple mathematical operations-these include convolution, a special type of linear unit. CNN can efficiently process images; therefore, it is commonly employed to analyze visual images, which includes tasks such as image classification, segmentation, medical image analysis, and natural language processing. The flexible nature of deep learning enables its adaptation to process time-series data.
RF, a prediction modeling and behavioral analysis technique, is based on DTs. It employs bagging or bootstrap aggregation, an ensemble learning technique. The RF method fits many DTs on subsamples of the data set and combines the output of all the DTs. This method achieves greater accuracy because it is able to reduce the problems of variance and overfitting in DTs. The RF technique considers individual instances and uses the instance that receives the most votes as its prediction. Each tree receives inputs from samples in the initial dataset. Features are then randomly selected, which are used to generate nodes on each tree. The trees in the forest should not be pruned until a decisive forecast is reached at the end of the exercise. Thus, RF enables any classifier with weak correlations to create a strong classifier. The RF technique also employs an advanced method to address missing data. Missing values are replaced by the variable that occurs most often in a particular node. Because this method processes variables quickly, it is well-suited for complicated tasks. The RF method exhibits outstanding predictive ability.
As mentioned in the Introduction, hybrid methods have been widely employed in PM2.5 concentration prediction research because they are able to better quantify complex data. However, not all models can be effectively combined-which methods are suitable for ensemble and processes in the data processing needs to be studied. The proposed CNN-RF model in this study exploits the advantages of CNNs and the RF method. A CNN was implemented to extract features from air pollution and meteorological data. This study employed the extracted feature (i.e., the predicted PM2.5) and spatiotemporal variables from each observation, such as DoY, HoD, Lat, and Long, as the input data and PM2.5 as the output data; the RF method was implemented to construct the model of PM2.5 concentrations (Figure 3). Table 2 presents the hyperparameters of the proposed CNN-RF, CNN, and RF models. variables from each observation, such as DoY, HoD, Lat, and Long, as the input data and PM2.5 as the output data; the RF method was implemented to construct the model of PM2.5 concentrations (Figure 3). Table 2 presents the hyperparameters of the proposed CNN-RF, CNN, and RF models.

Experimental Equipment and Assessment Indicators
This study employed the Acer (ASUSTeK computer) ExpertCenter D700SC_M700MC and Matlab R2022b with Microsoft Windows 10 Professional Edition. The model verification process was divided into training and independent testing. The mean square error (MSE), root mean square error (RMSE), mean absolute error (MAE), coefficient of determination, (R 2 ), and mean accuracy (MA) were employed to evaluate model performance. In addition, the numbers of large residuals were calculated to assess model reliability and stability. The residual threshold was set to 10 μg/m 3 , 20 μg/m 3 , and 30 μg/m 3 , respectively. The relevant assessment indicators are expressed as follows (Equations (2)-(6)):

Experimental Equipment and Assessment Indicators
This study employed the Acer (ASUSTeK computer) ExpertCenter D700SC_M700MC and Matlab R2022b with Microsoft Windows 10 Professional Edition. The model verification process was divided into training and independent testing. The mean square error (MSE), root mean square error (RMSE), mean absolute error (MAE), coefficient of determination, (R 2 ), and mean accuracy (MA) were employed to evaluate model performance. In addition, the numbers of large residuals were calculated to assess model reliability and stability. The residual threshold was set to 10 µg/m 3 , 20 µg/m 3 , and 30 µg/m 3 , respectively. The relevant assessment indicators are expressed as follows (Equations (2)-(6)): where y tru,i is the actual measured value of the ith PM2.5 concentration, y p,i is the estimated concentration of the ith PM2.5, n is the number of observations, and y is the mean of the actual measured value.

Model Evaluation
This section compares the PM2.5 modeling performances of the CNN-RF, CNN, and RF models. The performance evaluation indicators for model training are presented in Table 3. The indicators include calculations for 88,383 hourly predicted PM2.5 concentrations and 88,383 hourly PM2.5 observations. The results reveal that the RF method had the best training performance, followed by the CNN-RF model. The RMSE, MSE, MAE, and R 2 of the CNN-RF model were 3.67 µg/m 3 , 13.44 µg/m 3 , 2.66 µg/m 3 , and 0.93, respectively. All the evaluation indicators of the CNN-RF model were similar to those of the RF method. Both the RF and CNN-RF models achieved good training results.  Figure 4 illustrates the scatterplots of the predictive PM2.5 concentration models. The RF method generated good predictions of PM2.5 concentration values during model training, following the proposed CNN-RF model.
where y , is the actual measured value of the ith PM2.5 concentration, y , is the estimated concentration of the ith PM2.5, n is the number of observations, and y is the mean of the actual measured value.

Model Evaluation
This section compares the PM2.5 modeling performances of the CNN-RF, CNN, and RF models. The performance evaluation indicators for model training are presented in Table 3. The indicators include calculations for 88,383 hourly predicted PM2.5 concentrations and 88,383 hourly PM2.5 observations. The results reveal that the RF method had the best training performance, followed by the CNN-RF model. The RMSE, MSE, MAE, and R 2 of the CNN-RF model were 3.67 μg/m 3 , 13.44 μg/m 3 , 2.66 μg/m 3 , and 0.93, respectively. All the evaluation indicators of the CNN-RF model were similar to those of the RF method. Both the RF and CNN-RF models achieved good training results.  Figure 4 illustrates the scatterplots of the predictive PM2.5 concentration models. The RF method generated good predictions of PM2.5 concentration values during model training, following the proposed CNN-RF model. A good prediction model is one that is accurate and reliable in its predictions. In addition to accuracy, a reliable prediction model should also be robust and able to generalize A good prediction model is one that is accurate and reliable in its predictions. In addition to accuracy, a reliable prediction model should also be robust and able to generalize well to new data. This means that the model should not only perform well on the training data it was trained on but also on new data that it has not seen before. The model's consistency ability for training and validation is an important indicator for assessing a model. Generally speaking, the training accuracy of the model will be better than the validation accuracy, but if the difference is too large, it will lead to an overfitting situation, which can easily result in the overestimation and misjudgment of the model. The further validation of the model's performance was conducted using an independent testing set. This allowed for an unbiased evaluation of the model's ability to generalize to new, unseen data, which is an important step in machine learning model development and evaluation.

Model Validation
Independent testing assessed the capability of the three models. Table 4 presents the results for the three models at the Fuxing and Fongshan testing stations. The results differ from those of the training. The proposed CNN-RF model had the best performance for all assessment indicators at the two testing stations; the RMSE, MAE, R 2 , and MA at the Fuxing station were 4.69 µg/m 3 , 3.47 µg/m 3 , 0.89, and 83.81, respectively. At the Fongshan station, the RMSE, MAE, R 2 , and MA were 5.06 µg/m 3 , 3.79 µg/m 3 , 0.88, and 83.50, respectively. Notably, all models had better Mas at the Fuxing station than at the Fongshan station. A possible explanation for this phenomenon is that the Fuxing station is closer to the training station than the Fongshan station is ( Figure 2). This study calculated the average values of RMSE and MAE at the Fuxing and Fongshan stations to compare the performance of the proposed CNN-RF method with those of the RF and CNN methods, as presented in Table 5. The average values of the CNN-RF model for RMSE, MAE, R 2 , and MA were 4.88 µg/m 3 , 3.63 µg/m 3 , 0.88, and 83.66, respectively. Compared with RF and CNN, the CNN-RF model had an RMSE improvement rate of 8.61% and 11.11%, respectively. Furthermore, compared with RF and CNN, the CNN-RF model had an MAE improvement rate of 8.10% and 9.48%, respectively. The findings demonstrate that the CNN-RF model had the best testing performance. This study estimated the quantities of residuals exceeding the thresholds of 10 µg/m 3 , 20 µg/m 3 , and 30 µg/m 3 to assess the model's stability and reliability. The results in Table 6 reveal a phenomenon similar to the aforementioned results. The RF method had fewer excess residuals at each threshold during model training, followed by the CNN-RF model. The number of excess residuals at thresholds of 10 µg/m 3 , 20 µg/m 3 , and 30 µg/m 3 were 1212, 72, and 14, respectively. The CNN-RF and RF models had similar numbers of excess residuals. The proposed CNN-RF model yielded the best results at the testing stations; the quantities of excess residuals were lower than those of the RF and CNN methods for all stations and thresholds. At the Fuxing station, the proposed model had excess residuals of 344, 14, and 0 at thresholds of 10 µg/m 3 , 20 µg/m 3 , and 30 µg/m 3 , respectively; at the Fongshan station, the excess residuals were 423, 21, and 1 at thresholds of 10 µg/m 3 , 20 µg/m 3 , and 30 µg/m 3 , respectively.  Figure 5 illustrates the scatterplots of the predictive PM2.5 concentration models. The proposed CNN-RF model performed well during independent model testing. The proposed CNN-RF model yielded stable, reliable, and accurate results for PM2.5 concentration modeling. The PM2.5 prediction research field is able to successfully adopt this hybrid method.   Figure 5 illustrates the scatterplots of the predictive PM2.5 concentration models. The proposed CNN-RF model performed well during independent model testing. The proposed CNN-RF model yielded stable, reliable, and accurate results for PM2.5 concentration modeling. The PM2.5 prediction research field is able to successfully adopt this hybrid method. The results of this research show that RF has a large difference in training and validation results in RMSE, MAE, and R 2 , while CNN-RF performs better than RF and has better validation results. This suggests that the CNN-RF proposed in this study is accurate, robust, and reliable. The CNN-RF model combines the feature learning and spatial relationship abilities of the CNN with the ensemble learning and averaging capabilities of the RF. The advantage of the CNN-RF model is that it effectively combines the strengths The results of this research show that RF has a large difference in training and validation results in RMSE, MAE, and R 2 , while CNN-RF performs better than RF and has better validation results. This suggests that the CNN-RF proposed in this study is accurate, robust, and reliable. The CNN-RF model combines the feature learning and spatial relationship abilities of the CNN with the ensemble learning and averaging capabilities of the RF. The advantage of the CNN-RF model is that it effectively combines the strengths of both CNN and RF methods, resulting in improved performance in PM2.5 concentration modeling. This hybrid approach can potentially be applied to a range of prediction tasks and provide useful insights for researchers in the fields of air pollution, data analysis, model estimation, and machine learning.

Conclusions
This study proposed a novel method, CNN-RF, for predicting PM2.5 concentrations by combining the advantages of CNN feature extraction and RF regression. The method involves using CNN to extract key meteorological and pollution data and reducing it to a single key factor, which is then combined with spatiotemporal factors and used with RF to construct a PM2.5 concentration prediction model. The proposed CNN-RF was tested on observational data from 13 monitoring stations in Kaohsiung, Taiwan, in 2021. The results showed that the proposed CNN-RF outperformed both the independent CNN and RF models. A summary of the results is given in the following sections.

Model Evaluation
Surprisingly, the model training performance of the proposed CNN-RF model was suboptimal. Instead, the RF model exhibited excellent performance with an R 2 value of 0.94. Notably, the RF model outperformed the CNN-RF model by approximately 9% and 11% in terms of RMSE and MAE, respectively, indicating its strong ability to fit the training data. However, this study reinforces the importance of utilizing an independent validation mechanism in experiments to determine the presence of overfitting, as model evaluation should not solely rely on training results. The results demonstrate that the CNN-RF model effectively mitigated overfitting.

Model Validation
Once again, the proposed CNN-RF model surprised us by demonstrating superior validation performance, despite having lower training performance than the RF model. In terms of RMSE and MAE, the CNN-RF model outperformed the RF model by approximately 8%. When evaluating the model's training and validation performance together, the proposed CNN-RF model showed a remarkable improvement of almost 20%, from being approximately 10% lower than the RF model to being 8% better than it. This phenomenon is rare in a predictive model, indicating the effectiveness of the proposed integrated framework, which combines the strengths of CNN and RF. The proposed model not only demonstrated the best performance during model validation but also exhibited minimal differences between training and validation results, demonstrating consistency and generalizability, which are important indicators of a good model. Therefore, the proposed CNN-RF model possesses these desirable characteristics and represents a significant advancement in the field.
In terms of future research, it would be interesting to test the proposed CNN-RF model on a national scale rather than just focusing on the urban area of Kaohsiung. Additionally, integrating LSTM for the early forecasting of PM2.5 concentrations could be a promising direction for further investigation.