Spatiotemporally Continuous Reconstruction of Retrieved PM2.5 Data Using an Autogeoi-Stacking Model in the Beijing-Tianjin-Hebei Region, China
Abstract
:1. Introduction
2. Study Area and Data
2.1. Study Region
2.2. Data
2.2.1. Ground-Level Measurements
2.2.2. China High Air Pollutants
2.2.3. Reanalysis Information
2.2.4. Ancillary Data
3. Research Methods
3.1. Machine Learning Models
- Random forest (RF): RF is an algorithm based on the bagging method and the classification and regression decision tree (CART) method, which was initially proposed by Breiman [38]. During the training process, RF will randomly sample many times and create a corresponding number of weak learners (i.e., various CART trees). At each node of the CART tree, the feature set (M) will be randomly selected from all the features (N, N > M), and the optimal feature set will be selected according to the size of the Gini coefficient. The randomness of sampling from the whole samples and the selection of feature set (M) improve the robustness of RF. The output of the regression model is the mean value of every weak learner.
- Extremely randomized trees (ET): ET is a variant of RF that is more extreme and random than RF [39]. Unlike extensive random samplings from the data set during the training of RF, the ET method always trains on the entire data samples. In addition, when ET divides the feature set at each node, it does not select the optimal features according to the Gini coefficient but randomly divides the feature set. These differences enhance the generalization ability of ET to some extent.
- Gradient boosting decision tree (GBDT): GBDT is a model based on the boosting method and CART regression tree, which differs from bagging in RF and ET [40]. During training, every iteration of GBDT is to fit the residual between the previous round of the model prediction result and the actual value due to the boosting method. In addition, GBDT can evaluate the importance of each feature (i.e., the selected independent variables) by calculating the frequency of every variable used.
- Extreme gradient boosting (Xgboost): Xgboost is a new algorithm proposed under the framework of GBDT, and its performance is better than GBDT [41]. Xgboost extends the cost function by employing the second-order Taylor expansion, which further enhances the fitting ability of the model. Moreover, Xgboost utilizes L2 regularization to decrease the overfitting of the model. During training, it selects the optimal split node to handle with NaN values by comparing the split coefficient of the left and right nodes. Xgboost will sort all features before traversing them for segmentation (i.e., find the optimal node) and then generate the child node; this process will lead to additional memory occupation and time consumption, which is one of the reasons why LightGBM is proposed.
- Light gradient boosting machine (LightGBM): LightGBM is an algorithm proposed after Xgboost; it is more optimized for the defects of GBDT and Xgboost [42]. LightGBM adopts the histogram algorithm. During training, by discretizing the continuous values of features into K integers, the histogram of the corresponding width is constructed (1/K represents a box). Then the box statistics are carried out according to the discretized value of the data. In this case, LightGBM can discover the optimal segmentation point in each box instead of traversing the entire set of discrete values, significantly decreasing memory usage and calculation costs and improving the algorithm’s efficiency. Due to the lower memory usage and computation costs, LightGBM can handle large-scale data.
- Histogram-based gradient boosting (HistGBM): HistGBM is a new algorithm based on GBDT, inspired by LightGBM [43]. As a GBDT method, HistGBM can also evaluate the importance of each feature. Moreover, it also adopts the histogram algorithm to decrease memory usage and calculation costs, thereby increasing the model’s robustness, speed, and stability. The efficiency of HistGBM on large data sets is preferable to the original GBDT. It has native support for NaN values. Based on the potential coefficient, the HistGBM tree grows at each split point to determine whether samples with NaN values should go left or right during training.
- Catboost: Catboost is a boosting algorithm proposed by Yandex under the GBDT framework, which is efficient, robust, and natively supports GPU acceleration [44]. It adopts an entirely symmetric tree, which is the same in the division of left and right nodes, to reduce the possibility of model over-fitting. In addition, Catboost adopts a novel algorithm called ordered boosting, which overcomes the prediction shift and gradient bias problems of the original GBDT algorithm. The model is compatible with processing categorical features and will automatically combine categorical features to generate more useful information (i.e., newly constructed variables) [45].
3.2. Automated Feature Engineering
- The process of first depth is the non-linear changes of raw data (e.g., ).
- The combination of constructed and original features happens in the second depth (e.g., ). The following steps are identical to this step.
Algorithm 1. The pseudo-code of feature selection process |
Process: Feature selection |
Input: Set B, which contains m features Output: Set D, which consists of l features (l < m) 1. Calculating the correlation of features, remove the features highly related to the original features, k features preserved (k < m); 2. Using the L1-regularized linear model to select the features to set C preliminarily, j features preserved (j < k); for i = 1 to n do The remaining features (k − j) are split into equal chunks (f, less than (k − j)/2); Training models on the set C and each chunk (i.e., j + f); Scoring models on the set C and each chunk; end for return set D |
3.3. Model Development
- Stage 1 (the part before model ensemble): two targets were used: values from the CHAP dataset and ground truth (8:00–18:00, local time) to estimate PM2.5 in this stage. The proposed method measured the spatial heterogeneity of in situ values and the temporal variation in ground-based observations toward the ground truth samples. When the target sample was the satellite retrieved data, the spatial heterogeneity of PM2.5 distributions of the CHAP dataset was studied. The selected models were trained three times (step 0, 1, 2) with default parameters toward the two targets (total: 7 × 3 × 2 = 42). The best one (at last: 7 × 1 × 2 = 14) was chosen for the three steps. For the two targets, 14 models were retained at last. Another hyperparameter optimization process started via the optuna library (version 2.9.1). The numbers of iterations of hyperparameter optimization were different according to the training speed of the models.
- Stage 2 (model ensemble part): Since PM2.5 concentrations at night (19:00–07:00, local time) have not been inducted into the 14 models, a large prediction bias would be caused if the 24 h PM2.5 distributions were generated directly. Thus, when training on the dataset, which consists of the estimated results of stage 1 (00:00–23:00, local time), the 24 h ground-based observations were used to limit the Catboost model in stage 2. The hyperparameters of the best model were tuned. For the Catboost method, the process of the hyperparameter optimization method and the optimal hyperparameters can be found in supplementary (Table S2, Figure S1).
4. Results
4.1. Autofeat Impacts
4.2. Spatial Performance
4.3. Temporal Performance
5. Discussion
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
AOD | Aerosol Optical Depth |
AOT | Aerosol optical thickness |
Autofeat | Automated feature engineering |
Autogeoi-stacking | The automatic geo-intelligent stacking model |
BTH | Beijing-Tianjin-Hebei |
CAQRA | A high-resolution air quality reanalysis dataset over China |
CART | Classification and regression decision tree |
CNEMC | China National Environmental Monitoring Center |
CV | Cross-validation |
CV-R2 | Cross-validation coefficient of determination |
DNN | The deep neural network method |
ECMWF | European Centre for Medium-range Weather Forecasts |
EnKF | Ensemble Kalman filter |
ERA5 | ECMWF Reanalysis v5 |
ET | Extremely randomized trees |
GAM | Generalized additive model |
GBDT | Gradient boosting decision tree |
GBM | Gradient boosting machine |
Geoi-DBN | Geo-intelligent deep belief network |
GTWR | Geographically and temporally weighted regression model |
GWR | Geographically weighted regression model |
HistGBM | Histogram-based gradient boosting machine |
JRA55 | The Japanese 55-year Reanalysis |
LightGBM | Light Gradient Boosting Machine |
LME | Linear mixed-effect model |
MAE | Mean absolute error |
MERRA-2 | The second Modern-Era Retrospective analysis for |
Research and Applications | |
NAQPMS | Nested Air Quality Prediction Modeling System |
NCEP-2 | The NCEP/DOE Reanalysis 2 |
NDVI | Normalized Difference Vegetation Index |
PM2.5 | Fine particulate matter with a diameter of less than 2.5 µm |
R2 | Coefficient of determination |
RF | Random forest |
RMSE | Root-mean square error |
SHAP | Shapley additive explanation |
SMOTE | Synthetic minority over-sampling technique |
SRTM | The Shuttle Radar Topography Mission |
STET | The space-time extra-trees method |
STLG | The fast space-time LightGBM model |
STRF | The space-time random forest model |
Xgboost | Extreme gradient boosting |
YRD | Yangtze River Delta |
Appendix A
Model | Training Device | RAM (M) | Mean R2 | Time Cost (Minute) |
---|---|---|---|---|
Catboost | GPU | 0.94 | 0.78 | 2h6 |
ET | CPU | 1588 | 0.87 | 1h52 |
GBDT | CPU | 0.17 | 0.72 | 2h24 |
HistGBM | CPU | 0.43 | 0.80 | 2h36 |
LightGBM | CPU | 0.28 | 0.80 | 5h10 |
RF | CPU | 992.64 | 0.85 | 1h58 |
Xgboost | CPU | 0.66 | 0.82 | 1h46 |
References
- Brook, R.D.; Urch, B.; Dvonch, J.T.; Bard, R.L.; Speck, M.; Keeler, G.; Morishita, M.; Marsik, F.J.; Kamal, A.S.; Kaciroti, N.; et al. Insights Into the Mechanisms and Mediators of the Effects of Air Pollution Exposure on Blood Pressure and Vascular Function in Healthy Humans. Hypertension 2009, 54, 659–667. [Google Scholar] [CrossRef] [PubMed]
- Xing, Y.-F.; Xu, Y.-H.; Shi, M.-H.; Lian, Y.-X. The Impact of PM2.5 on the Human Respiratory System. J. Thorac. Dis. 2016, 8, 6. [Google Scholar]
- Shi, Y.; Zhao, A.; Matsunaga, T.; Yamaguchi, Y.; Zang, S.; Li, Z.; Yu, T.; Gu, X. Underlying Causes of PM2.5-Induced Premature Mortality and Potential Health Benefits of Air Pollution Control in South and Southeast Asia from 1999 to 2014. Environ. Int. 2018, 121, 814–823. [Google Scholar] [CrossRef] [PubMed]
- Xu, Y.; Huang, Y.; Guo, Z. Influence of AOD Remotely Sensed Products, Meteorological Parameters, and AOD–PM2.5 Models on the PM2.5 Estimation. Stoch. Environ. Res. Risk Assess. 2021, 35, 893–908. [Google Scholar] [CrossRef]
- Lin, C.; Labzovskii, L.D.; Mak, H.W.L.; Fung, J.C.H.; Lau, A.K.H.; Kenea, S.T.; Bilal, M.; Hey, J.D.V.; Lu, X.; Ma, J. Observation of PM2.5 Using a Combination of Satellite Remote Sensing and Low-Cost Sensor Network in Siberian Urban Areas with Limited Reference Monitoring. Atmos. Environ. 2020, 227, 117410. [Google Scholar] [CrossRef]
- Li, J.; Zhang, H.; Chao, C.-Y.; Chien, C.-H.; Wu, C.-Y.; Luo, C.H.; Chen, L.-J.; Biswas, P. Integrating low-cost air quality sensor networks with fixed and satellite monitoring systems to study ground-level PM2.5. Atmos. Environ. 2020, 223, 117293. [Google Scholar] [CrossRef]
- Wang, J.; Christopher, S.A. Intercomparison between Satellite-Derived Aerosol Optical Thickness and PM2.5 Mass: Implications for Air Quality Studies. Geophys. Res. Lett. 2003, 30, 2095. [Google Scholar] [CrossRef]
- Xie, Y.; Wang, Y.; Zhang, K.; Dong, W.; Lv, B.; Bai, Y. Daily Estimation of Ground-Level PM2.5 Concentrations over Beijing Using 3 Km Resolution MODIS AOD. Environ. Sci. Technol. 2015, 49, 12280–12288. [Google Scholar] [CrossRef]
- Guo, Y.; Tang, Q.; Gong, D.-Y.; Zhang, Z. Estimating Ground-Level PM2.5 Concentrations in Beijing Using a Satellite-Based Geographically and Temporally Weighted Regression Model. Remote Sens. Environ. 2017, 198, 140–149. [Google Scholar] [CrossRef]
- Ma, Z.; Hu, X.; Huang, L.; Bi, J.; Liu, Y. Estimating Ground-Level PM2.5 in China Using Satellite Remote Sensing. Environ. Sci. Technol. 2014, 48, 7436–7444. [Google Scholar] [CrossRef]
- Ranjan, A.K.; Patra, A.K.; Gorai, A.K. A Review on Estimation of Particulate Matter from Satellite-Based Aerosol Optical Depth: Data, Methods, and Challenges. Asia-Pac. J. Atmos. Sci. 2021, 57, 679–699. [Google Scholar] [CrossRef]
- Zhang, Y.; Li, Z.; Bai, K.; Wei, Y.; Xie, Y.; Zhang, Y.; Ou, Y.; Cohen, J.; Zhang, Y.; Peng, Z.; et al. Satellite remote sensing of atmospheric particulate matter mass concentration: Advances, challenges, and perspectives. Fundam. Res. 2021, 1, 240–258. [Google Scholar] [CrossRef]
- Lee, C.; Lee, K.; Kim, S.; Yu, J.; Jeong, S.; Yeom, J. Hourly Ground-Level PM2.5 Estimation Using Geostationary Satellite and Reanalysis Data via Deep Learning. Remote Sens. 2021, 13, 2121. [Google Scholar] [CrossRef]
- Lu, X.; Wang, J.; Yan, Y.; Zhou, L.; Ma, W. Estimating Hourly PM2.5 Concentrations Using Himawari-8 AOD and a DBSCAN-Modified Deep Learning Model over the YRDUA, China. Atmos. Pollut. Res. 2021, 12, 183–192. [Google Scholar] [CrossRef]
- Wei, J.; Li, Z.; Pinker, R.T.; Wang, J.; Sun, L.; Xue, W.; Li, R.; Cribb, M. Himawari-8-Derived Diurnal Variations in Ground-Level PM2.5 Pollution across China Using the Fast Space-Time Light Gradient Boosting Machine (LightGBM). Atmos. Chem. Phys. 2021, 21, 7863–7880. [Google Scholar] [CrossRef]
- Chen, J.; Yin, J.; Zang, L.; Zhang, T.; Zhao, M. Stacking Machine Learning Model for Estimating Hourly PM2.5 in China Based on Himawari 8 Aerosol Optical Depth Data. Sci. Total Environ. 2019, 697, 134021. [Google Scholar] [CrossRef]
- Song, Z.; Fu, D.; Zhang, X.; Han, X.; Song, J.; Zhang, J.; Wang, J.; Xia, X. MODIS AOD Sampling Rate and Its Effect on PM2.5 Estimation in North China. Atmos. Environ. 2019, 209, 14–22. [Google Scholar] [CrossRef]
- Shin, M.; Kang, Y.; Park, S.; Im, J.; Yoo, C.; Quackenbush, L.J. Estimating Ground-Level Particulate Matter Concentrations Using Satellite-Based Data: A Review. GISci. Remote Sens. 2020, 57, 174–189. [Google Scholar] [CrossRef]
- Chen, Z.-Y.; Zhang, T.-H.; Zhang, R.; Zhu, Z.-M.; Yang, J.; Chen, P.-Y.; Ou, C.-Q.; Guo, Y. Extreme Gradient Boosting Model to Estimate PM2.5 Concentrations with Missing-Filled Satellite Data in China. Atmos. Environ. 2019, 202, 180–189. [Google Scholar] [CrossRef]
- Jiang, T.; Chen, B.; Nie, Z.; Ren, Z.; Xu, B.; Tang, S. Estimation of Hourly Full-Coverage PM2.5 Concentrations at 1-Km Resolution in China Using a Two-Stage Random Forest Model. Atmos. Res. 2021, 248, 105146. [Google Scholar] [CrossRef]
- Xiao, Q.; Geng, G.; Cheng, J.; Liang, F.; Li, R.; Meng, X.; Xue, T.; Huang, X.; Kan, H.; Zhang, Q.; et al. Evaluation of Gap-Filling Approaches in Satellite-Based Daily PM2.5 Prediction Models. Atmos. Environ. 2021, 244, 117921. [Google Scholar] [CrossRef]
- Zhan, Y.; Luo, Y.; Deng, X.; Chen, H.; Grieneisen, M.L.; Shen, X.; Zhu, L.; Zhang, M. Spatiotemporal Prediction of Continuous Daily PM2.5 Concentrations across China Using a Spatially Explicit Machine Learning Algorithm. Atmos. Environ. 2017, 155, 129–139. [Google Scholar] [CrossRef]
- Brokamp, C.; Jandarov, R.; Hossain, M.; Ryan, P. Predicting Daily Urban Fine Particulate Matter Concentrations Using a Random Forest Model. Environ. Sci. Technol. 2018, 52, 4173–4179. [Google Scholar] [CrossRef] [PubMed]
- Li, T.; Zhang, C.; Shen, H.; Yuan, Q.; Zhang, L. Real-time and seamless monitoring of ground-level pm2.5 using satellite remote sensing. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2018, IV-3, 143–147. [Google Scholar] [CrossRef]
- Wu, J.; Li, T.; Zhang, C.; Cheng, Q.; Shen, H. Hourly PM2.5 Concentration Monitoring With Spatiotemporal Continuity by the Fusion of Satellite and Station Observations. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 8019–8032. [Google Scholar] [CrossRef]
- Li, T.; Shen, H.; Yuan, Q.; Zhang, X.; Zhang, L. Estimating Ground-Level PM2.5 by Fusing Satellite and Station Observations: A Geo-Intelligent Deep Learning Approach: Deep Learning for PM2.5 Estimation. Geophys. Res. Lett. 2017, 44, 11985–11993. [Google Scholar] [CrossRef]
- Wei, J.; Li, Z.; Cribb, M.; Huang, W.; Xue, W.; Sun, L.; Guo, J.; Peng, Y.; Li, J.; Lyapustin, A.; et al. Improved 1 Km Resolution PM2.5 Estimates across China Using Enhanced Space–Time Extremely Randomized Trees. Atmos. Chem. Phys. 2020, 20, 3273–3289. [Google Scholar] [CrossRef]
- Wei, J.; Huang, W.; Li, Z.; Xue, W.; Peng, Y.; Sun, L.; Cribb, M. Estimating 1-Km-Resolution PM2.5 Concentrations across China Using the Space-Time Random Forest Approach. Remote Sens. Environ. 2019, 231, 111221. [Google Scholar] [CrossRef]
- Li, H.; Yang, Y.; Wang, H.; Li, B.; Wang, P.; Li, J.; Liao, H. Constructing a Spatiotemporally Coherent Long-Term PM2.5 Concentration Dataset over China during 1980–2019 Using a Machine Learning Approach. Sci. Total Environ. 2021, 765, 144263. [Google Scholar] [CrossRef]
- Zhang, J.; Fogelman-Soulié, F.; Largeron, C. Towards Automatic Complex Feature Engineering. In Proceedings of the International Conference on Web Information Systems Engineering, Dubai, United Arab Emirates, 12–15 November 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 312–322. [Google Scholar]
- Domingos, P. A Few Useful Things to Know about Machine Learning. Commun. ACM 2012, 55, 78–87. [Google Scholar] [CrossRef]
- He, Q.; Gu, Y.; Zhang, M. Spatiotemporal Trends of PM2.5 Concentrations in Central China from 2003 to 2018 Based on MAIAC-Derived High-Resolution Data. Environ. Int. 2020, 137, 105536. [Google Scholar] [CrossRef]
- He, Q.; Gao, K.; Zhang, L.; Song, Y.; Zhang, M. Satellite-Derived 1-Km Estimates and Long-Term Trends of PM2.5 Concentrations in China from 2000 to 2018. Environ. Int. 2021, 156, 106726. [Google Scholar] [CrossRef]
- Ma, J.; Zhang, R.; Xu, J.; Yu, Z. MERRA-2 PM2.5 Mass Concentration Reconstruction in China Mainland Based on LightGBM Machine Learning. Sci. Total Environ. 2022, 827, 154363. [Google Scholar] [CrossRef]
- Kong, L.; Tang, X.; Zhu, J.; Wang, Z.; Li, J.; Wu, H.; Wu, Q.; Chen, H.; Zhu, L.; Wang, W.; et al. A 6-Year-Long (2013–2018) High-Resolution Air Quality Reanalysis Dataset in China Based on the Assimilation of Surface Observations from CNEMC. Earth Syst. Sci. Data 2021, 13, 529–570. [Google Scholar] [CrossRef]
- Zhao, Q.; Zhao, W.; Bi, J.; Ma, Z. Climatology and Calibration of MERRA-2 PM2.5 Components over China. Atmos. Pollut. Res. 2021, 12, 357–366. [Google Scholar] [CrossRef]
- Ma, J.; Xu, J.; Qu, Y. Evaluation on the Surface PM2.5 Concentration over China Mainland from NASA’s MERRA-2. Atmos. Environ. 2020, 237, 117666. [Google Scholar] [CrossRef]
- Rigatti, S.J. Random Forest. J. Insur. Med. 2017, 47, 31–39. [Google Scholar] [CrossRef]
- Wei, J.; Li, Z.; Lyapustin, A.; Sun, L.; Peng, Y.; Xue, W.; Su, T.; Cribb, M. Reconstructing 1-Km-Resolution High-Quality PM2.5 Data Records from 2000 to 2018 in China: Spatiotemporal Variations and Policy Implications. Remote Sens. Environ. 2021, 252, 112136. [Google Scholar] [CrossRef]
- Zhan, Q.; Fan, Z.; Yan, S.; Yang, S.; Yang, C. New MAIAC AOD Product Based High Resolution PM2.5 Spatial-Temporal Distribution Change at Urban Scale—Case Study of Wuhan. In Proceedings of the 2019 10th International Workshop on the Analysis of Multitemporal Remote Sensing Images (MultiTemp), Shanghai, China, 5–7 August 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–4. [Google Scholar]
- Gui, K.; Che, H.; Zeng, Z.; Wang, Y.; Zhai, S.; Wang, Z.; Luo, M.; Zhang, L.; Liao, T.; Zhao, H.; et al. Construction of a Virtual PM2.5 Observation Network in China Based on High-Density Surface Meteorological Observations Using the Extreme Gradient Boosting Model. Environ. Int. 2020, 141, 105801. [Google Scholar] [CrossRef]
- Zhong, J.; Zhang, X.; Gui, K.; Wang, Y.; Che, H.; Shen, X.; Zhang, L.; Zhang, Y.; Sun, J.; Zhang, W. Robust Prediction of Hourly PM2.5 from Meteorological Data Using LightGBM. Natl. Sci. Rev. 2021, 8, nwaa307. [Google Scholar] [CrossRef]
- Guryanov, A. Histogram-Based Algorithm for Building Gradient Boosting Ensembles of Piecewise Linear Decision Trees. In Proceedings of the International Conference on Analysis of Images, Social Networks and Texts, Kazan, Russia, 17–19 July 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 39–50. [Google Scholar]
- Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased Boosting with Categorical Features. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; Volume 31, pp. 6638–6648. [Google Scholar]
- Dorogush, A.V.; Ershov, V.; Gulin, A. CatBoost: Gradient Boosting with Categorical Features Support. arXiv 2018, arXiv:1810.11363. [Google Scholar]
- Horn, F.; Pack, R.; Rieger, M. The Autofeat Python Library for Automated Feature Engineering and Selection. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Würzburg, Germany, 16–20 September 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 111–120. [Google Scholar]
- Selvam, S.K.; Rajendran, C. Tofee-Tree: Automatic Feature Engineering Framework for Modeling Trend-Cycle in Time Series Forecasting. Neural Comput. Appl. 2021, 1–20. [Google Scholar] [CrossRef]
- Wang, M.; Ding, Z.; Pan, M. LbR: A New Regression Architecture for Automated Feature Engineering. In Proceedings of the 2020 International Conference on Data Mining Workshops (ICDMW), Sorrento, Italy, 17–20 November 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 432–439. [Google Scholar]
- Shi, Q.; Zhang, Y.-L.; Li, L.; Yang, X.; Li, M.; Zhou, J. SAFE: Scalable Automatic Feature Engineering Framework for Industrial Tasks. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 20–24 April 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1645–1656. [Google Scholar]
- Khurana, U.; Samulowitz, H.; Turaga, D. Feature Engineering for Predictive Modeling Using Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Zhang, Y.; Liang, S.; Zhu, Z.; Ma, H.; He, T. Soil Moisture Content Retrieval from Landsat 8 Data Using Ensemble Learning. ISPRS J. Photogramm. Remote Sens. 2022, 185, 32–47. [Google Scholar] [CrossRef]
- Yadav, S.; Shukla, S. Analysis of K-Fold Cross-Validation over Hold-Out Validation on Colossal Datasets for Quality Classification. In Proceedings of the 2016 IEEE 6th International Conference on Advanced Computing (IACC), Bhimavaram, India, 27–28 February 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 78–83. [Google Scholar]
- Rodriguez, J.D.; Perez, A.; Lozano, J.A. Sensitivity Analysis of K-Fold Cross Validation in Prediction Error Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 569–575. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 4768–4777. [Google Scholar]
- Altman, N.; Krzywinski, M. The Curse(s) of Dimensionality. Nat. Methods 2018, 15, 399–400. [Google Scholar] [CrossRef]
- Dao, F.-Y.; Lv, H.; Wang, F.; Feng, C.-Q.; Ding, H.; Chen, W.; Lin, H. Identify Origin of Replication in Saccharomyces Cerevisiae Using Two-Step Feature Selection Technique. Bioinformatics 2019, 35, 2075–2083. [Google Scholar] [CrossRef]
- Geng, G.; Xiao, Q.; Liu, S.; Liu, X.; Cheng, J.; Zheng, Y.; Xue, T.; Tong, D.; Zheng, B.; Peng, Y.; et al. Tracking Air Pollution in China: Near Real-Time PM2.5 Retrievals from Multisource Data Fusion. Environ. Sci. Technol. 2021, 55, 12106–12115. [Google Scholar] [CrossRef]
- Xu, S.; Zhang, Z.; Du, X.; Li, Y.; Zhang, S.; Xu, P.; Zhang, B.; Meng, F. Impact of Residential Coal Combustion Control in Beijing-Tianjin-Hebei and Surrounding Region on PM2.5 in Beijing. Res. Environ. Sci. 2021, 34, 2876–2886. [Google Scholar] [CrossRef]
- Zhang, L.; An, J.; Liu, M.; Li, Z.; Liu, Y.; Tao, L.; Liu, X.; Zhang, F.; Zheng, D.; Gao, Q.; et al. Spatiotemporal Variations and Influencing Factors of PM2.5 Concentrations in Beijing, China. Environ. Pollut. 2020, 262, 114276. [Google Scholar] [CrossRef]
- Zhao, H.; Zheng, Y.; Li, C. Spatiotemporal Distribution of PM2.5 and O3 and Their Interaction During the Summer and Winter Seasons in Beijing, China. Sustainability 2018, 10, 4519. [Google Scholar] [CrossRef]
- Manning, M.I.; Martin, R.V.; Hasenkopf, C.; Flasher, J.; Li, C. Diurnal Patterns in Global Fine Particulate Matter Concentration. Environ. Sci. Technol. Lett. 2018, 5, 687–691. [Google Scholar] [CrossRef]
- Wang, L.; Xiong, Q.; Wu, G.; Gautam, A.; Jiang, J.; Liu, S.; Zhao, W.; Guan, H. Spatio-Temporal Variation Characteristics of PM2.5 in the Beijing–Tianjin–Hebei Region, China, from 2013 to 2018. Int. J. Environ. Res. Public Health 2019, 16, 4276. [Google Scholar] [CrossRef] [Green Version]
- Ding, Y.; Chen, Z.; Lu, W.; Wang, X. A CatBoost Approach with Wavelet Decomposition to Improve Satellite-Derived High-Resolution PM2.5 Estimates in Beijing-Tianjin-Hebei. Atmos. Environ. 2021, 249, 118212. [Google Scholar] [CrossRef]
- Zheng, G.J.; Duan, F.K.; Su, H.; Ma, Y.L.; Cheng, Y.; Zheng, B.; Zhang, Q.; Huang, T.; Kimoto, T.; Chang, D.; et al. Exploring the Severe Winter Haze in Beijing: The Impact of Synoptic Weather, Regional Transport and Heterogeneous Reactions. Atmos. Chem. Phys. 2015, 15, 2969–2983. [Google Scholar] [CrossRef]
- Blagus, R.; Lusa, L. SMOTE for High-Dimensional Class-Imbalanced Data. BMC Bioinform. 2013, 14, 106. [Google Scholar] [CrossRef]
- Yu, Z.; Qu, Y.; Wang, Y.; Ma, J.; Cao, Y. Application of Machine-Learning-Based Fusion Model in Visibility Forecast: A Case Study of Shanghai, China. Remote Sens. 2021, 13, 2096. [Google Scholar] [CrossRef]
- Vu, B.N.; Bi, J.; Wang, W.; Huff, A.; Kondragunta, S.; Liu, Y. Application of Geostationary Satellite and High-Resolution Meteorology Data in Estimating Hourly PM2.5 Levels during the Camp Fire Episode in California. Remote Sens. Environ. 2022, 271, 112890. [Google Scholar] [CrossRef]
- Hu, H.; Hu, Z.; Zhong, K.; Xu, J.; Zhang, F.; Zhao, Y.; Wu, P. Satellite-Based High-Resolution Mapping of Ground-Level PM2.5 Concentrations over East China Using a Spatiotemporal Regression Kriging Model. Sci. Total Environ. 2019, 672, 479–490. [Google Scholar] [CrossRef]
Categories | Abbreviation | Content | Unit | Spatial Resolution | Data Source |
---|---|---|---|---|---|
Ground Truth | PM2.5 | PM2.5 | µg/m3 | Point | CNEMC |
Satellite Retrieval | PM2.5 | PM2.5 | µg/m3 | 0.05° × 0.05° | CHAP |
Meteorological | 10WU | 10 m u-component of wind | m/s | 0.25° × 0.25° | ERA5 |
10WV | 10 m v-component of wind | - | - | - | |
100WU | 100 m u-component of wind | - | - | - | |
100WV | 100 m v-component of wind | - | - | - | |
T2M | 2 m temperature | K | - | - | |
D2M | 2 m dewpoint temperature | - | - | - | |
RH | Relative humidity | % | - | - | |
SP | Surface pressure | Pa | - | - | |
BLH | Boundary-layer height | m | - | - | |
PRE | Total precipitation | - | - | - | |
KX | K index | K | - | - | |
Aerosols | PM2.5 | PM2.5 | µg/m3 | 0.5° × 0.625° | MERRA2 |
BC | Black carbon aerosol | - | - | - | |
OC | Organic carbon aerosol | - | - | - | |
DUST | Dust aerosol | - | - | - | |
SO4 | Sulfate aerosol | - | - | - | |
SS | Sea salt aerosol | - | - | - | |
CO | Carbon monoxide | - | 0.136° × 0.136° | CAQRA | |
O3 | Ozone | - | - | - | |
Topographic | SRTM | Surface elevation | m | 90 m | SRTM |
Study | Domain | Resolution | Gaps | Period | R2 | RMSE (µg/m3) |
---|---|---|---|---|---|---|
Wei et al., 2021 [12] | China | 0.05°, hourly | Yes | 2018 | 0.82, 0.71, 0.87, 0.86 | 14.55, 9.63, 11.83, 17.57 |
Chen et al., 2019 [13] | China | 0.05°, hourly | Yes | 2016 | 0.82, 0.72, 0.86, 0.86 | 15.90, 11.00, 16.40, 21.40 |
Jiang et al., 2021 [17] | China | 1 km, hourly | No | 2018–2019 | 0.85, 0.80, 0.85, 0.90 | 12.27, 7.78, 9.50, 14.83 |
Wei et al., 2020 [24] | China | 1 km, daily | Yes | 2017–2018 | 0.88, 0.79, 0.90, 0.88 | 11.23, 7.23, 8.97, 12.84 |
Wei et al., 2019 [25] | China | 1 km, daily | Yes | 2015–2016 | 0.81, 0.69, 0.84, 0.85 | 14.79, 9.62, 14.59, 20.06 |
This study | BTH | 0.05°, hourly | No | 2018 | 0.86, 0.67, 0.89, 0.91 | 16.69, 12.24, 15.10, 19.40 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chu, W.; Zhang, C.; Zhao, Y.; Li, R.; Wu, P. Spatiotemporally Continuous Reconstruction of Retrieved PM2.5 Data Using an Autogeoi-Stacking Model in the Beijing-Tianjin-Hebei Region, China. Remote Sens. 2022, 14, 4432. https://doi.org/10.3390/rs14184432
Chu W, Zhang C, Zhao Y, Li R, Wu P. Spatiotemporally Continuous Reconstruction of Retrieved PM2.5 Data Using an Autogeoi-Stacking Model in the Beijing-Tianjin-Hebei Region, China. Remote Sensing. 2022; 14(18):4432. https://doi.org/10.3390/rs14184432
Chicago/Turabian StyleChu, Wenhao, Chunxiao Zhang, Yuwei Zhao, Rongrong Li, and Pengda Wu. 2022. "Spatiotemporally Continuous Reconstruction of Retrieved PM2.5 Data Using an Autogeoi-Stacking Model in the Beijing-Tianjin-Hebei Region, China" Remote Sensing 14, no. 18: 4432. https://doi.org/10.3390/rs14184432
APA StyleChu, W., Zhang, C., Zhao, Y., Li, R., & Wu, P. (2022). Spatiotemporally Continuous Reconstruction of Retrieved PM2.5 Data Using an Autogeoi-Stacking Model in the Beijing-Tianjin-Hebei Region, China. Remote Sensing, 14(18), 4432. https://doi.org/10.3390/rs14184432