1. Introduction
Air pollution has become a global environmental problem. Particulate matter 2.5 (PM
2.5) has been identified as one of the most significant pollutants associated with this problem [
1]. It is evident that PM
2.5 exerts a considerable influence on both human health and the environment. It has been demonstrated that result in considerable health complications, including respiratory [
2], cardiovascular [
3], and neurological diseases [
4]. Concurrently, PM
2.5 exerting significant negative impacts on transportation [
5], climate [
6], environment [
7] and building [
8]. Therefore, it is imperative to ascertain the spatiotemporal patterns of PM
2.5 concentration changes.
The limited number of monitoring stations in the Beijing–Tianjin–Hebei area results in inadequate coverage. Despite the establishment of a dedicated monitoring network for PM
2.5 concentrations, this network alone is inadequate for accurately reflecting the spatiotemporal variation characteristics of fine particulate matter concentrations [
9]. The advent of satellite remote sensing technology has afforded researchers a more expansive observational range and the capacity for uninterrupted data acquisition [
10]. It has been demonstrated that there is a high degree of correlation between PM
2.5 concentrations and the aerosol optical depth (AOD) satellite observation [
11,
12]. It is possible to make a preliminary estimate of PM
2.5 concentrations by qualitative analysis of the observable AOD [
13]. In order to do so, however, it is necessary to take into account ground observation station factors, such as meteorological and pollutant data, in a fitting manner [
14].
The acquisition of AOD data primarily relies on satellite observations and reanalysis data. Satellite-derived data exhibits superior spatial resolution; however, the presence of cloud cover can result in the occurrence of data gaps. Satellites that acquire AOD data are categorized into three distinct types: polar-orbiting satellites, geostationary satellites, and high-resolution satellites [
15]. Polar-orbiting satellites are characterized by daily revisit cycles and higher resolution, with representative satellites including Aqua (3 km, from the NASA) and VIIRS (6 km, from the NASA). Geostationary satellites facilitate the acquisition of hourly data, though they offer reduced resolution, as evidenced by satellites such as Himawari-8 (10 km, from the JMA, Japan Meteorological Agency) and GOES-16 (6 km, from NOAA, National Oceanic and Atmospheric Administration). High-resolution satellites offer the highest resolution but have longer revisit periods. Representative satellites include Landsat-8 (30 m, 16 day, from the NASA) and Sentinel-2 (20 m, 5 day, from the ESA, European Space Agency). This study utilizes data from the polar-orbiting satellite Aqua, capitalizing on its balanced advantages in both data resolution and frequency. Reanalysis data is complete with no gaps but features lower resolution, typically around 50 km. Representative datasets include MERRA-2 (from the NASA) and ERA5 (from the ECMWF, European Centre for Medium Range Weather Forecasts). This study employs MERRA-2 data due to its superior assimilation quality [
16]. The utilization of reanalysis AOD to address the limitations of satellite-observed AOD data has emerged as a processing method to enhance the availability of AOD [
17,
18,
19,
20].
The method of PM
2.5 concentration measurement comprises three primary stages. In the early stages of research, researchers employ interpolation techniques [
21,
22,
23] to estimate PM
2.5 concentrations in areas for which no monitoring had been conducted. However, the paucity of known points at the periphery of the study area, or at sparsely populated monitoring stations in proximity, compromises the precision of the interpolation outcomes [
24]. Subsequently, statistical fitting methods has become a significant research method. Among statistical models for PM
2.5 concentrations prediction, geographically weighted regression (GWR) [
25] and geographically temporally weighted regression (GTWR) [
26] is the most readily available. This is attributable to the fact that GWR and GTWR models exhibit superior spatiotemporal non-stationarity [
27]. However, these models falls short in terms of accuracy and interpretability. In the third phase, the present one, the introduction of machine learning models has resulted in enhanced accuracy and interpretability for PM
2.5 forecasting [
28,
29]. Among these, the XGBoost model demonstrated outstanding performance in the field of PM
2.5 concentration prediction [
30,
31,
32,
33]. Machine learning methods have been shown to demonstrate superior performance in comparison with statistical models in terms of regularization [
34] and generalization [
35]. Nevertheless, its performance in terms of spatio-temporal non-stationarity is unsatisfactory. Therefore, it is essential to combine accurate and interpretable machine learning models with statistically robust models for spatio-temporal non-stationarity into a two-stage model for predicting PM
2.5 concentrations [
36,
37,
38,
39]. In this study, the machine learning model selected for the two-stage model is XGBoost, for statistical models, the GWR and GTWR models were chosen, respectively.
Although previous studies have explored two-stage frameworks combining XGBoost and GTWR [
36] and others have attempted to fuse MODIS C6.1 AOD with MERRA-2 AOD data [
17], there is still a lack of research focusing on the Beijing–Tianjin–Hebei (BTH) region and leveraging higher-resolution fused AOD data while fully accounting for both nonlinear relationships and spatiotemporal dependencies in PM
2.5 estimation. This study aims to fill this gap. The main contributions are as follows: First, at a 3 km grid resolution, the MODIS C6.1 3 km AOD is fused with MERRA-2 data for the Beijing–Tianjin–Hebei region, producing AOD data with a higher spatial resolution and broader coverage and thereby providing a more reliable basis for PM
2.5 prediction. Second, the fused AOD data is incorporated into a two-stage nonlinear and spatiotemporal residual correction framework, which simultaneously captures nonlinear effects and spatiotemporal dependence, thus improving the robustness of the model.
5. Discussion
Compared with existing studies that either used coarse-resolution AOD products or did not address residual spatial autocorrelation, this study integrates 3 km MODIS AOD with MERRA-2 data and employs a two-stage XGBoost-GTWR hybrid model, achieving higher prediction accuracy (R
2 = 0.95) while retaining fine spatial granularity. In
Table 12, a comparison is presented between the present study and previous ones.
In addressing the issue of inadequate effective pixels resulting from Aqua AOD cloud gaps, this study employs a fusion algorithm to supplement the missing areas of Aqua using MERRA-2 reanalysis AOD. The results demonstrate that 41.3% of the original missing pixels were successfully filled, and the filled field exhibited significantly improved spatiotemporal continuity in comparison with single-satellite products. Despite the fact that the spatial resolution of MERRA-2 was downsampled from 0.5° to 3 km, its hourly output and completeness characteristics provided a reliable basis for fusion, and the high-resolution details of Aqua were preserved through residual correction. This strategy is intended to ensure the maintenance of the spatial accuracy of satellite products. The model provides continuous, gap-free AOD input for subsequent PM2.5 estimation, thereby reducing estimation variance caused by data gaps.
The trend in PM2.5 concentrations across the Beijing–Tianjin–Hebei region demonstrates a gradual decline from January to August, followed by a subsequent gradual increase until December. The highest average concentration recorded in January exceeds the lowest average concentration recorded in August by more than threefold. This is largely attributable to the fact that the southeasterly monsoon in summer facilitates the dispersion of PM2.5, whereas the northwesterly monsoon in winter is blocked by the Yanshan and Taihang mountain ranges. Furthermore, winter heating exacerbates the accumulation of PM2.5.
Additionally, RMSE and MAE rise markedly during winter months with high PM
2.5 concentrations; however, this does not indicate a deterioration in model accuracy. As absolute-error metrics, RMSE and MAE scale linearly with the ambient concentration itself. Consequently, even when the relative error remains unchanged, their values increase in proportion to the elevated winter concentrations. Relying solely on RMSE or MAE to evaluate performance across months with vastly different magnitudes (e.g., January vs. August) can therefore lead to the erroneous conclusion that “winter predictions are worse” [
37,
56,
57,
58]. To eliminate this scale effect, we introduced the dimensionless PMSE (%MSE), which normalizes errors to a percentage scale and enables a fair comparison of true model accuracy across seasons and pollution levels.
The spatial pattern of PM2.5 in the Beijing–Tianjin–Hebei region, characterized by higher concentrations in the south than in the north, is the result of a combination of emissions, topography and meteorology. The central-southern plains (Shijiazhuang, Xingtai, Handan and Hengshui) are characterized by an elevation of less than 50 meters, elevated levels of industrial and transport emissions, and semi-enclosed ‘dustpan’ topography, which is bordered by the Taihang Mountains to the west and the Bohai Sea to the east. This configuration gives rise to frequent winter stagnation and the facile accumulation of pollutants. Conversely, the northern mountainous plateau regions (Zhangjiakou, Chengde and north-west Beijing) have elevations in excess of 800 m, sparse populations and low emission baselines. These regions are distinguished by the presence of intense turbulence and conducive dispersion conditions. The coastal industrial belt, comprising Tianjin, Tangshan and Cangzhou, is distinguished by its flat terrain, yet it is notable for its dense port transportation infrastructure and significant presence of heavy chemical industries. The presence of significant secondary aerosol formation in this area is attributable to the combination of sea–land breeze circulation, thus classifying it within the second tier of concentration levels. Despite having the highest population density, urban areas in Beijing have experienced a significant decrease in primary emission factors since 2013. This decline can be attributed to strict measures such as the conversion of coal-fired power plants to electricity, the crackdown on scattered, disorderly and polluting enterprises, and stricter vehicle emission standards. Consequently, the average PM2.5 concentration in Beijing is now 30–50 μg/m3 lower than in southern plain cities, which highlights the decisive role of emission reduction policies in local concentration levels.
6. Conclusions
This study presents a novel two-stage framework that integrates a machine learning model (XGBoost) with a spatial–temporal regression model (GTWR) to estimate daily PM2.5 concentrations in the Beijing–Tianjin–Hebei region at a fine spatial resolution of 3 km. Compared with previous studies, the main innovations are as follows: high-resolution AOD fusion strategy combining MODIS Collection 6.1 (3 km) with MERRA-2 background data, improving both data completeness and spatial coverage; residual correction mechanism where GTWR explicitly models the spatiotemporal autocorrelation of prediction residuals from XGBoost, enhancing robustness and accuracy; and comprehensive model evaluation across monthly and seasonal time scales, confirming that the proposed XGBoost–GTWR model outperforms traditional single-stage and hybrid models (R2 = 0.95). These innovations offer a generalizable framework for fine-scale air pollution estimation in data-scarce or topographically complex regions.
This research still has some shortcomings and areas for future work. First at all, the current feature set only includes natural factors, such as meteorology, pollutants and remote sensing data. As anthropogenic emission information is indirectly reflected through natural variables, it is difficult to attribute PM2.5 concentrations directly to either natural or anthropogenic drivers directly. Future work will introduce socioeconomic indicators in order to construct a dual natural–anthropogenic feature space. This will enable the explicit quantification and attribution analysis of emission source contributions. Secondly, the study used one year of data to predict PM2.5 concentrations and was unable to explore the interannual variation patterns of PM2.5 concentration. Using continuous years of meteorological and pollutant observation data to predict PM2.5 concentration can explore the annual variation pattern of PM2.5 concentration and further explore the monthly year-on-year variation pattern.