Next Article in Journal
Explainable Learning Framework for the Assessment and Prediction of Wind Shear-Induced Aviation Turbulence
Previous Article in Journal
DSMF-Net: A Spatiotemporal Memory Flow Network for Long-Range Prediction of Stratospheric Sudden Warming Events
error_outline You can access the new MDPI.com website here. Explore and share your feedback with us.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

High-Resolution Spatial Prediction of Daily Average PM2.5 Concentrations in Jiangxi Province via a Hybrid Model Integrating Random Forest and XGBoost

by
Yuming Tang
1,*,
Jing Deng
2,
Xinyi Cui
2,
Zuhan Liu
2,
Liu Yang
2,
Shaoquan Zhang
3 and
Yeheng Liang
4
1
Smart Water Monitoring Laboratory with Air-Space-Ground Integration, School of Information Engineering, Jiangxi University of Water Resources and Electric Power, Nanchang 330099, China
2
School of Communications and Information Engineering, Jiangxi University of Water Resources and Electric Power, Nanchang 330099, China
3
Jiangxi Province Key Laboratory of Smart Water Conservancy, School of Information Engineering, Jiangxi University of Water Resources and Electric Power, Nanchang 330099, China
4
School of Geography and Planning, Sun Yat-Sen University, Guangzhou 510275, China
*
Author to whom correspondence should be addressed.
Atmosphere 2025, 16(12), 1317; https://doi.org/10.3390/atmos16121317
Submission received: 15 October 2025 / Revised: 16 November 2025 / Accepted: 20 November 2025 / Published: 22 November 2025
(This article belongs to the Section Air Quality)

Abstract

Numerous machine learning models have been widely used for the spatial prediction of PM2.5 mass concentrations in the field of remote sensing, but most studies rely on single models, limiting their ability to capture complex nonlinear relationships. Furthermore, traditional Aerosol Optical Depth (AOD) methods suffer from extensive missing values due to algorithmic limitations, hindering daily PM2.5 mass concentration retrieval. This study first developed a hybrid random forest and extreme gradient boosting model (RF-XGBoost) to overcome single-model accuracy constraints. Subsequently, Top-of-Atmosphere (TOA) reflectance replaced conventional AOD as the hybrid model’s input. Finally, we integrated four-year (2020–2023) TOA reflectance, normalized difference vegetation index (NDVI) data, meteorological data, digital elevation model (DEM) data, and day-of-year data to develop a high-precision hybrid model specifically optimized for Jiangxi Province. The simulation results demonstrated that the hybrid RF-XGBoost model (test-R2 = 0.82, RMSE = 7.25 μg/m3, MAE = 4.90 μg/m3) outperformed the single Random Forest Model by 25% and 26% in terms of the root mean square error (RMSE) and mean absolute error (MAE), respectively. The high predictive accuracy of our method confirms its effectiveness in generating reliable PM2.5 estimates. The resulting four-year dataset also successfully delineated the characteristic seasonal PM2.5 pattern in the region, with the highest levels in winter and the lowest in summer, alongside a clear decreasing annual trend, signifying gradual atmospheric improvement.

1. Introduction

PM2.5 is defined as a substance suspended in the air with a diameter less than or equal to 2.5 μm. These fine particles belong to the aerosol category and can be suspended in the air for a long time and spread. Studies have shown that long-term inhalation of these particles can lead to a variety of human diseases, such as respiratory, cardiovascular and neurological diseases [1,2,3,4]. Severe air pollution from rapid industrialization is a global issue, and the monitoring of pervasive pollutants like PM2.5 remains a major challenge for many nations. China serves as a notable example [5]; although a large ground station monitoring network consisting of major locations in urban areas has been established to monitor major pollutants, including PM2.5 and PM10 mass concentrations, etc., the ground monitoring stations have sparse and discontinuous characteristics, and it cannot satisfy the daily monitoring requirements of most urban and non-urban areas on a large scale. Satellite remote sensing technology is characterized by extensive spatial coverage, multi-temporal data acquisition, and cost-effectiveness [6]; it can be well applied to large-scale and extensive monitoring of atmospheric pollutants and to identify correlations between various air pollutants and human activities in a certain area [7]. Currently, there are three primary models, the empirical model, the semi-empirical model, and the physical model, that are employed to estimate PM2.5 mass concentrations via satellite remote sensing. However, physical and semi-empirical models require extensive measurements of PM2.5 optical parameters, which significantly limits their practical applications [8,9]. In addition, owing to their avoidance of aerosol physical mechanisms and computationally efficient nature, empirical models have gained extensive application. Therefore, employing statistical models for daily PM2.5 concentrations is more reasonable than physical models and semi-empirical models.
Most empirical model studies have used Aerosol Optical Depth (AOD) data to evaluate PM2.5 mass concentrations through statistical modeling [10]. In earlier studies, simple linear regression models and nonlinear regression models were used to construct the relationship between AOD and PM2.5 mass concentrations, but the results showed that the accuracy of the results obtained by this method was generally low (r2 = 0.10–0.65) [11,12]. Subsequently, many more advanced and mature statistical models were developed to improve model accuracy (r2 = 0.65–0.95) [13]: The neural network model (BPNNM) demonstrates exceptional performance in complex spatial-temporal prediction tasks [14], such as in grid-based PM2.5 prediction. By combining the advantages of LSTM and Transformer, they achieved MAE = 18.66 μg/m3 in 24 h prediction tasks, R2 = 0.76 [15]. In a systematic study on the prediction of PM2.5 and PM10 concentrations in air pollution, the LSTM model achieved the highest MAE = 0.593 μg/m3 and RMSE = 0.804 μg/m3 [14]. The Random Forest Model (RFM) demonstrated excellent high-resolution prediction capabilities, achieving high prediction accuracy R2 = 0.89–0.93 and RMSE = 8.5–12.3 μg/m3 at a 0.01° spatial resolution in the Beijing-Tianjin-Hebei region, significantly outperforming linear regression models [16]. For PM2.5 related to urban traffic, the RFM model utilizes real-time traffic flow and meteorological data to achieve an hourly concentration prediction RMSE of 15.7 μg/m3, and demonstrates strong robustness to missing AOD data [14]. The Extreme Gradient Boosting (XGBoost) model is a machine learning algorithm based on the gradient boosting decision tree (GBDT) framework. It is renowned for its high predictive accuracy and efficiency with structured data, which are critical for reliable PM2.5 estimation. The model’s performance can be further enhanced through Bayesian optimization, an automated hyperparameter tuning technique that efficiently identifies the optimal configuration to maximize predictive performance. For instance, in a study applying this method to national-scale PM2.5 prediction [17], Bayesian optimization refined an XGBoost model, raising the R2 to 0.91 (an improvement of 0.04 over the base GBDT), while also doubling the training efficiency. This increase in R2 represents a meaningful leap in model performance, leading to more accurate and reliable predictions. The traditional Support Vector Machine model (SVMM) exhibits strong robustness in small-sample tasks after optimization. In long-term predictions for typical regions in China, SVMM achieves an R2 of approximately 0.82–0.85, which is lower than RFM and XGBoost, but they demonstrate stronger stability for small-sample data in sparsely sampled regions [18]. Among these models, RFM and XGBoost models perform the best in spatial prediction of PM2.5 mass concentrations. However, the XGBoost model suffers from slow training speed and high sensitivity to outliers, and the RFM model exhibits poor extrapolation performance when predicting beyond the training data distribution. In addition, most studies are based on AOD data products to establish the relationship with PM2.5 mass concentrations [19,20], which usually contain large areas of missing data and often cannot cover the entire study area due to the limitations of different AOD inversion algorithms. Furthermore, the majority of research targets economically developed areas with abundant ground stations, while studies on underdeveloped areas with sparse and clustered station distributions remain scarce. The scarcity of short-term historical data leads to small sample sizes, which in turn causes model instability and low prediction accuracy.
Aiming at the above-mentioned problems, we first integrated an RF-XGBoost hybrid model with enhanced generalization capability to overcome the limitations of the single model [21,22]. Secondly, TOA reflectance data was employed as an alternative instead of conventional AOD retrievals to derive spatially continuous PM2.5 mass concentrations. Finally, considering Jiangxi Province’s status as a less-developed central region with sparse and geographically clustered ground monitoring stations, the hybrid model incorporates four years (2020–2023) of multi-factor feature data (MODIS L1B TOA reflectance data, NDVI data, meteorological data, DEM and the number of days) as inputs to establish nonlinear relationships with PM2.5 mass concentrations, significantly enhancing model stability and robustness. This ensures that the high-spatial-resolution (1 km × 1 km) PM2.5 mass concentrations derived from hybrid model inversion can be reliably utilized in less-developed areas with sparse ground monitoring stations.
The structure of the constructed RF-XGBoost model is shown in Figure 1. First, the relevant feature data is processed to obtain a complete dataset, and the RF-XGBoost model was used to split the test set and training set in a 1:3 ratio, while setting the XGBoost parameters as follows: number of trees = 1000, learning rate_XGB = 0.05, minimum leaf node_XGB = 3, maximum depth = 8, RFM parameters: number of trees = 100, minimum leaf node = 1. Then, the RFM was trained ten times independently using TreeBagger, and the XGBoost model was also trained ten times independently using fitrensemble, each with a random 75:25 split of the training and testing datasets. The two models were then used for prediction, and the predicted results are adjusted using dynamic weighting to obtain the optimal result by weighting and fusing the predictions of the two models with a 50% weight. The R2, RMSE, and MAE metrics for the ten independent experiments are used to display the evaluation results. Finally, a scatter density plot of the actual PM2.5 values and the fusion prediction values was plotted.
This paper is organized as follows: Section 2 provides a detailed overview of the dataset used in this study and the model construction process. Section 3 presents the prediction results of the RF-XGBoost model, as well as the annual and seasonal PM2.5 concentration estimates for the Jiangxi Province region from 2020 to 2023. Section 4 discusses the main findings of this study and compares the performance of different models. Finally, Section 5 summarizes this study.

2. Materials and Methods

2.1. Study Area

Jiangxi Province is located in southeastern China, on the south bank of the middle and lower reaches of the Yangtze River, in the latitudinal range 24°29′–30°04′ and longitudinal range 113°34′–118°28′ (Figure 2). As an important ecological barrier in southern China, Jiangxi Province has the largest freshwater lake in China, Poyang Lake, as well as national nature reserves such as Mount Lu and Jinggangshan. Jiangxi presents a significant case study of regions grappling with conflicts between economic development and ecological conservation. Its recent reindustrialization and urbanization have introduced significant atmospheric pollution, which threatens local environments and, through the Yangtze River system and atmospheric circulation [23,24], poses broader ecological risks. This combination of high stakes and insufficient ground-level monitoring data makes Jiangxi an ideal region to apply and benefit from satellite-based PM2.5 monitoring, and the findings here are expected to provide insights for other ecologically sensitive areas undergoing similar development.

2.2. Datasets

The feature variables for this study were selected based on their established physical relationships with PM2.5 concentrations, ensuring the model captures key processes including emission, dispersion, and transformation. We integrated 14 variables from multiple categories, as summarized in Table 1.
Satellite-based variables included the Top-of-Atmosphere (TOA) reflectance from MODIS Bands 1, 3, and 7, which provide a direct optical proxy for aerosol abundance. The Normalized Difference Vegetation Index (NDVI) was used as an indicator of land cover and phenological dynamics, which influence dust resuspension and dry deposition. Meteorological variables critically govern pollutant dynamics. Planetary Boundary Layer Height (PBLH) and wind speed components (U10, V10) control the dispersion and transport of PM2.5. Temperature parameters (T2m, SKT), relative humidity (RH), and surface pressure (SP) influence chemical transformation and particle growth, while total precipitation (TP) accounts for wet deposition. Topographic and land-use characteristics were represented by the Digital Elevation Model (DEM), which helps characterize terrain-induced pollution confinement or ventilation. This multi-source data fusion strategy ensures that the model is informed by both direct particulate-matter signals and the underlying physical environment governing PM2.5 concentrations.

2.2.1. Ground Station PM2.5 Data

In this study, hourly PM2.5 mass concentration data from 2020 to 2023 were acquired from the China National Environmental Monitoring Center (CNEMC; http://www.cnemc.cn/) (accessed on 18 October 2024). The dataset encompasses all available ground monitoring stations (a total of 69 sites) across Jiangxi Province. To ensure data quality, a rigorous screening procedure was implemented. Hourly records with null values or marked as invalid by the data provider were excluded. Furthermore, physiologically implausible negative concentrations were discarded. The daily average PM2.5 concentration for each station was subsequently calculated as the arithmetic mean of all valid hourly measurements within a 24 h period (from 00:00 to 23:00 local time). A key quality control criterion was applied at this stage: a daily average was computed only if at least 18 valid hourly records (75% data capture rate) were available for that day. This step is critical to minimize the uncertainty associated with sparse temporal sampling.

2.2.2. MODIS Data

We collected Aqua MODIS L1B calibrated radiances (MYDO21KM) covering the area of Jiangxi Province from 2020 to 2023 (https://ladsweb.modaps.eosdis.nasa.gov/) (accessed on 30 October 2024), and seven bands (band 1–band 7) were extracted from the raw MODIS data. The raw data underwent a standardized preprocessing chain using the MODIS Conversion Toolkit (MCTK) in the ENVI (version 6.1) software environment. The preprocessing steps included geometric Correction and geometric cropping. For geometric correction, the data were projected from the native swath-based Sinusoidal projection to a uniform GeoTIFF format using the WGS84 geographic coordinate system, thereby correcting for sensor view geometry and Earth curvature effects. In addition, the Aqua MODIS L1B data were converted to TOA reflectance data by radiometric calibration. The radiometric calibration process involves obtaining the gains and offsets in the MODIS L1B data header file and then converting the radiance to TOA reflectance.

2.2.3. Meteorological Data

We collected four years (2020–2023) of hour-by-hour meteorological data from the European Center for Medium-Range Weather Forecasts (http://www.ecmwf.int/) (accessed on 29 October 2024). These meteorological data are described in detail in Table 1. To temporally align with the daily PM2.5 data, the hourly meteorological data were aggregated into daily averages. For each variable, the daily mean was calculated as the arithmetic average of all 24 h values for that day. No additional spatial interpolation was performed at this stage; the native ERA5 grid was used in subsequent modeling steps to maintain the physical consistency of the meteorological fields.

2.2.4. Elevation Data

We obtained elevation data with a resolution of 1 km from the Resource and Environmental Sciences Data Center (RESDC) (http://www.resdc.cn/) (accessed on 22 October 2024). This dataset provides the height above sea level and is widely used in regional environmental studies. The elevation data were cropped to the extent of Jiangxi Province area as inputs to the model.

2.2.5. Land Use Data

We collected 30-day time-resolved normalized vegetation index (NDVI) data covering the area of Jiangxi Province from 2020 to 2023 (https://ladsweb.modaps.eosdis.nasa.gov/) (accessed on 30 October 2024) from the Level 1 and Atmospheric Archives & Distribution System Active Archives Centers.

2.2.6. Data Integration

In order to harmonize all impact factor data to 1 km spatial resolution, we bilinearly interpolated the coarse spatial resolution meteorological data (0.25°) and resampled them to 1 km spatial resolution. Subsequently, we performed data cleaning on the PM2.5 mass concentration data. We excluded unrealistic values (below 0 μg/m3 or above 200 μg/m3), which may represent sensor anomalies or exceptional pollution events outside our model’s scope, as well as missing values recorded by ground monitoring stations. All impact factor data were cropped to the Jiangxi Province region, and the spatial resolution was kept constant (1 km). Finally, we used the latitude and longitude of each PM2.5 monitoring station to extract the values of all impact factor data as model inputs. It is worth mentioning that we performed cloud detection on the MODIS TOA reflectance data when extracting the TOA reflectance and NDVI values. Pixels identified as cloudy were excluded from analysis. No spatial or temporal interpolation was applied to fill these data gaps, as the underlying surface features are physically unobservable through clouds. While this reduces daily spatial completeness, the long-term, multi-year dataset ensures robust model training by relying exclusively on high-quality, cloud-free observations. Although some studies [25] have shown that extracting cloud zone values does not affect the simulation accuracy of the model, we only investigate how the RF-XGBoost model predicts the PM2.5 mass concentration without using the cloud reflectance as the characteristic influence factor for the input of the RF-XGBoost model, so the PM2.5 values are not extracted when the ground monitoring stations are located in cloudy areas. For NDVI data, since the temporal resolution of MODIS data is 30 days, we choose the time point closest to the MODIS imaging time as the value of the day.

2.3. Build RF-XGBoost Model

The RF-XGBoost model in this study is implemented through a weight averaging ensemble. As illustrated in Figure 3, the base RFM and XGBoost models are first trained independently on the same set of input features. Their predictions are then combined using a pre-defined, equal weight of 0.5 for each model to produce the final PM2.5 concentration estimate. This fixed-weight strategy was adopted after preliminary experiments indicated that both base models exhibited comparable and complementary predictive performance. Averaging their outputs provides a straightforward yet effective mechanism to enhance model robustness and stability, reducing the variance and potential overfitting of any single model [26,27]. This approach efficiently leverages the respective strengths of RF, which is robust to noise, and XGBoost, known for its high predictive accuracy, resulting in a more generalized and reliable ensemble prediction.
We analyzed the correlations between different MODIS L1B spectral bands, and finally chose band 1 (0.86 μm), band 3 (0.47 μm), and band 7 (2.11 μm) as hybrid model inputs, which have the lowest correlation coefficients. Therefore, the final RF-XGBoost model output results can be expressed as follows:
P M 2.5 = R F X G B o o s t ( b a n d 1 , b a n d 3 , b a n d 7 , T P , U 10 , V 10 , R H , S K T , P B L H , S P , D E M , N D V I )
where P M 2.5 is the estimated daily mean PM2.5 concentration; b a n d 1 , b a n d 3 , b a n d 7 are the TOA reflectance from MODIS bands 1, 3, and 7; T P , U 10 , V 10 , R H , S K T , P B L H , S P are the daily averaged meteorological variables; D E M and N D V I represent the topographic elevation and vegetation index, respectively.
In our ensemble modeling implementation, we initially trained RF and XGBoost models separately to establish baseline performance metrics before ensemble integration. The prediction outputs from these single models were then utilized as input features, with differential weighting applied to generate the final PM2.5 mass concentrations estimates. Through our repeat dynamic weight adjustment, the best results are achieved when the weights of the two models are set equally to 0.5 [22]. The final result can be expressed as bellow:
Final_PM 2.5 = ω R F R F P M 2.5 R F + ω X G B X G B ( P M 2.5 )
where Final_PM 2.5 represents final PM2.5 mass concentration value, R F P M 2.5 and X G B ( P M 2.5 ) represent the single prediction results of the RF and XGBoost models, ω R F represents the RF model weight, and ω X G B represents the XGBoost model weight.
After building the RF-XGBoost model, we input all variable factors into the model to predict the PM2.5 mass concentration of the whole image. The initial features or noise reduction preprocessing generated using Random Forest was utilized, and then the processed data was used for further fitting by XGBoost. This method integrates RF’s noise tolerance with XGBoost’s residual tuning. Both capabilities are strategically leveraged.
In this study, PM2.5 mass concentrations data were used as the dependent variable of RF-XGBoost, and other influencing factor data were used as the independent variables to construct RF-XGBoost. The regional ground monitoring stations in the Jiangxi Province were utilized to extracted PM2.5 mass concentration data for four years (2020–2023) to train the hybrid model. All modeling work was done in MATLAB R2024a software. In addition, through iterative testing, we finally determined the optimal values of two important model parameters learning_Rate_XGB and min_LeafSize_RF as 0.05 and 1, respectively.

2.4. Validation

The predictive accuracy of the hybrid model was assessed using 10-fold cross validation (CV) method [28]. The dataset was randomly divided into ten mutually exclusive subsets. In an iterative process, nine subsets were used for training, and the remaining one for testing. This was repeated ten times such that each subset served as the test set once. The final performance metrics (RMSE, R2, and MAE) were computed as the average of the ten validation results, providing a robust accuracy estimate for the RF-XGBoost model.
The selection of these three metrics is based on their complementary strengths in providing a comprehensive assessment of model performance from different perspectives: R2 measures the proportion of variance in the observed data that is explained by the model, indicating the overall goodness-of-fit. RMSE gives greater weight to large errors and is sensitive to outlier predictions, reflecting the model’s prediction precision. MAE represents the average magnitude of errors without considering their direction, providing an intuitive measure of average error. Their expressions are defined as follows:
R 2 = 1 i = 1 n P M 2.5 o b s ( i ) P M 2.5 e s t ( i ) 2 i = 1 n P M 2.5 o b s ( i ) P M 2.5 o b s ¯ 2
R M S E = 1 n i = 1 n ( P M 2.5 o b s ( i ) P M 2.5 e s t ( i ) ) 2
M A E = 1 n i = 1 n P M 2.5 o b s ( i ) P M 2.5 e s t ( i )
where n denotes the number of test sample sizes, P M 2.5 o b s and P M 2.5 e s t i are the observed and estimated PM2.5 mass concentrations for the i t h sample, respectively, and P M 2.5 o b s ¯ is the mean of the observed values.

3. Results

3.1. RF-XGBoost Performance

We extracted four years of sample data, comprising a total of 37,754 data points corresponding to the latitude and longitude of each ground monitoring station. In addition, a comprehensive statistical analysis was conducted for all input variables. Across the entire study area of Jiangxi Province, the seasonal average PM2.5 mass concentrations ranked from highest to lowest as follows: winter (22.5 μg/m3), autumn (15.8 μg/m3), spring (14.6 μg/m3), and summer (9.1 μg/m3) (Table A1). Figure 4 presents both the simulated PM2.5 mass concentrations based on TOA reflectance over four years and the observed PM2.5 mass concentrations. For the test set, R2 is 0.82, RMSE is 7.25 μg/m3, MAE is 4.90 μg/m3. The slope of the fitted curve is greater than 1, indicating that the observed PM2.5 mass concentrations are higher than the estimated values.
Figure 5 illustrates the simulated inversion results using TOA reflectance for different years, demonstrating that the model exhibits excellent stability across various temporal scales. We divided the sample data by year and conducted simulation tests for each annual dataset. The annual test-R2 values reached a maximum of 0.86 in 2020, while the lowest value was observed in 2021 at 0.83. Over the four-year period from 2020 to 2023, the annual simulation results indicate that the model’s fitting errors fluctuated only slightly, with RMSE ranging from 6.28 μg/m3 to 6.83 μg/m3, and MAE from 3.97 μg/m3 to 4.56 μg/m3. In addition, we further divided the data for each year into different seasons. Figure 6 figures the model demonstrated notably strong fitting ability in spring and autumn, with coefficients of R2 of 0.69 and 0.72, respectively. Both RMSE and MAE remained at relatively low levels, indicating that the model could estimate PM2.5 mass concentrations with high accuracy during these two seasons. In summer, the R2 dropped to 0.60, which is slightly lower than in spring and autumn, but both RMSE and MAE were also small. This suggests that the overall PM2.5 concentrations in summer were low, and the model’s ability to fit low-concentration intervals was limited. In winter, the R2 increased to 0.78, showing a strong correlation. However, both RMSE and MAE rose significantly, reaching 11.15 μg/m3 and 7.25 μg/m3, respectively. This reflects greater errors under high-concentration pollution scenarios, especially in extreme high-value ranges. Overall, the model’s simulation performance was most stable in spring and autumn, showed slight fluctuations in summer, and experienced increased errors in winter due to frequent high-pollution events. This seasonal variation pattern is highly consistent with the temporal characteristics of actual meteorological conditions and pollution emissions, further confirming the model’s effectiveness in capturing the spatiotemporal distribution of PM2.5 across different seasons in Jiangxi Province.

3.2. Estimated PM2.5 Mass Concentrations in Jiangxi Province

We input four years of TOA reflectance, meteorological data, DEM, NDVI, and day information covering the entire Jiangxi Province into the model to derive the daily PM2.5 mass concentrations for Jiangxi Province over the four-year period [29]. In addition, statistical tests were performed on the daily PM2.5 mass concentrations, and the average results for different years and seasons were ultimately obtained. Figure 7 demonstrates the spatial distribution of annual average PM2.5 mass concentrations in Jiangxi Province from 2020 to 2023. PM2.5 mass concentrations exhibited significant spatial heterogeneity across the province, with higher concentrations observed in the northern and central-northern regions (such as Jiujiang, Jingdezhen, Nanchang, and Shangrao), while the southern regions (such as Ganzhou and its surrounding areas) displayed lower concentrations. Over the four-year period, PM2.5 mass concentrations remained relatively low, with values below 40 μg/m3, and the vast majority of areas in Jiangxi Province met the national Grade II standard. It is noteworthy that, as the provincial capital of Jiangxi, Nanchang exhibits not only representative changes in PM2.5 mass concentrations but also a pronounced high-value distribution pattern radiating from Nanchang to surrounding cities such as Jiujiang, Jingdezhen, Yichun, and Shangrao. From 2020 to 2023, this radiating trend showed a gradual weakening. Specifically, in 2020 and 2021, the high-value PM2.5 zones centered around Nanchang and its neighboring urban agglomerations covered a broad area, with a distinct concentration gradient. However, by 2022 and 2023, PM2.5 mass concentrations in the areas surrounding Nanchang had decreased, the extent of high-value zones had narrowed, and the spatial gradient had become less pronounced. This indicates that, with the continuous implementation of air pollution control measures, the radiative impact of pollution centered on Nanchang has been progressively reduced, leading to an overall improvement in regional air quality. Regarding individual cities (Table A2), from 2020 to 2023, the annual average PM2.5 mass concentrations in northern and central cities such as Nanchang, Jiujiang, and Jingdezhen generally remained above the provincial average, but all have shown a downward trend year by year. For example, in Nanchang, the annual average PM2.5 mass concentration was 27 μg/m3 in 2020 and 29 μg/m3 in 2023, indicating minor fluctuations but an overall positive long-term trend. Jiujiang and Jingdezhen have exhibited similar patterns. In contrast, southern cities such as Ganzhou, Jian, and Fuzhou have consistently maintained lower annual average PM2.5 mass concentrations, with Ganzhou recording only 20 μg/m3 in 2023. This further highlights the north–south disparity in PM2.5 pollution across Jiangxi Province.
The distribution of PM2.5 mass concentrations in Jiangxi Province across different seasons is shown in Figure 8. The spatial distribution of PM2.5 in Jiangxi Province exhibits distinct seasonal characteristics, with the lowest concentrations in summer and the highest in winter, followed by autumn and spring. The extremely low PM2.5 mass concentrations during summer are mainly attributed to abundant precipitation, which effectively removes airborne pollutants, as well as high temperatures and strong winds that facilitate the dispersion and dilution of pollutants. In terms of regulatory capacity, the winter season in Jiangxi is characterized by lower wind speeds, frequent wind field convergence, temperature inversions, and high humidity. These factors contribute to the accumulation of pollutants. Notably, over the four-year period, most regions in Jiangxi experienced a gradual decline in winter PM2.5 mass concentrations, especially in the northern and central areas. In other seasons, PM2.5 mass concentrations remained relatively stable, indicating that Jiangxi Province possesses a strong adaptive capacity for controlling PM2.5 pollution.
From the perspective of city-level distribution, northern and central cities such as Nanchang, Jiujiang, and Jingdezhen exhibit significantly higher PM2.5 mass concentrations in winter compared to other seasons, with a broad distribution of high-value zones. This suggests that pollutants tend to accumulate more easily in these areas during winter. In spring and autumn, although PM2.5 mass concentrations in these cities fluctuate, the overall levels are lower than in winter, and the spatial distribution is more uniform. During summer, regardless of whether it is northern cities like Nanchang, Jiujiang, and Jingdezhen, or southern cities such as Ganzhou, Jian, and Fuzhou, PM2.5 mass concentrations drop to the lowest levels of the year, and spatial differences are minimized. This indicates that air quality is generally good across the province in summer. In southern Jiangxi, PM2.5 mass concentrations remain the lowest throughout all seasons, with minimal seasonal fluctuations, reflecting a favorable air environment and strong self-purification capacity. In contrast, northern cities such as Nanchang experience more pronounced seasonal variations, with especially marked differences between winter and summer. This highlights the significant impact of urbanization, industrial activities, and meteorological conditions on the accumulation and dispersion of pollutants.

4. Discussion

In this study, we estimated PM2.5 mass concentrations in Jiangxi Province using a hybrid Random Forest and RF-XGBoost model, based on satellite TOA reflectance data and multiple influencing factors. Compared with traditional AOD products previously used for PM2.5 estimation [30], our approach utilizes TOA reflectance data [31,32,33], which maintains high simulation accuracy and provides a comprehensive representation of PM2.5 mass concentration distributions across Jiangxi Province. This method also enables the estimation of daily PM2.5 mass concentrations throughout the province.
Figure 9 shows average annual PM2.5 mass concentrations in Jiangxi Province from 2020 to 2023. Higher GDP regions tend to have high PM2.5 mass concentrations, such as Nanchang, Jiujiang and Jingdezhen, leading cities in economic output and PM2.5 mass concentrations. This spatial correlation implies that economic activity is strongly related to atmospheric pollution levels. Plant distribution shows that high PM2.5 mass concentrations correlate with dense clusters of factories, particularly heavy industrial plants such as iron and steel plants, smelters, chemical plants and petroleum plants, refineries, biochemical plants and fertilizer plants. These industrial operations release large amounts of particulate matter and some air pollutants, which are local sources of high PM2.5 mass concentrations. In contrast, regions with low GDP and limited industrial infrastructure, like Ganzhou, exhibit lower levels of PM2.5 mass concentration across the province. This observation underscores the influence of economic development levels and industrial activities, particularly the spatial arrangement of heavy industrial facilities, on PM2.5 distribution patterns. Such factors play a crucial role in shaping spatial distribution trends. This finding not only bolsters the spatial analysis of model outcomes but also furnishes a scientific rationale for informing future air pollution mitigation policies.
The RF-XGBoost model offers several advantages. First, it possesses strong nonlinear modeling capabilities. XGBoost, through its gradient boosting tree structure, can automatically capture the interactions and nonlinear relationships among influencing factors such as satellite remote sensing data, NDVI, reflectance, meteorological data, and DEM. Second, the Random Forest component reduces variance through bagging, while XGBoost reduces bias via boosting. The combination of the two further enhances predictive accuracy and adapts to the diverse pollution scenarios in Jiangxi Province. Furthermore, the regularization terms (L1/L2) in XGBoost and the random subsampling in Random Forest effectively suppress overfitting, ensuring model stability even in the presence of missing data and outliers. Notably, XGBoost supports multithreading and distributed computing, making it suitable for processing real-time data streams from the 69 monitoring stations across Jiangxi Province and meeting the demand for rapid prediction. In addition, the RF-XGBoost model can leverage the strengths of both algorithms by adjusting the weighting parameters [34]. The Random Forest component within the hybrid model can also provide the contribution of each influencing factor to the estimated PM2.5 mass concentrations. Figure 10 illustrates that relative humidity (RH) exhibited the highest contribution, reflecting the significant impact of atmospheric moisture content on PM2.5 mass concentrations. This was followed by day (time) and the 10 m U-component wind speed (u10) as the second and third most important factors, indicating that seasonal variation, human activity cycles, and horizontal airflow play substantial roles in the transport and dispersion of PM2.5. Total precipitation (Totalp) and planetary boundary layer height (PBLH) also showed relatively high contributions, which is consistent with theoretical expectations, as these factors determine the atmosphere’s capacity to remove PM2.5 particles through precipitation and the vertical mixing volume of pollutants, respectively, thereby directly affecting surface-level PM2.5 mass concentrations.
Finally, we compared the proposed model with traditional models, including BPNNM, RFM, XGBoost, and SVMM [35,36,37]. The RF-XGBoost model demonstrated higher predictive accuracy and the advantage of complementary integration Figure 11 [38]. We randomly sampled the data and employed cross-validation to obtain the evaluation metrics (R2, RMSE, and MAE) for each model [38,39]. The results showed that RF-XGBoost achieved higher R2 values and the lowest RMSE and MAE values, indicating that RF-XGBoost better explains the variability of PM2.5 mass concentrations in Jiangxi Province compared to the other four algorithms. In addition, considering the spatial correlation of model prediction errors, the RF-XGBoost model yielded the smallest RMSE and MAE values on the test set, further demonstrating its superior accuracy in predicting daily average PM2.5 mass concentrations in Jiangxi Province. It is also noteworthy that the training times for SVMM and BPNNM were longer than that of RF-XGBoost, and the simulation results produced by SVMM and BPNNM were inferior to those of RF-XGBoost.
Although the RF-XGBoost model outperforms most of the constructed models, it still has certain limitations. These limitations are closely related to the implementation orientation and resource allocation characteristics of current air pollution control policies. From a policy perspective, in recent years, both the national government and Jiangxi Province have promoted the construction of ground monitoring stations through policies such as the “Blue Sky Protection Campaign” and the “Ecological Environment Monitoring Network Construction Plan.” However, resource investment has been more inclined toward urban centers and surrounding areas with dense populations and active economic activities (for example, the central urban areas of Nanchang and Jiujiang), in order to prioritize air quality monitoring in key regions. This has directly resulted in the relatively concentrated distribution of ground monitoring stations shown in Figure 2 that stations are mainly located in the urban and surrounding areas of each prefecture-level city, while coverage in remote mountainous and rural non-key monitoring areas is severely lacking. Such policy-driven disparities in monitoring resource allocation have a significant impact on the model’s prediction results: on one hand, data from densely monitored areas provide the model with abundant and accurate observational information, thereby improving prediction accuracy in these regions; on the other hand, in some remote or sparsely monitored areas, the lack of direct observational data forces the model to rely on interpolation or extrapolation from nearby stations, which may introduce uncertainty and errors in predictions for these regions. Another notable limitation is the model’s insufficient performance under extreme pollution scenarios. In winter, although the model’s goodness-of-fit is relatively high (test-R2 = 0.78), prediction errors increase significantly (RMSE = 11.15 μg/m3, MAE = 7.25 μg/m3), and the model fails to effectively capture the complex nonlinear relationships between PM2.5 and its driving factors under extreme conditions. Therefore, the spatial imbalance in monitoring station distribution to some extent affects the model’s generalization ability and its capacity for fine-scale characterization of PM2.5 spatial distribution across the province. In the future, upgrading monitoring network policies toward “balanced and diversified” approaches can address the issue of spatial data imbalance, establish a “policy-environment” interactive feature system, and enhance prediction accuracy under extreme pollution scenarios. Additionally, leveraging policy data-sharing platforms to strengthen model iteration and regional adaptability will enable model optimization to be closely aligned with policy orientation and resource utilization, thereby further improving the model’s predictive accuracy and spatial representativeness at the provincial scale.

5. Conclusions

This study employed a hybrid model combining RFM and RF-XGBoost, integrating multi-source data such as NDVI, meteorological data, DEM, and day information, to predict the long-term daily average PM2.5 mass concentrations across Jiangxi Province. The simulation results demonstrate that the proposed model exhibits high prediction accuracy, with a test-R2 of 0.82, RMSE of 7.25 μg/m3, and MAE of 4.90 μg/m3. Annual inversion results from 2020 to 2023 indicate a continuous decline in PM2.5 mass concentrations, reflecting a gradual improvement in the regional atmospheric environment. The PM2.5 concentrations in Jiangxi Province show distinct spatial distribution differences and seasonal characteristics. Spatially, PM2.5 concentrations are significantly higher in northern Jiangxi and lower in the south, while seasonally, concentrations are lowest in summer and highest in winter, a trend closely related to frequent summer precipitation in the province. Comparative results show that the RFM (test-R2 = 0.85, RMSE = 9.51 μg/m3, MAE = 6.61 μg/m3) demonstrates better performance than the RF-XGBoost model (test-R2 = 0.82, RMSE = 7.25 μg/m3, MAE = 4.90 μg/m3) in terms of explanatory power for impact factors. However, in terms of prediction error for PM2.5, the integration of the XGBoost algorithm in the hybrid model effectively enhances the overall model’s fitting and generalization abilities, fully capturing complex nonlinear relationships within the data, thereby resulting in more accurate predictions. This study not only provides a scientific basis for analyzing the temporal and spatial distribution characteristics of air pollution in Jiangxi Province, but also offers technical support for relevant authorities in formulating air pollution prevention and control measures and environmental management policies. In the future, multi-source data fusion can be utilized by incorporating more remote sensing data and ground observation data to further enhance the model’s predictive capability in data-scarce regions. To address the issue of insufficient monitoring data in remote areas, techniques such as Generative Adversarial Networks (GANs) can be employed to improve the model’s spatial generalization ability. Additionally, extending the model’s prediction horizon for longer-term PM2.5 trend forecasting can provide strong scientific support for policy-making and air quality management.

Author Contributions

Conceptualization, Y.T.; software, Y.T., J.D. and X.C.; data curation, J.D. and X.C.; writing—original draft preparation, J.D. and X.C.; writing—review and editing, Y.T., L.Y. and S.Z.; supervision, Y.T., Z.L. and S.Z.; project administration, Z.L.; funding acquisition, Z.L. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

National Natural Science Foundation of China (No. 41901352 and No. 42261077); Guangdong Basic and Applied Basic Research Foundation (No. 2020A1515010780); Science and Technology Projects in Guangzhou (No. 202102020454).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy concerns.

Acknowledgments

We appreciate the support of the European Centre for Medium-Range Weather Forecasts (ECMWF) in providing reanalysis meteorological data. In addition, we sincerely appreciate all the anonymous reviewers for their excellent comments and efforts.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Annual average PM2.5 mass concentrations from 2020 to 2023 and PM2.5 mass concentrations in all seasons of each year (unit: μg/m3).
Table A1. Annual average PM2.5 mass concentrations from 2020 to 2023 and PM2.5 mass concentrations in all seasons of each year (unit: μg/m3).
YearAnnual
Average
SpringSummerAutumnWinter
202015.6415.349.8316.4922.61
202115.4815.069.4116.1922.51
202213.7514.108.6915.2121.29
202314.3313.748.3015.1721.79
Table A2. Annual PM2.5 mass concentrations, GDP, and factory statistics for major cities in Jiangxi Province.
Table A2. Annual PM2.5 mass concentrations, GDP, and factory statistics for major cities in Jiangxi Province.
City2020 (μg/m3)2021 (μg/ m3)2022 (μg/ m3)2023 (μg/ m3)Avg (μg/m3)Total GDP (100 Million Yuan)Total Factory (ton)
Nanchang27.4428.6326.3829.6228.02104,645.7565
Jiujiang36.2629.3627.8529.3630.7168,724.0096
Jingdezhen24.4722.6420.4722.3722.4860,671.2525
Pingxiang29.4730.4626.2429.3428.8781,194.0035
Xinyu30.5629.2426.4228.2128.6197,189.2523
Yingtan30.7523.4221.3423.3224.71100,654.7557
Ganzhou24.6321.6419.4320.4721.5447,200.7535
Jian28.4526.5424.4527.3426.6957,341.2545
Yichun30.4729.3425.4228.3628.3964,809.0051
Fuzhou27.7524.4622.6424.2124.7751,134.5039
Shangrao21.2326.4221.7423.2122.1547,979.2541

References

  1. Vanoli, J.; Quint, J.K.; Rajagopalan, S.; Stafoggia, M.; Al-Kindi, S.; Mistry, M.N.; Masselot, P.; de la Cruz Libardi, A.; Ng, C.F.S.; Madaniyazi, L.; et al. Association between long-term exposure to low ambient PM2.5 and cardiovascular hospital admissions: A UK Biobank study. Environ. Int. 2024, 192, 109011. [Google Scholar] [CrossRef]
  2. Zhang, S.; Li, X.; Zhang, L.; Zhang, Z.; Li, X.; Xing, Y.; Wenger, J.C.; Long, X.; Bao, Z.; Qi, X.; et al. Disease types and pathogenic mechanisms induced by PM2.5 in five human systems: An analysis using omics and human disease databases. Environ. Int. 2024, 190, 108863. [Google Scholar] [CrossRef]
  3. Xu, J.; Ni, M.; Wang, J.; Zhu, J.; Niu, G.; Cui, J.; Li, X.; Meng, Q.; Chen, R. Low-level PM2.5 induces the occurrence of early pulmonary injury by regulating circ_0092363. Environ. Int. 2024, 187, 108700. [Google Scholar] [CrossRef]
  4. Cao, J.; Yang, C.; Li, J.; Chen, R.; Chen, B.; Gu, D.; Kan, H. Association between long-term exposure to outdoor air pollution and mortality in China: A cohort study. J. Hazard. Mater. 2011, 186, 1594–1600. [Google Scholar] [CrossRef]
  5. Zhou, D.; Yang, Y.; Zhao, Z.; Zhou, K.; Zhang, D.; Tang, W.; Zhou, M. Air pollution-related disease and economic burden in China, 1990–2050: A modelling study based on Global burden of disease. Environ. Int. 2025, 196, 109300. [Google Scholar] [CrossRef] [PubMed]
  6. Bai, K.; Li, K.; Sun, Y.; Wu, L.; Zhang, Y.; Chang, N.-B.; Li, Z. Global synthesis of two decades of research on improving PM2.5 estimation models from remote sensing and data science perspectives. Earth Sci. Rev. 2023, 241, 104461. [Google Scholar] [CrossRef]
  7. Tian, J.; Chen, D. A semi-empirical model for predicting hourly ground-level fine particulate matter (PM2.5) concentration in southern Ontario from satellite remote sensing and ground-based meteorological measurements. Remote Sens. Environ. 2010, 114, 221–229. [Google Scholar] [CrossRef]
  8. Jumaah, H.J.; Dawood, M.A.; Abd Alreza, T.A.; Meteab, M.A. Air pollution landscape in Iraq: A Sentinel-5P based assessment of key atmospheric pollutants. DYSONA Appl. Sci. 2026, 7, 82–87. [Google Scholar]
  9. Shang, K.; Yao, Y.; Di, Z.; Jia, K.; Zhang, X.; Fisher, J.B.; Chen, J.; Guo, X.; Yang, J.; Yu, R.; et al. Coupling physical constraints with machine learning for satellite-derived evapotranspiration of the Tibetan Plateau. Remote Sens. Environ. 2023, 289, 113519. [Google Scholar] [CrossRef]
  10. Jiang, T.; Chen, B.; Nie, Z.; Ren, Z.; Xu, B.; Tang, S. Estimation of hourly full-coverage PM2.5 concentrations at 1-km resolution in China using a two-stage random forest model. Atmos. Res. 2021, 248, 105146. [Google Scholar] [CrossRef]
  11. Sorek-Hamer, M.; Strawa, A.; Chatfield, R.; Esswein, R.; Cohen, A.; Broday, D. Improved retrieval of PM2.5 from satellite data products using non-linear methods. Environ. Pollut. 2013, 182, 417–423. [Google Scholar] [CrossRef] [PubMed]
  12. Chen, Z.-Y.; Zhang, T.-H.; Zhang, R.; Zhu, Z.-M.; Ou, C.-Q.; Guo, Y. Estimating PM2.5 concentrations based on non-linear exposure-lag-response associations with aerosol optical depth and meteorological measures. Atmos. Environ. 2018, 173, 30–37. [Google Scholar] [CrossRef]
  13. Chen, G.; Li, S.; Knibbs, L.D.; Hamm, N.A.S.; Cao, W.; Li, T.; Guo, J.; Ren, H.; Abramson, M.J.; Guo, Y. A machine learning method to estimate PM2.5 concentrations across China with remote sensing, meteorological and land use information. Sci. Total Environ. 2018, 636, 52–60. [Google Scholar] [CrossRef] [PubMed]
  14. Suleiman, A.; Tight, M.; Quinn, A. Applying machine learning methods in managing urban concentrations of traffic-related particulate matter (PM10 and PM2.5). Atmos. Pollut. Res. 2019, 10, 134–144. [Google Scholar] [CrossRef]
  15. Ye, Y.; Cao, Y.; Dong, Y.; Yan, H. A graph neural network and Transformer-based model for PM2.5 prediction through spatiotemporal correlation. Environ. Model. Softw. 2025, 191, 106501. [Google Scholar] [CrossRef]
  16. Zhao, C.; Wang, Q.; Ban, J.; Liu, Z.; Zhang, Y.; Ma, R.; Li, S.; Li, T. Estimating the daily PM2.5 concentration in the Beijing-Tianjin-Hebei region using a random forest model with a 0.01 × 0.01 spatial resolution. Environ. Int. 2020, 134, 105297. [Google Scholar] [CrossRef]
  17. Song, Y.; Zhang, C.; Jin, X.; Zhao, X.; Huang, W.; Sun, X.; Yang, Z.; Wang, S. Spatial prediction of PM2.5 concentration using hyper-parameter optimization XGBoost model in China. Environ. Technol. Innov. 2023, 32, 103272. [Google Scholar] [CrossRef]
  18. Yi, L.; Mengfan, T.; Kun, Y.; Yu, Z.; Xiaolu, Z.; Miao, Z.; Yan, S. Research on PM2.5 estimation and prediction method and changing characteristics analysis under long temporal and large spatial scale-A case study in China typical regions. Sci. Total Environ. 2019, 696, 133983. [Google Scholar] [CrossRef]
  19. Yan, X.; Zang, Z.; Luo, N.; Jiang, Y.; Li, Z. New interpretable deep learning model to monitor real-time PM2.5 concentrations from satellite data. Environ. Int. 2020, 144, 106060. [Google Scholar] [CrossRef]
  20. Fu, M.; Kelly, J.A.; Clinch, J.P. Prediction of PM2.5 daily concentrations for grid points throughout a vast area using remote sensing data and an improved dynamic spatial panel model. Atmos. Environ. 2020, 237, 117667. [Google Scholar] [CrossRef]
  21. Wu, Y.; Cai, D.; Gu, S.; Jiang, N.; Li, S. Compressive strength prediction of sleeve grouting materials in prefabricated structures using hybrid optimized XGBoost models. Constr. Build. Mater. 2025, 476, 141319. [Google Scholar] [CrossRef]
  22. Meiseles, A.; Rokach, L. Iterative Feature eXclusion (IFX): Mitigating feature starvation in gradient boosted decision trees. Knowl. Based Syst. 2024, 289, 111546. [Google Scholar] [CrossRef]
  23. Ali, A.; Huang, Z.; Bilal, M.; Assiri, M.E.; Mhawish, A.; Nichol, J.E.; de Leeuw, G.; Almazroui, M.; Wang, Y.; Alsubhi, Y. Long-term PM2.5 pollution over China: Identification of PM2.5 pollution hotspots and source contributions. Sci. Total Environ. 2023, 893, 164871. [Google Scholar] [CrossRef] [PubMed]
  24. Zhang, H.; Jiang, Y.; Ding, M.; Xie, Z. Level, source identification, and risk analysis of heavy metal in surface sediments from river-lake ecosystems in the Poyang Lake, China. Environ. Sci. Pollut. Res. 2017, 24, 21902–21916. [Google Scholar] [CrossRef]
  25. Jie, P.; Zhou, Y.; Zhang, Z.; Wei, F. Heating energy consumption prediction based on improved GA-BP neural network model. Energy 2025, 328, 136392. [Google Scholar] [CrossRef]
  26. Ou, J.; Zhang, J.; Li, H.; Duan, B. Road damage prediction and intelligent maintenance methods based on stacking ensemble learning. Adv. Eng. Inform. 2025, 66, 103466. [Google Scholar] [CrossRef]
  27. Li, X.; Chen, H.; Xu, L.; Mo, Q.; Du, X.; Tang, G. multi-model fusion stacking ensemble learning method for the prediction of berberine by FT-NIR spectroscopy. Infrared Phys. Technol. 2024, 137, 105169. [Google Scholar] [CrossRef]
  28. Roy, J.; Saha, S. Ensemble hybrid machine learning methods for gully erosion susceptibility map: K-fold cross validation approach. Artif. Intell. Geosci. 2022, 3, 28–45. [Google Scholar] [CrossRef]
  29. Guo, B.; Zhang, D.; Pei, L.; Su, Y.; Wang, X.; Bian, Y.; Zhang, D.; Yao, W.; Zhou, Z.; Guo, L. Estimating PM2.5 concentrations via random forest method using satellite, auxiliary, and ground-level station dataset at multiple temporal scales across China in 2017. Sci. Total Environ. 2021, 778, 146288. [Google Scholar] [CrossRef]
  30. Xu, Q.; Chen, X.; Yang, S.; Tang, L.; Dong, J. Spatiotemporal relationship between Himawari-8 hourly columnar aerosol optical depth (AOD) and ground-level PM2.5 mass concentration in mainland China. Sci. Total Environ. 2021, 765, 144241. [Google Scholar] [CrossRef]
  31. Yang, L.; Xu, H.; Yu, S. Estimating PM2.5 concentrations in Yangtze River Delta region of China using random forest model and the Top-of-Atmosphere reflectance. J. Environ. Manag. 2020, 272, 111061. [Google Scholar] [CrossRef]
  32. Chen, X.; Zhang, W.; He, J.; Zhang, L.; Guo, H.; Li, J.; Gu, X. Mapping PM2.5 concentration from the top-of-atmosphere reflectance of Himawari-8 via an ensemble stacking model. Atmos. Environ. 2024, 330, 120560. [Google Scholar] [CrossRef]
  33. Amiri, Z.; Shahne, M.Z. Modeling PM2.5 concentration in Tehran using satellite-based Aerosol optical depth (AOD) and machine learning: Assessing input contributions and prediction accuracy. Remote Sens. Appl. Soc. Environ. 2025, 38, 101549. [Google Scholar] [CrossRef]
  34. Chowdhury, S.; Saha, A.K.; Das, D.K. Hydroelectric Power Potentiality Analysis for the Future Aspect of Trends with R2 Score Estimation by XGBoost and Random Forest Regressor Time Series Models. Procedia Comput. Sci. 2025, 252, 450–456. [Google Scholar] [CrossRef]
  35. Zhang, D.; Du, L.; Wang, W.; Zhu, Q.; Bi, J.; Scovronick, N.; Naidoo, M.; Garland, R.M.; Liu, Y. A machine learning model to estimate ambient PM2.5 concentrations in industrialized highveld region of South Africa. Remote Sens. Environ. 2021, 266, 112713. [Google Scholar] [CrossRef] [PubMed]
  36. Yang, Y.; Wang, Z.; Cao, C.; Xu, M.; Yang, X.; Wang, K.; Guo, H.; Gao, X.; Li, J.; Shi, Z. Estimation of PM2.5 concentration across China based on multi-source remote sensing data and machine learning methods. Remote Sens. 2024, 16, 467. [Google Scholar] [CrossRef]
  37. Li, X.; Li, L.; Chen, L.; Zhang, T.; Xiao, J.; Chen, L. Random Forest estimation and trend analysis of PM2.5 concentration over the Huaihai economic zone, China (2000–2020). Sustainability 2022, 14, 8520. [Google Scholar] [CrossRef]
  38. Vovk, T.; Kryza, M.; Werner, M. Using random forest to improve EMEP4PL model estimates of daily PM2.5 in Poland. Atmos. Environ. 2024, 332, 120615. [Google Scholar] [CrossRef]
  39. Lu, J.; Zhang, Y.; Chen, M.; Wang, L.; Zhao, S.; Pu, X.; Chen, X. Estimation of monthly 1 km resolution PM2.5 concentrations using a random forest model over “2 + 26” cities, China. Urban Clim. 2021, 35, 100734. [Google Scholar] [CrossRef]
Figure 1. Flowchart of the specific process for the Random Forest and Extreme Gradient Boosting (RF-XGBoost) models.
Figure 1. Flowchart of the specific process for the Random Forest and Extreme Gradient Boosting (RF-XGBoost) models.
Atmosphere 16 01317 g001
Figure 2. The region of Jiangxi Province, China, and the distribution of ground monitoring stations.
Figure 2. The region of Jiangxi Province, China, and the distribution of ground monitoring stations.
Atmosphere 16 01317 g002
Figure 3. Structure and specific schematic of the Random Forest and Extreme Gradient Boost (RF-XGBoost) models.
Figure 3. Structure and specific schematic of the Random Forest and Extreme Gradient Boost (RF-XGBoost) models.
Atmosphere 16 01317 g003
Figure 4. Density scatter plots of observed versus estimated PM2.5 mass concentrations.
Figure 4. Density scatter plots of observed versus estimated PM2.5 mass concentrations.
Atmosphere 16 01317 g004
Figure 5. Density scatter plots of observed PM2.5 mass concentrations versus those estimated using TOA reflectance for different years.
Figure 5. Density scatter plots of observed PM2.5 mass concentrations versus those estimated using TOA reflectance for different years.
Atmosphere 16 01317 g005
Figure 6. Density scatter plots of observed PM2.5 mass concentrations versus those estimated using TOA reflectance for different seasons.
Figure 6. Density scatter plots of observed PM2.5 mass concentrations versus those estimated using TOA reflectance for different seasons.
Atmosphere 16 01317 g006
Figure 7. Spatial Distribution of Annual Average PM2.5 Mass Concentration in Jiangxi Province from 2020 to 2023.
Figure 7. Spatial Distribution of Annual Average PM2.5 Mass Concentration in Jiangxi Province from 2020 to 2023.
Atmosphere 16 01317 g007
Figure 8. Seasonal distribution of estimated PM2.5 mass concentrations in Jiangxi Province from 2020 to 2023.
Figure 8. Seasonal distribution of estimated PM2.5 mass concentrations in Jiangxi Province from 2020 to 2023.
Atmosphere 16 01317 g008aAtmosphere 16 01317 g008b
Figure 9. Spatial Distribution of Annual Average PM2.5 Mass Concentration, GDP, and Factory Locations in Jiangxi Province from 2020 to 2023.
Figure 9. Spatial Distribution of Annual Average PM2.5 Mass Concentration, GDP, and Factory Locations in Jiangxi Province from 2020 to 2023.
Atmosphere 16 01317 g009
Figure 10. Relative Importance of Influencing Factors in the RF-XGBoost Model.
Figure 10. Relative Importance of Influencing Factors in the RF-XGBoost Model.
Atmosphere 16 01317 g010
Figure 11. Comparative Analysis of the Predictive Performance of Analytical Models.
Figure 11. Comparative Analysis of the Predictive Performance of Analytical Models.
Atmosphere 16 01317 g011
Table 1. Catalog of datasets used for modeling.
Table 1. Catalog of datasets used for modeling.
Data DirectoryVariableSpatial
Resolution
Temporal
Resolution
PM2.5 dataDaily mean PM2.5-Hour
mass concentration
MODIS dataTop-of-Atmosphere1 kmDay
reflectance (Band 1, 3, 7)
MeteorologicalPlanetary boundary layer0.25°Hour
data(PBLH)
Total precipitation (TP)0.25°Hour
Relative humidity (RH)0.25°Hour
2 m air temperature (T2m)0.25°Hour
Surface skin temperature (SKT)0.25°Hour
Surface pressure (SP)0.25°Hour
10 m northward wind speed (V10)0.25°Hour
10 m eastward wind speed (U10)
Elevation DataDEM1 km-
Land Use DataNormalized difference1 km30 days

Day Data
Vegetation index (NDVI)
Number of days in a year

-

1 day
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tang, Y.; Deng, J.; Cui, X.; Liu, Z.; Yang, L.; Zhang, S.; Liang, Y. High-Resolution Spatial Prediction of Daily Average PM2.5 Concentrations in Jiangxi Province via a Hybrid Model Integrating Random Forest and XGBoost. Atmosphere 2025, 16, 1317. https://doi.org/10.3390/atmos16121317

AMA Style

Tang Y, Deng J, Cui X, Liu Z, Yang L, Zhang S, Liang Y. High-Resolution Spatial Prediction of Daily Average PM2.5 Concentrations in Jiangxi Province via a Hybrid Model Integrating Random Forest and XGBoost. Atmosphere. 2025; 16(12):1317. https://doi.org/10.3390/atmos16121317

Chicago/Turabian Style

Tang, Yuming, Jing Deng, Xinyi Cui, Zuhan Liu, Liu Yang, Shaoquan Zhang, and Yeheng Liang. 2025. "High-Resolution Spatial Prediction of Daily Average PM2.5 Concentrations in Jiangxi Province via a Hybrid Model Integrating Random Forest and XGBoost" Atmosphere 16, no. 12: 1317. https://doi.org/10.3390/atmos16121317

APA Style

Tang, Y., Deng, J., Cui, X., Liu, Z., Yang, L., Zhang, S., & Liang, Y. (2025). High-Resolution Spatial Prediction of Daily Average PM2.5 Concentrations in Jiangxi Province via a Hybrid Model Integrating Random Forest and XGBoost. Atmosphere, 16(12), 1317. https://doi.org/10.3390/atmos16121317

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop