Identifying Meteorological and Gaseous Pollutant Factors Across PM2.5 Pollution Levels for Sustainable Air Quality Management in the Beijing–Tianjin–Hebei Region Using CatBoost–SHAP: A 2021–2024 Analysis

Zeng, Ling; Shuai, Dandan; Xu, Daichi; Jing, Linhai

doi:10.3390/su18115611

Open AccessArticle

Identifying Meteorological and Gaseous Pollutant Factors Across PM_2.5 Pollution Levels for Sustainable Air Quality Management in the Beijing–Tianjin–Hebei Region Using CatBoost–SHAP: A 2021–2024 Analysis

¹

Geomathematics Key Laboratory of Sichuan Province, Chengdu Technological University, Chengdu 610059, China

²

School of Mathematical Science, Chengdu Technological University, Chengdu 610059, China

³

School of Artificial Intelligence, China University of Geosciences, Beijing 100083, China

^*

Authors to whom correspondence should be addressed.

Sustainability 2026, 18(11), 5611; https://doi.org/10.3390/su18115611

Submission received: 24 April 2026 / Revised: 20 May 2026 / Accepted: 29 May 2026 / Published: 2 June 2026

Download

Browse Figures

Versions Notes

Abstract

This study examines the meteorological and gaseous pollutant drivers of PM_2.5 under mild, moderate, and severe pollution conditions in the Beijing–Tianjin–Hebei region, with the aim of supporting sustainable air quality management. Daily observations from approximately 65 monitoring stations from 1 November 2021 to 31 October 2024 were used, including PM_2.5, four gaseous pollutants (SO₂, NO₂, CO, and O₃), and five meteorological variables: temperature, pressure, relative humidity, precipitation, and wind speed. A CatBoost–SHAP framework was adopted, with CatBoost used for station-level spatial prediction of PM_2.5 and SHAP applied to interpret variable contributions. Based on predefined PM_2.5 thresholds, 425 pollution days were classified into those three pollution-level scenarios. These pollution days occurred mainly in winter and spring, with higher frequencies in Handan, Baoding, and Shijiazhuang, followed by Tianjin and Beijing. The model performed well across the three pollution-level scenarios. The severe-pollution scenario achieved the highest R², indicating a clearer spatial structure under high-PM_2.5 conditions. Although absolute RMSE and MAE increased with pollution severity, their normalized values changed little, suggesting that larger errors mainly reflected stronger spatial heterogeneity at higher PM_2.5 concentrations. SHAP results showed that CO, precipitation, wind speed, and temperature dominated the prediction structure. CO was the most stable and influential predictor, but its importance should be interpreted as an indicator of combustion-related pollution accumulation rather than direct causality. Precipitation represented event-dependent wet scavenging, wind speed reflected dispersion conditions, and temperature captured seasonal and thermal background effects. SHAP dependence analysis further indicated that CO had the clearest direct dependence, whereas wind speed and temperature were more background-dependent, and precipitation acted as an episodic nonlinear regulator.

Keywords:

PM_2.5; Beijing–Tianjin–Hebei; CatBoost–SHAP; meteorological factors; gaseous pollutants; pollution-level scenarios; sustainable air quality management

1. Introduction

Fine particulate matter (PM_2.5) is a major air pollutant with substantial implications for air quality and human health. A growing body of evidence has shown that exposure to PM_2.5 is closely associated with cardiovascular and respiratory diseases and with increased premature mortality risk [1,2]. In China, rapid industrialization, urbanization, dense population, and intensive energy consumption have long made the Beijing–Tianjin–Hebei (BTH) region a hotspot of PM_2.5 pollution. In addition, the surrounding Taihang and Yanshan Mountains, together with unfavorable regional meteorological conditions, can weaken atmospheric ventilation and promote pollutant accumulation, giving the BTH region pronounced characteristics of regional haze and heavy PM_2.5 pollution [3,4].

Although annual mean PM_2.5 concentrations in the Beijing–Tianjin–Hebei (BTH) region have declined markedly since the implementation of China’s Air Pollution Prevention and Control Action Plan in 2013, with 2024 levels in Beijing, Tianjin, and Hebei all more than 60% lower than those in 2013 [5], seasonal and regional pollution episodes still occur repeatedly. In particular, wintertime PM_2.5 pollution in North China and the BTH region is often aggravated by unfavorable meteorological conditions, including atmospheric stagnation, weak winds, stable stratification, high humidity, and suppressed boundary-layer development, which facilitate pollutant accumulation and enhance the effects of combustion-related emissions [6,7]. Recent studies further suggest that, despite the overall decline in PM_2.5 concentrations, regional transport, meteorological stagnation, and precursor sensitivity continue to play important roles in determining pollution severity and the evolution of pollution episodes in the BTH region [8,9]. Therefore, clarifying how meteorological factors and gaseous pollutants influence PM_2.5 under different pollution levels is essential for developing targeted and pollution-level-specific control strategies in this region.

Previous studies have extensively investigated PM_2.5 pollution in the BTH region from the perspectives of spatiotemporal distribution [10,11], influencing factors [6,12], and regional transport and pollution episodes [13,14]. However, most of these studies have focused on long-term average conditions, overall trends, or individual pollution events, while systematic comparisons across different PM_2.5 pollution levels remain limited. In addition, existing studies have seldom quantified the relative contribution strength, ranking variation, and influence patterns of meteorological factors and gaseous precursors under mild, moderate, and severe pollution conditions. Therefore, the differentiated driving mechanisms of PM_2.5 across pollution levels in the BTH region remain insufficiently understood.

In recent years, data-driven methods, including machine learning [15], deep learning [16,17], and explainable artificial intelligence [18,19,20], have been increasingly applied to air quality prediction and factor analysis. These approaches are capable of capturing complex nonlinear relationships among meteorological variables, precursor pollutants, and air-quality indicators, and often achieve strong predictive performance [15,16,17]. In particular, interpretable frameworks based on SHAP have made it possible to quantify the contribution of individual predictors and improve model transparency [18,19].

To address these gaps, this study uses daily observations from approximately 65 air quality monitoring stations in the BTH region from 1 November 2021 to 31 October 2024 and investigates PM_2.5 pollution under different severity levels using a CatBoost–SHAP framework [21,22,23]. Unlike studies focusing on the full range of daily PM_2.5 concentrations, this work specifically focuses on PM_2.5 pollution days. These pollution days are classified into three levels based on daily PM_2.5 concentrations: mild pollution (75 μg/m³ ≤ PM_2.5 < 115 μg/m³), moderate pollution (115 μg/m³ ≤ PM_2.5 < 150 μg/m³), and severe pollution (PM_2.5 ≥ 150 μg/m³). Daily PM_2.5 concentrations are analyzed together with four gaseous pollutants (SO₂, NO₂, CO, and O₃) and five meteorological variables, including temperature (T), pressure (P), relative humidity (RH), precipitation (PRE), and wind speed (WS). Specifically, this study aims to: (1) reveal the temporal and spatial distribution characteristics of mild, moderate, and severe PM_2.5 pollution days in the BTH region during 2021–2024; (2) quantify the relative importance and ranking changes in meteorological factors and gaseous precursors across pollution levels; and (3) examine the dependence patterns of key drivers under different covariate backgrounds.

2. Methodology

2.1. Study Area and Data Source

The study area encompasses the Beijing–Tianjin–Hebei (BTH) region in North China, located in the northern part of the North China Plain and surrounded by the Taihang Mountains to the west and the Yanshan Mountains to the north. Spanning approximately 218,000 km², this national-level strategic region serves as China’s political, economic, and cultural hub, with Beijing as the national capital and Tianjin as a major international port city. It plays a pivotal role in driving innovation, governance, coordinated regional development, and high-quality urbanization under the Beijing–Tianjin–Hebei Coordinated Development Plan.

Geographically, the BTH region includes the municipality of Beijing, the municipality of Tianjin, and the entirety of Hebei Province which consists of 11 prefecture-level cities (Figure 1). Its coordinates range from approximately 113°04′ E to 119°53′ E and 36°01′ N to 42°37′ N. The terrain predominantly consists of plains in the east and southeast, with hills and low to medium mountains in the western and northern parts, which significantly influence local meteorological conditions and pollutant dispersion.

The dataset comprises daily mean PM_2.5 concentrations and four gaseous pollutants (SO₂, NO₂, CO, and O₃) collected from approximately 65 air quality monitoring stations across the BTH region (Figure 1). All air pollutant data, along with five meteorological variables—mean air temperature (T), atmospheric pressure (P), relative humidity (RH), precipitation (PRE), and mean wind speed (WS)—were obtained from the Environmental Information and Analysis (EIA) Data Platform available at http://eia-data.com/ (accessed on 31 December 2025), which integrates official observations from the China National Environmental Monitoring Center (CNEMC) network. To generate station-specific meteorological inputs, each air quality station was matched to its nearest meteorological station based on spatial proximity, and the corresponding meteorological time series was assigned as the matched record. The study period spans from 1 November 2021 to 31 October 2024, encompassing three complete seasonal cycles and precisely aligning with the three defined annual periods (1 November 2021–31 October 2022, 1 November 2022–31 October 2023, and 1 November 2023–31 October 2024).

The majority of these approximately 65 air quality stations are concentrated in densely populated urban and downtown areas, consistent with national monitoring priorities that emphasize locations with high population exposure and intense emission sources. Consequently, coverage is dense in major urban cores such as central Beijing, Tianjin, and Shijiazhuang, but relatively sparse in remote or mountainous counties, particularly in western and northern Hebei Province.

2.2. Descriptive Statistics

Descriptive statistics of datasets for the air quality variable (PM_2.5), four gaseous pollutants (SO₂, NO₂, CO, and O₃), and five meteorological variables (T, P, RH, PRE, and WS), aggregated across all 65 stations, are summarized in Table 1. Statistics include the mean, minimum (Min) and maximum (Max) values, standard deviation (SD), and coefficient of variation (CV), calculated as (SD/mean) × 100%.

Table 1 shows that PM_2.5 exhibited substantial variability across the BTH region during the study period, with a mean concentration of 38.93 μg/m³, a standard deviation of 32.78 μg/m³, and a coefficient of variation of 84%, indicating considerable temporal and spatial fluctuations. Among the gaseous precursors, O₃ had the highest mean concentration (102.21 μg/m³), whereas SO₂ showed the lowest mean level (6.49 μg/m³). NO₂ and CO displayed moderate variability, suggesting relatively stable but still fluctuating precursor conditions.

2.3. Temporal Statistics of PM_2.5 Pollution Level Days

Figure 2 presents the monthly distribution of PM_2.5 pollution days at three levels in the BTH region from 1 November 2021 to 31 October 2024. The figure consists of three subfigures corresponding to Year 1 (1 November 2021–31 October 2022), Year 2 (1 November 2022–31 October 2023), and Year 3 (1 November 2023–31 October 2024). Within each subfigure, three pie charts illustrate the distributions of mild pollution, moderate pollution, and severe pollution.

The PM_2.5 pollution levels were classified according to the Ambient Air Quality Standards of China (GB 3095–2012): mild pollution was defined as 75 μg/m³ ≤ PM_2.5 < 115 μg/m³, moderate pollution as 115 μg/m³ ≤ PM_2.5 < 150 μg/m³, and severe pollution as PM_2.5 ≥ 150 μg/m³ [24].

The sectors are labeled by month and colored by season, including winter (November–January, shades of blue), spring (February–April, shades of green), summer (May–July, shades of red), and autumn (August–October, shades of yellow). The classification of pollution days was based on the daily maximum PM_2.5 concentration among approximately 65 monitoring stations. Specifically, a day was classified as a mild, moderate, or severe pollution day when this daily maximum PM_2.5 concentration fell within the corresponding PM_2.5 pollution-level range.

Clear seasonal differences were observed across the three pollution levels. Severe pollution days were concentrated predominantly in winter throughout the three study years, with only limited occurrences in spring and autumn and almost none in summer. Moderate pollution days were also mainly distributed in winter and spring, while autumn contributed a smaller share and summer remained negligible. In contrast, mild pollution days occurred more frequently and were most common in spring, followed by winter, with autumn showing a noticeable contribution and summer consistently recording the fewest pollution days.

Meanwhile, we also conducted the statistics of the total precipitation over the three-year study period: that is highest in summer (88,794.23 mm), followed by autumn (17,676.16 mm), spring (12,922.33 mm), and winter (3997.56 mm). This pattern was the opposite of the seasonal distribution of PM₂.₅ pollution days: summer had the highest rainfall but the fewest pollution events [25].

At the monthly scale, January was one of the most prominent months for moderate and severe pollution, and November and December also formed a clear winter cluster. For mild pollution, March and April were particularly important in several years, while February became more prominent in the later study period. Autumn pollution was mainly concentrated in September and October, especially for severe events in Year 2 and Year 3.

Overall, the frequency and severity of PM_2.5 pollution in the BTH region were dominated by winter, followed by spring.

2.4. Spatial Statistics of PM_2.5 Pollution LEVEL Days

Figure 3 presents the spatial distribution of mild, moderate, and severe PM_2.5 pollution days in the BTH region during the study period. It summarizes the frequency of pollution-level days at different cities and highlights the spatial differences among the three pollution categories. Overall, PM_2.5 pollution days were unevenly distributed across cities in the BTH region, and the highest numbers of pollution days were recorded in Handan, Baoding, and Shijiazhuang, followed by Tianjin and Beijing.

2.5. CatBoost

CatBoost is an ensemble learning algorithm based on gradient boosting decision trees, which constructs a strong predictor by iteratively combining multiple weak learners [21,26]. Given a training dataset

D = {(x_{i}, y_{i})}_{i = 1}^{n}

, where

x_{i}

denotes the input feature vector and

y_{i}

denotes the observed target value of the

i

-th sample, the model starts from an initial prediction defined as:

{\hat{y}}_{i}^{(0)} = \arg \underset{c}{\min \sum_{i = 1}^{n} L (y_{i}, c)}

(1)

where

L (\cdot)

is the loss function. At the

m

-th boosting iteration, the model prediction is updated by adding a new decision tree to the previous prediction:

{\hat{y}}_{i}^{(m)} = {\hat{y}}_{i}^{(m− 1)} + η f_{m} (x_{i})

(2)

where

{\hat{y}}_{i}^{(m)}

is the predicted value after the

m

-th iteration,

η

is the learning rate, and

f_{m} (x_{i})

is the output of the newly added decision tree. Under the gradient boosting framework, the new tree is fitted to the pseudo-residuals, which are defined as the negative gradient of the loss function with respect to the current prediction [26]:

r_{i}^{(m)} = - {[\frac{\partial L (y_{i}, {\hat{y}}_{i})}{\partial {\hat{y}}_{i}}]}_{{\hat{y}}_{i} = {\hat{y}}_{i}^{(m− 1)}}

(3)

After

M

iterations, the final prediction of the CatBoost model can be expressed as

{\hat{y}}_{i} = {\hat{y}}_{i}^{(0)} + η \sum_{m = 1}^{M} f_{m} (x_{i})

(4)

Compared with conventional gradient boosting methods, CatBoost introduces ordered boosting to reduce prediction shift during training and thereby improve model robustness and generalization performance [21]. In addition, as a tree-based ensemble model, CatBoost is capable of capturing nonlinear relationships and complex interactions among variables, which makes it suitable for identifying the combined effects of meteorological factors and gaseous precursors on PM_2.5 pollution levels.

2.6. SHapley Additive exPlanations

SHAP (SHapley Additive exPlanations) was introduced to explain how each predictor contributed to the CatBoost-based PM_2.5 estimates [22]. For each sample, the model output was represented as the sum of a baseline prediction and the feature-level SHAP contributions. For the

i

-th sample, the prediction can be written as:

{\hat{y}}_{i} = ϕ_{0} + \sum_{j = 1}^{p} ϕ_{i j}

(5)

where

ϕ_{0}

is the expected model output,

p

is the number of explanatory variables, and

ϕ_{i j}

denotes the SHAP value of feature

j

for sample

i

. Positive and negative SHAP values indicate whether a given feature increases or decreases the predicted PM_2.5 concentration relative to the baseline.

The SHAP value of feature

j

is obtained by averaging its marginal contribution over all possible feature subsets:

ϕ_{i j} = \sum_{S \subseteq F ∖ \{j\}} \frac{∣ S ∣! (p -∣ S∣ - 1)!}{p!} [f_{S \cup \{j\}} (x_{i}) - f_{S} (x_{i})]

(6)

where

F

is the complete feature set,

S

is a subset that does not include feature

j

,

f_{S} (x_{i})

is the model output based on subset

S

, and

f_{S \cup {j}} (x_{i})

is the output after feature

j

is added. This formulation provides consistent feature attribution and links each prediction to the contribution of individual variables [22].

Because the prediction model in this study was tree-based, SHAP values were calculated using TreeSHAP, which is designed for efficient interpretation of tree ensemble models [23]. The global contribution of each variable was summarized by the mean absolute SHAP value:

I_{j} = \frac{1}{n} \sum_{i = 1}^{n} ∣ ϕ_{i j} ∣

(7)

where

I_{j}

represents the overall contribution strength of feature

j

across all samples. A larger

I_{j}

indicates that the variable played a more important role in PM_2.5 prediction. In this study, these SHAP values were used to compare variable importance across pollution levels and to examine the dependence patterns of dominant meteorological and gaseous pollutant factors.

2.7. Evaluation Metrics

Model performance is evaluated using mean absolute error (MAE), root mean square error (RMSE), and the coefficient of determination (R²). MAE reflects the average magnitude of prediction errors, RMSE penalizes larger errors more heavily, and R² measures the proportion of variance in observed PM_2.5 values explained by the predictions.

2.7.1. Mean Absolute Error (MAE)

MAE calculates the average absolute difference between actual and predicted values, offering a straightforward, outlier-insensitive error metric. It is defined as

M A E = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - \hat{y_{i}}|

(8)

where

y_{i}

and

{\hat{y}}_{i}

represent the observed and predicted PM_2.5 values for sample

i

, respectively, and

n

is the number of samples. A smaller MAE indicates a lower average prediction error. To compare MAE across pollution levels with different PM_2.5 concentration baselines, normalized MAE was also calculated as:

M A E % = \frac{M A E}{\bar{y}}

(9)

where

\bar{y}

is the mean observed PM_2.5 concentration for the corresponding daily model.

2.7.2. Root Mean Square Error (RMSE)

RMSE was used to measure prediction error while assigning greater weight to larger deviations between observed and predicted values. It is expressed as

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}

(10)

where

y_{i}

,

\hat{y_{i}}

, and

n

are as defined above.

Compared with MAE, RMSE is more sensitive to large errors and therefore provides complementary information on model performance. Similarly, normalized RMSE was calculated to reduce the influence of different PM_2.5 concentration scales among pollution levels:

R M S E % = \frac{R M S E}{\bar{y}}

(11)

where

\bar{y}

is the mean observed PM_2.5 concentration for the corresponding daily model.

2.7.3. R-Squared (R²)

R^{2}

, or the coefficient of determination, is a unitless measure used to evaluate how much of the variance in the dependent variable is explained by the regression model. It reflects the overall goodness of fit and is defined as

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(12)

where

y_{i}

is the observed value,

{\hat{y}}_{i}

is the predicted value,

n

is the sample size, and

\bar{y}

is the mean of the observed values. An

R^{2}

value close to 1 indicates strong explanatory power, while a value near 0 suggests limited improvement over the mean prediction. Negative values may occur when the model performs worse than the mean-based baseline.

3. Results

According to the pollution-day definition and temporal statistics presented in Section 2.3, each day was classified as a mild, moderate, and severe PM_2.5 pollution day according to the PM_2.5 pollution-level range into which the daily maximum PM_2.5 concentration among the 65 monitoring stations fell. Based on this definition, a total of 425 pollution days were identified during the three-year study period and retained for subsequent analysis.

For each selected pollution day, observations from all 65 stations were used for modeling, including stations that did not reach the pollution level assigned to that day. with PM_2.5 as the response variable and nine explanatory variables, including five meteorological factors and four gaseous pollutants. An 8:2 split was applied to the station observations to construct the training and test datasets; that is, the split was applied across stations rather than across the temporal sequence of the full study period. CatBoost was used to perform daily spatial prediction of PM_2.5, while TreeSHAP was employed to quantify the contributions of the nine predictors.

3.1. Performance of CatBoost for PM_2.5 Spatial Prediction

Across the 425 pollution days included in the analysis, the CatBoost model showed generally good predictive skill for station-level PM_2.5, although its performance varied with pollution severity. Figure 4 presents the boxplots of the model performance metrics (R², RMSE, and MAE) over the 425 pollution days.

As shown in Figure 4, across all pollution days, the median values of R², RMSE, and MAE were 0.866, 7.509, and 5.906, respectively. For mild pollution days, the median R², RMSE, and MAE were 0.821, 6.639, and 5.156, respectively. For moderate pollution days, the corresponding median values were 0.882, 7.544, and 6.135, representing increases of 7.4%, 13.6%, and 19.0% relative to the mild pollution level. For severe pollution days, the model achieved the highest explanatory power, with a median R² of 0.919, while the median RMSE and MAE increased to 11.001 and 8.736, respectively, corresponding to increases of 11.9%, 65.7%, and 69.4% compared with the mild pollution level.

To account for differences in baseline PM_2.5 concentrations among pollution-day scenarios, normalized RMSE (RMSE%) and normalized MAE (MAE%) were additionally calculated for each daily model by dividing the RMSE and MAE by the mean observed PM_2.5 concentration across all 65 stations on that day. Across all pollution days, the median values of RMSE% and MAE% were 0.85 and 0.89, respectively. The median RMSE% and MAE% were 0.88 and 0.98 for mild pollution days, 0.87 and 0.94 for moderate pollution days, and 0.89 and 0.88 for severe pollution days, respectively. Overall, the normalized errors showed only minor differences among pollution levels.

3.2. SHAP-Based Interpretation of Meteorological and Gaseous Pollutant Contributions

To further interpret the CatBoost model, the SHAP results of the nine explanatory variables were summarized by mean absolute SHAP values, relative percentage of contributions among the nine variables, rankings, and sign frequencies across pollution levels and study years (Table 2).

Overall, the contribution structure showed a clear concentration pattern. Across all year–pollution combinations, CO, PRE, WS, and T consistently accounted for the vast majority of the total SHAP contribution, with a combined relative share of approximately 88–95%. This indicates that PM_2.5 prediction mainly relied on a small core group of variables, rather than being evenly explained by all nine predictors. Within this group, CO made the largest and most stable contribution to model prediction, while PRE, WS, and T together formed the leading meteorological controls, although their internal ranking varied slightly across years and pollution levels. By contrast, SO₂, RH, and NO₂ provided only limited supplementary contributions, and O₃ and P remained negligible throughout.

Among the gaseous pollutants, the contribution structure was overwhelmingly dominated by CO, whereas SO₂, NO₂, and especially O₃ played much smaller roles. Among the meteorological variables, the main contribution came from PRE, WS, and T, while RH and P were consistently weak. Therefore, the overall SHAP structure was primarily shaped by one dominant gaseous pollutant indicator (CO) and three major meteorological regulators (PRE, WS, and T).

Across pollution levels, most variables showed increasing mean absolute SHAP values from mild to moderate to severe pollution, suggesting that their contributions generally strengthened as pollution intensified. This pattern was particularly clear for CO, and was also evident for WS, T, SO₂, RH, NO₂, O₃, and P. However, PRE did not show a regular monotonic trend, but instead displayed substantial interannual and inter-level fluctuations. This irregularity likely reflects the event-dependent nature of precipitation, whose contribution depends more on whether rainfall occurs and under what pollution context it occurs than on pollution severity alone.

The annual results further confirmed the robustness of this structure. CO remained the top-ranked variable under mild, moderate, and severe pollution in all three years. PRE, WS, and T consistently occupied the next most important positions, although their internal order varied slightly across years and pollution levels. In contrast, SO₂, RH, and NO₂ generally remained in the middle positions, while O₃ and P consistently ranked at the bottom. Overall, Table 2 indicates that PM_2.5 prediction in the BTH region was mainly explained by the combined contribution of CO, PRE, WS, and T, while the remaining variables played secondary or negligible roles.

3.3. SHAP Dependence Patterns of the Dominant Variables Under Different Covariate Backgrounds

Based on the results of Section 3.2, the analyses of SHAP dependence patterns will be focused on CO, PRE, WS, and T, which together accounted for the vast majority of the total SHAP contribution. Figures S1–S4 show the SHAP dependence plots of these four variables under mild PM_2.5 pollution conditions, Figures S5–S8 show the corresponding plots under moderate pollution conditions, and Figures S9–S12 show the corresponding plots under severe pollution conditions.

Each supplementary figure contains eight subfigures (a–h), with each subfigure showing the SHAP dependence pattern of the focal variable under the background of one of the other eight explanatory variables. In each subfigure, the x-axis represents the value of the focal variable, the y-axis represents its SHAP value for PM_2.5 prediction, and the point color indicates the value of one of the other eight covariates. These plots therefore illustrate both the dependence pattern of each focal variable and its variation under different covariate backgrounds.

Among the four focal variables, CO showed the clearest and most stable dependence pattern across the three pollution levels. In Figures S1, S5 and S9, CO-SHAP generally increased with increasing CO concentration, and the SHAP range became wider under heavier pollution conditions. In addition, relatively clear background relationships were observed with RH and NO₂, while a weaker relationship was also visible with SO₂, especially under mild and moderate pollution conditions.

PRE showed a nonlinear and irregular dependence pattern across all three pollution levels. Although PRE values were concentrated mainly in the low-value range, the corresponding SHAP values still exhibited considerable vertical variation, without a clear monotonic trend with increasing precipitation. Compared with the other focal variables, PRE did not show consistently clear relationships with the other eight covariates.

For WS, the dependence pattern with its own values was also nonlinear and without a clear monotonic trend. However, relatively clear background relationships were observed with CO and NO₂. In Figures S3, S7 and S11, higher CO and NO₂ values were generally associated with higher WS-SHAP values, and this pattern became more evident under severe pollution conditions.

For T, the dependence pattern with its own values was broad and nonlinear across the three pollution levels. Relatively clear background relationships were observed with CO, NO₂, and especially P. In Figures S4, S8 and S12, higher CO, NO₂, and P values were generally associated with higher T-SHAP values, and this pattern became slightly clearer under severe pollution conditions. Among these covariates, the relationship with P appeared to be the most distinct.

Overall, the dependence plots showed that CO had the clearest self-dependence, whereas PRE, WS, and T mainly exhibited nonlinear and irregular dependence patterns with their own values. At the same time, WS showed relatively clear background relationships with CO and NO₂, and T showed relatively clear background relationships with CO, NO₂, and P.

4. Discussion

4.1. Implications of Model Performance Differences Across Pollution Levels

The contrasting behavior of R², RMSE, and MAE, and the normalized error metrics across pollution levels suggests that changes in pollution conditions affected model performance in different ways. Although R² decreased from severe to mild pollution, the relative reduction was modest. This indicates that the model’s ability to capture the spatial variation in PM_2.5 remained comparatively stable across pollution levels, whereas the magnitude of absolute prediction errors was more strongly influenced by pollution intensity.

The higher R² under severe pollution indicates that the spatial variation in PM_2.5 became more structured, allowing the model to capture relative differences among stations more effectively. This may be because severe pollution episodes are more strongly influenced by regional transport, pollutant accumulation, and stable meteorological conditions, which together can produce a more persistent and organized spatial pattern of PM_2.5. By contrast, the lower and broader R² under mild pollution implies that PM_2.5 spatial patterns were less stable or less clearly organized under cleaner conditions.

Meanwhile, the increase in RMSE and MAE from mild to severe pollution indicates that absolute prediction errors became larger as PM_2.5 concentrations increased. However, after normalization by the mean observed PM_2.5 concentration of each daily model, the relative errors showed only minor differences among pollution levels. This suggests that the larger absolute errors under severe pollution were partly associated with higher PM_2.5 concentration levels, rather than a clear decline in relative model performance. Therefore, severe pollution conditions appeared to improve the predictability of relative spatial patterns while simultaneously increasing the magnitude of absolute prediction errors.

4.2. Dominant Variable Contributions Across Pollution Levels

A clear pattern in Table 2 is that the influence of most variables on model prediction increased with pollution severity. For the majority of predictors, mean absolute SHAP values were higher under moderate pollution than under mild pollution, and became highest under severe pollution. In particular, the four dominant variables identified in Section 3.2—CO, PRE, WS, and T—generally became more influential as pollution intensified. However, an increase in mean absolute SHAP value should be interpreted as a stronger contribution to model prediction, rather than as evidence of a uniformly positive effect on PM_2.5.

Within this overall trend, CO remained the top-ranked variable across all years and pollution levels. PRE, WS, and T consistently remained within the top four, although their internal order varied slightly. In the total three-year results, PRE contributed more than WS under mild and moderate pollution, whereas WS exceeded PRE under severe pollution. This suggests that the relative hierarchy among the dominant meteorological variables changed as pollution intensified. The contrast was particularly evident in Year 2, when PRE dropped to the fourth position under severe pollution, while WS remained among the leading contributors. Overall, Table 2 indicates that increasing pollution severity not only strengthened the influence of the dominant variables, but also reshaped their relative hierarchy, especially among the leading meteorological controls.

4.2.1. Stable Dominance of CO Across Years and Pollution Levels

Among all explanatory variables, CO showed the most stable dominance across the three study years and all pollution levels. Its persistent first-place ranking suggests that CO was the largest and most robust predictor of PM_2.5 in this study. Unlike the other dominant variables, CO also showed a relatively clear dependence pattern in Section 3.3, with higher CO concentrations generally associated with higher CO-SHAP values, making its role easier to interpret in physical terms.

The stable importance of CO may reflect its close association with combustion-related emissions and pollution accumulation. In the BTH region, CO is commonly linked to traffic, residential heating, and other incomplete combustion sources that also contribute directly or indirectly to PM_2.5 formation [27,28]. In addition, CO may serve as an integrated indicator of accumulation-favorable conditions, because both CO and PM_2.5 can accumulate under weak atmospheric dispersion and stagnant meteorological conditions [29,30]. However, the high SHAP importance of CO should be interpreted as a strong predictive association rather than direct causal evidence. Because CO is often co-emitted with PM_2.5 or its precursors and is also influenced by similar accumulation conditions, its contribution may partly reflect co-pollutant conditions and combustion-related accumulation processes, rather than an independent physical driving effect. Therefore, the dominance of CO was not only statistically stable and physically meaningful, but CO should be interpreted mainly as an indicator of combustion-related pollution accumulation rather than as an isolated physical driver of PM_2.5 variation.

4.2.2. Roles of the Leading Meteorological Regulators: PRE, WS, and T

Among the meteorological variables, PRE, WS, and T consistently ranked within the top four across years and pollution levels, indicating that the leading meteorological controls on PM_2.5 in this study were associated with scavenging, dispersion, and broader thermal background conditions.

PRE mainly reflects the removal effect of precipitation on airborne particles. Rainfall can reduce PM_2.5 concentrations through wet deposition, but its contribution was highly variable across years and pollution levels. This suggests that the role of precipitation was strongly event-dependent and likely controlled by the occurrence, timing, and intensity of rainfall. The sharp reduction in PRE importance under severe pollution in Year 2 may indicate that rainfall was too limited or too infrequent to provide effective scavenging during those heavy-pollution events.

WS mainly represents the role of ventilation and dispersion. Higher wind speed generally promotes pollutant dilution and transport, whereas weak wind conditions favor stagnation and pollutant buildup. Its relatively strong contribution, especially under severe pollution, indicates that dispersion-related conditions became increasingly important in differentiating PM_2.5 levels when pollution was already intense. The fact that WS generally exceeded PRE under severe pollution suggests that, during intense pollution episodes, weakly ventilated and stagnant atmospheric conditions may have played a more important role than episodic wet scavenging. Even when precipitation occurred, its removal effect may have been insufficient to offset the stronger control exerted by poor dispersion.

Compared with PRE and WS, T likely reflects a more complex background influence. Temperature may partly act as a proxy for broader seasonal and meteorological conditions associated with PM_2.5 pollution. For example, lower temperatures are often associated with more stable atmospheric conditions and, at the same time, with stronger residential heating demand, both of which may favor PM_2.5 accumulation. Thus, the role of T in this study may reflect combined thermal, seasonal, and emission-related backgrounds rather than a simple one-way temperature effect alone.

4.3. Physical Interpretation of the Dependence Patterns of the Dominant Variables

The dependence patterns of the dominant variables provide further insight into how they contributed to PM_2.5 prediction under different pollution and meteorological backgrounds. Among the four dominant variables, CO showed the clearest and most stable self-dependence. CO-SHAP generally increased with increasing CO concentration across mild, moderate, and severe pollution conditions, indicating that the value of CO itself had a direct influence on its SHAP response. In addition to this self-dependence, CO-SHAP also showed relatively clear background relationships with RH and NO₂, and a weaker relationship with SO₂. Physically, higher RH likely reflects a more accumulation-favorable atmospheric background, while higher NO₂ and SO₂ indicate stronger coexisting combustion-related pollutant loads. Under such conditions, the contribution of CO to PM_2.5 prediction became more pronounced.

By contrast, WS did not show a clear monotonic relationship with its own values. Instead, its SHAP values varied more clearly with CO and NO₂ backgrounds, especially under severe pollution conditions. This suggests that the effect of wind speed became more important when the surrounding pollution burden was already high. In other words, under heavily polluted conditions, ventilation and dispersion played a larger role in determining PM_2.5 levels.

A similar pattern was observed for T. T-SHAP did not show a clear monotonic relationship with temperature itself, but it varied more clearly under different CO, NO₂, and especially P backgrounds. This suggests that temperature did not act mainly as an isolated direct driver. Instead, T likely reflected a broader seasonal and meteorological background. In particular, lower temperatures may be associated with wintertime conditions characterized by higher combustion-related CO and NO₂ levels and weaker atmospheric dispersion under higher-pressure conditions. Therefore, T is better interpreted as a proxy for broader pollution and meteorological environments than as a simple one-way temperature effect.

Compared with the other dominant variables, PRE showed the least regular dependence pattern. Although PRE values were mostly concentrated in the low range, PRE-SHAP still varied substantially, without a clear monotonic relationship with precipitation itself or with the backgrounds of the other covariates. This suggests that the contribution of PRE was highly event-dependent. In physical terms, precipitation does not act as a continuously operating background factor, but as an episodic regulator whose effect depends on whether rainfall occurs and, on its timing, duration, and intensity relative to the pollution episode.

Overall, the dependence plots suggest that the dominant variables contributed to PM_2.5 prediction in different ways. CO showed the clearest self-dependence, whereas WS and T were more background-dependent and PRE behaved more as an episodic nonlinear regulator. Together, these patterns indicate that PM_2.5 prediction in the BTH region depended not only on the dominant variables themselves, but also on the broader covariate backgrounds in which they operated.

4.4. Spatial Implications of Pollution Days in BTH

The spatial statistics of pollution-level days showed that the highest numbers of PM_2.5 pollution days were concentrated in Handan, Baoding, and Shijiazhuang, followed by Tianjin and Beijing (Figure 3 and Section 2.4). This pattern suggests that the occurrence of pollution days in the BTH region was shaped by the combined influence of anthropogenic emissions, topographic constraints, meteorological conditions, and regional transport. At the regional scale, the BTH area is characterized by the Taihang Mountains to the west and the Yanshan Mountains to the north, a topographic setting that can weaken atmospheric ventilation and favor pollutant accumulation under stagnant weather conditions. This regional background helps explain why several core BTH cities consistently ranked high in the frequency and severity of PM_2.5 pollution days.

For Shijiazhuang, Tianjin, and Beijing, their high ranking can generally be understood in the context of their roles as major metropolitan centers in the BTH region, where dense population, intensive traffic activity, high energy consumption, and concentrated urban functions contribute to strong anthropogenic emission backgrounds. The ranking order among these cities, from Shijiazhuang to Tianjin and then to Beijing, may further reflect differences in the strength and duration of air pollution control efforts.

By contrast, the high ranking of Handan and Baoding may reflect more city-specific disadvantages beyond the common metropolitan background. In Handan, the frequent pollution days may be more closely associated with its historically heavy industrial structure and combustion-related emissions, which is broadly consistent with the dominant role of CO identified in this study. In Baoding, the concentration of pollution days may be more strongly linked to unfavorable dispersion conditions and its position within the regional pollution transport pathway.

4.5. Scale-Dependent Interpretation of Meteorological Roles

In this study, each air quality monitoring station was assigned meteorological variables from the nearest meteorological station. This approach has the advantage of using direct observational meteorological records and avoids additional uncertainty introduced by meteorological interpolation or model-based reconstruction. However, because meteorological stations are fewer than air quality stations, several nearby air quality stations, especially within the same city or densely populated urban area, may share the same meteorological record.

As a result, local-scale meteorological differences among these air quality stations cannot be fully represented. For example, several air quality stations may have identical wind speed, temperature, pressure, or precipitation inputs, while their PM_2.5 concentrations differ due to local emissions, traffic intensity, urban structure, or other microscale factors. Therefore, the meteorological roles identified in this study should be interpreted mainly at the city or regional scale, rather than at the microscale within urban areas.

This limitation may smooth local meteorological heterogeneity and reduce the model’s ability to distinguish microscale meteorological effects on PM_2.5 spatial variability. Nevertheless, the matched meteorological data still provide meaningful city-scale and regional-scale meteorological background information.

5. Conclusions

This study investigated the meteorological and gaseous pollutant drivers of PM_2.5 across mild, moderate, and severe pollution-day scenarios in the Beijing–Tianjin–Hebei (BTH) region from 1 November 2021 to 31 October 2024 using the CatBoost–SHAP framework. The key findings are as follows.

(1) The CatBoost model showed good station-level PM_2.5 spatial prediction ability across pollution days. Severe pollution cases had the highest R², indicating a clearer spatial structure of PM_2.5, while the increase in RMSE and MAE was mainly associated with the higher concentration scale and stronger spatial heterogeneity during severe pollution days, as RMSE% and MAE% differed only slightly among pollution levels.

(2) The SHAP results showed a highly concentrated contribution structure. Across all years and pollution levels, CO, PRE, WS, and T accounted for the vast majority of the total SHAP contribution. The fact that CO remained the most dominant and stable predictor likely reflected combustion-related emissions and pollutant accumulation processes, and CO should be interpreted as a robust predictive indicator rather than direct causal evidence.

(3) The importance of most variables increased with pollution severity. In particular, PRE contributed more than WS under mild and moderate pollution, whereas WS exceeded PRE under severe pollution, indicating a shift in the relative roles of scavenging and dispersion as pollution intensified.

(4) The dependence plots showed different contribution patterns among the dominant variables. CO showed the clearest self-dependence, WS and T were more background-dependent, and PRE behaved more as an episodic nonlinear regulator.

(5) PM_2.5 pollution days in the BTH region showed clear temporal and spatial heterogeneity. Pollution days occurred mainly in winter and spring, and were most concentrated in Handan, Baoding, and Shijiazhuang, followed by Tianjin and Beijing.

(6) The meteorological effects identified in this study should be interpreted mainly at the city or regional scale rather than the microscale.

Overall, these findings indicate that mild, moderate, and severe PM_2.5 pollution-day scenarios in the BTH region were jointly shaped by combustion-related accumulation, dominant meteorological regulators, and regional spatial–temporal heterogeneity. Because this study focused only on selected pollution days rather than regular, clean, or low-level PM_2.5 conditions, the results should be interpreted as pollution-episode-specific and should not be directly generalized to the full PM_2.5 concentration range.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/su18115611/s1, Supplementary Materials (Figures S1–S12) are available in the online.

Author Contributions

Conceptualization, L.Z.; methodology, L.Z.; investigation, L.Z.; resources, L.Z.; writing—original draft, L.Z.; writing—review and editing, L.Z.; supervision, L.Z.; validation, L.Z.; funding acquisition, L.Z. and L.J.; data curation, D.S.; formal analysis, D.S., D.X. and L.J.; visualization, D.S. and D.X. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Science and Technology Major Project for Deep Earth (No. 2025ZD1008103) and the Scientific Program of Kashi, Xinjiang (KS2024010).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is available online at http://eia-data.com/ (accessed on 4 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Han, C.; Xu, R.; Ye, T.; Xie, Y.; Zhao, Y.; Liu, H.; Yu, W.; Zhang, Y.; Li, S.; Zhang, Z.; et al. Mortality burden due to long-term exposure to ambient PM2.5 above the new WHO air quality guideline based on 296 cities in China. Environ. Res. 2022, 213, 113965. [Google Scholar] [CrossRef]
Krittanawong, C.; Qadeer, Y.K.; Hayes, R.B.; Wang, Z.; Thurston, G.D.; Virani, S.; Lavie, C.J. PM2.5 and cardiovascular diseases: State-of-the-Art review. Int. J. Cardiol. Cardiovasc. Risk Prev. 2023, 19, 200217. [Google Scholar] [CrossRef]
Yang, Y.; Ren, X.; Liao, H.; Wang, H. Winter particulate pollution severity in North China driven by atmospheric teleconnections. Nat. Geosci. 2022, 15, 349–355. [Google Scholar] [CrossRef]
Peng, Y.; Wang, H.; Zhang, X.; Liu, Z.; Zhang, W.; Li, S.; Han, C.; Che, H. Superimposed effects of typical local circulations driven by mountainous topography and aerosol-radiation interaction on heavy haze in the Beijing-Tianjin-Hebei central and southern plains in winter. Atmos. Chem. Phys. 2022, 23, 8325–8339. [Google Scholar] [CrossRef]
Beijing Municipal Ecology and Environment Bureau. PM2.5 Concentration in Beijing-Tianjin-Hebei Region Decreases by 3.4 Percent Year-on-Year in 2024. Beijing Government; 3 March 2025. Available online: https://english.beijing.gov.cn/latest/news/202503/t20250303_4023243.html (accessed on 3 March 2026).
Li, Q.; Li, X.; Li, H. Factors influencing PM2.5 concentrations in the Beijing–Tianjin–Hebei urban agglomeration using a geographical and temporal weighted regression model. Atmosphere 2022, 13, 407. [Google Scholar] [CrossRef]
Xu, R.; Zhang, H.-D.; Yang, X.-W.; Cheng, S.-Y.; Zhang, T.-H.; Jiang, Q. Concentration characteristics of PM2.5 and the causes of heavy air pollution events in Beijing during autumn and winter. Huan Jing Ke Xue 2019, 40, 3405–3414. (In Chinese) [Google Scholar] [CrossRef]
Qiang, Y.; Wang, C.; Wang, X.; Cheng, S. Analysis of PM2.5 transport characteristics and continuous improvement in high-emission-load areas of the Beijing–Tianjin–Hebei region in winter. Sustainability 2025, 17, 6389. [Google Scholar] [CrossRef]
Zhang, H.; Guo, W.; Wang, R.; Wang, X.; Shan, W.; Yao, Z. Impacts of meteorology and precursor emission change on PM2.5 and O3 and identification of synergistic emission reduction pathway: A case of combined pollution event in Beijing, China. Environ. Pollut. 2025, 368, 125704. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Xiao, D.; Bai, H.; Tang, J.; Wang, W. Spatiotemporal distributions of PM2.5 concentrations in the Beijing–Tianjin–Hebei region from 2013 to 2020. Front. Environ. Sci. 2022, 10, 842237. [Google Scholar] [CrossRef]
Wang, L.; Xiong, Q.; Wu, G.; Gautam, A.; Jiang, J.; Liu, S.; Zhao, W.; Guan, H. Spatio-temporal variation characteristics of PM2.5 in the Beijing–Tianjin–Hebei Region, China, from 2013 to 2018. Int. J. Environ. Res. Public Health 2019, 16, 4276. [Google Scholar] [CrossRef] [PubMed]
Huang, T.; Yu, Y.; Wei, Y.; Wang, H.; Huang, W.; Chen, X. Spatial–seasonal characteristics and critical impact factors of PM_2.5 concentration in the Beijing–Tianjin–Hebei urban agglomeration. PLoS ONE 2018, 13, e0201364. [Google Scholar] [CrossRef]
Zhang, W.; Hai, S.; Zhao, Y.; Sheng, L.; Zhou, Y.; Wang, W.; Li, W. Numerical modeling of regional transport of PM2. 5 during a severe pollution event in the Beijing–Tianjin–Hebei region in November 2015. Atmos. Environ. 2021, 254, 118393. [Google Scholar] [CrossRef]
Liu, Y.; Zheng, Y.; Geng, G.; Cao, J.; Chen, C.; Hu, H.; Wang, X.; Wen, Z.; Feng, Y.; Lei, Y.; et al. Evolving regional transport of PM2. 5 in Beijing–Tianjin–Hebei and its surrounding areas from 2013 to 2020. Environ. Res. Lett. 2025, 20, 124049. [Google Scholar] [CrossRef]
Udristioiu, M.T.; Mghouchi, Y.E.; Yildizhan, H. Prediction, modelling, and forecasting of PM and AQI using hybrid machine learning. J. Clean. Prod. 2023, 421, 138496. [Google Scholar] [CrossRef]
Zeng, L.; Dong, R.; Yuan, M.; Jing, L.; Jiao, S. Evaluating deep learning time series models for PM2.5 forecasting across diverse horizons. iScience 2026, 29, 114770. [Google Scholar] [CrossRef]
Hu, B.; Zeng, L.; Fan, H. Comparative Study of Four Hybrid Spatiotemporal Models for Daily PM2.5 Prediction in the Chengdu–Chongqing Region. Sustainability 2026, 18, 3126. [Google Scholar] [CrossRef]
Hu, B.; Zeng, L.; Fan, H. Interpretable data-driven ozone prediction using statistical diagnostics, XGBoost, SHAP and temporal fusion transformers. Sustainability 2026, 18, 1009. [Google Scholar] [CrossRef]
Liu, S.; Wang, G.; Kong, F.; Zhao, N.; Gao, W.; Zhang, H. PM2.5 pollution characteristics, drivers, and regional transport during different pollution levels in Linyi, China: An integrated PMF-ML-SHAP framework and transport models. J. Hazard. Mater. 2025, 494, 138534. [Google Scholar] [CrossRef] [PubMed]
Stirnberg, R.; Cermak, J.; Kotthaus, S.; Haeffelin, M.; Andersen, H.; Fuchs, J.; Kim, M.; Petit, J.-E.; Favez, O. Meteorology-driven variability of air pollution (PM 1) revealed with explainable machine learning. Atmos. Chem. Phys. 2021, 21, 3919–3948. [Google Scholar] [CrossRef]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst. 2018, 18, 31. [Google Scholar]
Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 17, 30. [Google Scholar]
Lundberg, S.M.; Erion, G.G.; Lee, S.-I. Consistent individualized feature attribution for tree ensembles. arXiv 2018, arXiv:1802.03888. [Google Scholar]
GB 3095-2012; Ambient Air Quality Standards. Ministry of Environmental Protection of the People’s Republic of China: Beijing, China, 2012. Available online: https://www.mee.gov.cn/ywgz/fgbz/bz/bzwb/dqhjbh/dqhjzlbz/201203/t20120302_224165.shtml (accessed on 14 January 2026).
Li, Y.; Zhou, L.; Liu, H.; Liu, S.; Feng, M.; Song, D.; Tan, Q.; Jiang, H.; Zuoqiu, S.; Yang, F. Disparities in precipitation effects on PM_2.5 mass concentrations and chemical compositions: Insights from online monitoring data in Chengdu. J. Environ. Sci. 2025, 156, 421–434. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Panagi, M.; Fleming, Z.L.; Monks, P.S.; Ashfold, M.J.; Wild, O.; Hollaway, M.; Zhang, Q.; Squires, F.A.; Hey, J.D.V. Investigating the regional contributions to air pollution in Beijing: A dispersion modelling study using CO as a tracer. Atmos. Chem. Phys. 2020, 20, 2825–2838. [Google Scholar] [CrossRef]
Zhang, R.; Chen, C.; Liu, S.; Wu, H.; Zhou, W.; Li, P. Emission inventory of air pollutants from residential coal combustion over the Beijing-Tianjin-Hebei Region in 2020. Air Qual. Atmos. Health 2023, 16, 1823–1832. [Google Scholar] [CrossRef]
Yan, F.; Su, H.; Cheng, Y.; Huang, R.; Liao, H.; Yang, T.; Zhu, Y.; Zhang, S.; Sheng, L.; Kou, W.; et al. Frequent haze events associated with transport and stagnation over the corridor between the North China Plain and Yangtze River Delta. Atmos. Chem. Phys. 2024, 24, 2365–2376. [Google Scholar] [CrossRef]
Wang, S.; Gao, J.; Guo, L.; Nie, X.; Xiao, X. Meteorological influences on spatiotemporal variation of PM2.5 concentrations in atmospheric pollution transmission channel cities of the Beijing–Tianjin–Hebei region, China. Int. J. Environ. Res. Public Health 2022, 19, 1607. [Google Scholar] [CrossRef]

Figure 1. Study area and air quality monitoring stations.

Figure 2. Temporal distribution of mild, moderate, and severe PM_2.5 pollution days by month and season in the BTH region during 1 November 2021–31 October 2024.

Figure 3. Spatial distribution of mild, moderate, and severe PM_2.5 pollution days in the BTH region during 1 November 2021–31 October 2024.

Figure 4. Boxplots of CatBoost model performance metrics (RMSE, MAE, and R²) under different PM_2.5 pollution levels for 1st year, 2nd year, 3rd year, and total 3 years.

Table 1. Statistic description.

Index	Unit	Mean	Min~Max	Standard Deviation (SD)	Coefficient Variance (CV)
PM_2.5	${μ g / m}^{3}$	38.93	1.0~356.0	32.78	84%
SO₂	${μ g / m}^{3}$	6.49	1.0~95.0	4.06	63%
NO₂	${μ g / m}^{3}$	28.76	1.0~139.0	17.35	60%
O₃	${μ g / m}^{3}$	102.21	1.0~331.0	53.17	52%
CO	${m g / m}^{3}$	0.64	0.1~7.05	0.33	51%
T	°C	13.38	−20.72~35.78	11.8	88%
P	hPa	992.05	908.2~1044.6	28.65	3%
RH	unitless	53.2	9.18~100.0	19.07	36%
PRE	mm	7.62	0.0094~270.76	15.11	198%
WS	m/s	2.39	0.15~7.41	0.96	40%

T: Temperature; P: Pressure; PRE: Precipitation; WS: Wind speed; RH: Relative humidity.

Table 2. Mean absolute SHAP ranking and sign frequency across pollution levels.

Variable	Mild			Moderate			Severe
Variable	Rank	Mean $\|S H A P\|$	Relative $\|S H A P\|$	Rank	Mean $\|S H A P\|$	Relative $\|S H A P\|$	Rank	Mean $\|S H A P\|$	Relative $\|S H A P\|$
Year 1
CO	1	3.995	49.70%	1	7.111	51.38%	1	8.46	38.54%
PRE	2	1.882	23.41%	2	3.178	22.96%	2	7.314	33.32%
WS	3	1.126	14.01%	3	1.922	13.89%	3	3.694	16.83%
T	4	0.48	5.97%	4	0.732	5.29%	4	1.195	5.44%
SO₂	5	0.214	2.66%	5	0.313	2.26%	5	0.481	2.19%
RH	6	0.134	1.67%	6	0.276	1.99%	6	0.288	1.31%
NO₂	7	0.105	1.31%	7	0.16	1.16%	7	0.246	1.12%
O₃	8	0.065	0.81%	8	0.1	0.72%	8	0.159	0.72%
P	9	0.038	0.47%	9	0.047	0.34%	9	0.113	0.51%
Year 2
CO	1	5.983	61.87%	1	5.963	47.70%	1	10.232	60.16%
PRE	3	1.174	12.14%	2	2.761	22.09%	4	0.813	4.78%
WS	2	1.348	13.94%	3	2.157	17.26%	2	3.262	19.18%
T	4	0.524	5.42%	4	0.819	6.55%	3	1.201	7.06%
SO₂	5	0.205	2.12%	5	0.273	2.18%	5	0.558	3.28%
RH	6	0.182	1.88%	6	0.202	1.62%	6	0.436	2.56%
NO₂	7	0.13	1.34%	7	0.173	1.38%	7	0.246	1.45%
O₃	8	0.086	0.89%	8	0.096	0.77%	8	0.144	0.85%
P	9	0.039	0.40%	9	0.056	0.45%	9	0.116	0.68%
Year 3
CO	1	6.218	46.15%	1	7.845	61.64%	1	8.87	50.23%
PRE	2	4.783	35.50%	3	1.126	8.85%	3	2.449	13.87%
WS	3	1.164	8.64%	2	1.708	13.42%	2	3.435	19.45%
T	4	0.615	4.56%	4	0.852	6.69%	4	1.494	8.46%
SO₂	5	0.287	2.13%	5	0.424	3.33%	6	0.423	2.40%
RH	6	0.174	1.29%	6	0.393	3.09%	5	0.442	2.50%
NO₂	7	0.12	0.89%	7	0.192	1.51%	7	0.252	1.43%
O₃	8	0.063	0.47%	8	0.114	0.90%	8	0.165	0.93%
P	9	0.049	0.36%	9	0.073	0.57%	9	0.128	0.72%
Total 3 years
CO	1	5.302	50.88%	1	6.96	53.33%	1	9.34	50.49%
PRE	2	2.755	26.44%	2	2.403	18.41%	3	3.02	16.33%
WS	3	1.196	11.48%	3	1.933	14.81%	2	3.429	18.54%
T	4	0.54	5.18%	4	0.798	6.11%	4	1.292	6.98%
SO₂	5	0.238	2.28%	5	0.335	2.57%	5	0.495	2.68%
RH	6	0.161	1.54%	6	0.288	2.21%	6	0.4	2.16%
NO₂	7	0.117	1.12%	7	0.174	1.33%	7	0.248	1.34%
O₃	8	0.07	0.67%	8	0.103	0.79%	8	0.154	0.83%
P	9	0.042	0.40%	9	0.058	0.44%	9	0.119	0.64%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zeng, L.; Shuai, D.; Xu, D.; Jing, L. Identifying Meteorological and Gaseous Pollutant Factors Across PM_2.5 Pollution Levels for Sustainable Air Quality Management in the Beijing–Tianjin–Hebei Region Using CatBoost–SHAP: A 2021–2024 Analysis. Sustainability 2026, 18, 5611. https://doi.org/10.3390/su18115611

AMA Style

Zeng L, Shuai D, Xu D, Jing L. Identifying Meteorological and Gaseous Pollutant Factors Across PM_2.5 Pollution Levels for Sustainable Air Quality Management in the Beijing–Tianjin–Hebei Region Using CatBoost–SHAP: A 2021–2024 Analysis. Sustainability. 2026; 18(11):5611. https://doi.org/10.3390/su18115611

Chicago/Turabian Style

Zeng, Ling, Dandan Shuai, Daichi Xu, and Linhai Jing. 2026. "Identifying Meteorological and Gaseous Pollutant Factors Across PM_2.5 Pollution Levels for Sustainable Air Quality Management in the Beijing–Tianjin–Hebei Region Using CatBoost–SHAP: A 2021–2024 Analysis" Sustainability 18, no. 11: 5611. https://doi.org/10.3390/su18115611

APA Style

Zeng, L., Shuai, D., Xu, D., & Jing, L. (2026). Identifying Meteorological and Gaseous Pollutant Factors Across PM_2.5 Pollution Levels for Sustainable Air Quality Management in the Beijing–Tianjin–Hebei Region Using CatBoost–SHAP: A 2021–2024 Analysis. Sustainability, 18(11), 5611. https://doi.org/10.3390/su18115611

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Identifying Meteorological and Gaseous Pollutant Factors Across PM2.5 Pollution Levels for Sustainable Air Quality Management in the Beijing–Tianjin–Hebei Region Using CatBoost–SHAP: A 2021–2024 Analysis

Abstract

1. Introduction

2. Methodology

2.1. Study Area and Data Source

2.2. Descriptive Statistics

2.3. Temporal Statistics of PM2.5 Pollution Level Days

2.4. Spatial Statistics of PM2.5 Pollution LEVEL Days

2.5. CatBoost

2.6. SHapley Additive exPlanations

2.7. Evaluation Metrics

2.7.1. Mean Absolute Error (MAE)

2.7.2. Root Mean Square Error (RMSE)

2.7.3. R-Squared (R2)

3. Results

3.1. Performance of CatBoost for PM2.5 Spatial Prediction

3.2. SHAP-Based Interpretation of Meteorological and Gaseous Pollutant Contributions

3.3. SHAP Dependence Patterns of the Dominant Variables Under Different Covariate Backgrounds

4. Discussion

4.1. Implications of Model Performance Differences Across Pollution Levels

4.2. Dominant Variable Contributions Across Pollution Levels

4.2.1. Stable Dominance of CO Across Years and Pollution Levels

4.2.2. Roles of the Leading Meteorological Regulators: PRE, WS, and T

4.3. Physical Interpretation of the Dependence Patterns of the Dominant Variables

4.4. Spatial Implications of Pollution Days in BTH

4.5. Scale-Dependent Interpretation of Meteorological Roles

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Identifying Meteorological and Gaseous Pollutant Factors Across PM_2.5 Pollution Levels for Sustainable Air Quality Management in the Beijing–Tianjin–Hebei Region Using CatBoost–SHAP: A 2021–2024 Analysis

2.3. Temporal Statistics of PM_2.5 Pollution Level Days

2.4. Spatial Statistics of PM_2.5 Pollution LEVEL Days

2.7.3. R-Squared (R²)

3.1. Performance of CatBoost for PM_2.5 Spatial Prediction