Precipitation Prediction and Factor Interpretation at Maqu Station in the Eastern Qinghai-Tibet Plateau Based on XGBoost-SHAP

Zhao, Dandan; Zhang, Shaoqing; Liu, Guangjing; Pan, Xiaole; Wang, Tianyi; Ding, Huiyu; Sang, Wenjun; Ma, Yongjing

doi:10.3390/w18111355

Open AccessArticle

Precipitation Prediction and Factor Interpretation at Maqu Station in the Eastern Qinghai-Tibet Plateau Based on XGBoost-SHAP

by

Dandan Zhao

^1,2,*,

Shaoqing Zhang

¹,

Guangjing Liu

³,

Xiaole Pan

²,

Tianyi Wang

⁴,

Huiyu Ding

¹,

Wenjun Sang

¹ and

Yongjing Ma

²

¹

College of Aviation Meteorology, Civil Aviation Flight University of China, Chengdu 618307, China

²

State Key Laboratory of Atmospheric Boundary Layer Physics and Atmospheric Chemistry (LAPC), Institute of Atmospheric Physics, Chinese Academy of Sciences, Beijing 100029, China

³

Heavy Rain and Drought-Flood Disasters in Plateau and Basin Key Laboratory of Sichuan Province, Institute of Tibetan Plateau Meteorology, China Meteorological Administration, Chengdu 610072, China

⁴

Meteorological Service Center of Hubei Province, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Water 2026, 18(11), 1355; https://doi.org/10.3390/w18111355

Submission received: 15 April 2026 / Revised: 26 May 2026 / Accepted: 1 June 2026 / Published: 3 June 2026

(This article belongs to the Special Issue Application of Big Data and Machine Learning in Hydrological Forecasting and Water Resource Management)

Download

Browse Figures

Versions Notes

Abstract

Accurate precipitation forecasting on the Qinghai-Tibet Plateau (QTP) remains a significant challenge due to complex terrain and nonlinear atmospheric dynamics. This study evaluates an XGBoost-SHAP framework for 24 h precipitation forecasting at Maqu Station, leveraging multi-source observations from 2020 to 2022. Vertical profile analyses via microwave radiometer (MWR) indicate that moisture is predominantly confined to altitudes below 4 km (AGL), with Integrated Water Vapor (IWV) and Liquid Water Path (LWP) typically varying between 0–15 mm and 0–2.5 mm, respectively. The optimized XGBoost model achieves an annual R² of 0.872 and a Root Mean Square Error (RMSE) of 1.609 mm, showing improved statistical consistency compared with standard Random Forest baselines. While the framework maintains robust performance for winter stratiform precipitation (RMSE = 0.32 mm), predictive variance increases during summer convective periods (RMSE = 3.26 mm). SHAP diagnostic analysis identifies Dew Point Temperature (DPT) as a consistent year-round predictor. Feature sensitivity analysis further reveals shifting seasonal driving mechanisms: spring precipitation appears sensitive to mid-tropospheric geopotential height, whereas summer forecasts are more strongly modulated by 500 hPa specific humidity and lower-level water vapor density. Overall, the XGBoost-SHAP framework serves as a transparent and physically plausible diagnostic tool for examining seasonal moisture–dynamic coupling. While these site-specific results are encouraging, they represent a localized empirical baseline; further cross-site validation is required to assess regional generalizability.

Keywords:

eastern Qinghai-Tibet Plateau; precipitation prediction; XGBoost-SHAP method; microwave radiometer

1. Introduction

The Qinghai-Tibet Plateau, with an average elevation of over 4000 m and known as the “water tower of Asia,” plays a crucial role in the global climate system and regional water cycle [1,2]. Precipitation on the Qinghai-Tibet Plateau (QTP) serves as a critical component of the regional hydrological cycle, providing essential water replenishment for over ten major Asian rivers and sustaining local ecosystems. Beyond its regional hydrologic significance, plateau precipitation exerts a substantial influence on the global energy balance through complex land-atmosphere thermodynamic feedback mechanisms [3]. Against the backdrop of global warming, the Qinghai-Tibet Plateau is profoundly impacted by climate change, characterized by significant shifts in precipitation patterns [4,5]. Numerous studies indicate that precipitation in recent years has exhibited distinct spatiotemporal variability [6]. For instance, some regions have experienced significant increases in precipitation [7], while others have seen decreases; additionally, interannual and decadal variations in precipitation have become more complex [8]. These changes in precipitation patterns have profound implications for ecosystems [9], water resource management [10], agricultural production [11], and socio-economic development in the plateau and its surrounding regions. For example, increased precipitation may accelerate glacial melting, triggering geological disasters such as floods and landslides; conversely, reduced precipitation may exacerbate droughts, thereby affecting vegetation growth and livestock development [12,13]. However, accurately predicting precipitation on the Qinghai-Tibet Plateau has long been a major challenge in atmospheric science [14]. Due to the plateau’s harsh environment, meteorological observation networks remain spatially sparse, limiting the acquisition of high-resolution spatiotemporal precipitation data. Furthermore, the interplay between complex topography and heterogeneous thermal forcing induces highly nonlinear atmospheric dynamics, which significantly complicates the accurate characterization of local precipitation processes [15]. Consequently, traditional numerical weather models struggle with precipitation forecasting in this region [16], largely because existing universal parametrization schemes—originally designed for low-altitude terrains—lack the capacity to accurately represent the diurnal evolution of convective precipitation under high-altitude conditions [17]. Recent investigations emphasize that standard empirical relationships and conventional microphysics schemes—often parameterized for low-altitude plains—frequently fail to represent the unique vertical profiles of raindrop size distribution (DSD) over the eastern Qinghai-Tibet Plateau. This discrepancy arises primarily because atmospheric low-density conditions and intense terrain-induced forcing significantly modulate DSD evolution, processes that are not adequately accounted for in current high-altitude precipitation models [18].

In recent years, the rapid development of machine learning algorithms has provided new approaches for precipitation prediction [19,20]. Among these algorithms, the XGBoost algorithm has emerged as a leading method in the field of hydrometeorological prediction due to its high efficiency [21,22,23], strong robustness, and ability to handle high-dimensional data [24]. Existing machine learning applications in this region often suffer from limited spatial adaptability. This deficiency arises primarily because these models struggle to integrate multi-source heterogeneous datasets or adequately capture the highly nonlinear atmospheric couplings across distinct altitude zones [25]. Moreover, current research focuses predominantly on low-altitude regions, leaving high-altitude areas under-represented, with a notable paucity of research focusing on short-term precipitation forecasting specifically over the Tibetan Plateau [26]. And machine learning models are often criticized as “black-box” models, as their prediction results lack physical interpretability, which limits their widespread application in practical operations to some extent. To address this issue, explainable artificial intelligence (XAI) technologies have emerged, such as the Shapley Additive Explanations (SHAP) method [27]. This method quantifies the contribution of each input feature to the model’s prediction from a game-theoretic perspective, providing a powerful tool for analyzing the intrinsic relationship between meteorological elements and precipitation processes [28,29]. Given the complex cryospheric-atmospheric interactions and pronounced climatic heterogeneity across the plateau, diagnosing parameter sensitivity and providing transparent decision-making mechanisms are constrained if utilizing uninterpretable approaches, making targeted interpretability research an urgent priority for high-altitude modeling [30].

This study focuses on Maqu Station, situated in the eastern Qinghai-Tibet Plateau (QTP), utilizing an integrated dataset of ground observations, microwave radiometer (MWR) measurements, and ERA5 reanalysis. We implement an XGBoost-SHAP framework to evaluate short-term precipitation forecasting and enhance model interpretability. By integrating MWR-derived vertical profiles, this research addresses the limitations of conventional models in capturing the plateau’s complex, nonlinear atmospheric dynamics. Rather than pursuing a generalized solution, this study provides a site-specific empirical analysis of the drivers of precipitation and their seasonal variability. By elucidating the model’s decision-making process, this work aims to contribute a robust diagnostic baseline for local forecasting and improve the mechanistic understanding of moisture–dynamic coupling in high-altitude environments.

2. Data and Methods

2.1. Data

A three-year water vapor sounding campaign was conducted at Maqu Station (100.75° E, 33.76° N, red dot in Figure 1) in the eastern Tibetan Plateau from January 2020 to December 2022. Vertical profiles of atmospheric parameters (e.g., temperature, water vapor density, and liquid water density) were obtained by a microwave radiometer (MWR, MP3000A, XIWU Company, Chengdu, China). The MWR operated in a continuous observation mode with a temporal resolution of 1 s, providing temperature and humidity profile data from the ground to an altitude of 10 km. The vertical resolution of the profile data was 50 m for the 0–500 m layer, 100 m for the 500–2000 m layer, and 250 m for the 2000–10,000 m layer. In addition, three years (2020–2022) of hourly surface meteorological observations were collected at Maqu Station, including temperature, surface temperature, relative humidity, surface atmospheric pressure, wind speed, wind direction, dew point temperature, and precipitation.

This study employed daily post-processed single-level statistics and hourly single-layer reanalysis data from the ERA5 dataset, developed by the European Centre for Medium-Range Weather Forecasts (ECMWF). The data feature a spatial resolution of 0.25° × 0.25°. To match the spatial location of the study site, we extracted the specific grid cell centered at 100.75° E, 33.76° N, which represents the spatial extent covering the Maqu Station. All data, available at https://cds.climate.copernicus.eu/, cover the period from 2020 to 2022 and include single-level near-surface variables as well as isobaric variables at 500 hPa and 700 hPa. A detailed list of the selected variables is summarized in Table 1.

2.2. Methods

2.2.1. Feature Selection

Given the massive volume of multi-source meteorological data (e.g., ground observation data, microwave radiometer measurements, and ERA5 reanalysis data) involved in this study, coupled with the high dimensionality and complexity of the derived feature variables (e.g., atmospheric temperature and water vapor density), direct utilization of all original features in model training would not only introduce substantial computational overhead—prolonging the model training cycle—but also risk including redundant or irrelevant features. Such issues could interfere with the model’s ability to capture the intrinsic relationships between key meteorological factors and precipitation processes, thereby compromising prediction accuracy [31]. Therefore, a rigorous feature selection step is essential to optimize the input feature set prior to model construction.

To guarantee the consistency and validity of multi-source data for subsequent correlation analysis, this study first implemented a standardized data integration workflow [32]. Temporal alignment was strictly enforced by unifying all datasets to Coordinated Universal Time (UTC). This ensured that variables from ground observations, microwave radiometer profiles, and ERA5 reanalysis were accurately retrieved and matched at identical time steps. Specifically, spatial matching was conducted to resolve discrepancies across datasets. Given that Maqu Station (100.75° E, 33.76° N) is situated in a region characterized by relatively flat alpine meadows and pastoral basins, the gridded ERA5 data exhibit a high degree of spatial representativeness. The ERA5 variables were spatially interpolated to the station’s exact location using the bilinear interpolation method. This approach accounts for the surrounding grid environment and provides a more consistent spatial reference than the nearest-neighbor method, ensuring all integrated data are synchronized to the same geographic point and temporal instance. Following the spatiotemporal alignment of multi-source data, an integrated feature evaluation framework was implemented. Initially, a pairwise Pearson correlation analysis was conducted to quantify the linear relationships between candidate features and the target variable (precipitation), as well as to assess potential inter-feature redundancies. The Pearson correlation coefficient

γ X Y

was calculated as:

γ X Y = \frac{\sum_{i = 1}^{n} (X_{i} - X) (Y_{i} - Y)}{\sqrt{\sum_{i = 1}^{n} {(X_{i} - \bar{X})}^{2} {(Y_{i} - \bar{Y})}^{2}}}

(1)

where

X_{i}

and

Y_{i}

are the i-th observations of variables

X

and

Y

,

\bar{X}

and

\bar{Y}

are their respective sample means, and

n

is the total number of observations. This preliminary screening served to eliminate variables with negligible linear relevance (threshold < 0.2), thereby reducing the computational dimensionality. The threshold of 0.2 was selected based on the statistical classification by Evans [33], which defines correlation coefficients between 0.20 and 0.39 as a meaningful, albeit weak, correlation. This criterion ensures that the model retains predictors with potential influence on precipitation while filtering out irrelevant noise, a common practice in meteorological feature selection to balance model complexity and predictive power [34]. To further address potential nonlinear dependencies and multicollinearity that Pearson analysis might overlook, the intrinsic feature selection mechanism of the XGBoost model and the SHAP (SHapley Additive exPlanations) framework were subsequently employed. This multi-stage approach ensures that the selected predictors are both physically representative and statistically robust for precipitation forecasting [35].

2.2.2. XGBoost

XGBoost (Extreme Gradient Boosting) is an efficient and widely used machine learning algorithm, which is fundamentally built on the gradient boosting framework [36]. Its core principle involves iteratively training a sequence of weak learners (typically decision trees): in each iteration, the newly added weak learner focuses on fitting the residuals between the predictions of the existing ensemble model and the true labels (representing the observed 24 h precipitation values at Maqu Station), thereby continuously optimizing the model’s predictive accuracy. This incremental improvement enables XGBoost to capture complex nonlinear relationships in data, endowing it with exceptionally strong fitting capabilities [37,38,39].

The objective function of XGBoost integrates a loss function (to measure prediction error) and regularization terms (to prevent overfitting), expressed as follows [36]:

L (\emptyset) = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}^{(t)}) + \sum_{k = 1}^{t} Ω (f_{k})

(2)

where:

L (\emptyset)

denotes the total objective function of the XGBoost model at the

t

-th iteration;

n

is the number of training samples;

l (y_{i}, {\hat{y}}_{i}^{(t)})

represents the loss function that quantifies the error between the true label

y_{i}

and the predicted value

{\hat{y}}_{i}^{(t)}

of the

i

-th sample at the

t

-th iteration (specifically, the Mean Squared Error (MSE) was employed as the loss function for this precipitation regression task);

t

is the number of weak learners (decision trees) in the ensemble;

f_{k}

(

k

= 1, 2, …,

t

) denotes the

k

-th weak learner (a decision tree);

Ω (f_{k})

is the regularization term for the

k

-th weak learner, defined as:

Ω (f_{k}) = γ T + \frac{1}{2} λ \sum_{j = 1}^{T} ω_{j}^{2}

(3)

γ

is the complexity penalty parameter for the decision tree (controlling the minimum loss reduction required to split a node);

T

is the number of leaf nodes in the

k

-th decision tree;

λ

is the L2 regularization coefficient for leaf node weights (reducing the magnitude of weights to avoid overfitting); ω_j is the predicted value (weight) of the

j

-th leaf node.

To efficiently optimize the objective function, XGBoost approximates the loss function using a second-order Taylor expansion around the predicted value of the

(t - 1)

-th iteration (

{\hat{y}}_{i}^{(t - 1)}

). For the

i

-th sample, the loss function is expanded as:

l (y_{i}, {\hat{y}}_{i}^{(t)}) \approx l (y_{i}, {\hat{y}}_{i}^{(t - 1)}) + g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})

(4)

where:

g_{i} = \partial_{{\hat{y}}_{i}^{(t - 1)}} l (y_{i}, {\hat{y}}_{i}^{(t - 1)})

is the first-order gradient (error sensitivity) of the loss function at

{\hat{y}}_{i}^{(t - 1)}

;

h_{i} = \partial_{{\hat{y}}_{i}^{(t - 1)}}^{2} l (y_{i}, {\hat{y}}_{i}^{(t - 1)})

is the second-order gradient (curvature of the loss surface) at

{\hat{y}}_{i}^{(t - 1)}

;

f_{t} (x_{i})

is the output of the

t

-th weak learner for the

i

-th sample

x_{i}

.

XGBoost exhibits several key advantages that make it well-suited for precipitation prediction in the Tibetan Plateau region. First, in terms of computational efficiency, XGBoost adopts multiple optimization strategies. The second-order Taylor expansion (Equation (3)) allows it to simultaneously utilize first- and second-order gradient information, enabling faster calculation of model updates and significantly accelerating training speed. Second, the integration of regularization terms (Equation (2)) effectively mitigates overfitting. This ensures the model remains robust and reliable even when processing large-scale, complex meteorological datasets [24]. Third, these advantages of XGBoost directly address the inherent complexities of Tibetan Plateau precipitation prediction. The region’s rugged topography and sophisticated atmospheric dynamics result in highly nonlinear relationships between predictors and precipitation, which XGBoost can capture more effectively than linear models. For the long-term, high-dimensional accumulated meteorological observations in the region, XGBoost’s multiple optimization strategies and gradient-based learning allow it to rapidly process multi-source data (ground-based, radiometer, and reanalysis datasets), significantly shortening the training cycle [40]. Meanwhile, the regularization term is crucial for handling the high level of uncertainties and noise in Tibetan Plateau meteorological data, such as signal fluctuations caused by extreme altitude and harsh environmental conditions. By preventing the model from overfitting to such noise, XGBoost ensures that the extracted precipitation patterns are physically consistent with the plateau’s climate mechanisms, thereby improving the stability and reliability of the results [41].

2.2.3. SHAP Model Interpretation Method

SHAP (Shapley Additive Explanations) derives from the Shapley value in game theory. To quantify the contribution of influencing factors to model predictions by analyzing the mode (whether a factor promotes or inhibits precipitation) and magnitude (the degree of importance and sensitivity of the prediction to the factor) of their impacts on precipitation. As an additive feature attribution method within the framework of explainable artificial intelligence (XAI), the SHAP method was first proposed by Lundberg and Lee [27]. It interprets the model’s decision-making process based on the components and features learned during model training.

The Shapley value is computed as the average of the individual contributions of each feature across all possible feature subsets. It further quantifies the relative importance of variables: a higher relative importance value signifies a greater contribution of the variable to the model’s predictive outcomes. Additionally, SHAP partial dependence plots (PDPs) can indicate the interaction relationships between influencing factors and precipitation [42,43].

2.2.4. Model Evaluation and Interpretability Analysis

Root mean square error (

R M S E

), mean absolute error (

M A E

), coefficient of determination (

R^{2}

),

B i a s

, and False Alarm Rate (

F A R

) are used as indicators to evaluate the accuracy of XGBoost model. The closer the

R M S E

is to 0, the smaller the

M A E

value, and the closer the

R^{2}

value is to 1, the higher the model accuracy. The formulas for each indicator are as follows:

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}}

(5)

M A E = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - {\hat{y}}_{i}|

(6)

R^{2} = \sqrt{\frac{\sum_{i = 1}^{N} {({\hat{y}}_{i} - y_{i})}^{2}}{\sum_{i = 1}^{N} {({\bar{y}}_{i} - y_{i})}^{2}}}

(7)

B i a s = \frac{1}{N} \sum_{i = 1}^{N} ({\hat{y}}_{i} - y_{i})

(8)

F A R = \frac{F P}{T P + F P}

(9)

Here,

N

denotes the total number of training samples;

\bar{y}

represents the mean value of the target variable

y

(i.e., observed precipitation in this study); and

{\hat{y}}_{i}

denotes the predicted value of the target variable

y

corresponding to the

i

-th sample. In the

F A R

formula,

T P

(True Positives) represents the number of events correctly predicted, while

F P

(False Positives) denotes the number of events predicted by the model that did not occur in reality.

Furthermore, a normalized Taylor diagram was constructed to provide a multi-dimensional statistical summary of how well the predicted patterns match observations. The geometric relationship between the correlation coefficient (

r

), standard deviation (

σ

), and centered RMSE (

{R M S E}_{c}

) is defined as:

{R M S E}_{c} = σ_{o b s}^{2} + σ_{p r e d}^{2} - 2 σ_{o b s} σ_{p r e d} \cdot r

(10)

where

σ_{o b s}

and

σ_{p r e d}

are the standard deviations of observed and predicted series.

To rigorously validate the statistical significance of seasonal performance disparities, the non-parametric Kruskal–Wallis H-test was implemented. The test statistic

H

is calculated as:

H = \frac{12}{N (N + 1)} \sum_{j = 1}^{k} \frac{R_{j}^{2}}{n_{j}} - 3 (N + 1)

(11)

where

k

is the number of seasons,

n_{j}

is the sample size of the

j

-th season, and

R_{j}

is the sum of ranks for that season. This test ensures that observed performance fluctuations originate from distinct seasonal regimes (p < 0.05) rather than stochastic variance.

Finally, model interpretability was addressed using the SHapley Additive exPlanations (SHAP) method. Two complementary SHAP visualizations were employed: (1) SHAP bar histograms to rank global feature importance based on mean absolute SHAP values; and (2) SHAP summary plots (beeswarm plots) to illustrate the distribution and directional impact of each feature. This dual approach provides a transparent explanation of how key predictors—such as dew point temperature (DPT) and specific humidity—influence the model’s output across different seasons.

2.2.5. Physical Parameters Derivation

To better characterize the moisture conditions over the Maqu Station, two key column-integrated parameters—Integral Water Vapor (IWV) and Liquid Water Path (LWP)—were utilized in this study. These parameters represent the total amount of water vapor and liquid water (such as cloud droplets) in a vertical column of the atmosphere, respectively.

The

I W V

and

L W P

are derived by vertically integrating the water vapor density (

ρ_{V}

) and liquid water density (

ρ_{l}

) profiles retrieved from the microwave radiometer observations from the surface (

z_{0}

) up to a height of 10 km (

z_{m a x}

):

I W V = \int_{z_{0}}^{z_{m a x}} ρ_{v} (z) d z

(12)

L W P = \int_{z_{0}}^{z_{m a x}} ρ_{l} (z) d z

(13)

where:

ρ_{V}

is the water vapor density at height

z

(g/m³);

ρ_{l}

is the liquid water density at height

z

(g/m³);

The resulting

I W V

and

L W P

are expressed in centimeters (cm) or millimeters (mm), representing the equivalent depth of liquid water.

2.2.6. Model Training and Optimization Strategy

A systematic phased strategy was adopted to optimize the hyperparameters of the XGBoost model. Initially, a base model was established with a learning rate of 0.1 and a maximum limit of 2000 trees. To prevent overfitting, an early stopping mechanism was implemented to monitor the Root Mean Square Error (RMSE) every 10 iterations; training was terminated if no improvement occurred for 50 consecutive rounds, which yielded an optimal 1000 iterations. Building on this baseline, a grid search was performed to refine structural parameters, specifically targeting max_depth [7,8,9,10] and min_child_weight [3,4,5,6,7,8], which resulted in optimal values of 9 and 5, respectively. Following the structural optimization, the model’s generalization capability was further enhanced by fine-tuning regularization terms and sampling ratios. The final optimized hyperparameters are summarized in Table 2.

To complement the optimization process, a linear weight augmentation strategy was applied to samples with precipitation exceeding 5 mm to mitigate the data imbalance inherent in heavy rainfall events. Simultaneously, to account for temporal dependencies and ensure robust model evaluation, a chronological stratified partitioning strategy was employed. The seasonal datasets were sequentially partitioned into three independent subsets: a training set (70%) for model fitting, a validation set (15%) for hyperparameter tuning, and a strictly independent test set (15%) for final performance verification. This tripartite division ensures that the model evaluation is conducted on data entirely unseen during the training and optimization phases.

2.2.7. Baseline Model: Random Forest (RF)

To evaluate the “added value” of our proposed framework, a standard, unweighted Random Forest (RF) regressor was introduced as a baseline. To ensure a strictly controlled comparison, the RF model utilized the identical input features and rolling-window data split as the primary XGBoost model. Consistent with the phased optimization approach described in Section 2.2.6, the key structural hyperparameters of the RF model were tuned via grid search on the training-validation dataset. The finalized configuration consists of 500 independent trees (n_estimators = 500), a maximum tree depth of 12 (max_depth = 12), and a splitting feature restriction of max_features = ‘sqrt’. Statistical benchmarking was executed across both the overall test set and individual seasons.

3. Results and Discussion

3.1. Time Series of Observations

Figure 2 illustrates the temporal evolution of meteorological observations at Maqu Station (2020–2022). All variables exhibit pronounced seasonal cycles, consistent with the regional climate characterized by warm–humid summers and cold–dry winters. No significant interannual anomalies were detected, suggesting interannual climatic stability.

Integrated Water Vapor (IWV) fluctuates between 0 and 15 mm, driven primarily by seasonal atmospheric circulation: IWV peaks during the summer monsoon period due to enhanced moisture transport and reaches minima during winter under the influence of cold, dry air masses. Liquid Water Path (LWP) exhibits more transient fluctuations within a range of 0–2.5 mm, reflecting episodic condensation processes and cloud liquid water dynamics.

Surface and 2-m air temperatures exhibit consistent seasonal trends, ranging from −20 to 20 °C. The observed diurnal temperature discrepancy between the surface and the near-surface atmosphere arises from distinct thermal absorption and radiative cooling rates (Figure 2b). Daily precipitation is predominantly light (0–10 mm), punctuated by occasional high-intensity events. Dew Point Temperature (DPT) follows a similar seasonal pattern, with its proximity to ambient temperature serving as a critical threshold for saturation and subsequent precipitation (Figure 2c). Overall, these observations confirm that atmospheric moisture availability and precipitation intensity are significantly higher in spring and summer, facilitating active convective development compared with the quiescent autumn and winter regimes.

3.2. Vertical Structures of Observations

Regarding the vertical profile, liquid water density is highest near the surface and attenuates rapidly with altitude, nearing zero above 4 km (Figure 3a–c). Water vapor density exhibits a similar vertical decay but demonstrates a smoother gradient and higher absolute values (Figure 3d–f). In contrast, relative humidity (RH) shows a more pronounced vertical gradient, characterized by higher values at lower altitudes that decline gradually toward the upper troposphere (Figure 3g–i).

Seasonal comparisons reveal that these vertical profiles are strongly modulated by atmospheric circulation. During winter (Figure 3c,f,i), the region is dominated by cold, dry air masses associated with the winter monsoon, resulting in suppressed liquid water and water vapor densities at all levels relative to the annual mean. Conversely, during summer (Figure 3b,e,h), the influx of warm, humid monsoon systems leads to an enhanced moisture supply, resulting in significantly higher liquid water density, water vapor density, and RH in the lower atmosphere.

Collectively, these observations demonstrate a distinct seasonal moisture stratification: the summer atmosphere is characterized by an enriched vertical moisture structure, whereas winter is marked by relative moisture scarcity. This stratification reflects the seasonal transition of the plateau monsoon, which fundamentally dictates the vertical availability of atmospheric water content.

3.3. XGBoost Prediction Results

3.3.1. Training

Before model training, feature selection is required. In this study, correlation analysis was employed to select the features used based on the results of the correlation calculation. We selected feature values with a correlation coefficient greater than or equal to 0.2 for subsequent machine learning prediction (Figure 4). Notably, water vapor density at various altitudes—specifically wvd0, wvd1, and wvd2—exhibited significant correlations with precipitation. The inclusion of these parameters serves to quantitatively represent the moisture-rich environment within the 0–4 km atmospheric layer, providing the model with critical vertical profile information that aligns with the physical requirements for precipitation formation.

The XGBoost regression model was trained and optimized using the systematic three-phase strategy and chronological validation described in Section 2.2.6. Through this optimization, the model achieved its best performance with an optimal iteration count of 1000. The final hyperparameter configuration—including a max_depth of 9 and a min_child_weight of 5—successfully balanced the model complexity and generalization capability.

Table 2 presents the final optimized hyperparameters. By implementing the linear weight augmentation strategy for heavy rainfall (precipitation > 5 mm), the model’s sensitivity to high-intensity events was effectively enhanced, ensuring robust predictive accuracy across different precipitation intensities.

3.3.2. Model Predictive Performance

Density-based scatter plots and SHAP attribution were employed to evaluate the predictive performance of the precipitation model and quantify predictor contributions. The weighted XGBoost framework was benchmarked against unweighted XGBoost and Random Forest (RF) baselines (Figure 5, Table 3). Annually, the weighted XGBoost achieved an R² of 0.872, outperforming both the unweighted XGBoost (R² = 0.863) and the RF baseline (R² = 0.787). Quantitatively, the weighted model yielded an RMSE of 1.609 mm and an MAE of 0.691 mm, representing a systematic reduction in error relative to the unweighted XGBoost (RMSE = 1.660 mm) and the RF baseline (RMSE = 2.073 mm).

As shown in the density scatter plots (Figure 5a,b), most predictions align closely with the 1:1 identity line. The 95% Prediction Interval (PI), calculated as

\hat{y} \pm 1.96 \times R M S E

under a homoscedastic assumption, provides an empirical boundary for individual forecast fluctuations. While the majority of observations remain within this statistical range, this interval serves as a localized verification of residual variance rather than a comprehensive probabilistic uncertainty quantification. The density gradient suggests higher prognostic consistency for low-to-moderate precipitation (0–5 mm), which constitutes the majority of observed events. The weighted XGBoost maintained a minimal Bias (−0.063 mm), demonstrating better predictive balance across the precipitation spectrum than the RF baseline (−0.087 mm).

Targeted evaluation of heavy precipitation (>5 mm) shows that the weighted strategy mitigates the systematic underestimation of high-intensity events (Table 3). For these events, the weighted model achieved an R² of 0.727 (unweighted: 0.686; RF: 0.529) and an RMSE of 4.470 mm, a marked improvement over the baselines. Although all models displayed a negative bias for heavy precipitation, the weighted framework approached zero most closely (−2.79 mm). Notably, the False Alarm Rate (FAR) remained stable (0.089), indicating that the weighting strategy improves intensity estimation without increasing classification errors.

SHAP attribution analysis identifies Dew Point Temperature (DPT) as the primary predictor (mean absolute SHAP value > 1.25, Figure 5c), consistent with its role as a proxy for lower-tropospheric moisture saturation. The high SHAP values for mid-tropospheric geopotential height (z500) and specific humidity (q500) indicate that the framework captures synoptic-scale circulation dynamics and vertical moisture transport over the QTP. The lower reliance on surface relative humidity suggests the model prioritizes large-scale thermodynamic columns over volatile surface variables, a strategy broadly aligned with alpine precipitation formation mechanisms.

To evaluate the predictive capabilities across seasonal atmospheric regimes, we benchmarked the proposed framework against the Random Forest (RF) baseline. A Kruskal–Wallis H-test confirmed that differences in error distributions across seasons are statistically significant (H = 1216.108, p < 0.001), indicating that model performance is fundamentally modulated by seasonal atmospheric dynamics.

Comparative metrics (Table 4) demonstrate that the proposed framework consistently outperforms the RF baseline. In Autumn, the framework achieved an R² of 0.88 and a minimal Bias of −0.01 mm, notably improving upon the RF (R² = 0.72, Bias = −0.14 mm). Under transitional Spring conditions, the proposed model maintained stability (R² = 0.85, RMSE = 0.75 mm), showing a lower variance than the RF baseline. During Winter, characterized by low-intensity stratiform precipitation, the framework minimized absolute errors (RMSE = 0.32 mm, MAE = 0.11 mm), with predictions clustering tightly along the 1:1 identity line (Figure 6d).

In contrast, Summer convective periods present a greater challenge for both models. While the proposed framework achieved a slightly higher R² (0.80 vs. 0.79), the RF baseline maintained a lower seasonal RMSE (2.51 mm vs. 3.26 mm). Nevertheless, our framework exhibited a reduced Bias (−0.07 mm) compared with the RF (−0.18 mm), suggesting a more effective mitigation of systematic underestimation in convective scenarios.

For heavy precipitation (>5 mm), Taylor diagram analysis (Figure 6e) indicates that while the model retains high phase synchrony (r > 0.79), it exhibits a “smoothing effect” on peak magnitudes, particularly in Summer. Note that Winter heavy events were excluded from this analysis due to insufficient sample size (N = 6). The 5 mm threshold represents an empirical compromise to ensure statistical significance across transitional seasons. These results suggest that while the framework provides an effective diagnostic baseline, integrating season-specific dynamic constraints or hybrid loss functions is necessary to resolve the complex microphysics of intense convective events.

The seasonal-scale feature importance, derived from SHAP values (Figure 7a–d), provides an interpretative framework to examine the model’s sensitivity to various predictors across different climatic regimes. It is essential to clarify that these SHAP attributions characterize the statistical contribution of each variable within the specific model architecture, rather than establishing direct physical causality. Nonetheless, the observed hierarchy of feature importance appears indicative of established meteorological processes.

In Spring (Figure 7a), the model’s reliance on Dew Point Temperature (DPT) and z500 suggests that these variables serve as effective statistical proxies for moisture saturation and large-scale circulation patterns during this period. For the Summer season (Figure 7b), the increased sensitivity to q500 and wvd2 (2 km water vapor density) is consistent with the heightened role of vertical moisture transport typical of monsoon-influenced environments. During Autumn (Figure 7c), the transition of q500 and q700 to leading positions implies that mid-tropospheric moisture levels may offer more informative signals for precipitation estimation during this transitional phase. In contrast, the overall reduction in SHAP magnitudes during Winter (Figure 7d) likely reflects a regime where predictive variance is primarily constrained by absolute moisture availability.

In conclusion, while these SHAP results are primarily reflective of the model’s internal weighting, their alignment with the conceptual transition from ‘moisture triggering’ to ‘dynamic-moisture coupling’ suggests that the model leverages physically meaningful statistical signals. This consistency supports the potential utility of season-specific strategies in refining future precipitation forecasting frameworks.

3.4. Seasonal Interpretability and Consistency with Physical Processes

The seasonal SHAP feature effect distributions (Figure 8a–d) characterize the model’s internal sensitivity to meteorological predictors across varying climatic regimes. These distributions represent the statistical dependency of the model on specific features rather than definitive causal pathways, offering an interpretative framework that aligns with established meteorological principles.

In Spring (Figure 8a), the model exhibits a prominent positive sensitivity to Dew Point Temperature (DPT), where higher DPT values (represented in red) consistently elevate the predicted precipitation. This pattern suggests that the model effectively leverages moisture saturation as a primary predictive signal for spring precipitation. In Summer (Figure 8b), the dynamic range of SHAP values reaches its seasonal maximum (exceeding 10.0 for DPT and q500), reflecting the model’s heightened sensitivity to the intense moisture and convective signals typical of the monsoon season. The wide dispersion of SHAP values for mid-tropospheric humidity (q500) indicates that the model heavily relies on moisture-driven instability to characterize heavy rainfall events.

A distinct transition in predictive logic is observed in Autumn (Figure 8c), where q500 and q700 emerge as the most influential variables, surpassing DPT. This hierarchy suggests that during the post-monsoon transition, the model prioritizes mid-tropospheric moisture accumulation and large-scale convergence signals over surface-level conditions. In contrast, the SHAP value range in Winter (Figure 8d) is markedly compressed (typically within ±0.1), reflecting the limited variance and intensity of cold-season precipitation. Although DPT remains a leading predictor, its minimal impact magnitude indicates a regime primarily constrained by absolute moisture availability.

In summary, the seasonal evolution of the model’s predictive hierarchy—shifting from moisture-triggering in spring and convective-coupling in summer to mid-level regulation in autumn and moisture-limitation in winter—demonstrates that the framework has captured statistically consistent proxies for seasonal meteorological variance. This interpretative consistency supports the robustness of the model in adapting its internal logic to varying climatic conditions.

4. Discussion

The XGBoost framework yields an annual R² of 0.872, demonstrating consistency with recent machine learning applications across the Qinghai-Tibet Plateau (QTP) [19,40]. Benchmarking against a Random Forest (RF) baseline (Table 4) clarifies the framework’s relative value: while RF captures basic seasonal patterns, the sequential boosting architecture of XGBoost provides tighter error constraints during convective summers (R² = 0.80, Bias = −0.07 mm vs. RF Bias = −0.18 mm), suggesting superior adaptation to alpine nonlinearities. During winter, the low MAE (0.11 mm) implies a potential to mitigate the light precipitation overestimation often found in satellite products [44]. However, Taylor diagram evaluations reveal a characteristic “smoothing effect” during heavy events (> 5 mm). While correlation remains high (r > 0.84), R² values decline, indicating that the model captures temporal trends well but underrepresents extreme peak variance (STD < 1.0). This limitation, common in gradient-boosted trees, stems from an objective function that prioritizes global residual minimization over extreme outlier fitting. Consequently, these results should be viewed as a localized empirical baseline requiring further cross-site validation.

Distinguishing this study from conventional “black-box” machine learning applications, the integration of the SHAP framework provides valuable diagnostic transparency. Our results suggest that Dew Point Temperature (DPT) acts as a relatively consistent year-round predictor, while also revealing potential seasonal synergistic effects between 500 hPa specific humidity and lower-level water vapor density during the summer monsoon. This interpretability is consistent with established atmospheric physics regarding moisture–dynamic coupling [42], thereby supporting the model’s reliability by demonstrating that its statistical outputs are grounded in physically plausible predictors [41].

Despite these findings, several limitations necessitate a cautious interpretation. The 95% prediction intervals (PI) in our scatter plots serve strictly as an empirical error baseline (

\hat{y} \pm 1.96 \times R M S E

). As a deterministic framework, this study does not provide a comprehensive probabilistic uncertainty quantification; therefore, our results should not be overinterpreted as an exhaustive robustness assessment. Additionally, the limited three-year record at Maqu (2020–2022) may not fully capture interannual oscillations or the plateau’s extreme microclimatic heterogeneity [45]. Future work will aim to incorporate multi-site, multi-decadal datasets and explore hybrid loss functions to prioritize extreme value distributions, ultimately refining high-intensity convective forecasting.

5. Conclusions

Based on microwave radiometer and ground-based observations at Maqu Station (2020–2022), this study characterizes the seasonal evolution of atmospheric moisture over the eastern Qinghai-Tibet Plateau (QTP) and evaluates an XGBoost-SHAP framework for 24 h precipitation forecasting. The primary conclusions are as follows:

First, meteorological parameters in the Maqu region exhibit distinct seasonal cycles. Integrated Water Vapor (IWV) and Liquid Water Path (LWP) are predominantly confined to the lower troposphere (below 4 km AGL), with their vertical distributions modulated primarily by seasonal plateau monsoon transitions.

Second, the optimized XGBoost framework demonstrates favorable predictive performance, achieving an annual R² of 0.872. Benchmarking against a Random Forest (RF) baseline confirms the framework’s superior predictive skill, particularly in reducing systematic bias. However, performance remains regime-dependent (p < 0.001). While the model exhibits high phase synchrony (r > 0.91), it displays a characteristic “smoothing effect” during convective summer periods, where extreme precipitation peaks are often underestimated. This suggests that while gradient-boosted architectures effectively capture global temporal trends, they remain challenged by the stochastic nature of high-intensity convective extremes.

Third, SHAP attribution reveals a clear hierarchy in predictor importance. Dew Point Temperature (DPT) serves as a consistent year-round predictor, while seasonal variations in model sensitivity—such as the reliance on mid-tropospheric humidity during summer—reflect the complex moisture–dynamic coupling inherent to the QTP.

Finally, this study establishes the XGBoost-SHAP framework as a physically plausible diagnostic tool for localized precipitation forecasting. We emphasize that these results constitute an empirical baseline rather than a generalized regional solution. Future research should prioritize cross-site validation and the integration of hybrid loss functions to better constrain the prediction uncertainties associated with extreme convective events.

Author Contributions

Conceptualization, D.Z. and S.Z.; methodology, S.Z.; software, S.Z.; validation, S.Z. and H.D.; formal analysis, S.Z.; investigation, X.P.; resources, G.L. and D.Z.; data curation, G.L., T.W. and D.Z.; writing—original draft preparation, S.Z.; writing—review and editing, D.Z. and Y.M.; visualization, W.S.; supervision, D.Z.; project administration, D.Z.; funding acquisition, D.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (42307144), the Second Comprehensive Scientific Research Project on the Tibetan Plateau (2019QZKK0105), the Fundamental Research Funds for the Central Universities (24CAFUC07001; PHD2023-018), and the National Natural Science Foundation of China (42105087).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors acknowledge the European Centre for Medium-Range Weather Forecasts (ECMWF) for providing the reanalysis data used in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lin, Q.; Chen, J.; Chen, D.; Wang, X.; Li, W.; Scherer, D. Impacts of Bias-Corrected ERA5 Initial Snow Depth on Dynamical Downscaling Simulations for the Tibetan Plateau. J. Geophys. Res. Atmos. 2021, 126, e2021JD035625. [Google Scholar] [CrossRef]
Zhu, Q.; Liu, Y.; Shao, T.; Luo, R.; Tan, Z. Role of the Tibetan Plateau in Northern Drought Induced by Changes in the Subtropical Westerly Jet. J. Clim. 2021, 34, 4955–4969. [Google Scholar] [CrossRef]
Huang, J.; Zhou, X.; Wu, G.; Xu, X.; Zhao, Q.; Liu, Y.; Duan, A.; Xie, Y.; Ma, Y.; Zhao, P.; et al. Global Climate Impacts of Land-Surface and Atmospheric Processes Over the Tibetan Plateau. Rev. Geophys. 2023, 61, e2022RG000771. [Google Scholar] [CrossRef]
Ma, M.; Tang, J.; Ou, T.; Chen, D. Subdaily Extreme Precipitation and Its Linkage to Global Warming Over the Tibetan Plateau. J. Geophys. Res. Atmos. 2023, 128, e2023JD039062. [Google Scholar] [CrossRef]
Wu, F.; You, Q.; Pepin, N.; Kang, S.; Zhai, P. Surface Warming in Summer over the Tibetan Plateau: Local and Atmospheric Circulation Processes. Glob. Planet. Change 2025, 252, 104904. [Google Scholar] [CrossRef]
Zhang, X.; Li, X.; Che, T.; Yang, C.; Duan, H.; Wu, J.; Liu, Y. Changes in Precipitation Phases Based on the Multi-Discrimination Method in the Tibetan Plateau. Atmos. Res. 2024, 310, 107597. [Google Scholar] [CrossRef]
Ding, Z.; Ha, Y.; Hu, Y.; Zhu, Y.; Dai, H.; Zhong, Z. Spatiotemporal Characteristics of Summer Extreme Precipitation over the Inner Tibetan Plateau in Recent Decades. npj Clim. Atmos. Sci. 2025, 8, 193–199. [Google Scholar] [CrossRef]
Lu, M.; Yang, S.; Fan, H.; Wang, J. Interdecadal Instability of the Interannual Connection between Southern Tibetan Plateau Precipitation and Southeast Asian Summer Monsoon. Atmos. Res. 2023, 291, 106825. [Google Scholar] [CrossRef]
Zheng, Y.; Liu, H.; Du, Q.; Liu, Y.; Sun, J.; Cun, H.; Järvi, L. Effects of Precipitation Seasonal Distribution on Net Ecosystem CO2 Exchange over an Alpine Meadow in the Southeastern Tibetan Plateau. Int. J. Biometeorol. 2022, 66, 1561–1573. [Google Scholar] [CrossRef]
Liu, J.; Chen, C.; Qing, N.; Li, X.; Lai, J.; Guo, B. Response of Water Resources to Climate Change in Zoige, Tibetan Plateau. J. Glaciol. Geocryol. 2016, 38, 498–508. [Google Scholar] [CrossRef]
Sun, Z.; Wang, C. Impact of Changing Climate on Agriculture in China. Ke Ji Dao Bao 2010, 28, 110–117. [Google Scholar]
Zhao, D.; Zhu, Y.; Wu, S.; Zheng, D. Projection of Vegetation Distribution to 1.5 Degrees C and 2 Degrees C of Global Warming on the Tibetan Plateau. Glob. Planet. Change 2021, 202, 103525. [Google Scholar] [CrossRef]
Zou, F.; Li, H.; Hu, Q. Responses of Vegetation Greening and Land Surface Temperature Variations to Global Warming on the Qinghai-Tibetan Plateau, 2001–2016. Ecol. Indic. 2020, 119, 106867. [Google Scholar] [CrossRef]
Chen, Q.; Liu, H.; Hu, M.; Ge, F.; Li, Y. A Review of the Studies on Extreme Precipitation in the Complex Terrain Region on the Eastern Side of the Tibetan Plateau. Torrential Rain Disasters 2024, 43, 255–265. [Google Scholar] [CrossRef]
Wang, Z.Q.; Duan, A.M.; Li, M.S.; He, B. Influences of Thermal Forcing over the Slope/Platform of the Tibetan Plateau on Asian Summer Monsoon: Numerical Studies with the WRF Model. Acta Geophys. Sin. 2016, 59, 3175–3187. [Google Scholar] [CrossRef]
Guo, Y.; Zheng, H.; Yang, Y.; Sang, Y.; Wen, C. A Hydrogeomorphic Dataset for Characterizing Catchment Hydrological Behavior across the Tibetan Plateau. Earth Syst. Sci. Data 2024, 16, 1651–1665. [Google Scholar] [CrossRef]
Yang, K.; Zhou, X.; Ma, X.; Chen, D.; Chen, F.; Dai, Y.; Jiang, Y.; Huang, A.; Lin, Y.; Liu, J.; et al. A Physically-Refined Regional Climate Model for the Tibetan Plateau. Sci. Bull. 2025, 70, 4070–4079. [Google Scholar] [CrossRef]
Dong, P.; Jiang, X.; Zhao, X.; Dong, Y.; Zheng, J.; Hu, C.; Gao, G.; Liu, L.; Li, S.; Bu, L. Vertical Profiles of Raindrop Size Distribution Parameters of Summer Rainfall in the Eastern Tibetan Plateau: Retrieval Method and Characteristics. Atmos. Meas. Tech. 2026, 19, 1407–1419. [Google Scholar] [CrossRef]
Lyu, Y.; Yong, B. A Novel Double Machine Learning Strategy for Producing High-Precision Multi-Source Merging Precipitation Estimates over the Tibetan Plateau. Water Resour. Res. 2024, 60, e2023WR035643. [Google Scholar] [CrossRef]
Ning, C.; Wang, Y.; Nan, Z.; Chen, H.; Liu, C. Study on Correction of Daily Precipitation Data of the Qinghai-Tibetan Plateau with Machine Learning Models. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium, IGARSS 2016—Proceedings; IEEE: New York, NY, USA, 2016; Volume 2016, pp. 517–520. [Google Scholar]
Dong, J.; Zeng, W.; Wu, L.; Huang, J.; Gaiser, T.; Srivastava, A.K. Enhancing Short-Term Forecasting of Daily Precipitation Using Numerical Weather Prediction Bias Correcting with XGBoost in Different Regions of China. Eng. Appl. Artif. Intell. 2023, 117, 105579. [Google Scholar] [CrossRef]
Grecco Sanches, R.; Sanches Miani, R.; César dos Santos, B.; Martins Moreira, R.; Zen de Figueiredo Neves, G.; Bourscheidt, V.; Augusto Toledo Rios, P. Using Xgboost Models Dor Daily Rainfall Prediction. An. De Geogr. De La Univ. Complut. 2025, 45, 75–92. [Google Scholar] [CrossRef]
Kumar, V.; Kedam, N.; Sharma, K.V.; Khedher, K.M.; Alluqmani, A.E. A Comparison of Machine Learning Models for Predicting Rainfall in Urban Metropolitan Cities. Sustainability 2023, 15, 13724. [Google Scholar] [CrossRef]
Mai, X.; Zhong, H.; Li, L. Research on Rain or Shine Weather Forecast in Precipitation Nowcasting Based on XGBoost. In Advances in Natural Computation, Fuzzy Systems and Knowledge Discovery; Meng, H., Wang, L., Xiong, N., Lei, T., Li, M., Li, K., Eds.; Springer International Publishing AG: Cham, Switzerland, 2021; Volume 88, pp. 1313–1319. [Google Scholar]
Han, Y.; Jin, W.; Liu, H.; Wang, W.; Ma, J.; Zhao, W. Optimization of Ecological Restoration Efficiency in Qinghai-Tibet Plateau Using the Cubist Regression Tree Model: A Study of Environmental Adaptability Models. PLoS ONE 2025, 20, e0335056. [Google Scholar] [CrossRef]
Dong, N.; Hao, H.; Yang, M.; Wei, J.; Xu, S.; Kunstmann, H. Deep-Learning-Based Sub-Seasonal Precipitation and Streamflow Ensemble Forecasting over the Source Region of the Yangtze River. Hydrol. Earth Syst. Sci. 2025, 29, 2023–2042. [Google Scholar] [CrossRef]
Lundberg, S.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar]
Wang, F.; Wang, X.; Li, S. Explainable Machine Learning for Predictive Modeling of Blowing Snow Detection and Meteorological Feature Assessment Using XGBoost-SHAP. PLoS ONE 2025, 20, e0318835. [Google Scholar] [CrossRef] [PubMed]
Qian, Q.; Jia, X. Seasonal Forecast of Winter Precipitation over China Using Machine Learning Models. Atmos. Res. 2023, 294, 106961. [Google Scholar] [CrossRef]
Li, Y.; Che, T.; Dai, L.; Wu, A.; Wang, J. Investigating the spatiotemporal behavior of VIC model parameters over the Tibetan plateau via global sensitivity analysis and machine learning. Int. J. Digit. Earth 2026, 19, 2625537. [Google Scholar] [CrossRef]
Poletaev, A.; Liu, B.; Li, L.; Avagyan, V.; Voskanyan, V. Improve the Prediction in the Digital Era: Causal Feature Selection with Minimum Redundancy. J. Digit. Econ. 2024, 3, 14–36. [Google Scholar] [CrossRef]
Noardo, F. Multisource Spatial Data Integration for Use Cases Applications. Trans. GIS 2022, 26, 2874–2913. [Google Scholar] [CrossRef]
Evans, J.D. Straightforward Statistics for the Behavioral Sciences; Brooks/Cole Pub. Co.: Pacific Grove, CA, USA, 1996. [Google Scholar]
Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed.; Lawrence Erlbaum Associates: Hillsdale, NJ, USA, 1988. [Google Scholar]
Saputra, A.; Ramli, K.; Nugroho, A.S. Performance with Feature Selection Based Machine Learning: Combination of Low Variance Filter Pearson Correlation for Bots and Brute Force. In Proceedings of the 2025 International Conference on Computer Sciences, Engineering, and Technology Innovation (ICoCSETI), Jakarta, Indonesia, 21 January 2025; IEEE: New York, NY, USA, 2025; pp. 933–938. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2016; Volume 13–17, pp. 785–794. [Google Scholar]
Qian, Q.; Jia, X.; Lin, H.; Zhang, R. Seasonal Forecast of Nonmonsoonal Winter Precipitation over the Eurasian Continent Using Machine-Learning Models. J. Clim. 2021, 34, 7113–7129. [Google Scholar] [CrossRef]
Ali, S.; Khorrami, B.; Jehanzaib, M.; Tariq, A.; Ajmal, M.; Arshad, A.; Shafeeque, M.; Dilawar, A.; Basit, I.; Zhang, L.; et al. Spatial Downscaling of GRACE Data Based on XGBoost Model for Improved Understanding of Hydrological Droughts in the Indus Basin Irrigation System (IBIS). Remote Sens. 2023, 15, 873. [Google Scholar] [CrossRef]
Yue, F.; Wang, X.; Ai, R.; Wu, Y.; Li, Q.; Feng, G. Predicting Summer Precipitation in China: A Hybrid Downscaling Model Using the XGBoost Method. Int. J. Climatol. 2025, 45, e70064. [Google Scholar] [CrossRef]
Lei, H.; Li, H.; Zhao, H. Refining Daily Precipitation Estimates Using Machine Learning and Multi-Source Data in Alpine Regions with Unevenly Distributed Gauges. J. Hydrol. Reg. Stud. 2025, 58, 102272. [Google Scholar] [CrossRef]
Schütz, M.; Schütz, A.; Bendix, J.; Thies, J.M.u.B. Evaluating Station, Satellite, & Combined Data for XGBoost-Based Visibility Forecast. Atmos. Res. 2026, 328, 108395. [Google Scholar] [CrossRef]
Wang, M.; Li, Y.; Yuan, H.; Zhou, S.; Wang, Y.; Adnan Ikram, R.M.; Li, J. An XGBoost-SHAP Approach to Quantifying Morphological Impact on Urban Flooding Susceptibility. Ecol. Indic. 2023, 156, 111137. [Google Scholar] [CrossRef]
Zhao, C.; Lin, Z.; Yang, L.; Jiang, M.; Qiu, Z.; Wang, S.; Gu, Y.; Ye, W.; Pan, Y.; Zhang, Y.; et al. A Study on the Impact of Meteorological and Emission Factors on PM2.5 Concentrations Based on Machine Learning. J. Environ. Manag. 2025, 376, 124347. [Google Scholar] [CrossRef]
Lei, K.; Zhang, L.; Gao, L. Evaluation of IMERG Precipitation Product Downscaling Using Nine Machine Learning Algorithms in the Qinghai Lake Basin. Water 2025, 17, 1776. [Google Scholar] [CrossRef]
Liu, S.; Wang, J.; Shi, F.; Zhuo, P.; Ao, T. Research on Multi-Source Precipitation Fusion Based on Classification and Regression Machine Learning Methods—A Case Study of the Min River Basin in the Eastern Source of the Qinghai-Tibet Plateau. Remote Sens. 2025, 17, 3982. [Google Scholar] [CrossRef]

Figure 1. Geographic location of the study area and surrounding topography: The red dot indicates the location of Maqu Station situated in the eastern Tibetan Plateau, with background shading representing the digital elevation model (DEM) in meters (m).

Figure 2. Time–series scatterplot of atmospheric and meteorological parameters from 2020 to 2022: (a) Integral Water Vapor Content (IWV) and Liquid Water Path (LWP); (b) Atmospheric Temperature and Surface Temperature; (c) One–Day Precipitation and Dew Point Temperature (DPT).

Figure 3. Vertical profiles of liquid water density (a–c), water vapor density (d–f), and relative humidity (g–i). The three columns from left to right represent the annual average, summer, and winter, respectively.

Figure 4. Heatmap of correlation with 24-h precipitation. Features with correlation coefficients less than 0.2 are omitted and denoted by ellipses due to their large quantity.

Figure 5. Evaluation of annual precipitation forecasts and feature attributions: (a) scatter plot of predicted versus actual annual precipitation, with colors indicating point density, complemented by the 1:1 reference line and the 95% prediction interval (PI); (b) baseline model comparison using Random Forest predictions; (c) feature importance distribution derived from mean absolute SHAP values, showing the potential influence of meteorological variables on the model output.

Figure 6. Seasonal precipitation prediction performance: (a–d) Scatter plots of predicted versus actual precipitation for each season, with data point density indicated by color and showing the 1:1 line and 95% CI. (e) Taylor diagram showcasing model performance statistics, with heavy precipitation events marked by triangles.

Figure 7. Seasonal SHAP-based feature importance in precipitation prediction: spring (a), summer (b), autumn (c), winter (d).

Figure 8. Seasonal SHAP effect distributions of features in precipitation prediction: (a) spring, (b) summer, (c) autumn, (d) winter.

Table 1. Main variables involved in this study.

Data Type	Level & Temporal Resolution	Variables
Daily post-processed single-level statistics	Hourly (0.25° × 0.25°)	10 m u-component of wind (u10), 10 m v-component of wind (v10), 2 m dewpoint temperature (d2m), 2 m temperature (t2m), Surface pressure (sp), Total precipitation (tp), Skin temperature (skt), 100 m u-component of wind (u100), 100 m v-component of wind (v100), Surface latent heat flux (slhf), Surface net solar radiation (ssr), Surface net thermal radiation (str), Surface sensible heat flux (sshf), Downward surface solar radiation (ssrd), Downward surface thermal radiation (strd), Cloud base height (cbh), total cloud cover (tcc), Evaporation (e), Potential evaporation (pev), Runoff (ro), Leaf area index (high vegetation) (lai_hv), Leaf area index (low vegetation) (lai_lv)
Single layer hour-by-hour	500 hPa & 700 hPa Hourly (0.25° × 0.25°)	Divergence (d500; d700), Fraction of cloud cover (cc500; cc700), Geopotential (z500; z700), Ozone mass mixing ratio (o3500; o3700), Potential vorticity (pv500; pv700), Relative humidity (r500; r700), Specific cloud ice water content (ciwc500; ciwc700), Specific cloud liquid water content (clwc500; clwc700), Specific humidity (q500; q700), Specific rain water content (crwc500; crwc700), Specific snow water content (cswc500; cswc700), Temperature (t500; t700), u-component of wind (u500; u700), v-component of wind (v500; v700), Vertical velocity (w500; w700), Vorticity (relative) (vo500; vo700)
Microwave Radiometer (MWR, MP3000A)	Hourly	relative humidity (RH0; RH1…) water vapor density (wvd0; wvd1…), liquid water density (lwd0; lwd1…), Integral Water Vapor Content (IWV), Liquid Water Path (LWP)
Surface Meteorological Observations	Hourly	Temperature (T), surface temperature (T_surf), relative humidity (RH), surface atmospheric pressure (P), wind speed (Wspd), wind direction (Wdir), dew point temperature (DPT), precipitation (PRE_24h)

Table 2. Final Optimized Hyperparameter Configuration for the Model.

Category	Hyperparameter	Optimal Value
Boosting Control	learning_rate	0.1
	n_estimators	1000
Tree Structure	max_depth	9
	min_child_weight	5
Regularization	reg_alpha	0.5
	reg_lambda	1.0
Stochastic Sampling	subsample	0.8
	colsample_bytree	0.8

Table 3. Performance comparison between the Unweighted and Weighted models.

Category	Metric	Unweighted Model (XGB-Shap)	Weighted Model (XGB-Shap) (This Study)	Baseline Model (RF)
Overall	R²	0.863	0.872	0.787
	MAE	0.692	0.691	1.051
	RMSE (mm)	1.660	1.609	2.073
	FAR	0.0874	0.0890	0.0906
Heavy (>5 mm)	R²	0.686	0.727	0.529
	MAE	3.093	2.942	4.436
	RMSE (mm)	4.783	4.470	5.861
	Bias (mm)	−2.8092	−2.7939	−4.383

Note: Bold values indicate the best performance for each evaluation metric within the respective category (Overall and Heavy precipitation).

Table 4. Seasonal predictive performance metrics for the XGBoost and Random Forest (RF) models.

Metric	Spr (XGB)	Spr (RF)	Sum (XGB)	Sum (RF)	Atu (XGB)	Atu (RF)	Win (XGB)	Win (RF)
R²	0.85	0.83	0.80	0.79	0.88	0.72	0.85	0.77
RMSE	0.75	1.73	3.26	2.51	1.17	1.99	0.32	1.95
MAE	0.44	0.91	1.67	1.18	0.60	1.04	0.11	1.06
Bias	−0.04	0.03	−0.07	−0.18	−0.01	−0.14	−0.02	−0.06

Note: Spr, Sum, Atu, and Win denote Spring, Summer, Autumn, and Winter, respectively. XGB refers to the primary framework analyzed in this study, and RF denotes the baseline random forest configuration. The metrics (R², RMSE, MAE, and Bias) are derived based on the corresponding seasonal validation subsets, consistent with the distributions illustrated in Figure 6. Bold values indicate the best performance for each evaluation metric in each season.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, D.; Zhang, S.; Liu, G.; Pan, X.; Wang, T.; Ding, H.; Sang, W.; Ma, Y. Precipitation Prediction and Factor Interpretation at Maqu Station in the Eastern Qinghai-Tibet Plateau Based on XGBoost-SHAP. Water 2026, 18, 1355. https://doi.org/10.3390/w18111355

AMA Style

Zhao D, Zhang S, Liu G, Pan X, Wang T, Ding H, Sang W, Ma Y. Precipitation Prediction and Factor Interpretation at Maqu Station in the Eastern Qinghai-Tibet Plateau Based on XGBoost-SHAP. Water. 2026; 18(11):1355. https://doi.org/10.3390/w18111355

Chicago/Turabian Style

Zhao, Dandan, Shaoqing Zhang, Guangjing Liu, Xiaole Pan, Tianyi Wang, Huiyu Ding, Wenjun Sang, and Yongjing Ma. 2026. "Precipitation Prediction and Factor Interpretation at Maqu Station in the Eastern Qinghai-Tibet Plateau Based on XGBoost-SHAP" Water 18, no. 11: 1355. https://doi.org/10.3390/w18111355

APA Style

Zhao, D., Zhang, S., Liu, G., Pan, X., Wang, T., Ding, H., Sang, W., & Ma, Y. (2026). Precipitation Prediction and Factor Interpretation at Maqu Station in the Eastern Qinghai-Tibet Plateau Based on XGBoost-SHAP. Water, 18(11), 1355. https://doi.org/10.3390/w18111355

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Precipitation Prediction and Factor Interpretation at Maqu Station in the Eastern Qinghai-Tibet Plateau Based on XGBoost-SHAP

Abstract

1. Introduction

2. Data and Methods

2.1. Data

2.2. Methods

2.2.1. Feature Selection

2.2.2. XGBoost

2.2.3. SHAP Model Interpretation Method

2.2.4. Model Evaluation and Interpretability Analysis

2.2.5. Physical Parameters Derivation

2.2.6. Model Training and Optimization Strategy

2.2.7. Baseline Model: Random Forest (RF)

3. Results and Discussion

3.1. Time Series of Observations

3.2. Vertical Structures of Observations

3.3. XGBoost Prediction Results

3.3.1. Training

3.3.2. Model Predictive Performance

3.4. Seasonal Interpretability and Consistency with Physical Processes

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI