Impact of High Temporal Resolution Data on Water Quality Modeling: Insights from Erhai Case Study

Shi, Xiaomeng; Li, Yu; Yao, Bo; Wang, Shengrui; Ni, Shouqing

doi:10.3390/pr13061726

Open AccessArticle

Impact of High Temporal Resolution Data on Water Quality Modeling: Insights from Erhai Case Study

by

Xiaomeng Shi

^1,2,

Yu Li

^1,2,*

,

Bo Yao

³,

Shengrui Wang

^1,2 and

Shouqing Ni

⁴

¹

College of Water Sciences, Beijing Normal University, Beijing 100875, China

²

Guangdong-Hong Kong Joint Laboratory for Water Security, Center for Water Research, Advanced Institute of Natural Sciences, Beijing Normal University at Zhuhai, Zhuhai 519087, China

³

Key Laboratory for Mechanics in Fluid Solid Coupling Systems, Institute of Mechanics, Chinese Academy of Sciences, Beijing 100190, China

⁴

School of Environmental Science and Engineering, Shandong University, Jinan 250100, China

^*

Author to whom correspondence should be addressed.

Processes 2025, 13(6), 1726; https://doi.org/10.3390/pr13061726

Submission received: 24 April 2025 / Revised: 21 May 2025 / Accepted: 28 May 2025 / Published: 31 May 2025

(This article belongs to the Section AI-Enabled Process Engineering)

Download

Browse Figures

Versions Notes

Abstract

Lake monitoring is essential for sustaining aquatic ecosystems, and accurate estimation/prediction of water quality parameters is crucial to this effort. Despite its importance, the performance of predictive models built on varying temporal resolutions remains underexplored systematically. This study used daily and 4 h high temporal resolution (HTR) datasets to assess the performance of multiple machine learning models—namely, Support Vector Regression (SVR), Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Long Short-Term Memory (LSTM) networks—under consistent data scales. The results indicate that dissolved oxygen (DO) exhibits pronounced sensitivity to temporal resolution, while total nitrogen (TN), total phosphorus (TP), and ammonia nitrogen (NH₃-N) show distinct, parameter-specific response patterns that align with the temporal characteristics of their underlying biogeochemical processes. This research helps to deepen the understanding of how temporal data resolution influences model performance in water quality prediction, offering valuable insights for selecting optimal data resolutions and modeling techniques to enhance lake monitoring and protection strategies.

Keywords:

high-frequency monitoring; water quality prediction; temporal resolution sensitivity; machine learning; wavelet analysis

1. Introduction

The protection of aquatic environments has become an urgent global priority, as intensified human activities and accelerating climate change continue to increase the frequency and severity of water pollution incidents [1,2]. Studies have shown that the implementation of water treatment strategies could potentially alleviate approximately 20% of global economic water scarcity risks, which surged from USD 116 billion to USD 380 billion between 1995 and 2010; during the same period, risks specifically related to water quality increased from 20% to 30% [3]. The resulting influx of contaminants and heightened hydrological variability threaten the ecological integrity of freshwater systems and undermine essential ecosystem services. In this context, effective water quality monitoring is essential for the timely detection of pollution events, assessment of environmental trends, and informed decision-making.

However, the intrinsic fluctuations of key water parameters—arising from diverse biogeochemical processes such as nutrient cycling, phytoplankton dynamics, and contaminant transport—occur across a wide range of temporal scales [4,5]. While traditional methods such as biomonitoring techniques offer valuable ecological insights by capturing cumulative impacts and subtle environmental changes through biological responses [6], they often struggle to provide real-time data for rapid pollution events and can be labor-intensive. Similarly, advanced tools like single-molecule detection techniques and Atomic Force Microscopy (AFM), while providing highly detailed physicochemical information about individual aquatic microorganisms and biomolecules [7], primarily focus on highly specific, discrete analyses rather than broad-spectrum, continuous monitoring of multiple water quality parameters. These dynamic processes require monitoring strategies with frequencies tailored to their specific temporal characteristics to accurately capture both rapid events and gradual changes. Such considerations highlight the necessity for adaptive, high-resolution monitoring frameworks that can address the growing complexity of aquatic environments and support sustainable water resource management.

Traditional water quality monitoring has primarily relied on field sampling and laboratory analysis, which typically produce low-frequency datasets at daily, weekly, or longer intervals [8,9,10]. While such approaches are useful for identifying baseline conditions and long-term trends, they often fail to capture the sudden, short-duration, and high-impact pollution events that increasingly threaten aquatic systems, especially in regions with complex environmental dynamics and multiple pollution sources [11]. To address these shortcomings, high-frequency monitoring technologies—such as multi-parameter ultra-spectral sensors, automated in situ platforms, and unmanned aerial vehicles (UAVs)—have emerged as powerful alternatives, providing near-continuous measurements of key water quality parameters on timescales of minutes to hours [12,13]. Moreover, the wealth of high-resolution data generated by these modern systems aligns well with recent advancements in machine learning and other data-driven methods, which excel at uncovering complex patterns, identifying anomalies, and forecasting future water quality states [14,15,16]. For example, Zhang et al. [17] developed attention mechanism-based deep learning models using data sampled at 2 h intervals for wastewater treatment plant monitoring, consistently outperforming baseline models in predicting parameters like total nitrogen, total phosphorus, and chemical oxygen demand. Rodriguez-Perez et al. [18] developed a method for detecting technical anomalies in high-frequency water quality data using artificial neural networks (ANNs), utilizing turbidity and conductivity data collected by automated in situ sensors in riverine and estuarine environments, achieving effective detection of sudden and long-term anomalies through semi-supervised and supervised classification, with models calibrated via Bayesian multi-objective optimization. Huang et al. [19] developed a Self-Attentive LSTM (SA-LSTM) model integrated with LOADEST, using weekly/monthly sampled data from Dongting Lake, achieving Nash–Sutcliffe Efficiency (NSE) scores of 0.71 for COD Mn and 0.57 for NH₃-N, and reducing RMSE by 20–30% compared to standalone machine learning models.

While high-resolution water quality data offer significant potential to advance the efforts of data-driven models, their use might introduce additional challenges as well. For example, high-frequency datasets are not only more susceptible to noise, but also introduce substantial computational complexity [20]. Furthermore, the acquisition of long-term, continuous high-frequency datasets remains difficult, and even when available, these datasets demand rigorous quality control due to increased variability from frequent sampling. Systematic studies exploring the trade-offs between temporal data resolution and predictive accuracy remain scarce, especially given the substantial resources required for deploying and maintaining high-frequency monitoring networks. The prevailing assumption that simply increasing sampling frequency leads to better model performance has left the true value and limitations of high-frequency data underexplored [21], highlighting the critical need for systematic investigations into how temporal resolution influences predictive modeling in water quality monitoring.

This study aims to fill such a research gap by investigating the impact of temporal resolution on water quality prediction model performance. Using Erhai Lake as a pilot region, we systematically evaluated four representative machine learning models (SVR, RF, XGBoost, and LSTM) across two temporal resolutions (4-hourly and daily) for four critical water quality parameters, namely, total nitrogen, total phosphorus, ammonia nitrogen, and dissolved oxygen. This multifaceted analytical approach provides insights into the complex relationships between temporal resolution, model architecture, and water quality parameter characteristics, contributing to more efficient and effective monitoring and prediction systems. Our main contributions include characterizing the statistical differences between daily average and high temporal resolution water quality data and quantitatively evaluating the performance of popular machine learning models across different temporal resolutions for various water quality parameters. The results provide a scientific basis for optimizing monitoring strategies and model selection in monitoring and managing the water environment.

The remainder of this study is organized as follows: Section 2 introduces the materials, methods, and experimental design employed in this research. Section 3 presents the results regarding the data characteristics and model performance, and Section 4 further discusses insights behind the findings. The final conclusions are then presented in Section 5.

2. Materials and Methods

2.1. Study Area

Erhai Lake (see Figure 1), located in the northwestern part of Yunnan Province, Southwestern China, lies at an elevation of 1972 m above sea level. As one of the country’s largest freshwater lakes, it spans an area of approximately 249.8 km² and has an average depth of 10.2 m. Although there have been efforts to improve and protect the lake, it continues to suffer from eutrophication, pollution, and habitat degradation. These issues are primarily driven by rapid urban growth, tourism, and the intensification of agriculture. The water quality of the lake has been fluctuating between Class II and Class III standards [22], with significant algal blooms occurring since 1996, with particularly severe episodes in 2003–2004 and again after 2013.

2.2. Datasets

We collected 4 h water quality data from the Erhai Huxin monitoring site, a national key water quality assessment section located at the Erhai Lake. This site served as the sole monitoring location that provided consistent high temporal resolution data for a wide-spectrum of water quality parameters over the study period. We selected ten key parameters in freshwater systems as input features, including water temperature (WT), pH, dissolved oxygen (DO), permanganate index (IMn), electrical conductivity (EC), turbidity (NTU), ammonia nitrogen (NH₃-N), total nitrogen (TN), total phosphorus (TP), and chlorophyll a (Chla). After excluding outliers and filling in a small amount of missing data, we obtained a dataset of 7994 records covering the period from 5 March 2020 to 25 December 2023.

2.3. Models and Evaluation Matrix

This section introduces the predictive models adopted in the study and the evaluation metrics used to quantify their performance.

2.3.1. Machine Learning Models for Water Quality Predictions

(1): Support Vector Regression (SVR)

Support Vector Regression (SVR) [23] is an extension of Support Vector Machines (SVMs) for regression tasks. SVR aims to find a function that deviates from the training data by a value no greater than a predetermined margin ε, while remaining as flat as possible. An SVR optimization problem can be formulated as follows:

\min_{w, b, ξ, ξ^{*}} \frac{1}{2} {‖w‖}^{2} + C \sum_{i = 1}^{n} (ξ_{i} + ξ_{i}^{*})

(1)

Subject to the following:

y_{i} - w^{T} ϕ (x_{i}) - b \leq ε + ξ_{i}

(2)

w^{T} ϕ (x_{i}) + b - y_{i} \leq ε + ξ_{i}^{*}

(3)

ξ_{i}, ξ_{i}^{*} \geq 0, i = 1, \dots, n

(4)

where

w

is the weight vector,

b

is the bias term,

x_{i}

and

ξ_{i}^{*}

are slack variables,

C

is the regularization parameter, and

ϕ (x)

is a kernel function that maps the input features to a higher-dimensional space. SVR is effective for handling non-linear relationships in data with moderate dimensionality.

(2): Random Forest (RF)

Random Forest [24] is an ensemble learning method that constructs multiple decision trees during training and outputs the average prediction of the individual trees for regression tasks. Each tree is built from a bootstrap sample of the training data, and at each node, a random subset of features is considered for splitting. The prediction function can be expressed as follows:

f_{R F} (x) = \frac{1}{B} \sum_{b = 1}^{B} T_{b} (x)

(5)

where

B

is the number of trees in the forest and

T_{b} (x)

is the prediction of the

b

-th tree. Random Forest reduces overfitting through its ensemble nature and feature randomization, making it robust for environmental modeling where data may contain complex interactions and noise.

(3): Extreme Gradient Boosting (XGBoost)

XGBoost [25] is an advanced implementation of gradient boosting that sequentially builds decision trees, with each new tree trained to correct the errors of the ensemble of existing trees. The objective function being optimized in XGBoost is as follows:

L (ϕ) = \sum_{i = 1}^{n} l (y_{i}, \hat{y_{i}}) + \sum k = 1^{K} Ω (f_{k})

(6)

where

l

is a differentiable convex loss function,

\hat{y_{i}}

is the prediction, and

Ω (f) = γ T + \frac{1}{2} λ {|w|}^{2}

is a regularization term that penalizes the complexity of the model. XGBoost incorporates sophisticated regularization techniques and a system for handling sparse data, making it highly effective for water quality prediction where relationships may be complex and data often contain missing values.

(4): Long Short-Term Memory (LSTM)

LSTM [25] is a specialized recurrent neural network architecture designed to learn long-term dependencies in sequential data. Unlike traditional neural networks, LSTM contains memory cells with gates that regulate the flow of information, allowing the network to capture temporal patterns across different timescales. The core equations of an LSTM cell are as follows:

f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f}) i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i})

(7)

\tilde{C} t = \tanh (W_{C} \cdot [h t - 1, x_{t}] + b_{C})

(8)

C_{t} = f_{t} \times C_{t - 1} + i_{t} \times \tilde{C} t

(9)

o_{t} = σ (W_{o} \cdot [h t - 1, x_{t}] + b_{o})

(10)

h_{t} = o_{t} \times \tanh (C_{t})

(11)

where

f_{t}

,

i_{t}

, and

o_{t}

are the forget, input, and output gates, respectively,

C_{t}

is the cell state,

h_{t}

is the hidden state,

W

and

b

are the weight matrices and bias vectors, and

σ

is the sigmoid function. LSTM networks are suitable for water quality prediction as they capture complex temporal dependencies in environmental time series data, including seasonal patterns and irregular fluctuations.

2.3.2. Wavelet Component Analysis

The Wavelet Component Analysis [26] was performed using the Daubechies 4 (db4) mother wavelet with a decomposition level of 4. This method decomposes signals into approximate and detail coefficients, enabling multi-scale feature extraction. The db4 wavelet is selected for its balance between smoothness and compact support, while level 4 decomposition optimizes noise reduction and frequency localization. The discrete wavelet transform (DWT) with Daubechies 4 (db4) at level 4 was applied.

2.3.3. SHapley Additive exPlanations (SHAP)

SHapley Additive exPlanations (SHAP) is a unified framework for interpreting machine learning models by quantifying the contribution of each input feature to the model’s output [27,28]. Grounded in cooperative game theory, SHAP values represent the average marginal contribution of a feature across all possible coalitions.

2.3.4. Evaluation Matrix

The following four matrices were used to evaluate model performance, with each highlighting a different aspect of prediction quality, including accuracy, variability, and reliability.

(1): MAPE—Mean Absolute Percentage Error

MAPE = \frac{100 %}{n} \sum_{i = 1}^{n} |\frac{y_{i} - \hat{y_{i}}}{y_{i}}|

(12)

MAPE measures the average absolute percentage difference between predicted values (

\hat{y_{i}}

) and actual values

y_{i}

. It expresses the prediction error as a percentage, which makes it easy to interpret across different scales.

(2): rRMSE—Relative Root Mean Square Error

rRMSE = \frac{\sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}}{\bar{y}} \times 100 %

(13)

Relative RMSE is the Root Mean Square Error normalized by the mean observed values (

\bar{y}

). It expresses the RMSE as a percentage of the mean, helping to compare performance across different datasets or models.

(3): R²—Coefficient of Determination

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(14)

R² indicates how well the model explains the variability of the observed data. A value of 1 means perfect prediction, while zero means that the model performs no better than the mean of the observed values.

(4): Kling–Gupta efficiency

The modified Kling–Gupta efficiency (

K G E

) [29] and its three components (

β

,

α

,

ρ

) were used to evaluate model performance [30,31]. The theoretical version of the modified

K G E

metrics is

K G E = 1 - \sqrt{{(β - 1)}^{2} + {(α - 1)}^{2} + {(ρ - 1)}^{2}}

(15)

β = \frac{\bar{S}}{\bar{O}}

(16)

α = \frac{σ_{S}}{σ_{O}}

(17)

ρ = \frac{Cov (S, O)}{σ_{S} σ_{O}}

(18)

where

β

(Bias) measures the ratio of the mean of the simulated values (

\bar{S}

) to the mean of the observed values (

\bar{O}

).

α

(Variability) measures the ratio of the standard deviation of the simulated values (

σ_{S}

) to the standard deviation of the observed values (

σ_{O}

).

ρ

(Correlation) measures the Pearson correlation coefficient between the simulated values (

S

) and the observed values (

O

).

2.4. Experiment Design

The experiments (see Figure 2) use 4-hourly water quality monitoring data from the Erhai watershed to create two distinct datasets: a low-frequency daily aggregation (DA) dataset and a high-frequency 4 h high temporal resolution (HTR) dataset. We generated the DA dataset through a specific method: for each calendar day, we randomly selected one observation from the raw dataset as the representative value. This sampling process was repeated ten times to create ten DA data subsets.

To address potential bias from sample size differences, we performed ten independent random samplings on the HTR data, with each sample strictly matching the corresponding DA subset’s sample size. We then averaged the ten results to minimize sampling randomness. We applied stratified cross-validation for model training, building prediction models independently for each data subset. The final performance metrics were calculated as the arithmetic mean of the ten modeling results. We also established independent validation sets to ensure the statistical reliability of our evaluation conclusions. Our evaluation framework includes four complementary metrics: Mean Absolute Percentage Error (MAPE) to measure prediction bias; coefficient of determination (R²) to assess how well the model explains water quality variations; relative Root Mean Square Error (rRMSE) to evaluate prediction error fluctuations; and the Kling–Gupta efficiency (KGE), helping identify parameters sensitive to temporal resolution. This approach reveals how temporal resolution affects model accuracy, stability, and timeliness. Finally, Wavelet Component Analysis and SHAP (SHapley Additive exPlanations) were employed to interpret the underlying drivers of the model predictions, providing insights into how different temporal components and input variables contribute to the model’s performance across varying time resolutions.

The research addresses two key questions: (1) how changes in temporal resolution directionally impact model performance indicators, and (2) whether different water quality parameters (such as dissolved oxygen and ammonia nitrogen) respond differently to changes in temporal scale.

3. Results

3.1. Statistical Characterization of Data

We analyze the distributional characteristics between the original 4 h resolution data and the daily mean (DM) values, as presented in Table 1. According to Table 1, both daily and HTR datasets exhibit similar water quality characteristics. The daily dataset shows a mean temperature of 17.30 °C (standard deviation 4.56 °C) and a mean pH value of 8.56 (standard deviation 0.20), indicating stable alkaline conditions. The permanganate index averages 4.20 mg/L (standard deviation 0.86), while dissolved oxygen averages 7.78 mg/L (standard deviation 1.03), suggesting adequate oxygenation with moderate fluctuations. Ammonia nitrogen concentrations remain low (mean 0.03 mg/L), with total phosphorus and total nitrogen averaging 0.02 mg/L and 0.32 mg/L, respectively. Electrical conductivity averages 290.75 μS/cm (standard deviation 25.40). Notably, turbidity shows a mean value of 4.75 but with a high standard deviation of 10.49 and a maximum value of 262.25, indicating significant outliers. Chlorophyll-a concentrations average merely 0.01 mg/m³, reflecting low algal growth levels. Regarding signal-to-noise ratio (SNR), which is calculated by dividing the mean of the signal by the standard deviation (std) of the noise, the daily dataset demonstrates consistently higher values across parameters, such as dissolved oxygen (8.14) and temperature (45.78), both exceeding those of the HTR dataset, suggesting superior signal quality in the daily measurements.

The HTR dataset presents largely comparable parameters to the daily dataset: temperature averages 17.43 °C (standard deviation 4.25 °C), pH averages 8.52 (standard deviation 0.20), permanganate index averages 4.21 mg/L (standard deviation 0.84), and dissolved oxygen averages 7.65 mg/L (standard deviation 1.02). Ammonia nitrogen maintains a mean of 0.03 mg/L (standard deviation 0.01). Total phosphorus and total nitrogen average 0.02 mg/L and 0.32 mg/L, respectively, while electrical conductivity averages 290.69 μS/cm (standard deviation 24.28). Interestingly, turbidity shows a mean of 4.41 (standard deviation 11.72) with a maximum value of 184.33, higher than in the daily dataset. Chlorophyll-a concentrations remain consistent with the daily dataset at 0.01 mg/m³. In terms of SNR, the HTR dataset demonstrates slightly lower values overall compared to the daily dataset, with dissolved oxygen at 7.54, temperature at 41.73, and pH at 5.01, all lower than their daily counterparts, indicating greater data variability in specific parameters within the HTR measurements.

The parameter distribution characteristics are shown in Figure 3. In the HTR data, the distribution of standardized values for water quality parameters shows a multimodal pattern: temperature (Temp) forms a steep unimodal peak near zero, IMn forms a broad peak around 0.50, extending into the 1.00 range, and NH3_N and TN show primary peaks at 0.00 and 0.20, respectively, with TN exhibiting a slight right skew in the range above 0.4 (Figure 3a). In comparison, the daily average parameters demonstrate a more concentrated distribution pattern: pH and chla form sharp peaks near the origin, with DO showing a prominent peak at 0.4 and a notable right skew, EC displays a broad peak around 0.3, and TP shows a rapid decay in the right tail (right plot). The differences in kurtosis among the parameter distributions indicate that the HTR data have a wider range of variability compared to the daily average data, with IMn and EC exhibiting the broadest distribution spans across both timescales.

3.2. Model Performance of DA and HTR Datasets

We compared the predictive performance of four models—SVR, RF, XGBoost, and LSTM—across two temporal scales (4 h HTR and daily) for four water quality parameters (TN, TP, NH3_N, and DO). The results revealed significant differences in model performance across different water quality parameters (Table 2). The RF and XGBoost models demonstrated strong predictive capabilities for most water quality parameters, while SVR showed limited effectiveness except for DO predictions. The RF model achieved superior performance for TN prediction with HTR and daily MAPE values of 4.60% and 10.81%, rRMSE values of 7.23% and 14.93%, and R² values of 0.76 and 0.62, respectively. Similarly, for TP prediction, RF maintained strong performance with HTR and daily MAPE values of 6.17% and 10.73%, rRMSE values of 9.81% and 18.02%, and R² values of 0.66 and 0.62. For NH3-N, both RF and XGBoost models achieved identical HTR MAPE values of 3.60%, with RF maintaining slightly better daily performance (MAPE 9.30% versus 10.07%). For DO prediction, all models performed relatively well, with XGBoost achieving the lowest HTR MAPE of 1.71% (rRMSE 2.61%, R² 0.94), closely followed by RF with MAPE 1.73% (rRMSE 2.66%, R² 0.94).

Overall, tree-based models (RF and XGBoost) demonstrated superior and more consistent predictive capabilities compared to SVR and LSTM, particularly when handling variations in temporal resolution. SVR consistently underperformed across most parameters, often failing to capture concentration patterns effectively. LSTM, despite its strengths in sequential data, showed mixed results and did not consistently outperform the tree-based algorithms.

Surprisingly, LSTM did not outperform the tree-based models despite its theoretical advantages in handling sequential data. LSTM showed poor performance in TP prediction, with HTR and daily MAPE values of 12.08% and 16.23%, rRMSE values of 16.21% and 24.43%, and R² values of only 0.09 and 0.36, respectively. The SVR model consistently underperformed across most parameters, with inferior results for TP (HTR MAPE 36.55%, daily MAPE 104.86%) and NH3-N (HTR MAPE 71.04%, daily MAPE 200.92%), and negative R² values indicating a complete failure to capture concentration patterns.

Our analysis indicates that TN, NH3-N, and DO are relatively easier to predict than TP, which presents significant challenges across all models. This pattern was consistent in both temporal resolutions. For instance, even the best-performing RF model achieved HTR and daily MAPE values of 6.17% and 10.73% for TP, compared to 3.60% and 9.30% for NH3-N. The DO prediction task yielded the best results across all models, with the lowest MAPE values ranging from 1.71% to 2.50% for HTR predictions and 4.31% to 5.20% for daily predictions.

The benefit of using high temporal resolution data varied by parameter. Using HTR data consistently yielded better model performance for TN, NH3-N, and DO predictions than daily aggregated data. For example, RF’s MAPE for TN improved from 10.81% (daily) to 4.60% (HTR), and for NH3-N, it from 9.30% (daily) to 3.60% (HTR). The improvement was also notable for DO, with RF’s MAPE decreasing from 4.31% (daily) to 1.73% (HTR). However, for TP prediction, the improvement from high-frequency data was less pronounced, with RF’s MAPE decreasing from 10.73% (daily) to 6.17% (HTR). These variable benefits across parameters suggest that the added value of high-frequency monitoring may depend on the specific water quality parameter’s inherent characteristics and fluctuation patterns.

4. Discussion

4.1. Insights from Model Performance with Different Temporal Data

In this section, we discussed the temporal scaling effects using the Kling–Gupta efficiency metric and its components. The KGE measures the model performance by considering the bias, correlation, and variability between the prediction and observation. The results are shown in Figure 4. Among the prediction models, SVR exhibits the lowest KGE scores in predicting TN, TP, and NH3-N. The consistently poor performance (KGE = 0.42–0.56 for TN, approximately 0.00 for both TP and NH3-N) stems from inadequate correlation metrics and extreme bias amplification (bias ratios of 1.02–1.28 for TN, 1.28–1.87 for TP, and 1.58–3.00 for NH3-N), indicating fundamental limitations in capturing temporal patterns of target variables.

RF, XGBoost, and LSTM models demonstrate superior capability, achieving positive KGE scores across all target variables. For TN prediction, tree-based models (RF and XGBoost) notably benefit from high temporal resolution (HTR) data, enabling better temporal pattern capture, though with slight performance variations. XGBoost achieved the most stable TN simulations across temporal resolutions (Daily KGE = 0.83, HTR KGE = 0.75), driven by robust variability correlation (0.89 daily, 0.84 HTR) and minimal bias (1.00). While RF exhibited comparable daily performance (KGE = 0.80), its HTR KGE declined to 0.70 due to reduced correlation (0.75 vs. 0.84 daily). Similar enhancement effects from HTR data are observed for RF and XGBoost in predicting NH3-N and DO, where XGBoost consistently emerged as the most reliable model (NH3-N: Daily KGE = 0.92, HTR KGE = 0.83; DO: Daily KGE = 0.96, HTR KGE = 0.88).

An exception occurs in TP prediction, where HTR data result in worse performance than daily aggregated (DA) data. RF and XGBoost maintained relatively stable KGE across scales (RF: 0.72–0.81; XGBoost: 0.73–0.84) with balanced bias ratios (1.00) and strong variability correlations (0.80–0.95), but the performance degradation with HTR data might be attributed to the low signal-to-noise ratio (SNR) of TP measurements. This suggests that high noise components introduced in HTR datasets might degrade model performance instead of enhancing it [32]. Therefore, assessing the quality of monitoring data in terms of signal and noise components is crucial before implementing high-frequency monitoring systems.

The LSTM model achieved better performance with HTR data in predicting NH3-N and DO, with improvement observed in DO prediction. Surprisingly, LSTM demonstrated an inverse pattern for TN with superior HTR KGE (0.81) compared to daily (0.74), reflecting its capacity to leverage sequential dependencies in high-frequency data. Such behavior might be related to the recurrent mechanism of LSTM design, which is specifically engineered to capture long-term dependencies in sequential data [33]. Another interesting observation is that using HTR data for TP prediction with LSTM significantly enhanced the model’s ability to capture temporal variability, but correlation performance worsened (HTR KGE = 0.53 vs. daily KGE = 0.55). This dichotomy suggests that LSTM’s temporal advantage is conditional on parameter-specific dynamic characteristics rather than universally applicable [34,35].

Overall, the impact of temporal resolution varied significantly by model architecture. Tree-based methods (RF, XGBoost) showed inherent robustness to temporal scaling, experiencing only moderate performance degradation when moving from daily to HTR data (Average ΔKGE = 0.09). SVR’s structural limitations led to extreme sensitivity and poor performance across resolutions. LSTM, conversely, displayed non-linear temporal effects—excelling in TN HTR predictions but struggling with DO and NH₃-N at finer resolutions, exhibiting the largest temporal sensitivity for DO (ΔKGE = 0.21 between daily 0.91 and HTR 0.70). This highlights challenges in modeling oxygen fluctuations against complex biochemical interactions at finer resolutions [35].

4.2. SHAP Interpretability Analysis for Main Driving Factors

We first applied SHAP analysis to identify key features that influence model predictivity. This helps to study the relationships among model performance, dominant features, and the temporal characteristics of water quality parameters (Figure 5 and Figure 6).

Statistical analysis revealed significant differences in SHAP values between models across all water quality parameters. For TN prediction, SHAP analysis revealed that DO and temperature were the primary influencing factors. The SVR model showed higher dependency on DO, EC, and temperature (SHAP values 0.04–0.05) compared to the RF and XGBoost models (SHAP values < 0.02). This pattern of feature importance might explain why RF and XGBoost achieved better HTR performance (KGE = 0.80–0.83) compared to daily aggregation (KGE = 0.70–0.75), as these models possibly better captured sub-daily variations (variability correlation 0.84–0.89 vs. 0.75–0.84 daily). The relationship between environmental factors and nitrogen dynamics has been well-documented in previous studies [36,37].

For TP prediction, SHAP analysis identified TN as the most critical feature (SHAP value approximately 0.0015 in LSTM), with temperature also playing a significant role across traditional machine learning models. The complex interactions between these factors possibly contribute to the universally poorer predictions for TP (best RF daily KGE = 0.81) compared to other parameters. High-frequency noise may have degraded model accuracy at finer temporal resolutions (ΔKGE = −0.09 for RF), suggesting why daily aggregation improved model stability through smoothing of high-frequency components [38].

For NH3-N prediction, SHAP analysis indicated that DO was the dominant feature (SHAP value approximately 0.0055 in LSTM), while pH showed abnormally high importance in the SVR model (SHAP value around 0.008). This feature importance pattern might explain XGBoost’s optimal daily performance (KGE = 0.92) compared to HTR (KGE = 0.83), as daily aggregation potentially better aligned with the parameter’s intrinsic variations. RF’s 11% degradation in HTR performance further evidences possible limitations in resolving high-frequency fluctuations present in NH3-N signals [39].

For DO prediction, SHAP analysis identified temperature as the primary driver, with SHAP values ranging from 0.05 in the LSTM model to approximately 0.5 in the SVR model, substantially exceeding the contribution of other features. This finding aligns with established knowledge of temperature-driven oxygen dynamics in aquatic systems [40]. The strong relationship between temperature variation patterns and DO explains the excellent performance across all models (HTR XGBoost KGE = 0.96, HTR RF KGE = 0.94), with only marginal improvements from high temporal resolution data (ΔKGE = 0.08–0.10). This feature–response relationship supports the stable variability correlations (0.93–0.97) observed across temporal resolutions.

The integration of SHAP analysis with model performance metrics provides compelling evidence that prediction accuracy is maximized when model architecture and temporal sampling align with the dominant processes influencing water quality parameters. Parameters with strong temperature dependence (DO) achieved consistent performance across models and temporal scales, while parameters with complex multi-factor dynamics (TP) present inherent prediction challenges regardless of model sophistication. Parameters influenced by slower environmental processes (NH3-N) might perform better with daily aggregation, while those with significant sub-daily components (TN) could benefit from high temporal resolution monitoring [41].

This analysis demonstrates that the predictability of water quality parameters is potentially linked to the alignment among feature importance patterns, model architecture capabilities, and the inherent temporal characteristics of the target parameters.

4.3. Wavelet Decomposition for Temporal Variability Components

The previous analysis implied that the benefit of using high temporal data is not univocal. Both model structure and SNR of the data may play a role in the final performance. In this section, by analyzing wavelet decomposition, we further delve into the temporal characteristics of water quality measurements to explore the predictability of target water quality parameters. Under a 4 h sampling interval, the D1–D4 decomposition levels correspond to timescales of 8 h, 16 h, 32 h, and 64 h, respectively, thereby providing a comprehensive characterization of dynamic processes from intra-daily fluctuations to inter-weekly variations (Figure 7).

The results showed that, for lake water temperature, pH, DO, and TN, the sub-daily component (D2) appears as the dominant energy. Specifically, DO exhibited primary energy (99.57%) with D2 dominance in residual energy (47.40% at 16 h scale), aligning with diurnal photosynthetic cycles [42]. Similarly, TN showed D2-scale dominance (41.57% at 16 h) alongside the possibility of intermittent high-frequency influences related to rainfall events. Such results may link to the fact that these variables are sensitive to daily environmental cycles, such as sunlight and air temperature changes.

The presence of strong diurnal patterns in these parameters might explain their good predictability, as shown in previous model performance results. For DO, this mesoscale regularity explains why all models achieved satisfying daily predictions (XGBoost KGE = 0.96, RF = 0.94) with only marginal improvements from high temporal resolution data (ΔKGE = 0.08–0.10). The balanced energy distribution across sub-daily scales enabled comparable performance between temporal resolutions, as reflected in stable variability correlations (0.93–0.97) [42]. For TN, this temporal characteristic explains RF and XGBoost’s good HTR performance (KGE = 0.80–0.83 vs. daily 0.70–0.75) through enhanced capture of sub-daily variations (variability correlation 0.84–0.89 vs. 0.75–0.84 daily).

For IMn, NH3-N, EC, and NTU, the D4 component appears to be the dominant. NH3-N demonstrated D4 dominance (38.93% at 64 h) aside from primary energy, which is consistent with slow sediment-exchange processes. EC showed extreme low-frequency dominance (99.99% trend), explaining its minimal temporal resolution sensitivity [43]. NTU exhibited multi-scale energy distribution (54.85% trend), correlating with known sensitivity to disturbance events. These components correspond to longer-period fluctuations that might be mainly driven by slower biogeochemical processes or multi-day weather conditions [16,44]. XGBoost’s optimal daily performance for NH3-N (KGE = 0.92 vs. HTR = 0.83) aligns with this low-frequency control, while RF’s 11% HTR degradation reveals possible limitations in resolving residual high-frequency fluctuations.

Interestingly though, TP shows no dominant component, displaying balanced multi-scale energy (D1–D4: 24.77–27.79%) aside from the primary energy. This implies that the variation in TP in Erhai Lake could involve multiple processes operating at different timescales as well as irregular external inputs, reflecting complex interactions from rapid runoff (D1) to weekly sediment processes. This spectral complexity corresponds with TP’s universally poor predictions (best RF daily KGE = 0.81) compared to other parameters, particularly in HTR mode where high-frequency noise potentially degrades performance (ΔKGE = −0.09 for RF). The energy distribution explains why daily aggregation improved model stability through high-frequency smoothing [45,46].

Variables with strong cyclic patterns such as TN and DO might benefit from HTR datasets due to their distinct temporal signatures. For instance, LSTM’s inverse temporal pattern for TN (HTR KGE = 0.81 > daily 0.74) suggests a unique capacity to leverage sequential dependencies despite shared D2 dominance [47,48,49]. On the other hand, the prediction of TP will likely be more challenging even with the HTR dataset, as its strong variability originates from complex biogeochemical processes operating across multiple timescales. An alternative for improving the predictability of TP could be to enlarge the spectrum of input variables.

Supplementary analysis confirmed these general patterns: Chla revealed D3–D4 dominance beside primary energy (46.43%), aligning with algal growth cycles [50]. The spectral analysis suggests that models achieve optimal performance when their architectural design corresponds with the prevailing scale characteristics of the underlying processes, while high-frequency complexities (as in TP) may inherently limit prediction accuracy regardless of temporal resolution [51].

4.4. Implications for Water Quality Monitoring and Modeling

This study integrates machine learning models with SHAP value analysis to uncover the driving mechanisms behind dynamic changes in water quality parameters. Our results provide practical implications for designing effective monitoring strategies across different temporal resolutions.

For routine water quality monitoring programs, our findings demonstrate that daily resolution data generally satisfy prediction requirements for most parameters. Tree-based machine learning models like RF and XGBoost achieve satisfactory prediction performance (KGE > 0.80 for most parameters) using daily aggregated data, which meets the basic needs of lake water quality monitoring while minimizing resource investment. This finding is relevant for resource-constrained monitoring programs where high-frequency sensor deployment may not be feasible.

However, parameter-specific temporal dependencies suggest that differentiated monitoring strategies might be optimal. Temperature-sensitive parameters like DO benefit from high-resolution monitoring that captures diurnal fluctuations (HTR XGBoost KGE = 0.96 vs. daily KGE = 0.88), while parameters with complex multi-scale dynamics like TP present inherent prediction challenges regardless of temporal resolution or model sophistication. For specialized lake types, such as phosphorus-limited eutrophic lakes, precise TP monitoring remains critical. Our analysis reveals that despite the challenges in TP prediction, daily aggregated data provide more stable model performance than high-resolution data (RF daily KGE = 0.81 vs. HTR KGE = 0.72), possibly due to the smoothing of high-frequency noise components that might otherwise degrade model accuracy [52].

Comparative model analysis demonstrates that non-linear models (i.e., LSTM, RF, XGBoost) exhibit notable advantages in capturing the interactions of features across different temporal patterns [53]. LSTM models show particular promise for parameters with strong sequential dependencies, as evidenced by their superior performance for TN prediction using high temporal resolution data (HTR KGE = 0.81 vs. daily KGE = 0.74). This suggests that advanced modeling frameworks could compensate for limitations in monitoring frequency when deployed appropriately [54].

For implementation in water resource management, we recommend a tiered monitoring approach where routine parameters are monitored at daily intervals using cost-effective traditional machine learning models, while critical parameters for specific lake systems receive targeted high-frequency monitoring coupled with specialized modeling techniques. Such an approach would optimize resource allocation while ensuring accurate water quality assessment and prediction capabilities [55,56]. This strategy holds the potential to enhance water quality prediction performance and provide more reliable spatiotemporal dynamics for watershed management decision-making.

5. Conclusions

This study comprehensively investigates the impact mechanism of temporal resolution on water quality prediction models through a multi-scale analysis approach, providing a robust theoretical basis for optimizing monitoring strategies. The results indicate significant differences in the sensitivity of water quality parameters to temporal resolution: DO exhibits relatively consistent performance across temporal scales in all models, with XGBoost achieving optimal results at both daily (KGE = 0.96) and HTR (KGE = 0.88) resolutions through exceptional correlation (0.97 daily, 0.94 HTR) and variability reproduction. This temporal stability aligns closely with DO’s pronounced diurnal cycle driven by photosynthesis (47.40% of energy at the 16 h scale). In contrast, TN, TP, and NH₃-N demonstrate parameter-specific response patterns, reflecting the unique temporal scale characteristics of different biogeochemical processes. For instance, NH₃-N shows a dominant characteristic at the 64 h scale (38.93% of energy), explaining why XGBoost emerged as the most reliable model for NH₃-N simulation (Daily KGE = 0.92, HTR KGE = 0.83), supported by near-ideal bias ratios (1.00–1.01) and superior variability tracking (0.94 both scales).

Model comparison analysis further reveals that tree-based methods (RF, XGBoost) demonstrate moderate HTR performance degradation (Average ΔKGE = 0.09), suggesting their inherent robustness to temporal scaling. In contrast, SVR displayed systematic limitations, particularly in HTR variability correlation and excessive bias (1.28–3.00), resulting in failures for TP and NH₃-N predictions (KGE = 0.00). LSTM exhibited non-linear temporal effects—excelling in TN HTR predictions (HTR KGE = 0.81 > daily KGE = 0.74) but struggling with DO and NH₃-N at finer resolutions (ΔKGE = 0.21 for DO). Feature importance analysis based on SHAP values further elucidates the temporal scale effects of environmental driving factors. Temperature is identified as the most significant factor influencing DO variations, with a strong correlation at the 16 h scale, confirming the dominant role of the diurnal photosynthesis cycle. These findings provide scientific support for developing parameter-specific monitoring strategies: high-frequency monitoring (≤1 h) is beneficial for parameters with significant sub-daily components like TN, while parameters dominated by low-frequency processes (NH₃-N) perform adequately with daily aggregation. The “parameter-scale-model” matching framework established in this study not only deepens the understanding of water quality dynamics but also provides methodological guidance for the optimization of intelligent monitoring system design.

This study highlights the influence of temporal resolution on water quality model performance, revealing distinct sensitivities across parameters. DO showed stable predictive accuracy across scales, driven by its diurnal pattern, while nutrient indicators such as TN, TP, and NH₃-N exhibited scale-dependent behaviors reflecting varied biogeochemical dynamics. XGBoost consistently outperformed other models for DO and NH₃-N, while tree-based methods demonstrated general robustness to resolution changes. In contrast, SVR showed notable degradation at high temporal resolution, and LSTM displayed mixed results depending on the parameter. SHAP analysis further confirmed scale-specific feature importance, particularly the role of temperature in sub-daily DO variation.

This study, while contributing valuable insights, has several limitations that warrant attention. First, the current analysis focuses exclusively on Erhai Lake, which might constrain its generalizability to other regions with different climatic, hydrological, and socio-economic characteristics. Second, it is important to acknowledge that only the water quality data from a single long-term monitoring site were used in the analysis. This limits our ability to assess the spatial variability and gradients of water quality parameters across the entire lake, given that the Erhai watershed is characterized by strong heterogeneous hydrological and environmental conditions which might influence the contaminant distribution in Erhai Lake. Future research might explore spatial patterns of lake water quality and include other variables besides water quality parameters in prediction models to further enhance the generalizability.

Author Contributions

Conceptualization, X.S. and Y.L.; methodology, X.S. and Y.L.; writing—original draft preparation, X.S.; writing—review and editing, Y.L. and B.Y.; visualization, X.S.; supervision, Y.L.; project administration, S.W.; funding acquisition, S.W. and S.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by Key R&D Program of Shandong Province (2021CXGC011202), the Major Science and Technology Project of Yunnan Province (202202AE090034), and Project of Science and Technology Department of Yunnan Province (202304BQ040005 and 202305AF150055). The computation resource is supported by the Interdisciplinary Intelligence SuperComputer Center of Beijing Normal University Zhuhai.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lakra, W.S.; Sarkar, U.K.; Dubey, V.K.; Sani, R.; Pandey, A. River Inter Linking in India: Status, Issues, Prospects and Implications on Aquatic Ecosystems and Freshwater Fish Diversity. Rev. Fish Biol. Fish. 2011, 21, 463–479. [Google Scholar] [CrossRef]
Cloern, J.E.; Abreu, P.C.; Carstensen, J.; Chauvaud, L.; Elmgren, R.; Grall, J.; Greening, H.; Johansson, J.O.R.; Kahru, M.; Sherwood, E.T.; et al. Human Activities and Climate Variability Drive Fast-Paced Change across the World’s Estuarine–Coastal Ecosystems. Glob. Chang. Biol. 2016, 22, 513–529. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Li, J.; van Vliet, M.T.H.; Jones, E.R.; Huang, Z.; Liu, M.; Bi, J. Economic Risks Hidden in Local Water Pollution and Global Markets: A Retrospective Analysis (1995–2010) and Future Perspectives on Sustainable Development Goal 6. Water Res. 2024, 252, 121216. [Google Scholar] [CrossRef]
Zitoun, R.; Marcinek, S.; Hatje, V.; Sander, S.G.; Völker, C.; Sarin, M.; Omanović, D. Climate Change Driven Effects on Transport, Fate and Biogeochemistry of Trace Element Contaminants in Coastal Marine Ecosystems. Commun. Earth Environ. 2024, 5, 560. [Google Scholar] [CrossRef]
Dibike, Y.B.; Broadbent, J.; Musetta-Lambert, J.; Reid, T.; Spoelstra, J.; Monk, W.A.; Nicholls, E.M.; Shrestha, R.R.; Beltaos, S.; Peters, D.L.; et al. Toward a Canadian National River Water Quality Modeling System: State of Science and Future Prospects. Environ. Rev. 2025, 33, 1–26. [Google Scholar] [CrossRef]
Sumudumali, R.G.I.; Jayawardana, J.M.C.K. A Review of Biological Monitoring of Aquatic Ecosystems Approaches: With Special Reference to Macroinvertebrates and Pesticide Pollution. Environ. Manag. 2021, 67, 263–276. [Google Scholar] [CrossRef]
Marcuello, C. Present and Future Opportunities in the Use of Atomic Force Microscopy to Address the Physico-Chemical Properties of Aquatic Ecosystems at the Nanoscale Level. Int. Aquat. Res. 2022, 14, 231–240. [Google Scholar] [CrossRef]
Deng, C.; Liu, L.; Li, H.; Peng, D.; Wu, Y.; Xia, H.; Zhang, Z.; Zhu, Q. A Data-Driven Framework for Spatiotemporal Characteristics, Complexity Dynamics, and Environmental Risk Evaluation of River Water Quality. Sci. Total Environ. 2021, 785, 147134. [Google Scholar] [CrossRef]
Zhong, S.; Zhang, K.; Bagheri, M.; Burken, J.G.; Gu, A.; Li, B.; Ma, X.; Marrone, B.L.; Ren, Z.J.; Schrier, J.; et al. Machine Learning: New Ideas and Tools in Environmental Science and Engineering. Environ. Sci. Technol. 2021, 55, 12741–12754. [Google Scholar] [CrossRef]
Ait Ballagh, F.E.; Rabouille, C.; Andrieux-Loyer, F.; Soetaert, K.; Elkalay, K.; Khalil, K. Spatio-Temporal Dynamics of Sedimentary Phosphorus along Two Temperate Eutrophic Estuaries: A Data-Modelling Approach. Cont. Shelf Res. 2020, 193, 104037. [Google Scholar] [CrossRef]
Wang, D.; Wang, Y. Emergency Capacity of Small Towns to Endure Sudden Environmental Pollution Accidents: Construction and Application of an Evaluation Model. Sustainability 2021, 13, 5511. [Google Scholar] [CrossRef]
Sampaio, F.G.; Araújo, C.A.S.; Dallago, B.S.L.; Stech, J.L.; Lorenzzetti, J.A.; Alcântara, E.; Losekann, M.E.; Marin, D.B.; Leão, J.A.D.; Bueno, G.W. Unveiling Low-to-High-Frequency Data Sampling Caveats for Aquaculture Environmental Monitoring and Management. Aquac. Rep. 2021, 20, 100764. [Google Scholar] [CrossRef]
Seifert-Dähnn, I.; Furuseth, I.S.; Vondolia, G.K.; Gal, G.; de Eyto, E.; Jennings, E.; Pierson, D. Costs and Benefits of Automated High-Frequency Environmental Monitoring—The Case of Lake Water Management. J. Environ. Manag. 2021, 285, 112108. [Google Scholar] [CrossRef]
Marcé, R.; George, G.; Buscarinu, P.; Deidda, M.; Dunalska, J.; de Eyto, E.; Flaim, G.; Grossart, H.-P.; Istvanovics, V.; Lenhardt, M.; et al. Automatic High Frequency Monitoring for Improved Lake and Reservoir Management. Environ. Sci. Technol. 2016, 50, 10780–10794. [Google Scholar] [CrossRef]
Xie, Y. A Hybrid Deep Learning Approach to Improve Real-Time Effluent Quality Prediction in Wastewater Treatment Plant. Water Res. 2024, 250, 121092. [Google Scholar] [CrossRef]
Shuai, P.; Chen, X.; Mital, U.; Coon, E.T.; Dwivedi, D. The Effects of Spatial and Temporal Resolution of Gridded Meteorological Forcing on Watershed Hydrological Responses. Hydrol. Earth Syst. Sci. 2022, 26, 2245–2276. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, J.; Li, C.; Duan, H.; Wang, W. Attention-Based Deep Learning Models for Predicting Anomalous Shock of Wastewater Treatment Plants. Water Res. 2025, 275, 123192. [Google Scholar] [CrossRef]
Rodriguez-Perez, J.; Leigh, C.; Liquet, B.; Kermorvant, C.; Peterson, E.; Sous, D.; Mengersen, K. Detecting Technical Anomalies in High-Frequency Water-Quality Data Using Artificial Neural Networks. Environ. Sci. Technol. 2020, 54, 13719–13730. [Google Scholar] [CrossRef]
Huang, S.; Xia, J.; Wang, Y.; Lei, J.; Wang, G. Water Quality Prediction Based on Sparse Dataset Using Enhanced Machine Learning. Environ. Sci. Ecotechnology 2024, 20, 100402. [Google Scholar] [CrossRef]
Kocheturov, A.; Pardalos, P.M.; Karakitsiou, A. Massive Datasets and Machine Learning for Computational Biomedicine: Trends and Challenges. Ann. Oper. Res. 2019, 276, 5–34. [Google Scholar] [CrossRef]
Bieroza, M.; Acharya, S.; Benisch, J.; ter Borg, R.N.; Hallberg, L.; Negri, C.; Pruitt, A.; Pucher, M.; Saavedra, F.; Staniszewska, K.; et al. Advances in Catchment Science, Hydrochemistry, and Aquatic Ecology Enabled by High-Frequency Water Quality Measurements. Environ. Sci. Technol. 2023, 57, 4701–4719. [Google Scholar] [CrossRef] [PubMed]
Xu, T.; Ma, W.; Chen, J.; Duan, L.; Li, H.; Zhang, H. Water Quality of Lake Erhai in Southwest China and Its Projected Status in the near Future. Water 2024, 16, 972. [Google Scholar] [CrossRef]
Awad, M.; Khanna, R. Support Vector Regression. In Efficient Learning Machines; Apress: Berkeley, CA, USA, 2015; pp. 67–80. ISBN 978-1-4302-5989-3. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13 August 2016; ACM: New York, NY, USA; pp. 785–794. [Google Scholar]
Nason, G.P.; Sachs, R.V. Wavelets in Time-Series Analysis. Philos. Trans. R. Soc. London. Ser. A Math. Phys. Eng. Sci. 1999, 357, 2511–2526. [Google Scholar] [CrossRef]
El, K.; Higdon, B.P.; Başar, A. Interpreting Financial Time Series with SHAP Values. In Proceedings of the 29th Annual International Conference on Computer Science and Software Engineering, Toronto, ON, Canada, 4–6 November 2019. [Google Scholar]
Kim, Y.; Kim, Y. Explainable Heat-Related Mortality with Random Forest and SHapley Additive exPlanations (SHAP) Models. Sustain. Cities Soc. 2022, 79, 103677. [Google Scholar] [CrossRef]
Kling, H.; Fuchs, M.; Paulin, M. Runoff Conditions in the Upper Danube Basin under an Ensemble of Climate Change Scenarios. J. Hydrol. 2012, 424–425, 264–277. [Google Scholar] [CrossRef]
Liu, H.; Yang, R.; Duan, Z.; Wu, H. A Hybrid Neural Network Model for Marine Dissolved Oxygen Concentrations Time-Series Forecasting Based on Multi-Factor Analysis and a Multi-Model Ensemble. Engineering 2021, 7, 1751–1765. [Google Scholar] [CrossRef]
He, M.; Wu, S.; Huang, B.; Kang, C.; Gui, F. Prediction of Total Nitrogen and Phosphorus in Surface Water by Deep Learning Methods Based on Multi-Scale Feature Extraction. Water 2022, 14, 1643. [Google Scholar] [CrossRef]
Jia, J.; Zheng, X.; Wang, Y.; Chen, Y.; Karjalainen, M.; Dong, S.; Lu, R.; Wang, J.; Hyyppä, J. The Effect of Artificial Intelligence Evolving on Hyperspectral Imagery with Different Signal-to-Noise Ratio, Spectral and Spatial Resolutions. Remote Sens. Environ. 2024, 311, 114291. [Google Scholar] [CrossRef]
Sangiorgio, M.; Dercole, F. Robustness of LSTM Neural Networks for Multi-Step Forecasting of Chaotic Time Series. Chaos Solitons Fractals 2020, 139, 110045. [Google Scholar] [CrossRef]
Chen, K.; Chen, H.; Zhou, C.; Huang, Y.; Qi, X.; Shen, R.; Liu, F.; Zuo, M.; Zou, X.; Wang, J.; et al. Comparative Analysis of Surface Water Quality Prediction Performance and Identification of Key Water Parameters Using Different Machine Learning Models Based on Big Data. Water Res. 2020, 171, 115454. [Google Scholar] [CrossRef] [PubMed]
Zheng, H.; Liu, Y.; Wan, W.; Zhao, J.; Xie, G. Large-Scale Prediction of Stream Water Quality Using an Interpretable Deep Learning Approach. J. Environ. Manag. 2023, 331, 117309. [Google Scholar] [CrossRef] [PubMed]
Wu, K.; Hu, M.; Zhang, Y.; Zhou, J.; Wu, H.; Wang, M.; Chen, D. Long-Term Riverine Nitrogen Dynamics Reveal the Efficacy of Water Pollution Control Strategies. J. Hydrol. 2022, 607, 127582. [Google Scholar] [CrossRef]
Basu, N.B.; Van Meter, K.J.; Byrnes, D.K.; Van Cappellen, P.; Brouwer, R.; Jacobsen, B.H.; Jarsjö, J.; Rudolph, D.L.; Cunha, M.C.; Nelson, N.; et al. Managing Nitrogen Legacies to Accelerate Water Quality Improvement. Nat. Geosci. 2022, 15, 97–105. [Google Scholar] [CrossRef]
Bailey, B.A.; Doney, S.C.; Lima, I.D. Quantifying the Effects of Dynamical Noise on the Predictability of a Simple Ecosystem Model. Environmetrics 2004, 15, 337–355. [Google Scholar] [CrossRef]
Khullar, S.; Singh, N. Machine Learning Techniques in River Water Quality Modelling: A Research Travelogue. Water Supply 2020, 21, 1–13. [Google Scholar] [CrossRef]
Hutchings, A.M.; de Vries, C.S.; Hayes, N.R.; Orr, H.G. Temperature and Dissolved Oxygen Trends in English Estuaries over the Past 30 Years. Estuar. Coast. Shelf Sci. 2024, 306, 108892. [Google Scholar] [CrossRef]
Fu, B.; Horsburgh, J.S.; Jakeman, A.J.; Gualtieri, C.; Arnold, T.; Marshall, L.; Green, T.R.; Quinn, N.W.T.; Volk, M.; Hunt, R.J.; et al. Modeling Water Quality in Watersheds: From Here to the Next Generation. Water Resour. Res. 2020, 56, e2020WR027721. [Google Scholar] [CrossRef]
Kang, M.; Tian, Y.; Peng, S.; Wang, M. Effect of Dissolved Oxygen and Nutrient Levels on Heavy Metal Contents and Fractions in River Surface Sediments. Sci. Total Environ. 2019, 648, 861–870. [Google Scholar] [CrossRef]
Ren, J.; Chen, Q.; Ma, D.; Xie, R.; Zhu, H.; Zang, S. Study on a Fast EC Measurement Method of Soda Saline-Alkali Soil Based on Wavelet Decomposition Texture Feature. Catena 2021, 203, 105272. [Google Scholar] [CrossRef]
Wang, C.; Yang, Y.; Yang, B.; Lin, H.; Miller, T.R.; Newton, R.J.; Guo, L. Causal Relationship between Alkaline Phosphatase Activities and Phosphorus Dynamics in a Eutrophic Coastal Lagoon in Lake Michigan. Sci. Total Environ. 2021, 787, 147681. [Google Scholar] [CrossRef]
Wang, J.; Wang, W.; Xiong, J.; Li, L.; Zhao, B.; Sohail, I.; He, Z. A Constructed Wetland System with Aquatic Macrophytes for Cleaning Contaminated Runoff/Storm Water from Urban Area in Florida. J. Environ. Manag. 2021, 280, 111794. [Google Scholar] [CrossRef] [PubMed]
Xia, Y.; Zhang, M.; Tsang, D.C.W.; Geng, N.; Lu, D.; Zhu, L.; Igalavithana, A.D.; Dissanayake, P.D.; Rinklebe, J.; Yang, X.; et al. Recent Advances in Control Technologies for Non-Point Source Pollution with Nitrogen and Phosphorous from Agricultural Runoff: Current Practices and Future Prospects. Appl. Biol. Chem. 2020, 63, 8. [Google Scholar] [CrossRef]
Cambon, C.; Scott, J.F. Linear and nonlinear models of anisotropic turbulence. Annu. Rev. Fluid Mech. 1999, 31, 1–53. [Google Scholar] [CrossRef]
Längkvist, M.; Karlsson, L.; Loutfi, A. A Review of Unsupervised Feature Learning and Deep Learning for Time-Series Modeling. Pattern Recognit. Lett. 2014, 42, 11–24. [Google Scholar] [CrossRef]
Wang, K.; Yang, H.; Chang, Y.; Huang, W.; Jiang, X. Phosphorus Release and Distribution in Sediment Resuspension Systems under Disturbing Conditions. Chemosphere 2024, 359, 142386. [Google Scholar] [CrossRef]
Yang, J.; Holbach, A.; Wilhelms, A.; Qin, Y.; Zheng, B.; Zou, H.; Qin, B.; Zhu, G.; Norra, S. Highly Time-Resolved Analysis of Seasonal Water Dynamics and Algal Kinetics Based on in-Situ Multi-Sensor-System Monitoring Data in Lake Taihu, China. Sci. Total Environ. 2019, 660, 329–339. [Google Scholar] [CrossRef]
Wai, K.P.; Chia, M.Y.; Koo, C.H.; Huang, Y.F.; Chong, W.C. Applications of Deep Learning in Water Quality Management: A State-of-the-Art Review. J. Hydrol. 2022, 613, 128332. [Google Scholar] [CrossRef]
Neumann, A.; Dong, F.; Shimoda, Y.; Arnillas, C.A.; Javed, A.; Yang, C.; Zamaria, S.; Mandal, S.; Wellen, C.; Paredes, D.; et al. A Review of the Current State of Process-Based and Data-Driven Modelling: Guidelines for Lake Erie Managers and Watershed Modellers. Environ. Rev. 2021, 29, 443–490. [Google Scholar] [CrossRef]
Afan, H.A.; El-shafie, A.; Mohtar, W.H.M.W.; Yaseen, Z.M. Past, Present and Prospect of an Artificial Intelligence (AI) Based Model for Sediment Transport Prediction. J. Hydrol. 2016, 541, 902–913. [Google Scholar] [CrossRef]
Mantovani, C.; Corgnati, L.; Horstmann, J.; Rubio, A.; Reyes, E.; Quentin, C.; Cosoli, S.; Asensio, J.L.; Mader, J.; Griffa, A. Best Practices on High Frequency Radar Deployment and Operation for Ocean Current Measurement. Front. Mar. Sci. 2020, 7, 210. [Google Scholar] [CrossRef]
Brown, L.E.; Maavara, T.; Zhang, J.W.; Chen, X.H.; Klaar, M.; Moshe, F.O.; Ben-Zur, E.; Stein, S.; Grayson, R.; Carter, L.; et al. Integrating Sensor Data and Machine Learning to Advance the Science and Management of River Carbon Emissions. Crit. Rev. Environ. Sci. Technol. 2025, 55, 600–623. [Google Scholar] [CrossRef]
Deng, Y.; Zhang, Y.; Pan, D.; Yang, S.X.; Gharabaghi, B. Review of Recent Advances in Remote Sensing and Machine Learning Methods for Lake Water Quality Management. Remote Sens. 2024, 16, 4196. [Google Scholar] [CrossRef]

Figure 1. A map of the Erhai watershed. The blue area represents Erhai Lake, while the black boundary outlines the Erhai watershed. The pink dot indicates the monitoring site. The background hillshade provides topographic context.

Figure 2. Experimental workflow for evaluating temporal resolution effects on water quality modeling.

Figure 3. The probability density distribution of input feature variables (a,b). The values were normalized to a 0–1 scale to facilitate comparison.

Figure 4. KGE values for water quality indexes using different models. blue stands for HTR results, red stands for daily results.

Figure 5. SHAP value analysis with HTR data for (a) TN, (b) TP, (c) NH₃-N, and (d) DO under significant level of 0.05 (i.e., p < 0.05).

Figure 6. SHAP value analysis with DA data for (a) TN, (b) TP, (c) NH₃-N, and (d) DO under significant level of 0.05 (i.e., p < 0.05).

Figure 7. Wavelet Energy Decomposition for water quality indicators: main trend (A4) and detail components (D1–D4).

Table 1. Summary of basic statistics for high temporal resolution (HTR) and daily mean (DM) observations of various water quality parameters.

Parameter	Mean		Std		Min		25%		50%		75%		Max		SNR
Parameter	HTR	DM	HTR	DM	HTR	DM	HTR	DM	HTR	DM	HTR	DM	HTR	DM	HTR	DM
Temp	17.43	17.30	4.25	4.56	8.28	8.53	13.91	13.36	18.23	17.72	21.17	21.62	26.89	25.25	4.10	3.80
pH	8.52	8.56	0.20	0.19	5.75	6.74	8.37	8.46	8.54	8.57	8.66	8.68	9.27	9.13	41.73	45.78
IMn	4.21	4.19	0.84	0.80	2.61	3.17	3.87	3.87	4.10	4.06	4.40	4.35	20.97	20.97	5.01	5.22
DO	7.65	7.75	1.02	0.95	2.46	4.20	7.02	7.06	7.59	7.77	8.36	8.45	13.77	10.92	7.54	8.14
NH₃-N	0.03	0.03	0.01	0.01	0.01	0.03	0.03	0.03	0.03	0.03	0.03	0.03	0.18	0.16	2.32	2.45
TP	0.02	0.02	0.01	0.01	0.01	0.01	0.02	0.02	0.02	0.02	0.02	0.03	0.09	0.07	3.44	3.73
TN	0.33	0.32	0.08	0.08	0.09	0.10	0.27	0.26	0.32	0.31	0.38	0.37	0.83	0.55	4.04	4.10
EC	292.61	290.75	24.28	25.40	154.85	154.94	288.57	287.05	295.59	293.44	303.50	302.27	320.08	319.88	12.05	11.45
NTU	4.41	4.75	11.72	10.49	0.27	0.41	2.13	2.04	2.68	3.11	4.25	4.66	342.58	184.33	0.38	0.45
chla	0.01	0.01	0.02	0.02	0.00	0.00	0.00	0.00	0.01	0.01	0.01	0.01	0.40	0.28	0.61	0.69

HTR = high temporal resolution; DM = daily mean. Statistical terms: std = standard deviation, SNR = signal-to-noise ratio. Percentiles (25%, 50%, 75%) refer to 25th, 50th (median), and 75th percentiles.

Table 2. Performance comparison of four machine learning models under high temporal resolution (HTR) and daily aggregation (DA) timescales for water quality prediction.

Model	Index	MAPE		rRmse		R²
Model	Index	HTR	DA	HTR	DA	HTR	DA
SVR	TN	12.04	18.59	14.39	20.76	0.00	0.27
	TP	36.55	100.10	35.90	91.44	−4.08	−9.79
	NH3_N	71.04	228.92	65.05	203.71	−5.87	−27.75
	DO	2.14	3.59	3.20	5.21	0.91	0.80
RF	TN	4.60	9.80	7.23	13.70	0.76	0.68
	TP	6.17	8.96	9.81	14.18	0.66	0.75
	NH3_N	3.60	8.04	8.29	18.41	0.90	0.77
	DO	1.73	3.37	2.66	5.05	0.94	0.81
XGBoost	TN	4.75	10.01	7.31	14.25	0.75	0.66
	TP	6.38	9.26	9.85	14.30	0.65	0.75
	NH3_N	3.60	8.83	8.15	19.93	0.90	0.73
	DO	1.71	3.38	2.61	5.01	0.94	0.81
LSTM	TN	5.98	9.38	8.45	12.62	0.66	0.70
	TP	12.08	14.15	16.21	22.56	0.09	0.40
	NH3_N	9.91	14.16	15.56	29.90	0.61	0.56
	DO	2.50	5.44	3.56	8.27	0.88	0.61

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shi, X.; Li, Y.; Yao, B.; Wang, S.; Ni, S. Impact of High Temporal Resolution Data on Water Quality Modeling: Insights from Erhai Case Study. Processes 2025, 13, 1726. https://doi.org/10.3390/pr13061726

AMA Style

Shi X, Li Y, Yao B, Wang S, Ni S. Impact of High Temporal Resolution Data on Water Quality Modeling: Insights from Erhai Case Study. Processes. 2025; 13(6):1726. https://doi.org/10.3390/pr13061726

Chicago/Turabian Style

Shi, Xiaomeng, Yu Li, Bo Yao, Shengrui Wang, and Shouqing Ni. 2025. "Impact of High Temporal Resolution Data on Water Quality Modeling: Insights from Erhai Case Study" Processes 13, no. 6: 1726. https://doi.org/10.3390/pr13061726

APA Style

Shi, X., Li, Y., Yao, B., Wang, S., & Ni, S. (2025). Impact of High Temporal Resolution Data on Water Quality Modeling: Insights from Erhai Case Study. Processes, 13(6), 1726. https://doi.org/10.3390/pr13061726

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Impact of High Temporal Resolution Data on Water Quality Modeling: Insights from Erhai Case Study

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Datasets

2.3. Models and Evaluation Matrix

2.3.1. Machine Learning Models for Water Quality Predictions

2.3.2. Wavelet Component Analysis

2.3.3. SHapley Additive exPlanations (SHAP)

2.3.4. Evaluation Matrix

2.4. Experiment Design

3. Results

3.1. Statistical Characterization of Data

3.2. Model Performance of DA and HTR Datasets

4. Discussion

4.1. Insights from Model Performance with Different Temporal Data

4.2. SHAP Interpretability Analysis for Main Driving Factors

4.3. Wavelet Decomposition for Temporal Variability Components

4.4. Implications for Water Quality Monitoring and Modeling

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI