1. Introduction
Eutrophication is a major global environmental problem that has negative impacts on freshwater, estuarine, and marine ecosystems [
1]. The main drivers of eutrophication include excessive loads of nutrients discharged from different sources [
2,
3]. Eutrophication and its consequences have become major concerns worldwide due to the long-term detrimental impacts on water quality and water ecosystem equilibrium [
4]. Lakes are treasures of the natural environment, playing a crucial role in regional water balance, climate conditions, and economic development [
5]. Limnologic studies over the past five decades have shown that human activities are driving the deterioration of lake ecosystems and their catchment areas at an alarming pace by altering the abiotic and biotic conditions of natural ecosystems [
6].
The primary cause of eutrophication has been identified as excessive anthropogenic loads of biogenic pollutants entering lakes from various sources within watershed areas and their improper development. Preventing further degradation of lake ecosystems requires the immediate development of comprehensive management strategies to protect their biological resources and improve water quality. The basic task in developing such strategies is establishing an appropriate system for eutrophication monitoring and applying reliable methodologies of trophic status assessment, which are integral components of the lake management system. This forms the foundation for the formulation of prognostic models of lake eutrophication under different scenarios and for selecting suitable methods of lake rehabilitation and protection.
Eutrophication modeling can be classified into two main groups: deterministic (process-based) models and statistical models [
7,
8,
9].
Process-based models describe selected elements of ecosystem functioning, such as nutrient cycling or phytoplankton growth. Their advantage lies in their cause-and-effect nature and the ability to closely reproduce ecosystem dynamics. Their disadvantage is the requirement for detailed data and an in-depth understanding of the conditions prevailing in a given ecosystem. Although these models are easy to calibrate, their generalizability across sites is limited [
10]. Water quality issues have been studied by limnologists, using multi-dimensional lake and reservoir hydrodynamic and water quality models created with sophisticated computational capabilities and a better understanding of eutrophication processes [
11].
In contrast to conventional statistically based water quality models, which assume linear and normally distributed relationships between response and predictor variables, ANNs can capture the non-linear interactions typical of ecosystems [
12]. The significance of these models is that they can duplicate the non-linear relationships among variables, unlike conventional statistical models that are based on linear relationships [
13,
14].
The aim of this study was not to create a eutrophication prediction model but to identify the key limiting factors of eutrophication and establish a ranking of their impact on the progression of this process in two different types of freshwater lakes, namely, natural Dal Lake in Kashmir, India, and Dobczyce Dam Reservoir in southern Poland, using multilinear regression (MLR) and artificial neural network (ANN) models. The research focuses on the modeling of the values of the Index of Trophic Status (ITS), which was chosen as the eutrophication assessment method. The research also seeks to investigate how data-driven methods can improve the reliability of tropic status predictions, providing a basis for lake protection strategies through the management of the main factors of eutrophication.
The limiting factors of eutrophication include any environmental condition that restricts a vegetation growth rate, often explained by Liebig’s Law of the Minimum, where the scarcest resource dictates growth. In the case of eutrophication, these factors can include phosphorous and nitrogen, BOD, COD, temperature, and insulation depending on the specific water ecosystem. Nitrogen and phosphorus are the basic nutrients for aquatic plants, and COD indirectly reflects the content of these substances. Temperature and sunlight influence the intensity of photosynthesis and the decomposition of aquatic plants. Depending on the specific features of water ecosystem functioning and the local conditions, different factors may play a leading role.
ANNs are now widely used in trophic status prediction, mainly coupled with spatial analysis [
15,
16,
17,
18]. Neural networks can be used to predict the values of traditional trophic state indicators, such as chlorophyll-a, water transparency, and oxygen [
19]. In this study, regression analysis and ANNs were used not to predict individual eutrophication indicator values, but to estimate the ITS, which was selected as the trophic level indicator. The identification of the primary limiting factors driving eutrophication development started with a classical and widely used method, namely, MLR [
20]. Subsequently, two sets of neural network models were developed; however, these models remain relatively basic due to the limited availability of long-term monitoring data, which is a major constraint in most studies related to trophic state assessment [
21,
22].
2. Materials and Methods
We selected two lakes for our study: Dal Lake, located in Kashmir, India, and Dobczyce Lake, located in southern Poland. Both lakes have different origins. Dal Lake is a natural lake fed by glacial waters. Dobczyce Lake is an artificial reservoir built on a river. However, their situation is similar as both suffer from eutrophication problems. The waters of both lakes are monitored. However, these monitoring data are not online, and comprehensive qualitative data are lacking. We assume that the applied research method will enable the identification of the main eutrophication factors with equal effectiveness for both Dal Lake and the Dobczyce Reservoir.
2.1. Study Area: Dal Lake
Dal Lake is an urban lake of Kashmir located in Srinagar city of Jammu and Kashmir. It lies between 4°5′ and 34°9′ N and 74°49′ and 74°53′ E at a mean altitude of 1583 m above sea level. It is the second largest lake in the state of Jammu and Kashmir, with a lake surface area of 22 km
2 and a catchment spread over 337 km
2 [
23]. The lake has a multi-drainage basin with a total water holding capacity of 15.45 million cubic meters (Mm
3), and the open water is spread over 10.5 km
2 [
24]. The lake is surrounded by the Zabarwan mountain range in the east and the Srinagar city area in the west, occupying a central position in the Kashmir valley [
25]. The lake has enormous ecological and socio-economic importance. Local residents depend on the lake for their livelihoods through tourism, fisheries, and agriculture [
26].
Figure 1 shows the location of the lake within the catchment. Dal Lake is a eutrophic lake subject to high anthropogenic pressure [
23].
2.2. Study Area: Dobczyce Dam Reservoir
Dobczyce Lake is an artificial dam reservoir located in the southern part of Poland, within the Lesser Poland voivodship, approximately 30 km from the south of Krakow. The lake was constructed between 1974 and 1987. The reservoir provides drinking water supply, hydropower generation, and recreation [
27]. The lake area is about 10.6 km
2 with a maximum depth of 28 m and an average depth of approximately 10 m [
28]. The lake catchment area is approximately 768 km
2 (
Figure 2). It comprises various land use classes including forested slopes in the upper catchment area and intensive agricultural and urban area in the lower area of the basin that contribute to nutrient inflows that lead to the annual eutrophication events [
15]. However, the lake provides a water source for over one million people. The lake faces tremendous environmental pressures linked to nutrient loading, sediment retention, and alterations in hydrological regimes. Dobczyce Lake serves as an excellent case study for hydrological, tropical, and ecosystem service modeling under temperate European conditions [
15].
2.3. Datasets
The models are built using two independent datasets, with the same chosen variables as predictors. Datasets include total nitrogen (TN, mg/L), ammonium (NH4, mg/L), total phosphorus (TP, mg/L), phosphate (PO4, mg/L), water temperature (WT, degrees Celsius), water transparency (T, NTU), and chemical oxygen demand (COD, mg/L). Here, the Index of Trophic Status (ITS) value served as a target. The Dal Lake dataset contains 349 observations, collected by the National Environmental Agency of India during the monitoring period of 1997–2023. Data were generally collected four times each year (1 timepoint per season: spring, summer, autumn, and winter). The Dobczyce Lake dataset contains 50 observations, collected by the Environmental Protection Inspectorate of Poland during the period of 2010–2022. Data collection was performed randomly for a few time periods over the entire year during different seasons. Data were cleaned and checked for outliers; they did not contain NAs.
Figure 3 shows the measuring point locations for both lakes.
2.4. Methodology for the Assessment of Trophic Status
The selected method for assessing the trophic state is based on the ITS, which belongs to a group of aggregated indices, including the Carlson TSI index and the Vollenweider TRIX [
29,
30]; however, these indices use different approaches. These indices are based on the dependencies of separate eutrophication indices on eutrophication factors that are used by different authors to obtain an aggregated numerical index that can comprehensively define the trophic status of a body of water. A characteristic feature of ITS is that it does not use input and output indicators such as nitrogen and phosphorus forms on one side of the system and chlorophyll on the other side to assess trophy. ITS is based on the mutual relationship between pH and water saturation by oxygen; this reflects the production–decomposition balance of organic matter produced by water vegetation. A shift in the biotic balance leads to changes in the gas levels in waters and, consequently, in the quantitative ratios of oxygen and carbon dioxide concentrations. Thus, a change in the quantitative ratio of these two gases reflects changes in the balance of the production and decomposition processes. In an aquatic environment, the oxygen content can be expressed based on the oxygen saturation of the water, whereas the CO
2 content can be expressed by the pH. A change in CO
2 concentration leads to a change in pH, which is driven by the carbonate balance, i.e., the proportion of bicarbonate and bicarbonate ions [
31]. The ITS was created originally for the brackish waters of Neva Bay of the Gulf of Finland; however, it has since been applied to different types of waters (salt waters, transitional waters, rivers, and lakes) [
23]. The application of ITS requires a strong correlation between pH and oxygen saturation in the water. The ITS value can be calculated using Equation (1) as follows:
where
pH—pH value;
[DO%]—water oxygen saturation (%), measured simultaneously with pH;
a—slope coefficient of the pH–DO% linear regression;
n—number of measurements.
The trophic state of a body of water can be assessed by the ITS boundary values determined for waters with different trophic states (
Table 1).
To use the selected ITS for the assessment of the trophic state of the two studied water bodies, a correlation analysis was conducted to confirm the existence of a linear relationship between pH and water oxygen saturation. In both water bodies, the correlation was very strong as the Pearson coefficient (r) was above 0.9. The next step is to calculate the regression line to obtain parameter a in the formula. For Lake Dal, a = 0.08. For Lake Dobczyce, a = 0.11.
3. Model Formulation
3.1. Model Assumptions
The reliable prediction of the progression of eutrophication processes is based on the prediction of ITS values under different scenarios of nutrient loading and ecosystem responses (e.g., chlorophyll-a concentrations). The predictive models should be both quantitatively reliable and easily interpretable. The models are evaluated using special metrics, such as the coefficient of determination (R
2), mean absolute error (MAE), root mean square error (RMSE), percent bias (PBIAS), and Nash–Sutcliffe efficiency (NSE). These metrics are the most commonly used to predict the reliability of prediction capacity. Moreover, interpretability tools, such as SHAP values, H-statistics, Garson and Olden connection-weight algorithms, partial dependence plots (PDPs), and accumulated local effects (ALEs) plots, are employed to understand impacts and relationships of the parameters within the complex eutrophication models. All analyses were performed in CRAN R Language version 4.5.1 with RStudio Version 2025.09.1-401. For analysis purposes, all chosen eutrophication predictors (COD, TN, NH4, TP, PO4, WT, and T) were normalized (z-score normalization) [
32]. The Dal Lake data were divided into training and test data at an 80/20 ratio, whereas a ratio of 70/30 was used for the Dobczyce Dam Reservoir data. This normalization method and data division yields the best final results. We generated 20 random datasets for all models. Training data were used for model architecture selection. Then, the test data were used for model verification and to analyze limiting factors.
For regression, the standard linear model was chosen, with 10-fold cross-validation and 40 repeats. For predictor influence testing, the following methods were chosen: AIC, Leaps, Boruta, and Earth algorithms. Results were evaluated based on R2, RMSE, MAE, PBIAS, and NSE.
Regarding ANNs, we choose NNET and neural net models, with 10-fold cross-validation and 40 repeats. The goal was to identify the best ANN based on R
2, RMSE, MAE, PBIAS, and NSE values and then test it (10-fold cross-validation, 40 repetitions) using SHAP, Garson and Olden, h-statistics, PDP, and ALE [
32].
NNET is a simple, classic feedforward net, with one hidden layer. For the hidden layer, the only activation option is sigmoid. The output activation could be linear, logistic, or SoftMax. Linear activation was chosen because the other options are for binary or categorical data. In the first stage, tests with configurations from 3 to 10 neurons show that 4 neurons give the best prediction results for Dal Lake, whereas 5 neurons yield the best results for Dobczyce Reservoir. Maxit was set to 600, whereas the convergence was set at 400 [
33].
NeuralNet is a feedforward multilayer perception (MLP) ANN with up to three hidden layers. Sigmoid or tanh activation functions could be used in hidden layers, and sigmoid/tanh or linear function could be used as output. The first configuration was from 0 to 10 neurons in each layer. The best results were obtained with network configurations of 3-1-4 for Dal Lake and 6-1-1 for Dobczyce Reservoir. The modeling algorithms are shown in
Figure 4.
3.2. Model Performance Metrics
The coefficient of determination (R
2) quantifies the proportion of variance in the observed eutrophication process described by the model. A higher R
2 value indicates that the model depicts the major variance in the observed sampling [
34]. However, the quantification of the errors and bias in the model cannot be obtained based on R
2. The R
2 value can be calculated using Equation (2) as follows:
The mean absolute error (MAE) represents the average absolute variance between the predicted and observed results, with lower values indicating better performance [
35]. MAE can be calculated using Equation (3) as follows:
Root mean square error (RMSE) gives an estimate of the average significance of predictive errors. This metric is especially appropriate for eutrophication modeling given the frequent variability in nutrient concentrations and algal biomass over time. Lower RMSE values indicate closer relationships between the predicted and actual trophic levels [
36]. RMSE can be calculated using Equation (4) as follows:
The percent bias (PBIAS) measures the average ability of the model to over- or underpredict concentrations. A positive PBIAS value indicates an underestimation of nutrient loading, whereas a negative PBIAS value indicates overestimation [
37]. The PBIAS can be calculated using Equation (5) as follows:
The residual variance of the predicted values relative to the variance of the observed values is expressed using the Nash–Sutcliffe Efficiency (NSE). If the NSE is equal to “1”, it indicates that the model accurately follows the observed eutrophication patterns. However, if the NSE is less than “0”, it indicates poor prediction [
36]. The NSE value can be calculated using Equation (6) as follows:
In all equations, is the number of observations, is the observed value, is the predicted value, and is the mean of the observed values.
3.3. Model Interpretability Methods
Various machine learning techniques, such as random forest, gradient boosting, and neural networks, are emerging and increasingly used to predict eutrophication development in surface waters in response to various factors. Various techniques are used interpret the results obtained with these models. Shapley Additive Explanations (SHAP) provides a theoretically based technique that measures the contribution of each predictor variable—positive or negative—and describes the influence of factors, such as temperature, nitrogen, and phosphorus, on the trophic state dynamics [
38]. The H-statistic quantifies the potential for synergistic effects to occur when several factors (e.g., temperature and nutrient load) interact [
39]. In neural network models, the Garson and Olden method is used to analyze the influence of connecting weights across the network layers to estimate the contribution of each input variable on the outcome variable. The Olden approach analyses both positive and negative influence of each input variable, explaining how each variable affects variation in the predicted eutrophication [
40]. Global visualization tools like partial dependence plots (PDPs) show the average secondary impact of two or more variables on predicted eutrophication levels [
41]. Accumulated local effect (ALE) plots expand this approach by considering the relationships between predictors, which is essential because the nutrient and hydrological variables are closely related to each other.
4. Results
4.1. Dal Lake
4.1.1. Correlation and Regression Analysis
Correlation analysis was conducted to analyze the relationship of the ITS with all water quality parameters in Dal Lake (
Figure 5). Pearson linear correlation analysis with pairwise complete observations showed that the trophic state, expressed by the ITS value, mostly depends on factors such as water transparency, total phosphorous, and total nitrogen, including their mineral forms, as evidenced by the strength of the correlations obtained. An interesting observation is the negative relationship between ITS and phosphorus. This usually occurs when areas experience algal blooms, which affect the phosphorus–chlorophyll-a relationship [
42]. Similarly, in the case of COD, the relationship with the trophic state can vary [
43]. In some cases, COD does not directly influence eutrophication, while in other situations it can be a significant factor and may be used as a component of the Index of Trophic Status [
44].
The regression analysis was also applied to all chosen eutrophication factors to analyze the prior limiting factors of eutrophication in the Dal Lake. It was observed that Boruta, AIC, Leaps, and Earth are consistently selected as predictors in every model outcome. Here, total phosphorus (TP), followed by transparency and total nitrogen (TN), are the main limiting factors of eutrophication, as shown in
Figure 6.
4.1.2. Neural Network (NNET) Model
The best net ANN model, selected from a variety of models with 3 to 10 neurons, is a 4-neuron model for Dal Lake with linear output. The boxplot illustrates the distribution of neural network weights assigned to the various eutrophication factors in Dal Lake. Every box shows the central capacity and variability of the weights for each variable, with the whiskers and outliers showing the range and extreme observed values, respectively. In the case of Dal Lake, the weights associated with the T, PO4, TN, and TP parameters show the greatest variability, indicating that these variables had higher variability and potentially greater influence in the neural network modeling (
Figure 7).
SHAP phi values
SHAP phi values for the Dal Lake NNET model are largely clustered near zero, with some outliers for COD and WT, indicating a generally uniform and moderate influence and relatively low variability in the contribution of individual parameters (
Figure 8).
H-Statistics
Figure 9 shows the distribution of the H-statistic values of eutrophication factors in Dal Lake, as computed using a neural network model. The H-statistic quantifies the strength of the interaction between the parameters, with greater values indicating a stronger relationship between parameters. Parameters such as COD, NH4, PO4, T, TN, TP, and WT show median H-statistic values close to zero, indicating low to moderate relationships. However, COD and NH4 represent high outliers.
Garson and Olden Importance Analysis
The Garson and Olden analyses were employed to evaluate the relative importance of input parameters in the neural network (NNET) model for Dal Lake. According to the Garson method, total nitrogen (TN) and total phosphorus (TP) were identified as prior limiting factors, demonstrating a strong contribution to the prediction of eutrophication development. The Olden method yields a similar pattern but provides directional insight, representing both the positive and negative influences of the predictors
(Figure 10).
4.1.3. Neural Net (NeuralNet) Model
In the NeuralNet model, a set of three-layer networks was proposed, with 0 to 10 neurons per layer and different activation functions. For Dal Lake, the best results yield a model with three neurons in the first layer, one in the second, and four in the third layer, with a linear output and sigmoid activation in the hidden layers.
Figure 11 shows the distribution of the connecting weights assigned to the various quality parameters by the neural net model for Dal Lake. The parameters COD and PO4 show clear dispersion, with both negative and positive weights, indicating that these parameters are essential drivers of model behavior. Other variables with a narrower distribution and weight values near zero, such as T, TN, and WT, illustrated more stable and less variable contributions. These variables have a relatively consistent but moderate influence on model prediction.
SHAP phi model
Figure 12 illustrates the SHAP phi values for each variable modeled using the NeuralNet algorithm for Dal Lake, providing a quantitative assessment of each variable. The SHAP phi metric indicates the extent to which each parameter moves predictions away from the expected value. COD, NH4, PO4, TN, TP and WT show distributions close to zero with a narrow range, indicating that the influence on model output is generally small and consistent. Only a few outliers are observed, indicating the limited cases where a few parameters have a stronger influence.
H-Statistics
Figure 13 presents the distribution of the H-statistic values of different eutrophication factors in Dal Lake as obtained using the NeuralNet model. The H-statistic measures the strength of the relationship between variables. Stronger interactions are indicated by higher H-statistic values. Among all the parameters, TP and TN exhibit the highest median H-statistics, indicating a greater influence on the model prediction. The results for Dal Lake revealed stronger interactions, with NH4, WT, and T significantly influencing model prediction.
4.2. Dobczyce Dam Reservoir
4.2.1. Correlation and Regression Analysis
Pearson linear correlation analysis with pairwise complete observations was applied to all the parameters.
Figure 14 shows the correlation plot for all parameters in the Dobczyce Reservoir, but the situation here is different. Specifically, ITS values are highly correlated with temperature, COD, and transparency, but nitrogen forms do not directly influence eutrophication development. Similarly to Lake Dal, phosphorus and phosphates are negatively correlated.
After regression predictors analysis with Boruta, AIC, Leaps, and Earth, the most influential parameters were not clearly identified. However, the most commonly selected variables included phosphorous forms, total nitrogen, transparency, and COD (
Figure 15).
4.2.2. Neural Network (NNET) Model
For the neural network (NNET) model, the best-performing configuration for Dobczyce Reservoir was obtained using a five-neuron architecture with a linear output. The boxplot shown in
Figure 16 highlights the distribution of the neural network weights assigned to the various water quality parameters. Each box represents the central tendency and variability of the weights for a given parameter, while the whiskers and outliers indicate the range and extreme observed values.
In Dobczyce Dam Reservoir, the weights are much more diverse. Weights for TP and TN exhibit higher values than those of other parameters, whereas transparency exhibits the largest detrimental impact (
Figure 16).
SHAP phi values
In Dobczyce Lake, the SHAP phi results are generally similar, with condensed values near the zero line and few outliers for COD and WT (
Figure 17). The phi values for Dobczyce Lake show variability with more prominent outliers, indicating greater dynamic impact of these variables on model prediction.
H-Statistics
H-statistic analysis using the neural network (NNET) model for Dobczyce Reservoir revealed a weak relationship among the parameters. Most of the parameters show median H-statistic values between 0.1 and 0.2, suggesting a predominantly additive influence with limited temperature interdependence. In contrast, slightly stronger interactions were obtained for water temperature and nutrients (PO4, TN, and TP) (
Figure 18).
Garson and Olden Importance Analysis
For Dobczyce Reservoir, the Garson method identified chemical oxygen demand (COD), total nitrogen (TN), and total phosphorus (TP) as the most influential predictors with higher contributions to the model. In contrast, water temperature (WT), PO4, and NH4 demonstrated comparatively lower contributions. The Olden method, which considers both the magnitude and direction of influence, showed both positive and negative contributions from COD, TP, and TN, which exhibit greater influence patterns (
Figure 19).
4.2.3. NeuralNet Model
The best result was obtained for Dobczyce Reservoir using the NeuralNet model with six neurons in the first layer, one neuron in the second layer, and one neuron in the third layer, with the same activation methods. The spread of the weights shows the extent to which each parameter can affect the neural network predictions. Specifically, a wider spread of the weights signifies a greater effect, whereas a narrow range around zero indicates a more stable and consistent influence. Among all the parameters, COD and NH
4 exhibit a broad distribution with more positive outliers, indicating that these factors play a critical role in shaping the models’ results. Thus, the overall results highlight that COD and NH4 are key factors contributing to the Dobczyce Reservoir model (
Figure 20).
SHAP phi values
For Dobczyce Lake, most of the parameters clustered near zero, whereas NH
4 and TN exhibited slightly wider spreads with frequent positive outliers (
Figure 21).
H-Statistics
The H-statistic values presented in
Figure 22 for Dobczyce Reservoir show the variation in the degree of the strength of the interactions among the water quality parameters. NH
4, WT and T show the greater median H-statistic values, indicating greater influence with the model. However, PO4 and COD display lower H-statistic values, describing weaker interactions with the other parameters. In general, temperature and nutrient-related parameters play essential roles in affecting model behavior as compared with COD and phosphate.
5. Discussion
The model evaluation results comparing eutrophication predictions in two freshwater ecosystems, namely, Dal Lake in India and Dobczyce Dam Reservoir in Poland, are summarized in
Table 2.
Table 2 shows that the ITS prediction results are not perfect. The phi, h-statistic, and Garson and Olsen values also reveal difficulties in prediction, which is a widespread problem for scenarios with limited data [
19,
22]. The main utility of the presented models, however, lies in the possibility of identifying the previous limiting factors of eutrophication development and determining their impact ranking based on the specific individual characteristics of water ecosystems. Regression models using AIC, Leaps, Boruta, and Earth algorithms indicate that the key factor of eutrophication in Dal Lake is TP along with its mineral forms. In Dobczyce Reservoir, eutrophication is driven by a multifactorial influence of biogenic matter (PO4, TN, TP) and reinforced by temperature (WT).
The influence of single factors identified as predictors in NNET was assessed with DALEX [
45], using a combination of PDP/ALE methods.
Figure 23 and
Figure 24 show analysis results for Dal Lake and Dobczyce Lake for models created with NNET.
For Dal Lake NNET models, NH4 and PO4 represent stable factors. COD and WT have strong effects, but only at their highest values. These findings are consistent with the studies by Rather and Dar, where COD was one of the most influential factors for trophic change [
6]. TN exhibits a constant and almost linear influence on ITS values.
For Dobczyce Lake, the most influential variable is WT, which exhibits a strong, linear input on ITS values. The remainder of the variables (NH4, PO4) show a weaker impact. The main role of temperature (WT) may be due to various reasons. Some authors claim that it is caused by the relationship between temperature and sediments. In the case of Lake Dobczyce, this is very likely given the large amount of sediment brought into the reservoir [
46,
47]. Several studies directly show the significant influence of temperature on the entire eutrophication process, demonstrating a direct effect on nitrogen and phosphorus compounds [
48,
49,
50]. These findings could also indicate that the eutrophication process is mainly dependent on temperature under low nutrient conditions due to the strict protection of the Dobczyce Dam Reservoir.
Figure 25 and
Figure 26 show the DALEX results for the NeuralNet models of Dal Lake and Dobczyce Reservoir.
Analysis shows even less interaction with results for COD, NH4, and PO4. Similarly to in the NNET results, a strong, dominant influence of TN is observed. The studies of Hassan et al. show nitrogen is present in Dal Lake sediments at greater levels than phosphorus, potentially explaining our results [
51]. For Dal Lake, we assume that the most important limiting factor of eutrophication is TN along with its mineral forms.
Analysis of the inputs for the NeuralNet model for Dobczyce Reservoir also shows a dominant impact of temperature and a decreasing impact of phosphorous. This situation is rather specific but has been previously reported [
52,
53]. In this case, it may be explained by the fact that Dobczyce Reservoir supplies water for Krakow and nearby settlements and is subject to restrictive protection with a ban on all types of use, with the exception of those related to water supply.
Ultimately, methods such as SHAP and H-statistics were not effective in selecting the variable that most influences eutrophication. These metrics are quite sensitive to the amount of data [
54]. Garson and Olden statistics showed variation in predictors, but they are not applicable to NeuralNet models because the algorithm had insufficient data to run properly. The results of our analyses are similar to the results of other authors using ANNs with small datasets. Specifically, most of the results are not satisfactory in terms of prediction. However, using cross-validation, one can obtain satisfactory results to determine the significance of variables [
22,
55,
56]. The above-mentioned authors generated models using 89, 168 and 96 samples, respectively. Thus, the amount of data from Lake Dobczyce seems to be insufficient. A summary of the modeling results is presented in
Table 3.
6. Conclusions
The purpose of this study was to evaluate the potential use of ANNs for eutrophication development predictions under different scenarios. It is well established that the functioning of any ecosystem, including aquatic ecosystems, is controlled not by the full range of factors but by a few key limiting factors. Thus, efficient water protection strategies aimed at mitigating eutrophication and its negative consequences must be based primarily on the management of these factors. These key factors also provide the basis for the development of prognostic models of eutrophication under different conditions, which are essential for addressing practical and engineering challenges in water resource management.
One method for defining such factors is the application of ANNs. ANNs are used for prediction, ideally with large datasets. ANNs can also be used in situations with limited data, but the prediction results will be less accurate. In cases with limited data, it is very important to determine the optimal network structure in terms of the number of layers, neurons, and activations, as well as the use of cross-validation and repetitions.
Even if the networks do not provide accurate predictions, they can still be a source of valuable information on the causes of poor trophic status. This information could be used to indicate key qualitative parameters that cause reduced trophic status.
The strategy for identifying which variable is most important for eutrophication processes should involve creating different models, employing cross-validation, and evaluating them multiple time [
22]. This strategy will allow for obtaining a model that provides stable results that can be considered dependable.
The research results regarding the applied and verified methodology for assessing trophic status and the potential for using machine learning to assess the main causes of eutrophication have significant application potential and a wide range of applications: (1) these models enable a quick and precise assessment of the trophic status of aquatic ecosystems, even in the absence of systematic and limited monitoring; (2) they allow for the identification of key factors/drivers of the eutrophication process in specific water ecosystems, which simultaneously allows for the identification of their sources; (3) they serve as the basis for the formulation of prognostic models of eutrophication development; (4) on this basis, it is possible to prioritize directions for protective measures aimed at focusing on managing the impact of limiting factors and mitigating of their impact on eutrophication processes.
The limitations regarding the proposed methodology are associated with the need to create long-term data based on regular measurements of nutrient concentrations and other water quality parameters, as well as data on accompanying factors (precipitation, temperature, flows, land use). This, in turn, is related to the need to organize a monitoring system dedicated specifically to eutrophication, which is costly. Limitations may also include sufficient data requirements, the risk of overfitting, low interpretability, poor performance under new environmental conditions, difficulty in modeling biogeochemical processes, complex calibration, and high computational costs.