Application of Machine Learning in Modeling the Relationship between Catchment Attributes and Instream Water Quality in Data-Scarce Regions

This research delves into the efficacy of machine learning models in predicting water quality parameters within a catchment area, focusing on unraveling the significance of individual input variables. In order to manage water quality, it is necessary to determine the relationship between the physical attributes of the catchment, such as geological permeability and hydrologic soil groups, and in-stream water quality parameters. Water quality data were acquired from the Iran Water Resource Management Company (WRMC) through monthly sampling. For statistical analysis, the study utilized 5-year means (1998–2002) of water quality data. A total of 88 final stations were included in the analysis. Using machine learning methods, the paper gives relations for 11 in-stream water quality parameters: Sodium Adsorption Ratio (SAR), Na+, Mg2+, Ca2+, SO42−, Cl−, HCO3−, K+, pH, conductivity (EC), and Total Dissolved Solids (TDS). To comprehensively evaluate model performance, the study employs diverse metrics, including Pearson’s Linear Correlation Coefficient (R) and the mean absolute percentage error (MAPE). Notably, the Random Forest (RF) model emerges as the standout model across various water parameters. Integrating research outcomes enables targeted strategies for fostering environmental sustainability, contributing to the broader goal of cultivating resilient water ecosystems. As a practical pathway toward achieving a delicate balance between human activities and environmental preservation, this research actively contributes to sustainable water ecosystems.


Introduction
River water quality plays a crucial role in ensuring the sustainability and health of freshwater ecosystems.Traditional monitoring methods often have spatial and temporal coverage limitations, leading to difficulties in effectively assessing and managing water quality [1].However, recent developments in machine learning techniques have indicated the potential to predict water quality accurately based on catchment characteristics [1,2].One of the underlying reasons for the growing interest in applying machine learning techniques to predict river water quality is the ability to simultaneously consider a wide range of catchment characteristics.These characteristics include various features that affect water quality, including land use, soil properties, climate data, topography, and hydrological methods will be critical to advancing the field and harnessing the full potential of machine learning in predicting river water quality based on catchment characteristics.In the study, we intend to investigate how we can address spatial variations in the characteristics of the catchment to explain river water quality using machine learning techniques.
This paper explores the relationship between catchment attributes and in-stream water quality parameters using machine learning methods.It evaluates model accuracy with RMSE, MAE, R, and MAPE, identifies optimal models for 11 parameters, and determines significant influencing variables.
The predictive models developed for each water parameter demonstrate strong performance in most cases.The significant variables identified provide insights into the key factors influencing water quality in the studied catchment.This research, therefore, serves as a catalyst for fostering a nuanced and effective approach to water resource management, underpinned by the empirical foundation laid by the predictive models and the discerned influential variables.As a result, the integration of these findings into decision-making processes holds the potential to optimize resource allocation, mitigate environmental impacts, and ultimately contribute to the overarching goal of achieving sustainable and resilient water ecosystems.The study highlights the potential of artificial intelligence for quick and accurate water quality assessment, tailored to watershed attributes.

Materials and Methods
To establish relationships between the catchment attributes and water quality parameters, machine learning methods were employed.Multiple algorithms, such as regression trees, TreeBagger, Random Forests, and Gaussian process regression (GPR) models, were applied to construct predictive models for each water quality parameter.

Regression Tree Models
The fundamental concept behind regression trees is to partition the input space into distinct regions and assign predictive values to these regions.This segmentation enables the model to make predictions based on the most relevant conditions and characteristics of the data.A regression tree (RT) is a simple and comprehensible machine learning model applicable to both regression and classification problems.It follows a tree-like structure composed of nodes and branches (Figure 1).
and stakeholders [25].The development of logical machine learning techniques that provide insights into the model decision-making process and highlight the most influential catch characteristics would improve the reliability and usability of predictive models.
Addressing these gaps requires interdisciplinary collaborations among hydrologists, ecologists, data scientists, policymakers, and other relevant stakeholders.Furthermore, focusing on data-driven approaches, data-sharing initiatives, and advances in computational methods will be critical to advancing the field and harnessing the full potential of machine learning in predicting river water quality based on catchment characteristics.In the study, we intend to investigate how we can address spatial variations in the characteristics of the catchment to explain river water quality using machine learning techniques.
This paper explores the relationship between catchment attributes and in-stream water quality parameters using machine learning methods.It evaluates model accuracy with RMSE, MAE, R, and MAPE, identifies optimal models for 11 parameters, and determines significant influencing variables.
The predictive models developed for each water parameter demonstrate strong performance in most cases.The significant variables identified provide insights into the key factors influencing water quality in the studied catchment.This research, therefore, serves as a catalyst for fostering a nuanced and effective approach to water resource management, underpinned by the empirical foundation laid by the predictive models and the discerned influential variables.As a result, the integration of these findings into decisionmaking processes holds the potential to optimize resource allocation, mitigate environmental impacts, and ultimately contribute to the overarching goal of achieving sustainable and resilient water ecosystems.The study highlights the potential of artificial intelligence for quick and accurate water quality assessment, tailored to watershed attributes.

Materials and Methods
To establish relationships between the catchment attributes and water quality parameters, machine learning methods were employed.Multiple algorithms, such as regression trees, TreeBagger, Random Forests, and Gaussian process regression (GPR) models, were applied to construct predictive models for each water quality parameter.

Regression Tree Models
The fundamental concept behind regression trees is to partition the input space into distinct regions and assign predictive values to these regions.This segmentation enables the model to make predictions based on the most relevant conditions and characteristics of the data.A regression tree (RT) is a simple and comprehensible machine learning model applicable to both regression and classification problems.It follows a tree-like structure composed of nodes and branches (Figure 1).The partitioning of an input space into distinct regions and the representation of a 3D regression surface within a regression tree [26].
Each node corresponds to a specific condition related to the input data, and this condition is evaluated at each node as the data progresses through the tree.
To predict an outcome for a given input, the starting point is the root node of the tree (Figure 1).Here, the initial condition associated with the input feature(s) is considered.Depending on whether this condition is deemed true or false, the branches are followed to reach the next node.This process is repeated recursively until a leaf node is arrived at.At the leaf node, a value is found, which serves as the predicted result for the input instance.For regression tasks, this value is typically a numeric prediction.
As the tree is traversed, the input space undergoes changes.Initially, all instances are part of a single set represented by the root node.However, as the algorithm progresses, the input space is gradually divided into smaller subsets.These divisions are based on conditions that help in tailoring predictions to different regions within the input space.
The process of constructing regression trees involves determining the optimal split variable ("j") and split point ("s") to partition the input space effectively.These variables are chosen by minimizing a specific expression (Equation (1)) that considers all input features.The goal is to minimize the sum of squared differences between observed values and predicted values in resulting regions [27][28][29].
Once "j" and "s" are identified, the tree-building process continues by iteratively dividing regions.This process is referred to as a "greedy approach" because it prioritizes local optimality at each step.The binary recursive segmentation approach divides the input space into non-overlapping regions characterized by their mean values.
The depth of a regression tree serves as a critical factor in preventing overfitting (too much detail) or underfitting (too simplistic).

Ensembles of Regression Trees: Bagging, Random Forest, and Boosted Trees
Bagging is another ensemble method that involves creating multiple subsets of the training dataset through random sampling with replacement (bootstrap samples).
The process begins with the creation of multiple bootstrap samples from the original dataset.Bootstrap sampling involves randomly selecting data points from the dataset with replacement.This means that the same data point can be selected multiple times, while others may not be selected at all.In this way, subsets of the same size as the original data set are formed and are used to train the model.
Each subset is used to train a separate regression tree model, and their predictions are aggregated to make the final prediction (Figure 2).Bagging helps reduce variance by averaging predictions from multiple models, making it particularly effective when the base models are unstable or prone to overfitting [27][28][29].
In bagging, multiple training sets are generated by repeatedly selecting samples from the original dataset, and this process involves sampling with replacement.This technique is utilized to create diverse subsets of data.The primary goal is to reduce the variance in the model's predictions by aggregating the results from these different subsets.Consequently, each subset contributes to the final prediction, and the averaging of multiple models enhances the model's robustness and predictive accuracy.
Random forests, a variant of bagging, stand out by introducing diversity among the constituent models within the ensemble.This diversity is achieved through the creation of multiple regression trees, each trained on a distinct bootstrap sample from the data.Moreover, before making decisions at each split within these trees, only a randomly selected subset of available features is considered.This approach helps in decorrelating the individual trees within the ensemble, thereby further reducing variance.The ensemble's final prediction is generated by aggregating the predictions from these decorrelated trees, resulting in a robust and high-performing model [26][27][28][29].
The boosting tree method is a sequential training method, and within this paradigm, gradient boosting stands out as a widely employed technique for enhancing overall model performance (Figure 3).In gradient boosting, submodels are introduced iteratively, with each new model selected based on its capacity to effectively estimate the residuals or errors of the preceding model in the sequence.The distinctive feature of gradient boosting lies in its commitment to minimizing these residuals during the iterative process.By focusing on minimizing residuals, gradient boosting ensures that each new submodel added to the ensemble is adept at correcting the errors left by its predecessors.This emphasis on addressing the shortcomings of prior models leads to the creation of a robust and adaptive ensemble model.The iterative nature of gradient boosting allows it to systematically refine its predictions, making the final ensemble proficient in capturing intricate patterns and nuances within the data.The result is a powerful model capable of delivering highly accurate predictions by continuously learning and adapting to the complexities present in the dataset.
The fundamental concept is rooted in gradient-based optimization techniques, which involve refining the current solution to an optimization problem by incorporating a vector that is directly linked to the negative gradient of the function under minimization, as referenced in previous works [31][32][33].This approach is logical because a negative gradient signifies the direction in which the function decreases.When it is applied a quadratic error function, each subsequent model aims to correct the discrepancies left by its predecessors, Figure 2. Creating regression tree ensembles using the bagging approach [30].
The boosting tree method is a sequential training method, and within this paradigm, gradient boosting stands out as a widely employed technique for enhancing overall model performance (Figure 3).In gradient boosting, submodels are introduced iteratively, with each new model selected based on its capacity to effectively estimate the residuals or errors of the preceding model in the sequence.The distinctive feature of gradient boosting lies in its commitment to minimizing these residuals during the iterative process.Creating regression tree ensembles using the bagging approach [30].
The boosting tree method is a sequential training method, and within this paradigm, gradient boosting stands out as a widely employed technique for enhancing overall model performance (Figure 3).In gradient boosting, submodels are introduced iteratively, with each new model selected based on its capacity to effectively estimate the residuals or errors of the preceding model in the sequence.The distinctive feature of gradient boosting lies in its commitment to minimizing these residuals during the iterative process.By focusing on minimizing residuals, gradient boosting ensures that each new submodel added to the ensemble is adept at correcting the errors left by its predecessors.This emphasis on addressing the shortcomings of prior models leads to the creation of a robust and adaptive ensemble model.The iterative nature of gradient boosting allows it to systematically refine its predictions, making the final ensemble proficient in capturing intricate patterns and nuances within the data.The result is a powerful model capable of delivering highly accurate predictions by continuously learning and adapting to the complexities present in the dataset.
The fundamental concept is rooted in gradient-based optimization techniques, which involve refining the current solution to an optimization problem by incorporating a vector that is directly linked to the negative gradient of the function under minimization, as referenced in previous works [31][32][33].This approach is logical because a negative gradient signifies the direction in which the function decreases.When it is applied a quadratic error function, each subsequent model aims to correct the discrepancies left by its predecessors, Figure 3.The application of gradient boosting within regression tree ensembles [26].
By focusing on minimizing residuals, gradient boosting ensures that each new submodel added to the ensemble is adept at correcting the errors left by its predecessors.This emphasis on addressing the shortcomings of prior models leads to the creation of a robust and adaptive ensemble model.The iterative nature of gradient boosting allows it to systematically refine its predictions, making the final ensemble proficient in capturing intricate patterns and nuances within the data.The result is a powerful model capable of delivering highly accurate predictions by continuously learning and adapting to the complexities present in the dataset.
The fundamental concept is rooted in gradient-based optimization techniques, which involve refining the current solution to an optimization problem by incorporating a vector that is directly linked to the negative gradient of the function under minimization, as referenced in previous works [31][32][33].This approach is logical because a negative gradient signifies the direction in which the function decreases.When it is applied a quadratic error function, each subsequent model aims to correct the discrepancies left by its predecessors, essentially reinforcing and improving the model with a focus on the residual errors from earlier stages.
In the context of gradient-boosting trees, the learning rate is a crucial hyperparameter that controls the contribution of each tree in the ensemble to the final prediction.It is often denoted as "lambda" (λ).The learning rate determines how quickly or slowly the model adapts to the errors from the previous trees during the boosting process.A lower learning rate means that the model adjusts more gradually and may require a larger number of trees to achieve the same level of accuracy, while a larger learning rate leads to faster adaptation but may risk overfitting with too few trees.
BT models are significantly more complex regarding computational complexity because they are trained sequentially compared to TR and RF models that can be trained in parallel.

Gaussian Process Regression (GPR)
Gaussian processes provide a probabilistic framework for modeling functions, capturing uncertainties, and making predictions in regression tasks.The choice of covariance functions and hyperparameters allows for flexibility in modeling relationships among variables [34].
Gaussian process modeling involves estimating an unknown function f(•) in nonlinear regression problems.It assumes that this function follows a Gaussian distribution characterized by a mean function µ(•) and a covariance function k(•,•).The covariance matrix K is a fundamental component of GPR and is determined by the kernel function (k).
The kernel function (k) plays a pivotal role in capturing the relationships between input data points (x and x ).This function is essential for quantifying the covariance or similarity between random values f(x) and f(x ).One of the most widely used kernels is defined by the following expression: In this expression, several elements are critical: σ 2 represents the signal variance, a model parameter that quantifies the overall variability or magnitude of the function values.
The exponential function "exp" is used to model the similarity between x and x .It decreases as the difference between x and x increases, capturing the idea that values close to each other are more strongly correlated.
The parameter l, known as the length scale, is another model parameter.It controls the smoothness and spatial extent of the correlation.A smaller l results in more rapid changes in the function, while a larger l leads to smoother variations.
The observations in a dataset y = {y 1 , . . . ,y n } can be viewed as a sample from a multivariate Gaussian distribution.
(y 1 , . . . , Gaussian processes are employed to model the relationship between input variables x and target variables y, considering the presence of additive noise ε ∼ N 0, σ 2 .The goal is to estimate the unknown function f(•).The observations y are treated as a sample from a multivariate Gaussian distribution with mean vector µ and covariance matrix K.This distribution captures the relationships among the data points.The conditional distribution of a test point's response value y * , given the observed data y = (y 1 , . . . ,y n ) T , is represented as xics 2023, 11, x FOR PEER REVIEW 7 of 40 essentially reinforcing and improving the model with a focus on the residual errors from earlier stages.
In the context of gradient-boosting trees, the learning rate is a crucial hyperparameter that controls the contribution of each tree in the ensemble to the final prediction.It is often denoted as "lambda" (λ).The learning rate determines how quickly or slowly the model adapts to the errors from the previous trees during the boosting process.A lower learning rate means that the model adjusts more gradually and may require a larger number of trees to achieve the same level of accuracy, while a larger learning rate leads to faster adaptation but may risk overfitting with too few trees.
BT models are significantly more complex regarding computational complexity because they are trained sequentially compared to TR and RF models that can be trained in parallel.

Gaussian Process Regression (GPR)
Gaussian processes provide a probabilistic framework for modeling functions, capturing uncertainties, and making predictions in regression tasks.The choice of covariance functions and hyperparameters allows for flexibility in modeling relationships among variables [34].
Gaussian process modeling involves estimating an unknown function f(•) in nonlinear regression problems.It assumes that this function follows a Gaussian distribution characterized by a mean function µ(•) and a covariance function k(•,•).The covariance matrix K is a fundamental component of GPR and is determined by the kernel function (k).
The kernel function (k) plays a pivotal role in capturing the relationships between input data points (x and x′).This function is essential for quantifying the covariance or similarity between random values f(x) and f(x′).One of the most widely used kernels is defined by the following expression: In this expression, several elements are critical:  represents the signal variance, a model parameter that quantifies the overall variability or magnitude of the function values.The exponential function "exp" is used to model the similarity between x and x′.It decreases as the difference between x and x′ increases, capturing the idea that values close to each other are more strongly correlated.
The parameter l, known as the length scale, is another model parameter.It controls the smoothness and spatial extent of the correlation.A smaller l results in more rapid changes in the function, while a larger l leads to smoother variations.
The observations in a dataset  =  , . . .,  can be viewed as a sample from a multivariate Gaussian distribution.
( , . . .,  ) ~(, ), Gaussian processes are employed to model the relationship between input variables x and target variables y, considering the presence of additive noise ~(0,  ).The goal is to estimate the unknown function f(•).The observations y are treated as a sample from a multivariate Gaussian distribution with mean vector µ and covariance matrix Κ.This distribution captures the relationships among the data points.The conditional distribution of a test point's response value  * , given the observed data  = ( , . . .,  ) , is represented as ( * ,  *  ) with the following: *  =  * * +  −  *   * .(5) with the following: is to estimate the unknown function f(•).The observations y are treated as a sample from a multivariate Gaussian distribution with mean vector µ and covariance matrix Κ.This distribution captures the relationships among the data points.The conditional distribution of a test point's response value  * , given the observed data  = ( , . . .,  ) , is represented as ( * ,  *  ) with the following: *  =  * * +  −  *   * .
(5) (5) In traditional GPR, a single length-scale parameter (l) and signal variance (σ 2 ) are used for all input dimensions.In contrast, the Automatic Relevance Determination (ARD) approach employs a separate length-scale parameter (l i ) for each input dimension, where 'i' represents a specific dimension.This means that for a dataset with 'm' input dimensions, you have 'm' individual length-scale parameters [34].
The key advantage of ARD is that it automatically determines the relevance of each input dimension in the modeling process.By allowing each dimension to have its own length scale parameter, the model can assign different degrees of importance to each dimension.This means that the model can adapt and focus more on dimensions that are more relevant to the target variable and be less influenced by less relevant dimensions.
GPR involves matrix operations, and the computational complexity can become an issue for large datasets.Techniques such as sparse approximations or using specialized kernels can be employed to address these computational challenges.GPR is frequently used in Bayesian optimization problems where the goal is to optimize an unknown objective function that is expensive to evaluate.

Case Study of the Caspian Sea Basin
This study took place in the Caspian Sea catchment area (Figure 4) in Northern Iran, covering approximately 618 m 2 with coordinates ranging from 49 • 48 to 54 • 41 longitude and from 35 • 36 to 37 • 19 latitude (Figure 4).In traditional GPR, a single length-scale parameter (l) and signal variance ( ) are used for all input dimensions.In contrast, the Automatic Relevance Determination (ARD) approach employs a separate length-scale parameter ( ) for each input dimension, where 'i' represents a specific dimension.This means that for a dataset with 'm' input dimensions, you have 'm' individual length-scale parameters [34].
The key advantage of ARD is that it automatically determines the relevance of each input dimension in the modeling process.By allowing each dimension to have its own length scale parameter, the model can assign different degrees of importance to each dimension.This means that the model can adapt and focus more on dimensions that are more relevant to the target variable and be less influenced by less relevant dimensions.
GPR involves matrix operations, and the computational complexity can become an issue for large datasets.Techniques such as sparse approximations or using specialized kernels can be employed to address these computational challenges.GPR is frequently used in Bayesian optimization problems where the goal is to optimize an unknown objective function that is expensive to evaluate.
Initially, 108 water quality monitoring stations scattered across the southern basin of the Caspian Sea were selected for analysis (Figure 4).To define the upstream catchment boundaries, digital elevation models (DEMs) with a resolution of 30 m by 30 m from the USGS database were used, with boundary refinement achieved through a user digitizing technique.Macro-sized catchments, those exceeding 1000 square kilometers, totaling 18 catchments, were excluded from the modeling process due to their significant impact on hydrological dynamics.
Water quality data, including parameters like SAR, Na + , Mg 2+ , Ca 2+ , SO 4 2− , Cl − , HCO 3− , K + , pH, EC, and TDS, were obtained from the Iran Water Resource Management Company (WRMC) through monthly sampling.Collection adhered to the WRMC Guidelines for Surface Water Quality Monitoring (2009) and EPA-841-B-97-003 standards [36].For statistical analysis, the 5-year means (1998-2002) of water quality data were calculated.After scrutinizing for normality and identifying outliers, 88 final stations were used in the study.The geographic scope of the study area is illustrated in Figure 4.
A land cover dataset was created using a 2002 digital land cover map (Scale 1:250,000) from the Forest, Ranges, and Watershed Management Organization of Iran.The original land cover categories were consolidated into six classes: bare land, water bodies, urban areas, agriculture, rangeland, and forests, following [37] land use and land cover classification systems.Furthermore, digital geological and soil feature maps (1:250,000 scale) were obtained from the Geological Survey of Iran (www.gsi.ir,accessed on 24 April 2021).Detailed information about the characteristics of the catchments and their statistical attributes can be found in Tables 1 and 2.
In this study, hydrologic soil groups and geological permeability classes were developed and applied in conjunction with land use/land cover types within the modeling process.Hydrologic soil groups are influenced by runoff potential and can be used to determine runoff curve numbers.They consist of four classes (A, B, C, and D), with A having the highest runoff potential and D the lowest.Notably, soil profiles can undergo significant alterations due to changes in land use/land cover.In such cases, the soil textures of the new surface soil can be employed to determine the hydrologic soil groups as described in Table 1 [38].Furthermore, the study incorporates the application of geological permeability attributes related to catchments, with the development of three geological permeability classes: Low, Medium, and High.These classes are associated with various characteristics of geological formations, such as effective porosity, cavity type and size, their connectivity, rock density, pressure gradient, and fluid properties like viscosity.
The range and statistical properties of training and test data play a fundamental role in the development and evaluation of machine learning models.They impact the model's generalization, robustness, fairness, and ability to perform effectively in diverse real-world scenarios.Statistical properties of input and output data are given in Tables 1 and 2.
The machine learning methods used in this paper were assessed using five-fold cross-validation.In this approach, the dataset was randomly divided into five subsets, with four of them dedicated to training the model and the remaining subset utilized for model validation (testing).This five-fold cross-validation process was repeated five times, ensuring that each subset was used exactly once for validation.Subsequently, the results from these five repetitions were averaged to produce a single estimation.
All models were trained and tested under identical conditions, ensuring a fair and consistent evaluation of their performance.This practice is essential in machine learning to provide a level playing field for comparing different algorithms and models.
When machine learning models are trained and tested under equal conditions, it means that they are exposed to the same datasets, preprocessing steps, and evaluation metrics.The quality of the model was assessed using several evaluation and performance measures, which include RMSE, MAE, Pearson's Linear Correlation Coefficient (R), and MAPE.
The RMSE criterion, expressed in the same units as the target values, serves as a measure of the model's general accuracy.It is calculated as the square root of the average squared differences between the actual values (d k ) and the model's predictions (o k ) across the training samples (N).
The MAE criterion represents the mean absolute error of the model, emphasizing the absolute accuracy.It calculates the average absolute differences between the actual values and the model's predictions.
Pearson's Linear Correlation Coefficient (R) provides a relative measure of accuracy assessment.It considers the correlation between the actual values (d k ) and the model's predictions (o k ) relative to their respective means (d and o).Values of R greater than 0.75 indicate a strong correlation between the variables.
The MAPE is a relative criterion that evaluates accuracy by calculating the average percentage differences between the actual values and the model's predictions.
This research deals with a limited dataset, and in this case, there is a higher risk of overfitting, where a model performs well on the training data but needs to generalize to new, unseen data.Five-fold cross-validation helps mitigate overfitting by partitioning the dataset into five subsets, using four for training and one for testing in each iteration.This process allows for a more robust evaluation of the model's performance.
Five-fold cross-validation efficiently utilizes the available data by rotating through different subsets for training and testing, ensuring that each data point contributes to training and evaluation.
Moreover, cross-validation provides a more robust estimate of the model's performance by averaging the evaluation metrics across multiple folds.This helps ensure that our results are not overly dependent on the particular random split of the data.Additionally, cross-validation allows us to iteratively train and evaluate the model on different subsets, aiding in the fine-tuning of hyperparameters and ensuring the model's performance is consistently reliable (Figure 5).

Results
The paper analyzes the application of regression trees, bagging, RF, gradient boosting, and Gaussian process regression models using a systemic approach (Figure 5).For each of the models, the hyperparameters of the model were varied in the appropriate range, and optimal values were determined using a grid-search method.The following values were analyzed: The depth of a regression tree is a crucial factor in preventing overfitting, which occurs when the tree becomes too detailed and fits the training data too closely, as well as underfitting, which happens when the tree is too simplistic and fails to capture the underlying patterns.Table 3 illustrates the impact of the minimum leaf size on the accuracy of the regression tree (RT) model.It shows how changing the minimum leaf size affects key performance metrics such as RMSE, MAE, MAPE, and R. Accordingly, the leaf size is almost positively associated with the RMSE but inversely correlated with the R values.

Results
The paper analyzes the application of regression trees, bagging, RF, gradient boosting, and Gaussian process regression models using a systemic approach (Figure 5).For each of the models, the hyperparameters of the model were varied in the appropriate range, and optimal values were determined using a grid-search method.The following values were analyzed: The depth of a regression tree is a crucial factor in preventing overfitting, which occurs when the tree becomes too detailed and fits the training data too closely, as well as underfitting, which happens when the tree is too simplistic and fails to capture the underlying patterns.Table 3 illustrates the impact of the minimum leaf size on the accuracy of the regression tree (RT) model.It shows how changing the minimum leaf size affects key performance metrics such as RMSE, MAE, MAPE, and R. Accordingly, the leaf size is almost positively associated with the RMSE but inversely correlated with the R values.Minimum number of samples per leaf: the study considered values from 2 to 10 samples per tree leaf, with a 1-sample increment.
• Boosted tree model: 1. Number of generated trees (B): analyzed within a range of 1-100 trees.
All models were evaluated in terms of optimality in terms of the mean square error, and then the optimal model obtained from all the analyzed ones was evaluated on the test data using the RMSE, MAE, MAPE, and R criteria.
In the paper, a detailed procedure is illustrated for determining the optimal model for the prediction of the SAR parameter.In contrast, for all other models for the prediction, it is given in a more concise form.Accompanying results for other models, except the SAR parameter, can be found in the Appendix A (Tables A1-A10) of the paper.

•
Regression tree models Table 3 illustrates the impact of the minimum leaf size on the accuracy of the regression tree (RT) model.It shows how changing the minimum leaf size affects key performance metrics such as RMSE, MAE, MAPE, and R.
In this particular case, it was found that models with less complexity, i.e., the amount of data per terminal sheet is 10, have higher accuracy (Figure 6). it is given in a more concise form.Accompanying results for other models, except the SAR parameter, can be found in the Appendix A (Tables A1-A10) of the paper.

Prediction of SAR Parameter Values
• Regression tree models Table 3 illustrates the impact of the minimum leaf size on the accuracy of the regression tree (RT) model.It shows how changing the minimum leaf size affects key performance metrics such as RMSE, MAE, MAPE, and R.
In this particular case, it was found that models with less complexity, i.e., the amount of data per terminal sheet is 10, have higher accuracy (Figure 6).

•
TreeBagger models and Random Forest models The application of TB and RF models was analyzed simultaneously (Figure 7).The figure shows the dependence of the achieved accuracy of the model on the hyperparameter value.The TB model represents the borderline case of the RF model when all variables are taken into account for potential calculations.
Among the optimal models in this group, the RF model with 500 generated trees proved to be the best.In contrast, the model that uses a subset of eight variables and has a minimum amount of data per terminal leaf equal to one has a higher accuracy according to the RMSE and R criteria, while the model that uses a subset with six variables and has a minimum amount of data per terminal sheet equal to one and has a higher accuracy according to MAE and MAPE criteria (Table 4).Optimal values according to different accuracy criteria are marked with bold numbers in Table 4.  • TreeBagger models and Random Forest models The application of TB and RF models was analyzed simultaneously (Figure 7).The figure shows the dependence of the achieved accuracy of the model on the hyperparameter value.The TB model represents the borderline case of the RF model when all variables are taken into account for potential calculations.
Among the optimal models in this group, the RF model with 500 generated trees proved to be the best.In contrast, the model that uses a subset of eight variables and has a minimum amount of data per terminal leaf equal to one has a higher accuracy according to the RMSE and R criteria, while the model that uses a subset with six variables and has a minimum amount of data per terminal sheet equal to one and has a higher accuracy according to MAE and MAPE criteria (Table 4).Optimal values according to different accuracy criteria are marked with bold numbers in Table 4.
With the BT model (Figure 8), it was shown that the highest accuracy is obtained by applying complex models with a large number of branches.
The optimal obtained model had a structure of 32 branches and a Learning Rate value equal to 0.01.

GPR models
The optimal values for the parameters of the applied models with different kernel functions were obtained using marine probability (Tables 5 and 6).With the BT model (Figure 8), it was shown that the highest accuracy is obtained by applying complex models with a large number of branches.The optimal obtained model had a structure of 32 branches and a Learning Rate value equal to 0.01.With the BT model (Figure 8), it was shown that the highest accuracy is obtained by applying complex models with a large number of branches.The optimal obtained model had a structure of 32 branches and a Learning Rate value equal to 0.01.

GP Model Covariance Function
Covariance Function Parameters   where The marginal likelihood is a function influenced by the observed data (y(X)) and model parameters {l, σ 2 , η 2 }.The determination of the model parameters is achieved through the maximization of this function.
Importantly, when the marginal likelihood is transformed by taking the logarithm, identical results are achieved as when optimizing the original likelihood.Therefore, model parameter optimization is typically carried out by employing gradient-based procedures on the log marginal probability expression, simplifying the optimization process without altering the final outcomes.The comparative results of the implemented ML models are presented in Table 7. Optimal values according to different accuracy criteria are marked with bold numbers in Table 7.
The values of all accuracy criteria according to the adopted accuracy criteria on the test data set are shown in Table 7.According to the RMSE and R criteria, the RF model had the highest accuracy (it uses a subset of eight variables for calculation, and the amount of data per terminal sheet is equal to one), while according to the MAE and MAPE criteria, the GP model with an exponential kernel function stood out as the most accurate.On the optimal RF model, the significance of each of the input variables was determined such that the values of the considered variable are permuted within the training data, and the out-of-bag error for such permuted data is recalculated.The significance of the variable (Figure 9) is then determined by calculating the mean value of the difference before and after a permutation.This value is then divided by the standard deviation of these differences.The variable for which a higher value was obtained in relation to the others is ranked as more significant in relation to the variables for which smaller values were obtained.The values of all accuracy criteria according to the adopted accuracy criteria o test data set are shown in Table 7.According to the RMSE and R criteria, the RF m had the highest accuracy (it uses a subset of eight variables for calculation, and the am of data per terminal sheet is equal to one), while according to the MAE and MAPE cr the GP model with an exponential kernel function stood out as the most accurate.O optimal RF model, the significance of each of the input variables was determined that the values of the considered variable are permuted within the training data, an out-of-bag error for such permuted data is recalculated.The significance of the va (Figure 9) is then determined by calculating the mean value of the difference befor after a permutation.This value is then divided by the standard deviation of these d ences.The variable for which a higher value was obtained in relation to the oth ranked as more significant in relation to the variables for which smaller values we tained.

Prediction of Na + Parameter Values
RF models proved to be the optimal models for predicting sodium ion (Na + ) co trations, while the analysis of all models in terms of accuracy is given in Appendix A ble A1).The dependence of the adopted accuracy criteria on the model paramet shown in Figure 10.Based on the defined accuracy criteria, four models with the follo criteria values were selected (Table 8).

Prediction of Na + Parameter Values
RF models proved to be the optimal models for predicting sodium ion (Na + ) concentrations, while the analysis of all models in terms of accuracy is given in Appendix A (Table A1).The dependence of the adopted accuracy criteria on the model parameters is shown in Figure 10.Based on the defined accuracy criteria, four models with the following criteria values were selected (Table 8).The "Weighted Sum Model" or "Simple Multi-Criteria Ranking" method was used to select the optimal model.For the minimization objectives (RMSE, MAE, MAPE), Min-Max normalization is applied, and for the maximization objective (R), Max-Min normalization is applied to ensure that all metrics are on the same scale.Equal weights are assigned to the normalized evaluation metrics to indicate their relative importance in the decision-making process.The weighted sum method calculated an aggregated value for each model, which considers all four normalized metrics.All models are ranked based on their aggregated values, with the lower aggregated value indicating better overall performance (Table 9).As the optimal model, the RF model with 500 trees was obtained, which uses a subset of 11 variables, where the minimum amount of data per sheet is six.The assessment of the significance of individual input variables for the accuracy of the prediction was performed precisely on the obtained model with the highest accuracy (Figure 11).The "Weighted Sum Model" or "Simple Multi-Criteria Ranking" method was used to select the optimal model.For the minimization objectives (RMSE, MAE, MAPE), Min-Max normalization is applied, and for the maximization objective (R), Max-Min normalization is applied to ensure that all metrics are on the same scale.Equal weights are assigned to the normalized evaluation metrics to indicate their relative importance in the decision-making process.The weighted sum method calculated an aggregated value for each model, which considers all four normalized metrics.All models are ranked based on their aggregated values, with the lower aggregated value indicating better overall performance (Table 9).As the optimal model, the RF model with 500 trees was obtained, which uses a subset of 11 variables, where the minimum amount of data per sheet is six.The assessment of the significance of individual input variables for the accuracy of the prediction was performed precisely on the obtained model with the highest accuracy (Figure 11).

Prediction of Magnesium (Mg 2+ ) Parameter Values
RF models proved to be the optimal models for predicting sodium ion (Mg 2+ ) concentrations.An analysis of all models in terms of accuracy is given in Appendix A (Table A2).The dependence of the adopted accuracy criteria on the model parameters is shown in Figure 12.Based on the defined accuracy criteria, three models with the following values were selected (Table 10).Optimal values according to different accuracy criteria are marked with bold numbers in Table 10."Simple Multi-Criteria Ranking" was applied again when extracting the optimal model (Table 11).

Prediction of Magnesium (Mg 2+ ) Parameter Values
RF models proved to be the optimal models for predicting sodium ion (Mg 2+ ) concentrations.An analysis of all models in terms of accuracy is given in Appendix A (Table A2).The dependence of the adopted accuracy criteria on the model parameters is shown in Figure 12.Based on the defined accuracy criteria, three models with the following values were selected (Table 10).Optimal values according to different accuracy criteria are marked with bold numbers in Table 10.As the optimal model, the RF model with 500 trees was obtained, which uses a subset of 12 variables, where the minimum amount of data per sheet is one.The assessment of the importance of individual input variables on the accuracy of the prediction was performed precisely on the obtained model with the highest accuracy (Figure 13)."Simple Multi-Criteria Ranking" was applied again when extracting the optimal model (Table 11).As the optimal model, the RF model with 500 trees was obtained, which uses a subset of 12 variables, where the minimum amount of data per sheet is one.The assessment of the importance of individual input variables on the accuracy of the prediction was performed precisely on the obtained model with the highest accuracy (Figure 13).

Prediction of Ca 2+ Parameter Values
RF models proved to be the optimal models for Ca 2+ (calcium ion concentration).An analysis of all models in terms of accuracy is given in Appendix A (Table A3).The dependence of the adopted accuracy criteria on the model parameters is shown in Figure 14.According to all the defined accuracy criteria, only one model stood out with values for RMSE, MAE, MAPE, and R of 0.5847, 0.4500, 0.2007, and 0.7496, respectively.As the optimal model, the RF model with 500 trees was obtained, which uses a subset of 12 variables, where the minimum amount of data per sheet is one.The assessment of the importance of individual input variables on the accuracy of the prediction was performed precisely on the obtained model with the highest accuracy (Figure 13).

Prediction of Ca 2+ Parameter Values
RF models proved to be the optimal models for Ca 2+ (calcium ion concentration).An analysis of all models in terms of accuracy is given in Appendix A (Table A3).The dependence of the adopted accuracy criteria on the model parameters is shown in Figure 14.According to all the defined accuracy criteria, only one model stood out with values for RMSE, MAE, MAPE, and R of 0.5847, 0.4500, 0.2007, and 0.7496, respectively.The assessment of the significance of individual input variables on the accuracy of the prediction was performed precisely on the obtained model with the highest accuracy (Figure 15).

Prediction of SO4 2− Parameter Values
The RF models proved to be the optimal models for predicting SO4 2− levels.An analysis of all models in terms of accuracy is given in Appendix A (Table A4).The dependence of the adopted accuracy criteria on the model parameters is shown in Figure 16.According to all defined accuracy criteria, only one model was singled out with values for RMSE, MAE, MAPE, and R of 0.5526, 0.3122, 0.5050, and 0.7381, respectively.

Prediction of SO 4 2− Parameter Values
The RF models proved to be the optimal models for predicting SO 4 2− levels.An analysis of all models in terms of accuracy is given in Appendix A (Table A4).The dependence of the adopted accuracy criteria on the model parameters is shown in Figure 16.

Prediction of SO4 2− Parameter Values
The RF models proved to be the optimal models for predicting SO4 2− levels.An analysis of all models in terms of accuracy is given in Appendix A (Table A4).The dependence of the adopted accuracy criteria on the model parameters is shown in Figure 16.According to all defined accuracy criteria, only one model was singled out with values for RMSE, MAE, MAPE, and R of 0.5526, 0.3122, 0.5050, and 0.7381, respectively.According to all defined accuracy criteria, only one model was singled out with values for RMSE, MAE, MAPE, and R of 0.5526, 0.3122, 0.5050, and 0.7381, respectively.
The assessment of the significance of the individual input variables for the accuracy of the prediction was performed directly on the obtained model with the highest accuracy (Figure 17).The assessment of the significance of the individual input variables for the accuracy of the prediction was performed directly on the obtained model with the highest accuracy (Figure 17).

Prediction of Cl − Parameter Values
RF models proved to be the optimal models for predicting Cl − concentrations.An analysis of all models in terms of accuracy is given in Appendix A (Table A5).The dependence of the adopted accuracy criteria on the model parameters is shown in Figure 18.Based on the defined accuracy criteria, three models were selected (Table 12).Optimal values according to different accuracy criteria are marked with bold numbers in Table 12.As the optimal model, the RF model with 500 trees was obtained, which uses a subset of 11 variables, where the minimum amount of data per leaf is five (Table 13).The assessment of the importance of individual input variables on the accuracy of the prediction was performed precisely on the obtained model with the highest accuracy (Figure 19).

Prediction of Cl − Parameter Values
RF models proved to be the optimal models for predicting Cl − concentrations.An analysis of all models in terms of accuracy is given in Appendix A (Table A5).The dependence of the adopted accuracy criteria on the model parameters is shown in Figure 18.Based on the defined accuracy criteria, three models were selected (Table 12).Optimal values according to different accuracy criteria are marked with bold numbers in Table 12.As the optimal model, the RF model with 500 trees was obtained, which uses a subset of 11 variables, where the minimum amount of data per leaf is five (Table 13).The assessment of the importance of individual input variables on the accuracy of the prediction was performed precisely on the obtained model with the highest accuracy (Figure 19).

Prediction of HCO 3− Parameter Values
GPR models proved to be the optimal models for predicting HCO3− concentrations.Very similar values in terms of accuracy were also given by the RF models.However, since the difference between the GPR model and the RF model is practically negligible, and since it is not possible to obtain the significance value for individual input variables on the obtained GPR model because it has the same length scale parameter for all variables, RF models were used for the analysis.An analysis of all models in terms of accuracy is given in Appendix A (Table A6).
The dependence of the adopted accuracy criteria on the parameters of the RF model is shown in Figure 20.
obtained GPR model because it has the same length scale parameter for all variables, RF models were used for the analysis.An analysis of all models in terms of accuracy is given in Appendix A (Table A6).
The dependence of the adopted accuracy criteria on the parameters of the RF model is shown in Figure 20.
In the specific case of applying the RF model, two models were distinguished, namely the RF model that uses ten variables as a subset for analysis and where the amount of data per terminal sheet is equal to one, which is optimal according to the RMSE, MAE, and MAPE criteria and the model that uses 13 variables as a subset for analysis and where the amount of data per terminal sheet is equal to two, which is optimal according to the R criterion.Since the first-mentioned model is optimal according to the three adopted accuracy criteria, RMSE, MAE, and MAPE, and the difference compared to the R criterion is practically negligible, the first model can be considered optimal.
The optimal model has the following criterion values for RMSE, MAE, MAPE, and R of 0.5174, 0.4252, 0.1822, and 0.7721, respectively.
The assessment of the importance of the individual input variables on the accuracy of the prediction was performed precisely on the obtained model with the highest accuracy (Figure 21).In the specific case of applying the RF model, two models were distinguished, namely the RF model that uses ten variables as a subset for analysis and where the amount of data per terminal sheet is equal to one, which is optimal according to the RMSE, MAE, and MAPE criteria and the model that uses 13 variables as a subset for analysis and where the amount of data per terminal sheet is equal to two, which is optimal according to the R criterion.Since the first-mentioned model is optimal according to the three adopted accuracy criteria, RMSE, MAE, and MAPE, and the difference compared to the R criterion is practically negligible, the first model can be considered optimal.
The optimal model has the following criterion values for RMSE, MAE, MAPE, and R of 0.5174, 0.4252, 0.1822, and 0.7721, respectively.
The assessment of the importance of the individual input variables on the accuracy of the prediction was performed precisely on the obtained model with the highest accuracy (Figure 21).

Prediction of K + Parameter Values
The RF models proved to be the optimal models for predicting K + levels.An analysis of all models in terms of accuracy is given in Appendix A (Table A7).The dependence of the adopted accuracy criteria on the model parameters is shown in Figure 22.In terms of accuracy, three models were singled out, and the optimal model was obtained by applying the Simple Multi-Criteria Ranking method (Tables 14 and 15).Optimal values according to different accuracy criteria are marked with bold numbers in Table 14.

Prediction of K + Parameter Values
The RF models proved to be the optimal models for predicting K + levels.An analysis of all models in terms of accuracy is given in Appendix A (Table A7).The dependence of the adopted accuracy criteria on the model parameters is shown in Figure 22.

Prediction of K + Parameter Values
The RF models proved to be the optimal models for predicting K + levels.An analysis of all models in terms of accuracy is given in Appendix A (Table A7).The dependence of the adopted accuracy criteria on the model parameters is shown in Figure 22.In terms of accuracy, three models were singled out, and the optimal model was obtained by applying the Simple Multi-Criteria Ranking method (Tables 14 and 15).Optimal values according to different accuracy criteria are marked with bold numbers in Table 14.In terms of accuracy, three models were singled out, and the optimal model was obtained by applying the Simple Multi-Criteria Ranking method (Tables 14 and 15).Optimal values according to different accuracy criteria are marked with bold numbers in Table 14.In terms of accuracy, three models were singled out, and the optimal model was obtained by applying the Simple Multi-Criteria Ranking method (Tables 16 and 17).Optimal values according to different accuracy criteria are marked with bold numbers in Table 16.The analysis of the significance of the individual input variables of the model was performed on the optimal RF model and shown in Figure 25.In terms of accuracy, three models were singled out, and the optimal model was obtained by applying the Simple Multi-Criteria Ranking method (Tables 16 and 17).Optimal values according to different accuracy criteria are marked with bold numbers in Table 16.Using the weighted sum method, an aggregated value for each model is calculated, which takes into account all four normalized metrics.
The analysis of the significance of the individual input variables of the model was performed on the optimal RF model and shown in Figure 25.

Prediction of EC Parameter Values
RF models proved to be optimal models for EC parameter prediction.An analysis of all models in terms of accuracy is given in Appendix A (Table A9).The dependence of the adopted accuracy criteria on the model parameters is shown in Figure 26.

Prediction of EC Parameter Values
RF models proved to be optimal models for EC parameter prediction.An analysis of all models in terms of accuracy is given in Appendix A (Table A9).The dependence of the adopted accuracy criteria on the model parameters is shown in Figure 26.

Prediction of EC Parameter Values
RF models proved to be optimal models for EC parameter prediction.An analysis of all models in terms of accuracy is given in Appendix A (Table A9).The dependence of the adopted accuracy criteria on the model parameters is shown in Figure 26.According to all defined accuracy criteria, only one model was singled out with values for RMSE, MAE, MAPE, and R of 271.5346, 149.3192, 0.2779, and 0.7665, respectively.
The obtained hyperparameter values for the number of trees, the subset of splitting variables, and the minimum amount of data per leaf are 500, 6, and 1, respectively.The analysis of the significance of the individual input variables of the model was performed on the optimal RF model and shown in Figure 27.According to all defined accuracy criteria, only one model was singled out with values for RMSE, MAE, MAPE, and R of 271.5346, 149.3192, 0.2779, and 0.7665, respectively.
The obtained hyperparameter values for the number of trees, the subset of splitting variables, and the minimum amount of data per leaf are 500, 6, and 1, respectively.The analysis of the significance of the individual input variables of the model was performed on the optimal RF model and shown in Figure 27.

Prediction of TDS Parameter Values
The RF models proved to be the optimal models for predicting SO4 2− levels.An analysis of all models in terms of accuracy is given in Appendix A (Table A10).The dependence of the adopted accuracy criteria on the model parameters is shown in Figure 28.
In terms of accuracy, three models were singled out, and the optimal model was obtained by applying the Simple Multi-Criteria Ranking method (Tables 18 and 19).Optimal values according to different accuracy criteria are marked with bold numbers in Table 18.

Prediction of TDS Parameter Values
The RF models proved to be the optimal models for predicting SO 4 2− levels.An analysis of all models in terms of accuracy is given in Appendix A (Table A10).The dependence of the adopted accuracy criteria on the model parameters is shown in Figure 28.
In terms of accuracy, three models were singled out, and the optimal model was obtained by applying the Simple Multi-Criteria Ranking method (Tables 18 and 19).Optimal values according to different accuracy criteria are marked with bold numbers in Table 18.The analysis of the significance of the individual input variables of the model was performed on the optimal RF model and shown in Figure 29.The analysis of the significance of the individual input variables of the model was performed on the optimal RF model and shown in Figure 29.

Discussion
In our research, most models demonstrated satisfactory accuracy, meeting the predefined criteria.However, a subset of models exhibited shortcomings in specific criteria.To gauge accuracy effectively, we leaned on relative metrics, notably accuracy (R) and mean absolute percentage error (MAPE), as they offer more insightful perspectives compared to absolute criteria such as RMSE and MAE (Table 20).Table 20 highlights the accuracy of the machine learning models in predicting individual water parameters.Notably, the RF model emerged as the best performer across various parameters, underscoring its efficacy.
Analyzing the R values reveals the overall satisfactory performance of most models, except for the pH prediction model.Examining MAPE values identified five models-SAR, Na+, SO 4 , Cl, and TDS-where this metric is relatively higher than other ones.Despite these nuances, our primary research focus was unraveling the significance of individual input variables within the constraints of limited data.
When we delve into the significance of the individual input variables, our conclusions (Table 21) unveil the following crucial insights: Forest Cover ('F'): Forest areas significantly influence diverse water quality parameters.Trees and vegetation in forests contribute organic matter to water bodies, influencing ion concentrations.The root systems of trees can affect the uptake of certain ions.Forests strongly impact the concentrations of sodium, magnesium, calcium, chloride, sulfate, bicarbonate, and potassium ions.Also, forests act as natural filters, reducing the transport of sediments and pollutants into water bodies.Cleaner water, with fewer suspended solids, tends TDS and EC.Additionally, forest areas often have minimal human activities compared to urban or agricultural areas.
Rangeland (RL) is essential for predicting water sulfate ion concentrations.This suggests that the characteristics associated with the rangeland, such as land cover and land use patterns, significantly influence sulfate levels.Additionally, rangeland strongly affects SAR by influencing sodium concentrations, vital for evaluating water's suitability for irrigation and soil health.Also, the notable impact on magnesium levels showcases rangeland's role in shaping water quality.Rangeland's influence on pH highlights its role in determining water acidity or alkalinity, which is crucial for aquatic ecosystems and nutrient availability.Additionally, rangeland significantly influences electrical conductivity, providing insights into water quality and dissolved ion content, essential for understanding overall water composition.While having a somewhat lesser impact, rangeland still plays a discernible role in shaping sodium concentrations, contributing to insights into water salinity and its ecological implications.
Urban Area ('UA'): Urban areas have a moderate impact on ion levels, magnesium, chloride, bicarbonate, and SAR parameters, owing to urbanization and land use changes, introducing contaminants and altering water chemistry.Calcium, sulfate, and EC parameters have less impact.
The Agricultural Area (AA) substantially impacts potassium, SAR, and sodium, with a moderate impact on calcium, TDS, and magnesium.The influence of AA on these parameters can be explained by the agricultural areas' use of potassium-containing fertilizers, leading to elevated potassium concentrations in water.Cultivation practices and nutrient management contribute to increased potassium levels.Additionally, agricultural activities often involve irrigation, and water with high sodium content can increase SAR.Sodium in the soil can be introduced through irrigation water, affecting sodium levels in the water.Moreover, agricultural runoff can introduce calcium, magnesium, and other dissolved solids into water sources.
Catchment Area ('CA'): The size of catchment areas plays a moderate role in ion transport, particularly affecting SAR, sodium, bicarbonate, calcium, and sulfate levels.The size of the catchment area could moderately impact SAR, as larger areas may interact with more diverse geological and soil features, affecting sodium adsorption ratios.
Considering different soil types (HSGA, HSGB, HSGC, HSGD) and geological permeability (GHGM, GHGN, GHGT) underscores their impact on ion retention and release.Sandy soils facilitate easier ion movement, while clayey soils retain ions.Geological permeability influences potassium, magnesium, calcium, and bicarbonate levels, showcasing the interconnectedness of soil and geological characteristics with water parameters.

Conclusions
Our study demonstrates the effectiveness of machine learning methods in predicting and assessing water quality parameters within a catchment area.With the Random Forest (RF) model as the standout performer, the model provides a robust tool for efficient and accurate water quality evaluation.
While certain models may fall short on specific criteria, a nuanced evaluation leveraging relative criteria like accuracy (R) and mean absolute percentage error (MAPE) underscores the overall robustness of the predictive models.Table 20 encapsulates the detailed results, highlighting the efficacy of the RF model across various water parameters.
Evaluation of R values showcases all models' satisfactory performance except for pH prediction.Despite marginally elevated MAPE values in five models (SAR, Na+, SO 4 , Cl, TDS), the core research objective-unraveling the importance of individual input variables within data constraints-was largely achieved.
This accomplishment paves the way for selecting and implementing optimal models from a broader ML spectrum.To further elevate model accuracy, future research will focus on dataset expansion, a strategic initiative to address current limitations and achieve heightened accuracy, particularly in parameters exhibiting slight deviations.
The significance of individual input variables, as outlined in Table 21, provides crucial insights for understanding their roles in influencing water parameters.Forest cover, catchment area characteristics, stream order, barren land, and urban areas are pivotal factors shaping water quality.
Incorporating these research insights into decision-making processes presents transformative opportunities for strategic resource allocation and environmental impact mitigation.Furthermore, integrating these outcomes empowers decision-makers to adopt targeted strategies for fostering environmental sustainability, contributing to the broader goal of cultivating resilient water ecosystems.This integration signifies a practical pathway toward achieving a delicate balance between human activities and environmental preservation, actively contributing to sustainable water ecosystems.

Figure 1 .
Figure 1.The partitioning of an input space into distinct regions and the representation of a 3D regression surface within a regression tree [26].

Figure 1 .
Figure 1.The partitioning of an input space into distinct regions and the representation of a 3D regression surface within a regression tree [26].

Figure 4 .
Figure 4.The study region and catchment areas situated within the southern Caspian Sea basin.Figure 4. The study region and catchment areas situated within the southern Caspian Sea basin.

Figure 4 .
Figure 4.The study region and catchment areas situated within the southern Caspian Sea basin.Figure 4. The study region and catchment areas situated within the southern Caspian Sea basin.

Figure 5 .
Figure 5. Applied methodology for creating prediction models.

Figure 5 .
Figure 5. Applied methodology for creating prediction models.

Figure 6 .
Figure 6.An optimal individual model for SAR parameter prediction based on a regression tree.

Figure 6 .
Figure 6.An optimal individual model for SAR parameter prediction based on a regression tree.

Figure 7 .
Figure 7.Comparison of different accuracy criteria for the RF model for the SAR parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 8 .
Figure 8. Dependence of the MSE value on the reduction parameter λ and the number of trees (base models) in the boosted tree model for the SAR parameter.

Figure 7 .
Figure 7.Comparison of different accuracy criteria for the RF model for the SAR parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 7 .
Figure 7.Comparison of different accuracy criteria for the RF model for the SAR parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 8 .
Figure 8. Dependence of the MSE value on the reduction parameter λ and the number of trees (base models) in the boosted tree model for the SAR parameter.

Figure 8 .
Figure 8. Dependence of the MSE value on the reduction parameter λ and the number of trees (base models) in the boosted tree model for the SAR parameter.

Figure 9 .
Figure 9. Significance of individual variables for SAR parameter prediction in an optimal RF model.

Figure 9 .
Figure 9. Significance of individual variables for SAR parameter prediction in an optimal RF model.

Figure 10 .
Figure 10.Comparison of different accuracy criteria for the RF model for the Na + parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 10 .
Figure 10.Comparison of different accuracy criteria for the RF model for the Na + parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Toxics 2023 , 40 Figure 11 .
Figure 11.The significance of individual variables for Na + parameter prediction in an optimal RF model.

Figure 11 .
Figure 11.The significance of individual variables for Na + parameter prediction in an optimal RF model.

Figure 12 .
Figure 12.Comparison of different accuracy criteria for the RF model for the Mg 2+ parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 13 .
Figure 13.Significance of individual variables for Mg 2+ parameter prediction in an optimal RF model.

Figure 12 .
Figure 12.Comparison of different accuracy criteria for the RF model for the Mg 2+ parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.
function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 13 .
Figure 13.Significance of individual variables for Mg 2+ parameter prediction in an optimal RF model.

Figure 13 .
Figure 13.Significance of individual variables for Mg 2+ parameter prediction in an optimal RF model.

Figure 14 .
Figure 14.Comparison of different accuracy criteria for the RF model for the Ca 2+ parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 14 .
Figure 14.Comparison of different accuracy criteria for the RF model for the Ca 2+ parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.The assessment of the significance of individual input variables on the accuracy of the prediction was performed precisely on the obtained model with the highest accuracy (Figure15).

Figure 15 .
Figure 15.Significance of individual variables for Ca 2+ parameter prediction in an optimal RF model.

Figure 16 .
Figure 16.Comparison of different accuracy criteria for the RF model for the SO4 2− parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 15 .
Figure 15.Significance of individual variables for Ca 2+ parameter prediction in an optimal RF model.

Toxics 2023 , 40 Figure 15 .
Figure 15.Significance of individual variables for Ca 2+ parameter prediction in an optimal RF model.

Figure 16 .
Figure 16.Comparison of different accuracy criteria for the RF model for the SO4 2− parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure
Figure Comparison of different accuracy criteria for the RF model for the SO 4 2− parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 17 .
Figure 17.Significance of the individual variables for SO4 2− parameter prediction in an optimal RF model.

Figure 17 .
Figure 17.Significance of the individual variables for SO 4 2− parameter prediction in an optimal RF model.

Figure 18 .
Figure 18.Comparison of different accuracy criteria for the RF model for the Cl − parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 19 .
Figure 19.Significance of the individual variables for Cl − parameter prediction in an optimal RF model.

Figure 18 .
Figure 18.Comparison of different accuracy criteria for the RF model for the Cl − parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 18 .
Figure 18.Comparison of different accuracy criteria for the RF model for the Cl − parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 19 .
Figure 19.Significance of the individual variables for Cl − parameter prediction in an optimal RF model.

Figure 19 .
Figure 19.Significance of the individual variables for Cl − parameter prediction in an optimal RF model.

Figure 20 .
Figure 20.Comparison of different accuracy criteria for the RF model for the HCO 3− parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 20 .
Figure 20.Comparison of different accuracy criteria for the RF model for the HCO 3− parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 21 .
Figure 21.Significance of individual variables for HCO 3− parameter prediction in an optimal RF model.

Figure 22 .
Figure 22.Comparison of different accuracy criteria for the RF model for the K+ parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 21 .
Figure 21.Significance of individual variables for HCO 3− parameter prediction in an optimal RF model.

Toxics 2023 , 40 Figure 21 .
Figure 21.Significance of individual variables for HCO 3− parameter prediction in an optimal RF model.

Figure 22 .
Figure 22.Comparison of different accuracy criteria for the RF model for the K+ parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 22 .
Figure 22.Comparison of different accuracy criteria for the RF model for the K+ parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 24 .
Figure 24.Comparison of different accuracy criteria for the RF model for the pH parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 24 .
Figure 24.Comparison of different accuracy criteria for the RF model for the pH parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 25 .
Figure 25.Significance of the individual variables for PH parameter prediction in an optimal RF model.

Figure 25 .
Figure 25.Significance of the individual variables for PH parameter prediction in an optimal RF model.

Toxics 2023 , 40 Figure 25 .
Figure 25.Significance of the individual variables for PH parameter prediction in an optimal RF model.

Figure 26 .
Figure 26.Comparison of different accuracy criteria for the RF model for the EC parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Toxics 2023 , 40 Figure 26 .
Figure 26.Comparison of different accuracy criteria for the RF model for the EC parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 27 .
Figure 27.Significance of the individual variables for EC parameter prediction in an optimal RF model.

Figure 27 .
Figure 27.Significance of the individual variables for EC parameter prediction in an optimal RF model.

Figure 28 .
Figure 28.Comparison of different accuracy criteria for the RF model for the TDS parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.

Figure 29 .
Figure 29.Significance of the individual variables for TDS parameter prediction in an optimal RF model.

Figure 28 .Figure 28 .
Figure 28.Comparison of different accuracy criteria for the RF model for the TDS parameter as a function of the number of randomly selected splitting variables and minimum leaf size: (a) RMSE, (b) MAE, (c) MAPE, (d) R.The analysis of the significance of the individual input variables of the model was performed on the optimal RF model and shown in Figure29.

Figure 29 .
Figure 29.Significance of the individual variables for TDS parameter prediction in an optimal RF model.

Figure 29 .
Figure 29.Significance of the individual variables for TDS parameter prediction in an optimal RF model.

Table 1 .
Statistical properties of the input variables used for modeling.

Table 2 .
Statistical properties of the output variables used for modeling.

Table 3 .
Influence of the minimum leaf size on regression tree (RT) model accuracy.
Minimum amount of data/samples per tree leaf: Analyzed values ranging from 2 to 15 samples per leaf, with a step size of 1 sample.The standard setting in Matlab is 5 samples per leaf for regression, but here, a broader range was examined to assess its impact on model generalization.•RandomForest (RF) model: 1. Number of generated trees (B): Analyzed within a range of 100-500 trees, with 100 as the standard setting in Matlab.Cumulative MSE values for all base models in the ensemble were presented.Bootstrap aggregation was used to create trees, generating 181 samples per iteration.The study explored an extended ensemble of up to 500 regression trees, aligning with recommended Random Forest practices.2. Number of variables used for tree splitting: Based on guidance, the study selected a subset of approximately √ p predictors for branching, where p is the number of input variables.With 16 predictors, this translated to a subset of 4 variables, but in those research, a wider number of variables, ranging from 1 to 16, is investigated.3.

Table 4 .
Accuracy of obtained models for SAR parameter prediction according to defined criteria.

Table 4 .
Accuracy of obtained models for SAR parameter prediction according to defined criteria.

Table 5 .
Values of optimal parameters in GPR models with different covariance functions.

Table 6 .
Values of optimal parameters in GPR ARD models with different covariance functions.

Table 7 .
Comparative analysis of the results of different machine learning models for the SAR prediction.

Table 8 .
Accuracy of obtained models for Na + parameter prediction according to defined crit

Table 9 .
Determining the optimal prediction model for the Na + parameter using Simple Multi-Criteria Ranking.

Table 8 .
Accuracy of obtained models for Na + parameter prediction according to defined criteria.

Table 9 .
Determining the optimal prediction model for the Na + parameter using Simple Multi-Criteria Ranking.

Table 10 .
The accuracy of the obtained models for Mg 2+ prediction according to defined criteria.

Table 11 .
Determining the optimal prediction model for the Mg 2+ parameter using Simple Multi-Criteria Ranking.

Table 10 .
The accuracy of the obtained models for Mg 2+ prediction according to defined criteria.

Table 11 .
Determining the optimal prediction model for the Mg 2+ parameter using Simple Multi-Criteria Ranking.

Table 12 .
Accuracy of the obtained models for Cl − prediction according to defined criteria.

Table 13 .
Determining the optimal prediction model for the Cl − parameter using Simple Multi-Criteria Ranking.

Table 12 .
Accuracy of the obtained models for Cl − prediction according to defined criteria.

Table 13 .
Determining the optimal prediction model for the Cl − parameter using Simple Multi-Criteria Ranking.

Table 16 .
Accuracy of the obtained models for pH parameter prediction according to defined criteria.Using the weighted sum method, an aggregated value for each model is calculated, which takes into account all four normalized metrics.

Table 17 .
Determining the optimal prediction model for the PH parameter using Simple Multi-Criteria Ranking.

Table 16 .
Accuracy of the obtained models for pH parameter prediction according to defined criteria.

Table 17 .
Determining the optimal prediction model for the PH parameter using Simple Multi-Criteria Ranking.

Table 18 .
Accuracy of the obtained models for TDS parameter prediction according to defined criteria.

Table 19 .
Determining the optimal prediction model for the TDS parameter using Simple Multi-Criteria Ranking.

Table 18 .
Accuracy of the obtained models for TDS parameter prediction according to defined criteria.

Table 19 .
Determining the optimal prediction model for the TDS parameter using Simple Multi-Criteria Ranking.

Table 20 .
Accuracy of the ML model in predicting individual water parameters.

Table 21 .
The most influential input variables for predicting water parameters.