Evolutionary-Assisted Data-Driven Approach for Dissolved Oxygen Modeling: A Case Study in Kosovo

Macêdo, Bruno da S.; Lima, Larissa; Fonseca, Douglas Lima; Boratto, Tales H. A.; Saporetti, Camila M.; Fetoshi, Osman; Hajrizi, Edmond; Bytyçi, Pajtim; Aires, Uilson R. V.; Yonaba, Roland; Capriles, Priscila; Goliatt, Leonardo

doi:10.3390/earth6030081

Open AccessArticle

Evolutionary-Assisted Data-Driven Approach for Dissolved Oxygen Modeling: A Case Study in Kosovo

by

Bruno da S. Macêdo

^1,†

,

Larissa Lima

^2,†,

Douglas Lima Fonseca

^2,†,

Tales H. A. Boratto

^2,†

,

Camila M. Saporetti

³

,

Osman Fetoshi

⁴,

Edmond Hajrizi

⁴,

Pajtim Bytyçi

⁴

,

Uilson R. V. Aires

⁵

,

Roland Yonaba

⁶

,

Priscila Capriles

²

and

Leonardo Goliatt

^2,*

¹

Department of Computer Science, Federal University of Lavras, Lavras 36036-900, MG, Brazil

²

Computational Modeling Program, Federal University of Juiz de Fora, Juiz de Fora 36036-900, MG, Brazil

³

Department of Computational Modeling, Polytechnic Institute, Rio de Janeiro State University, Nova Friburgo 22000-900, RJ, Brazil

⁴

Food Science and Biotechnology, University for Business and Technology, 10000 Prishtina, Kosovo

⁵

Department of Agricultural and Biological Engineering, Mississippi State University, Starkville, MS 39762, USA

⁶

Laboratoire Eaux, Hydro-Systèmes et Agriculture (LEHSA), Institut International d’Ingénierie de l’Eau et de l’Environnement (2iE), Rue de la Science, Ouagadougou 01 B.P. 594, Burkina Faso

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Earth 2025, 6(3), 81; https://doi.org/10.3390/earth6030081

Submission received: 31 May 2025 / Revised: 5 July 2025 / Accepted: 9 July 2025 / Published: 21 July 2025

Download

Browse Figures

Versions Notes

Abstract

Dissolved oxygen (DO) is widely recognized as a fundamental parameter in assessing water quality, given its critical role in supporting aquatic ecosystems. Accurate estimation of DO levels is crucial for effective management of riverine environments, especially in anthropogenically stressed regions. In this study, a hybrid machine learning (ML) framework is introduced to predict DO concentrations, where optimization is performed through Genetic Algorithm Search with Cross-Validation (GASearchCV). The methodology was applied to a dataset collected from the Sitnica River in Kosovo, comprising more than 18,000 observations of temperature, conductivity, pH, and dissolved oxygen. The ML models Elastic Net (EN), Support Vector Regression (SVR), and Light Gradient Boosting Machine (LGBM) were fine-tuned using cross-validation and assessed using five performance metrics: coefficient of determination (

R^{2}

), root mean square error (RMSE), mean absolute error (MAE), mean absolute relative error MARE, and mean square error (MSE). Among them, the LGBM model yielded the best predictive results, achieving an

R^{2}

of 0.944 and RMSE of 8.430 mg/L on average. A Monte Carlo Simulation-based uncertainty analysis further confirmed the model’s robustness, enabling comparison of the trade-off between uncertainty and predictive precision. Comparison with recent studies confirms the proposed framework’s competitive performance, demonstrating the effectiveness of automated tuning and ensemble learning in achieving reliable and real-time water quality forecasting. The methodology offers a scalable and reliable solution for advancing data-driven water quality forecasting, with direct applicability to real-time environmental monitoring and sustainable resource management.

Keywords:

dissolved oxygen prediction; machine learning; optimization; uncertainty analysis; water quality monitoring

1. Introduction

The amount of dissolved oxygen (DO) in watercourses is an essential indicator for understanding phenomena such as self-purification, microorganism respiration, and the metabolism of aquatic ecosystems. Considerable decreases in DO levels generally occur due to the biological oxidation of organic matter, which is intensified by the discharge of domestic and industrial effluents and the leaching of fertilizers. These circumstances promote phenomena such as eutrophication and anoxia, negatively affecting biodiversity and the biogeochemical balance of the aquatic environment [1]. Effective monitoring and management of freshwater systems have become central to global sustainability efforts. In this context, ensuring good ecological status of rivers is directly aligned with several targets under the United Nations Sustainable Development Goals (SDGs), most notably SDG 6 (related to Clean Water and Sanitation), which advocates for the availability and sustainable management of water resources, and SDG 14 (Life Below Water), which aims to reduce pollution and protect aquatic ecosystems.

A particularly important indicator of riverine health is DO, a measure of the oxygen available for aquatic organisms. DO is a key integrative parameter, as it reflects the cumulative effects of physical, chemical, and biological processes occurring in aquatic systems [2]. Low DO levels, often resulting from anthropogenic stressors such as untreated sewage discharge, agricultural runoff, and industrial effluents, can lead to hypoxia, ecosystem degradation, and the collapse of sensitive aquatic species [3,4,5,6]. Thus, the ability to accurately monitor and predict DO concentrations in real time is essential for safeguarding riverine ecosystems and supporting data-driven water management.

Machine learning-based approaches for predicting dissolved oxygen in watercourses have advantages when combined with traditional field measurements. ML models enable real-time and continuous predictions, facilitating the early detection of hypoxic or anoxic conditions [7]. This predictive capability is more efficient and often more cost-effective than intensive monitoring with physical sampling, especially on a large scale or in hard-to-reach locations [8]. Furthermore, the adoption of these models can improve data-driven decision-making for the protection of aquatic ecosystems [9,10].

Dissolved oxygen (DO) modeling has been improved in a variety of hydrological conditions thanks to recent developments in machine learning (ML). Studies employing ensemble methods and deep learning architectures such as LSTM and hybrid neural networks have shown good predictive accuracy across Asia, especially in China and India [8,9,10,11,12,13,14,15]. Similar to this, studies conducted in North America have effectively used cutting-edge deep learning models to capture DO changes in dynamic and complex environmental settings [6,16,17]. Together, these initiatives demonstrate ML’s scalability and versatility while highlighting the global trend toward incorporating AI in aquatic environmental monitoring.

The majority of research has focused on areas with robust monitoring systems and a wealth of datasets. On the other hand, the literature continues to underrepresent under-monitored regions, such as Eastern Europe and the Western Balkans. Dodig et al. [18] use ML techniques, specifically long short-term memory (LSTM) networks, to predict the water quality of the Sava River, which is located in the southeastern European regions and is part of the Danube river basin. In the work of He et al. [19], ML techniques were used to predict the DO, using data from a long stretch of 45 km from west to east along the River Thames. Krivoguz et al. [20] conducted a study for DO prediction, applying Random Forest (RF), for the Black Sea area, which geographically passes through the Balkan region.

These areas present particular challenges that require specialized solutions, as they are often characterized by significant anthropogenic pressure and limited data availability. Validating and modifying ML-based DO prediction techniques in environments with limited data is therefore urgently needed. This study addresses a critical knowledge gap through the application of an interpretable and evolutionarily optimized machine learning framework to the Sitnica River in Kosovo, a system historically lacking comprehensive data.

Optimization-enhanced ML models have demonstrated further promise. Yang [21] evaluated multiple training strategies, including Teaching-Learning-Based Optimization (TLBO), Sine Cosine Algorithm (SCA), Water Cycle Algorithm (WCA), and Electromagnetic Field Optimization (EFO), in training Multilayer Perceptron Neural Networks (MLPNNs). Their results highlighted the EFO-MLPNN as the most efficient (mean absolute error (MAE) = 1.0002, root mean square error (RMSE) = 1.2903, and R = 0.88154), outperforming previous efforts such as the Multi-Verse Optimizer (MVO) [22] and Bayesian Model Averaging (BMA) [23]. On the other hand, Ziyad Sami et al. [7] utilized an ANN to predict DO levels in the Feitsui reservoir in Taiwan, optimizing the number of neurons to achieve accurate results (coefficient of determination

R^{2}

= 0.98.

The DO prediction has been effectively achieved using ensemble and boosting methods. For instance, Moon et al. [24] used AdaBoost, RF, and Gradient Boosting algorithms to predict DO in the Hwanggujicheon region, with AdaBoost achieving superior performance (

R M S E

= 0.015,

M A E

= 0.009, and

R^{2}

= 0.912). Similarly, Qambar and Al Khalidy [25] demonstrated exceptional prediction accuracy and reduced energy costs using boosted algorithms.

Hybrid models further enhance predictive capacity by integrating complementary learning strategies. In the Yamuna River case study, Arora and Keshari [26] employed Adaptive Neuro-Fuzzy Inference Systems (ANFIS) with grid partitioning (ANFIS-GP) and subtractive clustering (ANFIS-SC), achieving an

R^{2}

of

0.953

with ANFIS-GP. Khan and Byun [27] developed the GA-XGCBXT model, combining Genetic Algorithms (GA) with eXtreme Gradient Boosting (XGB), CatBoost (CB), and eXtra Trees (XT), yielding a mean square error (MSE) of

0.310

.

Stacked ensemble approaches have also proven highly effective. Kozhiparamban et al. [15] proposed a stacked model that implemented Kernel Ridge Regression (KRR), Elastic Net (EN), and Light Gradient Boosting Machine (LGBM), achieving substantial performance gains (MAE = 0.0176, RMSE = 0.0319) over individual models. Guo [28] evaluated classical ML models such as DT, MLP, Naive Bayes (NB), K-Nearest Neighbors (KNN), and Support Vector Machines (SVM). The results indicate that DT (named as C4.5) and MLP models offered the best performance (RMSE = 0.068 and 0.055, respectively).

Recently, interpretable machine learning models have emerged, enabling both accurate prediction and transparent variable assessment [29]. Chen et al. [8] developed an ensemble framework for six Chinese estuaries using SHapley Additive Explanations (SHAP) analysis to evaluate feature importance, emphasizing variables such as pH, electrical conductivity (EC), and nutrient loads. Their approach highlighted both local interactions and lagged dependencies, improving the interpretability of DO prediction models. Hybrid architectures integrating signal processing and evolutionary computation have further advanced model accuracy. Zhao and Chen [30] introduced the DWT-KPCA-GWO-XGBoost model, which incorporates the Discrete Wavelet Transform (DWT) for denoising, Kernel Principal Component Analysis (KPCA) for feature reduction, and Grey Wolf Optimization (GWO) for hyperparameter tuning. This model significantly outperformed conventional approaches in forecasting DO in the Yangtze River basin.

Attention-based models have become increasingly prominent in water quality prediction, especially for forecasting DO and other key parameters. Li et al. [31] developed a transformer-based framework incorporating multi-scale temporal fusion and dynamic time-series decomposition to handle the nonstationarity and multi-scale nature of DO dynamics, outperforming seven DL baselines in accuracy and robustness. Building on this, Zhao and Chen [32] proposed a hybrid model combining wavelet convolution, variational mode decomposition (VMD), and a frequency-enhanced attention mechanism with Shapley Additive Explanations (SHAP), allowing for the interpretation of interactions between meteorological and water quality variables.

Recent advancements have also emphasized the integration of domain knowledge into data-driven models through physics-informed machine learning (PIML) and transfer learning. Koksal and Aydin [33] developed a hybrid framework combining transfer learning with physics-informed modeling to predict DO concentrations in an industrial wastewater treatment plant. Their approach leveraged knowledge from an open-source physics-based simulation and a real-world plant characterized by noisy and incomplete data. The proposed model improved prediction performance by up to 59% in validation scenarios.

Interpretability has also become a key focus in recent work, with techniques like SHAP being employed to quantify feature contributions and enhance transparency [29,34]. By coupling explainable AI with physical domain knowledge, models such as the PKBiLSTM [34] and DWT-KPCA-GWO-XGBoost [35] have successfully captured nonlinear interactions and seasonality in DO trends, while offering insights into model behavior.

Among numerous physicochemical parameters, DO remains the most sensitive and integrative indicator of aquatic ecosystem quality. Accurate modeling of DO dynamics enables risk anticipation and informs water quality control strategies. Given the challenges of manual parameter interaction analysis, data-driven and automated ML techniques offer a scalable and precise alternative.

Despite these advances, few studies have tested these advanced techniques in under-monitored or data-scarce regions, where robust and scalable models are especially needed. The Sitnica River was selected as a case study due to several factors that make it important for assessing anthropogenic impacts on inland aquatic ecosystems. The river is exposed to a wide range of anthropogenic pollutants, including untreated urban wastewater, industrial discharges, and agricultural runoff, making it a representative model of polluted rivers in the Western Balkans. In addition, there is a significant lack of comprehensive ecological and microbiological data for the river, despite its environmental importance and the continuous pressures it faces. This data gap limits the development of effective measures for its management and protection. Therefore, the study of the Sitnica River addresses both a scientifically relevant case of human impact and a need for baseline ecological data in an under-researched region.

The Sitnica River spans an area of 2861 km² and flows through the majority of the Kosovo Plain. It is the only major river that flows entirely (approximately 90 km in length) within the borders of the Republic of Kosovo [36]. Known as a plain river, it is characterized by frequent changes in its course and flooding. Although the Sitnica does not have a distinct spring, it is named after the location where the Shtime stream meets the Sazli stream on its left side, near the village of Robovc. Its source is considered to be Topila, which originates at the northern end of Derman Peak (1364 m). From Topilla (1280 m) to the point where the Sitnica River meets the Ibër, the elevation is 497.2 m. Its total drop is 782.8 m, while its relative drop is 7.2%. The average elevation of the Sitnica River is 734 m, with only 7.8% of its course exceeding 1000 m. The river network density of the Sitnica, calculated using the Neumann formula, is 824 m/km² on the right bank and 512.8 m/km on the left [37]. The river is known for its calm hydrological regime, with an average flow rate of 12.9

m^{3}

/s. Therefore, the Sitnica River is one of Kosovo’s main rivers and the main tributary of the Ibër, which eventually drains into the Black Sea [36].

The novelty of this study lies in the integration of a robust and interpretable machine learning framework combining evolutionary optimization (GASearchCV), uncertainty quantification via Monte Carlo Simulation, and SHAP-based feature attribution to predict DO in a data-scarce and environmentally stressed region. Rather than proposing new algorithms, the innovation stems from adapting and validating this scalable pipeline in the Sitnica River, where high-frequency yet limited-variable monitoring presents practical challenges rarely addressed in the literature.

The remaining sections of this study are separated as follows: Section 2 explains the dataset used, as well as the machine learning models employed, the optimization algorithm, and also the performance evaluation metrics. Section 3 covers the computational experiments as well as a discussion of the results achieved. Finally, the main conclusions are summarized in Section 4.

2. Materials and Methods

The Sitnica River is the longest watercourse entirely contained within the territory of Kosovo, with an approximate length of 90 km [38]. It originates near the Sazli lagoon, north of Ferizaj, and flows into the Ibar River in Mitrovica at an elevation of 499 m above sea level. The drainage basin spans between 2861 km² and 3129 km², corresponding to roughly 26% of Kosovo land area [36,39]. The spatial configuration of the hydrographic network is illustrated in Figure 1, highlighting the Sitnica River within the broader river system of Kosovo [36].

The basin is characterized by predominantly flat terrain, which contributes to the river’s meandering path. A dendritic and partially parallel drainage pattern has been identified, with approximately 70% of tributaries located on the right bank, forming confluence angles of 2° to 5°. Left-bank tributaries typically exhibit narrower angles ranging from 1° to 2°, suggesting topographic and structural asymmetry across the basin [40,41].

2.1. Dataset

The dataset used in this investigation [42] was extracted from a part of InWaterSense project [43] with a wireless sensor network (WSN) present on the bank of the Sitnica River in the village of Plemetin near the capital Kosovo, in Prishtina (Figure 2). This is a remote monitoring study conducted in real-time and continuously.

The raw water quality dataset consisted of high-frequency measurements recorded between 1 May 2015 and 21 December 2015. The dataset included dissolved oxygen (DO), temperature (T), pH, and electrical conductivity (EC), acquired through in situ sensors. A total of over 29,842 samples were initially available. To ensure model reliability, a cleaning and segmentation procedure was applied to the DO time series.

Figure 3 illustrates the DO time series before and after cleaning. The blue curve corresponds to the accepted data subset used in the modeling phase, and the red curve represents the discarded portion, which shows abrupt fluctuations and irregular behavior likely caused by sensor drift or interference. The first 18,360 samples, according Figure 4, were retained for modeling due to their relative consistency and plausibility, while the remaining records were discarded.

Table 1 presents the basic statistics of the input variables in the dataset. The dataset spans a wide range of environmental conditions. Temperature values range from 2.38 °C to 28.17 °C, with a mean of 14.79 °C, reflecting seasonal variation during the monitoring period. Electrical conductivity exhibits moderate dispersion (SD = 90.55 μS/cm), with values ranging from 22.80 to 690.80

μ

S/cm, suggesting potential variation in ion concentration due to runoff or industrial influence. The pH values show a broader range (0.02 to 11.50), although the interquartile range (IQR = 7.40 − 4.81 = 2.59) suggests that most readings fall within a realistic environmental window. However, extreme values (e.g., pH near 0 or above 11) likely indicate brief sensor errors or chemical discharges and were flagged in preprocessing.

Figure 5 displays their correlation matrix and p-value matrix. The T characteristic has the highest correlation with DO, followed by EC and pH. However, other characteristics, such as EC, also exhibit a high correlation with T, which in turn influences DO prediction. This information is important for providing insight into the relationship between these characteristics and their influence on the DO. With the exception of the connection between T and pH (p = 0.074), which is not statistically significant at the traditional 5% threshold, the majority of associations are extremely significant (

p < 0.001)

.

Figure 6 presents a visualization of the sample distribution in the dataset. It can be observed that the characteristics do not present a strong correlation, as the sample distribution does not follow a straight line, suggesting a nonlinear and complex relationship between the characteristics, characterized by dispersed patterns.

2.2. Machine Learning Models

This section presents the ML algorithms used in this work. Elastic Net was selected due to its effectiveness in dealing with multicollinearity, SVR for its ability to capture nonlinear relationships in noisy datasets, and LGBM for its computational efficiency and performance on multivariate data.

2.2.1. Elastic Net

The EN model, proposed by Zou and Hastie [44], is a hybrid regression technique based on penalized least squares, which performs regularization and feature selection. EN is the combination of two contraction regression algorithms, Ridge (

L_{2}

penalty) to solve problems with large multicollinearity and LASSO (

L_{1}

penalty) to select the characteristics of the regression coefficients [45].

The organization of the multiple linear regression algorithm, which creates the relationship of the response attribute and the prediction attribute, is derived according to Equation (1):

y_{i} = β_{0} + x_{i 1} β_{1} + x_{i 2} β_{2} + \dots + x_{i p} β_{p} + ϵ_{i}

(1)

where

i = 1, 2, \dots

, n, and

j = 1, 2, \dots, p

. Still in this context,

y_{i}

denotes the response corresponding to the ith observation. The intercept term is represented by

β_{0}

. The variable

x_{i j}

denotes the jth prediction element associated with the ith sample, and

β_{j}

indicates the regression coefficient of the jth prediction element. This coefficient reflects the expected change in the response variable

y_{i}

for a one-unit change in the jth predictor element

y_{i j}

. It is assumed that the prediction elements are standardized by centering (subtracting the mean) and scaling (dividing by the standard deviation), resulting in variables with zero mean and unit variance [46].

2.2.2. Support Vector Regression

SVR is an ML technique applied to regression problems [47]. It can be linear and nonlinear using the respective kernel functions. The technique considers a dataset of points, given by

(x_{1}, y_{1}), \dots (x_{l}, y_{l})

, where

x_{i} \in R^{n}

is a vector of data for input and

y_{i} \in R^{1}

is the vector of values considered. The standard formulation of SVR is expressed as an optimization problem through the internal attributes

ϵ > 0

and

C > 0

, as shown in Equation (2):

min \frac{1}{2} w^{T} w + C \sum_{i = 1}^{l} ξ_{i} + C \sum_{i = 1}^{l} ξ_{i}^{*}

(2)

subject to

\begin{matrix} w^{T} ϕ (x_{i}) + b - y_{i} \leq ϵ + ξ_{i}, \\ y_{i} - w^{T} ϕ (x_{i}) - b \leq ϵ + ξ_{i}^{*}, \\ ξ_{i}, ξ_{i}^{*} \geq 0, i = 1, . . ., l . \end{matrix}

Equation (3) may be used to express the optimization issue in its dual form:

min \frac{1}{2} {(α - α^{*})}^{T} K (x_{i}, x_{j}) (α - α^{*}) + \sum_{i = 1}^{l} (y_{i} + ε) (α_{i} - α_{i}^{*})

(3)

subject to

\begin{matrix} e^{T} (α - α^{*}) = 0, \\ 0 \leq α_{i}, α_{i}^{*} \leq C, i = 1, \dots, l . \end{matrix}

where

K (x_{i}, x_{j}) = ϕ {(x_{i})}^{T} ϕ (x_{j})

, and

ϕ (\cdot)

is the kernel function. By solving Equation (3) it is possible to find the parameters needed to generate the SVR approximation. Then, to get SVR estimations, Equation (4) is used:

\hat{y_{i}} = ε \sum_{i = 1}^{l} (α_{i} - α_{i}^{*}) K (x_{i}, x) + b .

(4)

Real data are often not linearly separable in the original feature space. To overcome this limitation, SVR employs the kernel function technique, which involves mapping the data to a higher-dimensional space where linear separation becomes possible [48]. Furthermore, SVR is robust to noise, which makes it an appropriate choice in scenarios with noisy data or high variability, ensuring model accuracy even in the face of uncertainties or failures in the input data. The choice of SVR for this work was precisely because it uses kernel functions to make linear separation possible.

2.2.3. Light Gradient Boosting Machine

The LGBM algorithm is seen as an enhanced version of the Gradient Boosting Decision Tree (GBDT). It utilizes DT, where the leaves expand vertically, as shown in Figure 7. This contributes to increasing the capacity and scalability of the algorithm [49]. It uses two new approaches: Gradient-based One-Sided Sampling (GOSS) and Exclusive Feature Bundling (EFB). In EFB, computational resources are reduced without compromising the precision of split point detection [50]. In GOSS, instances with larger gradients are preferentially selected to estimate information gain, while accuracy is preserved.

LGBM employs a leaf-wise tree growth method that enables the trees to converge faster, albeit at the risk of increasing overfitting. The most significant attribute of LGBM is the number of leaves, which regulates the algorithm’s complexity. There is a relationship between the depth of the tree and the number of leaves of weak learners. The information obtained may be stated using Equation (5), where

x_{1}, \dots, x_{n}

denotes the training data instances. The negative gradients of the loss function are expressed as

g_{1}, \dots, g_{n}

in each iteration [50]:

{\hat{y}}_{j} (d) = \frac{1}{n} (\frac{{(\sum_{x_{i} \in A_{l}} g_{i} + \frac{1 - a}{b} \sum_{x_{i} \in B_{l}} g_{i})}^{2}}{n_{l}^{j}} + \frac{{(\sum_{x_{i} \in A_{r}} g_{i} + \frac{1 - a}{b} \sum_{x_{i} \in B_{r}} g_{i})}^{2}}{n_{r}^{j}})

(5)

in which

A_{l} = x_{i} \in A : x_{i j} \leq d, A_{r} = x_{i} \in A : x_{i j} \geq d, B_{l} = x_{i} \in B : x_{i j} \leq d, B_{l} = x_{i} \in B : x_{i j} \geq d

, and the normalization coefficient

(\frac{1 - a}{b})

.

The ability of the LGBM algorithm to model complex nonlinear interactions is a crucial feature when working with data where the relationships between the variables are not linear or easily separated [51]. The model employs a leaf-wise approach to avoid growth, which, instead of growing the trees in a balanced manner, focuses on improving the nodes with the greatest error, making it possible to identify these interactions more effectively than other traditional models [52].

In addition, LGBM implements a boosting-based ensemble approach, in which each subsequent learner aims to correct the residual errors of the previous ones. This iterative procedure enables incremental improvements in predictive performance as new estimators are added [53]. In general, increasing the number of estimators enhances model accuracy, albeit at the cost of higher computational demands during training. Compared to non-ensemble models, LGBM provides predictions with robustness and generalization, making it well-suited for modeling complex, nonlinear patterns in water quality prediction.

2.3. Optimization Algorithm

This section presents the optimization strategy employed to enhance the performance of ML models for DO prediction. To this end, a Genetic Algorithm Search with Cross-Validation was adopted as a robust metaheuristic approach for hyperparameter tuning. The technique enables the automated selection of hyperparameter configurations that improve model generalization by minimizing regression error metrics.

Cross-validation (CV) is a model evaluation technique in which the dataset is split into several parts (or “folds”) to ensure the model is tested on different subsets of the data, thereby increasing its reliability and reducing the risk of overfitting. Using this technique, it is possible to analyze the performance of algorithms more reliably, as the entire dataset is trained and tested through this approach [54].

K-Fold is one of the most used cross-validation approaches. It divides the available data into K parts; some of these parts are used for training the model, and others for testing the model’s performance [54]. In this work, three folds were used for cross-validation (CV). Figure 8 illustrates an example of a 3-fold CV.

The GASearchCV was developed in the Python programming language in the sklearn-genetic-op library by [55], based on the genetic algorithm, an evolutionary algorithm, inspired by Darwin’s principle that species evolve. It is a stochastic algorithm in which solutions are provided in each iteration process, preserving the strongest solutions and eliminating the weakest ones. It is an alternative method to define the parameters that maximize metrics for classification problems—such as accuracy, recall, and precision, among others—that measure performance. In addition, it also seeks to minimize metrics for regression problems, such as MSE, MAE, and RMSE, among others.

The GASearchCV steps involve selecting random sets of hyperparameters based on a defined search space and the fixed size of the population. Model adjustment is performed for each set of hyperparameters, and the fitness function value is calculated at each step of cross-validation. Furthermore, it evaluates possible candidates according to their fitness function, creating new generations, defining new hyperparameter sets, and realizing state-of-the-art combinations with different configurations. These steps are repeated until the number of generations is reached, or another stopping criterion is met [56]. The choice of the best hyperparameters is defined according to the best score of the set in cross-validation.

GASearchCV has been used in several applications. Rimal and Sharma [57] used GASearchCV to optimize the parameters of ML algorithms for predicting heart disease. Oliveira et al. [58] explored the technique for detecting failures in industrial processes. Sun et al. [59] used GASearchCV in the prognosis and diagnosis of COVID-19. Feroz et al. [60] have already applied the technique to aid in predicting earthquake damage under climate change. Furthermore, Perdio et al. [61] used GASearchCV to predict house prices, in order to assist buyers and sellers in the real estate sector. Figure 9 shows a flowchart that illustrates the steps of GASearchCV and the computational methodology proposed.

Finally, the ML models were evaluated according to five performance metrics:

R^{2}

, RMSE, MAE, mean absolute relative error (MARE), and MSE, whose mathematical definitions are provided in Table 2. The predicted output is denoted by

\hat{y}

, the observed value by y, the mean of the actual values by

\bar{y}

, and the total number of samples by N.

3. Computational Experiments

3.1. Computational Settings

The computational experiments were conducted in Python using the Pandas [62] and Scikit-learn [63] libraries. All experiments were run on a computer with the following specifications:

13^{a}

generation Intel CoreTM i7-13700 (16-cores, 24 threads, 30 MB cache, 2.10 GHz to 5.20 GHz Turbo, 65 W) and Ubuntu 22.04 operating system. Table 3 shows the libraries used and their respective versions.

Table 3. Python (3.8 Version) libraries and versions used. The data were divided into 70% for training and 30% for the testing set, as shown in Figure 10, and they were run ranging from 0 to 49; at each iteration a different random seed was used in order to have different splits in k-fold, with k = 3 in the training set.

Library	Version
GASearch	0.06
Scikit-learn	1.2
Pandas	2.1
Hydroeval	0.1.0
Lightgbm	4.6

Figure 10. Training and testing sets.

The parameters used in the GASearchCV optimization algorithm are presented in Table 4. The range of values for each hyperparameter of the EN, SVR, and LGBM models to be tuned by metaheuristics is shown in Table 5.

Following best practices for balance exploration and convergence in hyperparameter optimization [64,65], we set a population size of 50 individuals, 30 generations, and a mutation probability of 0.09, which was informed by empirical testing [66]. A population of 50 individuals leads to enough variation to efficiently scan the space, requiring a reasonable amount of computing power. Convergence was achieved after 30 generations, as performance indicators stabilized at this stage and further iterations yielded minimal improvement. A balanced rate of genetic variety was achieved through a mutation probability of 0.09, which was both stable enough to ensure steady model improvement and high enough to prevent premature convergence.

3.2. Performance Comparison Analysis

The performance metrics shown in Table 6 demonstrate significant differences among the three tested models. A total of 50 runs were performed for each model to ensure statistical robustness, with the average values and standard deviations reported for

R^{2}

, RMSE, MAE, MARE, and MSE.

The analysis of the boxplots in Figure 11, in addition to the results in Table 6, reveals that the EN model has the lowest performance among the evaluated models, with an average

R^{2} = 0.517 \pm 0.029

, indicating minimal explanatory power and weak predictive capability. Furthermore, the performance in terms of RMSE and MAE metrics suggests that the model struggles to make accurate predictions, as evidenced by the high MARE (

0.480 \pm 0.026

). On the other hand, LGBM exhibits the best performance across all metrics, achieving an

R^{2} = 0.944 \pm 0.005

, which indicates an excellent fit and high predictive accuracy. The relatively small errors and standard deviations in all metrics confirm the model’s precision and consistency. Finally, although the SVR showed the greatest variability among the other models, its performance in all the metrics proved to be moderate, especially in terms of

R^{2}

, where the SVR managed to reach a level of

0.605 \pm 0.097

.

Regarding the performance metrics, it is essential to examine the source of errors in relation to the bias-variance trade-off. The LGBM model achieved the lowest RMSE (8.43 mg/L) and the smallest standard deviation across 50 runs, indicating both high accuracy (low bias) and strong generalization (low variance). This reflects LGBM’s boosting-based ensemble structure (see Equation (5)), where each decision tree sequentially corrects residuals from prior learners by fitting the negative gradient of the loss function. This iterative process effectively reduces both bias and variance, making LGBM robust to nonlinearity and noise. In contrast, the EN model’s higher RMSE and low variability suggest underfitting (high bias), while SVR’s moderate accuracy but higher variability reflect sensitivity to training data splits (higher variance).

3.3. Parametric Analysis

The distribution of EN and SVR parameters are displayed in Figure 12 and Figure 13, respectively. For EN, the

α

parameter varied mainly between 0 and 20, while the

L_{1}

ratio ranges approximately from 0.0 to 0.4. Positive weights were set to False 33 times, and

f i t_i n t e r c e p t

was True at all times. The observed behavior indicates that the model consistently relied on a bias term but lacked a consistent penalty pattern, resulting in instability in its predictive performance. For SVR, the variation of regularization parameter (C) is concentrated between 0.0 and 0.8 approximately,

γ

between 7.5 and 9, and

ϵ

between 0.12 and 0.15. It is possible to note that the SVR model showed tighter convergence, suggesting the model achieved moderate consistency in tuning. Yet, its overall performance was constrained by the kernel’s capacity to model nonlinearity effectively.

Figure 14 exhibits the distribution of LGBM parameters across multiple runs, indicating a clear convergence toward specific configurations. The learning rate (LR) was predominantly concentrated between 0.085 and 0.095, while the regularization parameter

λ

varied within the range [0, 2], with the number of estimators stabilizing between 289 and 299. Additionally, the regularization parameter ranged between reg_

α

0.1 and 0.3 approximately.

M a x_d e p t h

was 10 in 42 times and

b o o s t i n g_t y p e

was consistently set to gbdt option. This suggests the model has a high stability, corroborating the robustness observed in the forecast performance.

3.4. Uncertainty Assessment

In the current study, a normal distribution represented the variability of the input parameters, with lower and upper limits defined by the minimum and maximum values listed in Table 5. For each model developed in this manuscript, deterministic results were calculated for every individual Monte Carlo Simulation (MCS) run, resulting in 25,000 computed outcomes for DO. The Mean Absolute Deviation (MAD), computed based on the median of the output distribution, is defined according to Equation (6). In contrast, the model output’s uncertainty can be specified by Equation (7):

M A D = \frac{1}{25000} \sum_{i = 1}^{25000} | D O_{i} - m e d i a n (D O) |

(6)

U n c e r t a i n t y % = \frac{100 \times M A D}{m e d i a n (D O)}

(7)

where

D O_{i}

corresponds to the predicted dissolved oxygen value for the ith sample. The MCS methodology can be comprehensively described in [67].

Table 7 presents the results of the uncertainty analysis for DO predictions using the proposed models. Among them, the EN model exhibited the lowest uncertainty, while the LGBM demonstrated the lowest RMSE (

7.5 %

), as illustrated in Figure 15. These results suggest that LGBM is the most reliable model for practical applications, despite the comparatively higher uncertainty, as its predictions remain stable under variations in conductivity, pH, and temperature. In contrast, EN showed the highest RMSE (23.8%), reflecting poor generalization considering the input fluctuations. These results demonstrate that environmental modeling requires more than just reduced error. The application of LGBM in operational dissolved oxygen forecasting systems is supported by its accuracy.

An ANOVA test was applied and showed a significant difference between the results obtained by the methods. These results confirm that LGBM continuously outperforms the other models in terms of accuracy and robustness.

In Figure 16, scatterplots illustrate the alignment between the measured and predicted DO values for the best-performing models. The solid line is the ideal fit (1:1 reference), while the dots represent the estimates generated by the machine learning models based on their respective input features. The EN model exhibits deviation from the 1:1 line, indicating low predictive accuracy with performance metrics of

R^{2} = 0.543

and RMSE = 23.780. On the other hand, SVR achieves better results compared to EN, although it still shows a noticeable scatter, particularly at elevated DO levels, aligning with its moderate performance (

R^{2} = 0.707

). The LGBM model stands out by aligning closely with the 1:1 line, with predictions tightly clustered around the observed values. This visual evidence underscores its metrics (

R^{2} = 0.956

, RMSE = 7.474), highlighting its robustness in capturing the nonlinear behavior of DO. These findings confirm LGBM as the most accurate and reliable model for predicting DO.

A key component of the proposed methodology is the use of a genetic algorithm for hyperparameter optimization in the hybrid model. Replacing the genetic algorithm with other approaches, such as Particle Swarm Optimization or Ant Colony Algorithms, may be advantageous, particularly in high-dimensional problems. However, in typical machine learning applications involving a limited number of parameters (typically up to five), simpler strategies may be sufficient, reducing computational overhead without compromising performance.

The optimization task becomes significantly more challenging when feature selection is combined with hyperparameter search. In such cases, increasing the number of generations in a genetic algorithm does not guarantee better outcomes. It may result in excessive computational costs, highlighting a critical issue in environmental modeling contexts. However, for the ML models combined with the dataset in this paper, the algorithm has proven effective in providing a solution to the problem, as observed from the results of independent runs.

3.5. Feature Importance

The findings of the SHAP (Shapley Additive exPlanations) [68] analysis, which was used to assess the respective contributions of temperature, conductivity, and pH as input variables to the predicted DO levels, are shown in Figure 14 for the three machine learning models (EN, SVR, and LGBM). Particularly in environmental modeling, this interpretability stage is crucial for verifying that model choices align with domain expertise and for providing stakeholders with meaningful insights.

Figure 17 shows that, across all models, temperature displays the greatest SHAP value, demonstrating its dominating influence on DO prediction. This is in line with the theory that states that oxygen solubility in water is inversely impacted by temperature [69]. The second most important factor is conductivity, which probably reflects the influence of ion concentration and possible sources of pollution, such as runoff from industry or agriculture [70]. Despite having a smaller total contribution, pH has a noticeable impact, especially in the SVR and LGBM models.

In contrast to EN and SVR, the LGBM model notably exhibits a more stable and distinct separation of feature importances, confirming its applicability for capturing complicated interactions and nonlinear correlations between input variables. The model’s openness is enhanced by its consistent ranking of variable importance, a crucial prerequisite for incorporation into management or regulatory frameworks.

3.6. Study Limitations

The suggested modeling approach showed good predictive ability and robustness despite only using four input variables. There are several real-world benefits to this input approach, especially in situations when sensor infrastructure is few or expensive. It is crucial to acknowledge that the dataset does not include critical determinants of dissolved oxygen dynamics, including turbidity, flow velocity, nutrient loading, and biological oxygen demand. It is well recognized that these factors affect DO through intricate biochemical and physical mechanisms. Although the model works well in these circumstances, this limitation may render it less applicable to other water bodies that are more complex or have different impacts. By incorporating more environmental characteristics or combining the framework with mechanistic models, future developments could improve its adaptability. However, the current work provides a strong foundation for cost-effective and scalable DO monitoring, particularly in regions with limited data, such as the Western Balkans.

3.7. Practical Implications and Regional Relevance

The results obtained in this study have direct applicability for engineers, environmental managers, and practitioners involved in water quality monitoring and management, particularly within the Sitnica River basin in Kosovo. The region is characterized by substantial industrial runoff, urban wastewater discharge, and agricultural effluent, generating anthropogenic pressures that significantly impact dissolved oxygen dynamics. Moreover, Kosovo faces infrastructural constraints, with limited coverage of traditional monitoring systems [71]. In this context, the proposed machine learning framework offers a practical and scalable solution. It leverages input features that are commonly measured by low-cost, in situ sensors, requires minimal manual calibration due to its automated tuning strategy, and maintains strong performance under uncertainty. These attributes make the model particularly well-adapted to the resource limitations and environmental complexities of the region, thereby supporting more effective, data-driven decision-making.

The proposed hybrid modeling framework, incorporating automated hyperparameter tuning through GASearchCV, can be integrated into existing sensor networks and environmental decision-support systems. For civil and environmental engineers managing wastewater treatment plants, the model enables the forecasting of critical dissolved oxygen levels, allowing for the timely adjustment of effluent discharges to avoid hypoxic conditions [72]. Similarly, water resource planners and hydrologists can use the model to simulate future DO trends under different scenarios of climate variability, land use change, or pollution events [73].

The robustness of the framework against nonlinear relationships and multicollinearity, confirmed by correlation analysis, makes it suitable for real-world deployment in dynamic and complex hydrological systems. The inclusion of temperature, conductivity, and pH as key predictors ensures that the model leverages parameters commonly measured by standard in situ sensors, facilitating operational integration.

In the specific context of the Sitnica River, which flows through regions affected by agricultural runoff, industrial discharge, and urban development, accurate DO prediction is critical for ecological preservation and public health. By utilizing locally collected high-frequency data, this study provides evidence-based support for environmental management and contributes to the digital transformation of monitoring practices in the Western Balkans. The proposed approach is scalable and transferable, offering a valuable blueprint for other data-scarce or developing regions aiming to strengthen their water quality governance.

4. Conclusions

This study introduces a novel hybrid machine learning framework for predicting DO concentrations in inland water systems, with a case study focused on the Sitnica River in Kosovo. The key innovation lies in the integration of evolutionary optimization via Genetic Algorithm Search with Cross-Validation (GASearchCV) into the model development pipeline, enabling automated and effective hyperparameter tuning for three distinct regressors (EN, SVR, and LGBM).

A significant contribution of this work is the adaptation of the framework to a real-world context characterized by anthropogenic stressors (e.g., industrial and agricultural runoff) and limited monitoring infrastructure. Despite using only three commonly available input parameters (temperature, conductivity, and pH), the LGBM model achieved high accuracy (

R^{2} = 0.956

and an RMSE = 7.474 mg/L) and demonstrated strong robustness under uncertainty, as verified through Monte Carlo Simulation. To our knowledge, this is one of the first studies to implement an evolutionary-assisted ML approach for DO prediction in the Western Balkans. The methodology is not only scalable and reproducible but also can be used to operate under data-scarce and high-variability environments. In practical terms, the proposed framework serves as a reliable and low-cost alternative for environmental monitoring and early warning systems in under-resourced regions.

Although the findings show promise for data-driven water quality monitoring, there are obstacles to practical application, such as the requirement for regular sensor calibration to preserve data quality, the computational limitations for real-time deployment on edge devices, and the dependence on continuous high-frequency data streams that might not be accessible in areas with limited resources. However, the approach offers a useful starting point for developing sustainable water management, especially when combined with domain-specific validation and existing monitoring infrastructure.

Future research should consider expanding the model’s scope to include additional water quality indicators (e.g., nutrient loads, turbidity), incorporating spatial variability across monitoring locations, and embedding explainable AI techniques to further enhance interpretability. Furthermore, coupling this framework with remote sensing data or physical process-based models could enable hybrid systems that combine the strengths of data-driven learning with domain-specific knowledge in hydrology. Ultimately, such developments will contribute to more resilient, transparent, and sustainable water resource management systems aligned with global environmental goals.

Author Contributions

Conceptualization: L.G. and P.C.; Methodology: L.G., B.d.S.M., L.L., D.L.F. and T.H.A.B.; Software: B.d.S.M., L.L., D.L.F. and T.H.A.B.; Validation: R.Y., O.F. and U.R.V.A.; Formal Analysis: O.F., E.H., P.B., U.R.V.A. and R.Y.; Investigation: L.G., P.C., C.M.S., O.F., E.H., P.B., B.d.S.M., U.R.V.A., L.L., D.L.F. and T.H.A.B.; Resources: L.G., P.C., R.Y. and E.H.; Funding: L.G.; Data Curation: B.d.S.M., L.L., P.B., O.F., D.L.F. and T.H.A.B.; Writing—original draft: B.d.S.M., L.L., D.L.F. and T.H.A.B.; Writing—review & editing: L.G., P.C., O.F., E.H., P.B., U.R.V.A., R.Y. and C.M.S.; Supervision: P.C. and L.G. All authors have read and agreed to the published version of the manuscript.

Funding

The authors acknowledge the financial support provided by the funding agencies CNPq (grants 307688/2022-4, 409433/2022-5 and 304646/2025-3), Fapemig (grants APQ-02513-22, APQ-04458-23 and BPD-00083-22), Finep (grant SOS Equipamentos 2021 AV02 0062/22), FAPERJ (grant 10.432/2024-APQ1) and Capes (Finance Code 001). This work has been supported by UFJF’s High-Speed Integrated Research Network (RePesq) https://www.repesq.ufjf.br/ (accessed on 8 July 2025).

Data Availability Statement

The dataset is openly available in reference [42]. The source code for data cleaning is available at https://github.com/LGoliatt/paper_dissolved_oxygen_kosovo_open (accessed on 8 July 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Antanasijevic, D.; Pocajt, V.; Peric-Grujic, A.; Ristic, M. Modelling of dissolved oxygen in the Danube River using artificial neural networks and Monte Carlo Simulation uncertainty analysis. J. Hydrol. 2014, 519, 1895–1907. [Google Scholar] [CrossRef]
Stajkowski, S.; Zeynoddin, M.; Farghaly, H.; Gharabaghi, B.; Bonakdari, H. A Methodology for Forecasting Dissolved Oxygen in Urban Streams. Water 2020, 12, 2568. [Google Scholar] [CrossRef]
Shala, A.; Sallaku, F.; Meta, A.; Shala, A.; Ukaj, S. Assessment of the Water Quality from the Sitnica River as a Result of Urban Discharges. Albanian J. Agric. Sci. 2015, 14, 286. [Google Scholar]
Lewis, M. Dissolved oxygen. In Water-Resource Investing; U.S. Geological Survey Technology: Reston, VA, USA, 2006; Volume 9. [Google Scholar]
Ouma, Y.O.; Okuku, C.O.; Njau, E.N. Use of artificial neural networks and multiple linear regression model for the prediction of dissolved oxygen in rivers: Case study of hydrographic basin of River Nyando, Kenya. Complexity 2020, 2020, 9570789. [Google Scholar] [CrossRef]
Bolick, M.M.; Post, C.J.; Naser, M.Z.; Mikhailova, E.A. Comparison of machine learning algorithms to predict dissolved oxygen in an urban stream. Environ. Sci. Pollut. Res. 2023, 30, 78075–78096. [Google Scholar] [CrossRef] [PubMed]
Ziyad Sami, B.F.; Latif, S.D.; Ahmed, A.N.; Chow, M.F.; Murti, M.A.; Suhendi, A.; Ziyad Sami, B.H.; Wong, J.K.; Birima, A.H.; El-Shafie, A. Machine learning algorithm as a sustainable tool for dissolved oxygen prediction: A case study of Feitsui Reservoir, Taiwan. Sci. Rep. 2022, 12, 3649. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Zhao, C.; Chen, J.; Jiang, H.; Li, D.; Zhang, J.; Han, B.; Chen, S.; Wang, C. Water quality parameters-based prediction of dissolved oxygen in estuaries using advanced explainable ensemble machine learning. J. Environ. Manag. 2025, 380, 125146. [Google Scholar] [CrossRef] [PubMed]
Li, X.Y.; Wang, H.; Wang, Y.Q.; Zhang, L.J.; Wu, Y. Machine Learning-based Dissolved Oxygen Prediction Modeling and Evaluation in the Yangtze River Estuary. Huanjing Kexue/Environ. Sci. 2024, 45, 7123–7133. [Google Scholar] [CrossRef]
Roger Rozario, A. Prediction of Dissolved Oxygen in Shrimp Pond using Dolphin Glow Worm Optimization Based Radial Basis Function Neural Network. In Proceedings of the 2024 IEEE World AI IoT Congress (AIIoT), Seattle, WA, USA, 29–31 May 2024; pp. 286–291. [Google Scholar] [CrossRef]
Wu, J.; Wang, Z.; Dong, J.; Yao, Z.; Chen, X.; Fan, H. Multi-step ahead dissolved oxygen concentration prediction based on knowledge guided ensemble learning and explainable artificial intelligence. J. Hydrol. 2024, 636, 131297. [Google Scholar] [CrossRef]
Chen, X.; Sun, W.; Jiang, T.; Ju, H. Enhanced prediction of river dissolved oxygen through feature- and model-based transfer learning. J. Environ. Manag. 2024, 372, 123310. [Google Scholar] [CrossRef] [PubMed]
Yang, H.; Sun, M.; Liu, S. A hybrid intelligence model for predicting dissolved oxygen in aquaculture water. Front. Mar. Sci. 2023, 10, 1126556. [Google Scholar] [CrossRef]
Shiri, N.; Kisi, O.; Karimi, S.; Shiri, J. Towards seasonal-based assessment of machine learning models in river dissolved oxygen simulations with different flow ranges. ISH J. Hydraul. Eng. 2025, 31, 461–482. [Google Scholar] [CrossRef]
Kozhiparamban, R.A.H.; Swetha, P.; Harigovindan, V. Accurate Dissolved Oxygen Prediction for Aquaculture Using Stacked Ensemble Machine Learning Model. Natl. Acad. Sci. Lett. 2023, 46, 203–207. [Google Scholar] [CrossRef]
Hu, Y.; Liu, C.; Wollheim, W.M. Prediction of riverine daily minimum dissolved oxygen concentrations using hybrid deep learning and routine hydrometeorological data. Sci. Total Environ. 2024, 918, 170383. [Google Scholar] [CrossRef] [PubMed]
Liu, Q.; Li, Y.; Yang, J.; Deng, M.; Li, J.; An, K. Physics-guided spatio–temporal neural network for predicting dissolved oxygen concentration in rivers. Int. J. Geogr. Inf. Sci. 2024, 38, 1207–1231. [Google Scholar] [CrossRef]
Dodig, A.; Ricci, E.; Kvascev, G.; Stojkovic, M. A novel machine learning-based framework for the water quality parameters prediction using hybrid long short-term memory and locally weighted scatterplot smoothing methods. J. Hydroinform. 2024, 26, 1059–1079. [Google Scholar] [CrossRef]
He, H.; Boehringer, T.; Schäfer, B.; Heppell, K.; Beck, C. Analyzing spatio-temporal dynamics of dissolved oxygen for the River Thames using superstatistical methods and machine learning. Sci. Rep. 2024, 14, 21288. [Google Scholar] [CrossRef] [PubMed]
Krivoguz, D.; Semenova, A.; Malko, S. Performance of machine learning algorithms in predicting dissolved oxygen concentration. In Proceedings of the International Scientific Conference on Agricultural Machinery Industry “Interagromash”, Rostov-on-Don, Russia, 25–27 May 2022; Springer: Cham, Switzerland, 2022; pp. 1137–1144. [Google Scholar]
Yang, J. Predicting water quality through daily concentration of dissolved oxygen using improved artificial intelligence. Sci. Rep. 2023, 13, 20370. [Google Scholar] [CrossRef] [PubMed]
Yang, F.; Moayedi, H.; Mosavi, A. Predicting the degree of dissolved oxygen using three types of multi-layer perceptron-based artificial neural networks. Sustainability 2021, 13, 9898. [Google Scholar] [CrossRef]
Kisi, O.; Alizamir, M.; Docheshmeh Gorgij, A.R. Dissolved oxygen prediction using a new ensemble method. Environ. Sci. Pollut. Res. 2020, 27, 9589–9603. [Google Scholar] [CrossRef] [PubMed]
Moon, J.; Lee, J.; Lee, S.; Yoon, H. Urban River Dissolved Oxygen Prediction Model Using Machine Learning. Water 2022, 14, 1899. [Google Scholar] [CrossRef]
Qambar, A.S.; Al Khalidy, M.M. Optimizing dissolved oxygen requirement and energy consumption in wastewater treatment plant aeration tanks using machine learning. J. Water Process Eng. 2022, 50, 103237. [Google Scholar] [CrossRef]
Arora, S.; Keshari, A.K. Dissolved oxygen modelling of the Yamuna River using different ANFIS models. Water Sci. Technol. 2021, 84, 3359–3371. [Google Scholar] [CrossRef] [PubMed]
Khan, P.W.; Byun, Y.C. Optimized Dissolved Oxygen Prediction Using Genetic Algorithm and Bagging Ensemble Learning for Smart Fish Farm. IEEE Sens. J. 2023, 23, 15153–15164. [Google Scholar] [CrossRef]
Guo, P.; Liu, H.; Liu, S.; Xu, L. Numeric Prediction of Dissolved Oxygen Status Through Two-Stage Training for Classification-Driven Regression. In Proceedings of the 2019 International Conference on Machine Learning and Cybernetics (ICMLC), Kobe, Japan, 7–10 July 2019; IEEE Computer Society: Washington, DC, USA, 2019. [Google Scholar]
Yonaba, R.; Kiema, A.; Tazen, F.; Belemtougri, A.; Cissé, M.; Mounirou, L.A.; Bodian, A.; Koïta, M.; Karambiri, H. Accuracy and interpretability of machine learning-based approaches for daily ETo estimation under semi-arid climate in the West African Sahel. Earth Sci. Inform. 2025, 18, 87. [Google Scholar] [CrossRef]
Zhao, Y.; Chen, M. Prediction of river dissolved oxygen (DO) based on multi-source data and various machine learning coupling models. PLoS ONE 2025, 20, e0319256. [Google Scholar] [CrossRef] [PubMed]
Li, D.; Ji, X.; Liu, L. An accurate forecasting model for key water quality factors based on Transformer with multi-scale attention mechanism. Environ. Model. Softw. 2025, 191, 106491. [Google Scholar] [CrossRef]
Zhao, Y.; Chen, M. Development of a river dissolved oxygen prediction model integrating spatial effects and multiple deep learning algorithm. Ecol. Inform. 2025, 103234, in press. [Google Scholar] [CrossRef]
Koksal, E.S.; Aydin, E. A hybrid approach of transfer learning and physics-informed modelling: Improving dissolved oxygen concentration prediction in an industrial wastewater treatment plant. Chem. Eng. Sci. 2025, 304, 121088. [Google Scholar] [CrossRef]
Wu, J.; Chen, X.; Dong, J.; Tan, N.; Liu, X.; Chatzipavlis, A.; Yu, P.L.; Velegrakis, A.; Wang, Y.; Huang, Y.; et al. Dissolved oxygen prediction in the Dianchi River basin with explainable artificial intelligence based on physical prior knowledge. Environ. Model. Softw. 2025, 171, 106412. [Google Scholar] [CrossRef]
Yang, W.; Liu, W.; Gao, Q. Prediction of dissolved oxygen concentration in aquaculture based on attention mechanism and combined neural network. Math. Biosci. Eng. 2023, 20, 998–1017. [Google Scholar] [CrossRef] [PubMed]
Abazi, A.S.; Durmishi, B.; Sallaku, F.; Çadraku, H.; Fetoshi, O.; Ymeri, P.; Bytyc, P. Assessment of water quality of sitnica river by using water quality index (WQI). Rasayan J. Chem. 2020, 13, 146–159. [Google Scholar] [CrossRef]
Demaku, S.; Kastrati, G.; Halili, J. Assessment of contamination with heavy metals in the environment. Water, sediment, and soil around Kosovo power plants. Environ. Prot. Eng. 2022, 48, 15–27. [Google Scholar] [CrossRef]
Troni, N.; Hoti, R.; Halili, J.; Omanovic, D.; Laha, F.; Gashi, F. Water Quality Examination in the Stream Sediment of River Sitnica—The Assessment of Toxic Trace Elements. J. Ecol. Eng. 2021, 22, 234–243. [Google Scholar] [CrossRef]
Berisha, A. Water Regime and Flow Trends of Sitnica River. Eur. J. Environ. Earth Sci. 2021, 2, 11–14. [Google Scholar] [CrossRef]
Koraqi, H.; Luzha, I.; Termkolli, F. An Assessment of the Water Quality and Ecological Status of Sitnica River, Kosovo; Studia Universitatis Babes-Bolyai, Chemia: Cluj-Napoca, Romania, 2016; Volume 61.
Gashi, F.; Korca, B.; Kurteshi, K.; Faiku, F.; Domuzeti, M.; Gashi, S. Evaluation of the chemical and microbiological contamination of the river sitnica waters (Kosovo): A statistical approach. Environ. Chem. Bull. 2015, 4, 463–467. [Google Scholar]
Ahmedi, F.; Ahmedi, L. Dataset on water quality monitoring from a wireless sensor network in a river in Kosovo. Data Brief 2022, 44, 108486. [Google Scholar] [CrossRef] [PubMed]
Ahmedi, F.; Ahmedi, L.; O’Flynn, B.; Kurti, A.; Tahirsylaj, S.; Bytyçi, E.; Sejdiu, B.; Salihu, A. Inwatersense: An intelligent wireless sensor network for monitoring surface water quality to a river in kosovo. Int. J. Agric. Environ. Inf. Syst. IJAEIS 2018, 9, 39–61. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 2005, 67, 301–320. [Google Scholar] [CrossRef]
Wang, Y.; Kong, L.; Jiang, B.; Zhou, X.; Yu, S.; Zhang, L.; Heo, G. Wavelet-based LASSO in functional linear quantile regression. J. Stat. Comput. Simul. 2019, 89, 1111–1130. [Google Scholar] [CrossRef]
Melkumova, L.; Shatskikh, S.Y. Comparing Ridge and LASSO estimators for data analysis. Procedia Eng. 2017, 201, 746–755. [Google Scholar] [CrossRef]
Smola, A. Learning with Kernels. Ph.D. Thesis, Technische Universitat Berlin, Berlin, Germany, 1998. [Google Scholar]
Du, K.L.; Jiang, B.; Lu, J.; Hua, J.; Swamy, M. Exploring Kernel Machines and Support Vector Machines: Principles, Techniques, and Future Directions. Mathematics 2024, 12, 3935. [Google Scholar] [CrossRef]
Fan, J.; Ma, X.; Wu, L.; Zhang, F.; Yu, X.; Zeng, W. Light Gradient Boosting Machine: An efficient soft computing model for estimating daily reference evapotranspiration with local and external meteorological data. Agric. Water Manag. 2019, 225, 105758. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3146–3154. [Google Scholar]
Demirtürk, D.; Mintemur, Ö.; Arslan, A. Optimizing LightGBM and XGBoost Algorithms for Estimating Compressive Strength in High-Performance Concrete. Arab. J. Sci. Eng. 2025, 1–23. [Google Scholar] [CrossRef]
Zheng, Q.; Yu, C.; Cao, J.; Xu, Y.; Xing, Q.; Jin, Y. Advanced payment security system: Xgboost, lightgbm and smote integrated. In Proceedings of the 2024 IEEE International Conference on Metaverse Computing, Networking, and Applications (MetaCom), Hong Kong, China, 12–14 August 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 336–342. [Google Scholar]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning, 2nd ed.; Springer Series in Statistics; Springer: Cham, Switzerland, 2009. [Google Scholar] [CrossRef]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning: With Applications in R; Springer: Cham, Switzerland, 2013; Volume 1. [Google Scholar]
Gómez, R.A. Sklearn-Genetic-Opt: Hyperparameter Tuning and Feature Selection with Evolutionary Algorithms. 2024. Available online: https://sklearn-genetic-opt.readthedocs.io/en/stable/tutorials/basic_usage.html (accessed on 29 June 2025).
Oyedele, A.A.; Ajayi, A.O.; Oyedele, L.O.; Bello, S.A.; Jimoh, K.O. Performance evaluation of deep learning and boosted trees for cryptocurrency closing price prediction. Expert Syst. Appl. 2023, 213, 119233. [Google Scholar] [CrossRef]
Rimal, Y.; Sharma, N. Hyperparameter optimization: A comparative machine learning model analysis for enhanced heart disease prediction accuracy. Multimed. Tools Appl. 2024, 83, 55091–55107. [Google Scholar] [CrossRef]
Oliveira, R.M.A.; Sant’Anna, Â.M.O.; da Silva, P.H.F. Explainable machine learning models for defects detection in industrial processes. Comput. Ind. Eng. 2024, 192, 110214. [Google Scholar] [CrossRef]
Sun, L.; Liu, Y.; Han, L.; Chang, Y.; Du, M.; Zhao, Y.; Zhang, J. GACEMV: An ensemble learning framework for constructing COVID-19 diagnosis and prognosis models. Biomed. Signal Process. Control 2024, 94, 106305. [Google Scholar] [CrossRef]
Feroz, S.B.; Sharmin, N.; Sevas, M.S. An empirical analysis of hyperparameter tuning impact on ensemble machine learning algorithm for earthquake damage prediction. Asian J. Civ. Eng. 2024, 25, 3521–3547. [Google Scholar] [CrossRef]
Perdio, L.D.I.; Rosales, M.A.; de Luna, R.G. Manila City House Prices: A Machine Learning Analysis of the Current Market Value for Improvements. In Proceedings of the International Conference on Intelligent Computing &Optimization, Phnom Penh, Cambodia, 27–28 October 2023; Springer: Cham, Switzerland, 2023; pp. 304–314. [Google Scholar]
McKinney, W. Pandas: A foundational Python library for data analysis and statistics. Python High Perform. Sci. Comput. 2011, 14, 1–9. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Saeedi, E.; Mashhadinejad, M.; Tavallaii, A. Development of a machine learning model for prediction of intraventricular hemorrhage in premature neonates. Child’s Nerv. Syst. 2025, 41, 1–9. [Google Scholar] [CrossRef] [PubMed]
Waldbieser, J.; D’Antonio, R.; Wubben, M.J.; Brooks, J.P. Ensemble Feature Selection of Cotton Hyperspectral Reflectance to Predict Soil Health Genes. In Proceedings of the SoutheastCon 2025, Concord, NC, USA, 22–30 March 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 463–468. [Google Scholar]
Eiben, A.E.; Smith, J.E. Introduction to Evolutionary Computing; Springer: Cham, Switzerland, 2015. [Google Scholar]
Sattar, A.M.; Gharabaghi, B. Gene expression models for prediction of longitudinal dispersion coefficient in streams. J. Hydrol. 2015, 524, 587–596. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4768–4777. [Google Scholar]
Rajwa-Kuligiewicz, A.; Bialik, R.J.; Rowinski, P.M. Dissolved oxygen and water temperature dynamics in lowland rivers over various timescales. J. Hydrol. Hydromech. 2015, 63, 353. [Google Scholar] [CrossRef]
de Sousa, D.N.R.; Mozeto, A.A.; Carneiro, R.L.; Fadini, P.S. Electrical conductivity and emerging contaminant as markers of surface freshwater contamination by wastewater. Sci. Total Environ. 2014, 484, 19–26. [Google Scholar] [CrossRef] [PubMed]
Blair, J. Computer Vision as a Tool to Automate Specimen Classification in Large-Scale Ecological Research. Ph.D. Thesis, University of British Columbia, Vancouver, BC, Canada, 2024. [Google Scholar]
Schütze, M.; Butler, D.; Beck, B.M. Modelling, Simulation and Control of Urban Wastewater Systems; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
Dunn, S.; Brown, I.; Sample, J.; Post, H. Relationships between climate, water resources, land use and diffuse pollution and the significance of uncertainty in climate change. J. Hydrol. 2012, 434, 19–35. [Google Scholar] [CrossRef]

Figure 1. Hydrographic network of Kosovo with the Sitnica River highlighted.

Figure 2. Static detection in the Sitnica River comprises wireless sensors (middle of the underwater area in the photo). The figure reported by Ahmedi and Ahmedi [42] under the terms of the Creative Commons Attribution License—CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/ accessed on 5 May 2025).

Figure 3. Time series of DO measurements from the Sitnica River monitoring station in 2015. The blue line represents the portion of the dataset (18,360 samples) retained for modeling due to signal consistency and sensor stability. The red line corresponds to the discarded samples.

Figure 4. Observed environmental variables.

Figure 5. Pair-wise correlation between the dataset variables and p-value.

Figure 6. Pairplot graph showing correlations and distributions between the variables in the dataset.

Figure 7. Schematic representation of a leaf-wise tree growth.

Figure 8. Schematic representation of a 3-fold cross-validation.

Figure 9. GASearchCV optimization algorithm flowchart.

Figure 11. Boxplot for the metrics

R^{2}

, RMSE, MAE, MSE, and MARE (n = 50).

Figure 11. Boxplot for the metrics

R^{2}

, RMSE, MAE, MSE, and MARE (n = 50).

Figure 12. Elastic Net parameter distribution over the 50 runs.

Figure 13. Support Vector Regressor parameter distribution over the 50 runs.

Figure 14. LGBM parameters distribution over the 50 runs.

Figure 15. Visualization of uncertainty and prediction error for the proposed models on the test set.

Figure 16. Comparison between the predicted (test set) and true DO by the models EN, LGBM, and SVR, respectively. The values of

R^{2}

and RMSE of each experiment were presented for comparison.

Figure 16. Comparison between the predicted (test set) and true DO by the models EN, LGBM, and SVR, respectively. The values of

R^{2}

and RMSE of each experiment were presented for comparison.

Figure 17. Feature importance according to SHAP Values.

Table 1. Overview of statistical measures for the dataset variables.

Variables	Mean	Std	Min	25%	50%	75%	Max
Temperature (°C)	14.79	5.29	2.38	10.18	15.59	18.39	28.17
Condutivity ( $μ$ S/cm)	337.75	90.55	22.80	274.90	317.05	363.70	690.80
pH	5.95	1.87	0.02	4.81	6.24	7.40	11.50
Dissolved Oxygen (mg/L)	39.60	35.79	3.70	11.40	21.30	61.40	149.20

Table 2. Mathematical formulation of the performance metrics.

Metric Acronym	Expression
MAE	$\frac{1}{N} \sum_{t = 1}^{N} \| (y_{i} - {\hat{y}}_{i}) \|$
MARE	$\frac{1}{N} \sum_{i = 1}^{N} \frac{\| y_{i} - \hat{y} \|}{y_{i}}$
MSE	$\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}$
RMSE	$\frac{1}{N} \sqrt{\sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}}$
$R^{2}$	$\sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2} / \sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2}$

Table 4. Parameters used in the configuration of the GASearchCV algorithm for hyperparameter optimization.

Parameters	Value
Population	50
Num. Generation	30
Tournament Size	3
Crossover Probability	0.8
Mutation Probability	0.09
Algorithm	eaMuCommaLambda
Elitism	True
Cross Validation Folds	3 folds

Table 5. Range of values for each hyperparameter of the machine learning models to be adjusted by the optimization algorithm.

Model	Parameter	Variation
EN	alpha ( $α$ )	$[10^{- 2}, 10^{2}]$
	l1_ratio	$[0, 1]$
	fit_intercept	[True, False]
	positive	[True, False]
SVR	gamma ( $γ$ )	$[10^{- 2}, 10^{1}]$
	epsilon ( $ϵ$ )	$[10^{- 2}, 10^{2}]$
	Regularization Parameter (C)	$[10^{- 2}, 10^{4}]$
LGBM	no. estimators	$[5, 3 \times 10^{2}]$
	boosting_type	[gbdt, dart, goss]
	max_depth	$[2, 10^{1}]$
	learning_rate	$[10^{- 3}, 10^{- 1}]$
	reg_alpha	$[10^{- 2}, 10^{1}]$
	reg_lambda	$[10^{- 2}, 10^{1}]$

Table 6. Mean values and standard deviations are reported, with the latter presented within parentheses. Each entry in the table represents the outcome of 50 independent experimental runs in the test set. The best average result for each performance metric is shown in boldface.

ML Model	$R^{2}$	RMSE (mg/L)	MAE (mg/L)	MARE	MSE (mg/L)
EN	0.517 (0.029)	24.742 (0.728)	19.002 (1.044)	0.480 (0.026)	612.683 (36.825)
SVR	0.605 (0.097)	22.246 (2.548)	18.465 (2.370)	0.384 (0.040)	501.379 (123.022)
LGBM	0.944 (0.005)	8.430 (0.352)	4.339 (0.218)	0.110 (0.006)	71.188 (5.947)

Table 7. Summary of uncertainty analysis results on the test set for each model using Monte Carlo Simulation.

Model	Median	MAD	Uncertainty	RMSE
EN	39.4	20.8	52.0	23.8
LGBM	48.8	34.7	71.1	7.5
SVR	50.3	12.5	24.9	19.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Macêdo, B.d.S.; Lima, L.; Fonseca, D.L.; Boratto, T.H.A.; Saporetti, C.M.; Fetoshi, O.; Hajrizi, E.; Bytyçi, P.; Aires, U.R.V.; Yonaba, R.; et al. Evolutionary-Assisted Data-Driven Approach for Dissolved Oxygen Modeling: A Case Study in Kosovo. Earth 2025, 6, 81. https://doi.org/10.3390/earth6030081

AMA Style

Macêdo BdS, Lima L, Fonseca DL, Boratto THA, Saporetti CM, Fetoshi O, Hajrizi E, Bytyçi P, Aires URV, Yonaba R, et al. Evolutionary-Assisted Data-Driven Approach for Dissolved Oxygen Modeling: A Case Study in Kosovo. Earth. 2025; 6(3):81. https://doi.org/10.3390/earth6030081

Chicago/Turabian Style

Macêdo, Bruno da S., Larissa Lima, Douglas Lima Fonseca, Tales H. A. Boratto, Camila M. Saporetti, Osman Fetoshi, Edmond Hajrizi, Pajtim Bytyçi, Uilson R. V. Aires, Roland Yonaba, and et al. 2025. "Evolutionary-Assisted Data-Driven Approach for Dissolved Oxygen Modeling: A Case Study in Kosovo" Earth 6, no. 3: 81. https://doi.org/10.3390/earth6030081

APA Style

Macêdo, B. d. S., Lima, L., Fonseca, D. L., Boratto, T. H. A., Saporetti, C. M., Fetoshi, O., Hajrizi, E., Bytyçi, P., Aires, U. R. V., Yonaba, R., Capriles, P., & Goliatt, L. (2025). Evolutionary-Assisted Data-Driven Approach for Dissolved Oxygen Modeling: A Case Study in Kosovo. Earth, 6(3), 81. https://doi.org/10.3390/earth6030081

Article Menu

Evolutionary-Assisted Data-Driven Approach for Dissolved Oxygen Modeling: A Case Study in Kosovo

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Machine Learning Models

2.2.1. Elastic Net

2.2.2. Support Vector Regression

2.2.3. Light Gradient Boosting Machine

2.3. Optimization Algorithm

3. Computational Experiments

3.1. Computational Settings

3.2. Performance Comparison Analysis

3.3. Parametric Analysis

3.4. Uncertainty Assessment

3.5. Feature Importance

3.6. Study Limitations

3.7. Practical Implications and Regional Relevance

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI