1. Introduction
The amount of dissolved oxygen (DO) in watercourses is an essential indicator for understanding phenomena such as self-purification, microorganism respiration, and the metabolism of aquatic ecosystems. Considerable decreases in DO levels generally occur due to the biological oxidation of organic matter, which is intensified by the discharge of domestic and industrial effluents and the leaching of fertilizers. These circumstances promote phenomena such as eutrophication and anoxia, negatively affecting biodiversity and the biogeochemical balance of the aquatic environment [
1]. Effective monitoring and management of freshwater systems have become central to global sustainability efforts. In this context, ensuring good ecological status of rivers is directly aligned with several targets under the United Nations Sustainable Development Goals (SDGs), most notably SDG 6 (related to Clean Water and Sanitation), which advocates for the availability and sustainable management of water resources, and SDG 14 (Life Below Water), which aims to reduce pollution and protect aquatic ecosystems.
A particularly important indicator of riverine health is DO, a measure of the oxygen available for aquatic organisms. DO is a key integrative parameter, as it reflects the cumulative effects of physical, chemical, and biological processes occurring in aquatic systems [
2]. Low DO levels, often resulting from anthropogenic stressors such as untreated sewage discharge, agricultural runoff, and industrial effluents, can lead to hypoxia, ecosystem degradation, and the collapse of sensitive aquatic species [
3,
4,
5,
6]. Thus, the ability to accurately monitor and predict DO concentrations in real time is essential for safeguarding riverine ecosystems and supporting data-driven water management.
Machine learning-based approaches for predicting dissolved oxygen in watercourses have advantages when combined with traditional field measurements. ML models enable real-time and continuous predictions, facilitating the early detection of hypoxic or anoxic conditions [
7]. This predictive capability is more efficient and often more cost-effective than intensive monitoring with physical sampling, especially on a large scale or in hard-to-reach locations [
8]. Furthermore, the adoption of these models can improve data-driven decision-making for the protection of aquatic ecosystems [
9,
10].
Dissolved oxygen (DO) modeling has been improved in a variety of hydrological conditions thanks to recent developments in machine learning (ML). Studies employing ensemble methods and deep learning architectures such as LSTM and hybrid neural networks have shown good predictive accuracy across Asia, especially in China and India [
8,
9,
10,
11,
12,
13,
14,
15]. Similar to this, studies conducted in North America have effectively used cutting-edge deep learning models to capture DO changes in dynamic and complex environmental settings [
6,
16,
17]. Together, these initiatives demonstrate ML’s scalability and versatility while highlighting the global trend toward incorporating AI in aquatic environmental monitoring.
The majority of research has focused on areas with robust monitoring systems and a wealth of datasets. On the other hand, the literature continues to underrepresent under-monitored regions, such as Eastern Europe and the Western Balkans. Dodig et al. [
18] use ML techniques, specifically long short-term memory (LSTM) networks, to predict the water quality of the Sava River, which is located in the southeastern European regions and is part of the Danube river basin. In the work of He et al. [
19], ML techniques were used to predict the DO, using data from a long stretch of 45 km from west to east along the River Thames. Krivoguz et al. [
20] conducted a study for DO prediction, applying Random Forest (RF), for the Black Sea area, which geographically passes through the Balkan region.
These areas present particular challenges that require specialized solutions, as they are often characterized by significant anthropogenic pressure and limited data availability. Validating and modifying ML-based DO prediction techniques in environments with limited data is therefore urgently needed. This study addresses a critical knowledge gap through the application of an interpretable and evolutionarily optimized machine learning framework to the Sitnica River in Kosovo, a system historically lacking comprehensive data.
Optimization-enhanced ML models have demonstrated further promise. Yang [
21] evaluated multiple training strategies, including Teaching-Learning-Based Optimization (TLBO), Sine Cosine Algorithm (SCA), Water Cycle Algorithm (WCA), and Electromagnetic Field Optimization (EFO), in training Multilayer Perceptron Neural Networks (MLPNNs). Their results highlighted the EFO-MLPNN as the most efficient (mean absolute error (MAE) = 1.0002, root mean square error (RMSE) = 1.2903, and R = 0.88154), outperforming previous efforts such as the Multi-Verse Optimizer (MVO) [
22] and Bayesian Model Averaging (BMA) [
23]. On the other hand, Ziyad Sami et al. [
7] utilized an ANN to predict DO levels in the Feitsui reservoir in Taiwan, optimizing the number of neurons to achieve accurate results (coefficient of determination
= 0.98.
The DO prediction has been effectively achieved using ensemble and boosting methods. For instance, Moon et al. [
24] used AdaBoost, RF, and Gradient Boosting algorithms to predict DO in the Hwanggujicheon region, with AdaBoost achieving superior performance (
= 0.015,
= 0.009, and
= 0.912). Similarly, Qambar and Al Khalidy [
25] demonstrated exceptional prediction accuracy and reduced energy costs using boosted algorithms.
Hybrid models further enhance predictive capacity by integrating complementary learning strategies. In the Yamuna River case study, Arora and Keshari [
26] employed Adaptive Neuro-Fuzzy Inference Systems (ANFIS) with grid partitioning (ANFIS-GP) and subtractive clustering (ANFIS-SC), achieving an
of
with ANFIS-GP. Khan and Byun [
27] developed the GA-XGCBXT model, combining Genetic Algorithms (GA) with eXtreme Gradient Boosting (XGB), CatBoost (CB), and eXtra Trees (XT), yielding a mean square error (MSE) of
.
Stacked ensemble approaches have also proven highly effective. Kozhiparamban et al. [
15] proposed a stacked model that implemented Kernel Ridge Regression (KRR), Elastic Net (EN), and Light Gradient Boosting Machine (LGBM), achieving substantial performance gains (MAE = 0.0176, RMSE = 0.0319) over individual models. Guo [
28] evaluated classical ML models such as DT, MLP, Naive Bayes (NB), K-Nearest Neighbors (KNN), and Support Vector Machines (SVM). The results indicate that DT (named as C4.5) and MLP models offered the best performance (RMSE = 0.068 and 0.055, respectively).
Recently, interpretable machine learning models have emerged, enabling both accurate prediction and transparent variable assessment [
29]. Chen et al. [
8] developed an ensemble framework for six Chinese estuaries using SHapley Additive Explanations (SHAP) analysis to evaluate feature importance, emphasizing variables such as pH, electrical conductivity (EC), and nutrient loads. Their approach highlighted both local interactions and lagged dependencies, improving the interpretability of DO prediction models. Hybrid architectures integrating signal processing and evolutionary computation have further advanced model accuracy. Zhao and Chen [
30] introduced the DWT-KPCA-GWO-XGBoost model, which incorporates the Discrete Wavelet Transform (DWT) for denoising, Kernel Principal Component Analysis (KPCA) for feature reduction, and Grey Wolf Optimization (GWO) for hyperparameter tuning. This model significantly outperformed conventional approaches in forecasting DO in the Yangtze River basin.
Attention-based models have become increasingly prominent in water quality prediction, especially for forecasting DO and other key parameters. Li et al. [
31] developed a transformer-based framework incorporating multi-scale temporal fusion and dynamic time-series decomposition to handle the nonstationarity and multi-scale nature of DO dynamics, outperforming seven DL baselines in accuracy and robustness. Building on this, Zhao and Chen [
32] proposed a hybrid model combining wavelet convolution, variational mode decomposition (VMD), and a frequency-enhanced attention mechanism with Shapley Additive Explanations (SHAP), allowing for the interpretation of interactions between meteorological and water quality variables.
Recent advancements have also emphasized the integration of domain knowledge into data-driven models through physics-informed machine learning (PIML) and transfer learning. Koksal and Aydin [
33] developed a hybrid framework combining transfer learning with physics-informed modeling to predict DO concentrations in an industrial wastewater treatment plant. Their approach leveraged knowledge from an open-source physics-based simulation and a real-world plant characterized by noisy and incomplete data. The proposed model improved prediction performance by up to 59% in validation scenarios.
Interpretability has also become a key focus in recent work, with techniques like SHAP being employed to quantify feature contributions and enhance transparency [
29,
34]. By coupling explainable AI with physical domain knowledge, models such as the PKBiLSTM [
34] and DWT-KPCA-GWO-XGBoost [
35] have successfully captured nonlinear interactions and seasonality in DO trends, while offering insights into model behavior.
Among numerous physicochemical parameters, DO remains the most sensitive and integrative indicator of aquatic ecosystem quality. Accurate modeling of DO dynamics enables risk anticipation and informs water quality control strategies. Given the challenges of manual parameter interaction analysis, data-driven and automated ML techniques offer a scalable and precise alternative.
Despite these advances, few studies have tested these advanced techniques in under-monitored or data-scarce regions, where robust and scalable models are especially needed. The Sitnica River was selected as a case study due to several factors that make it important for assessing anthropogenic impacts on inland aquatic ecosystems. The river is exposed to a wide range of anthropogenic pollutants, including untreated urban wastewater, industrial discharges, and agricultural runoff, making it a representative model of polluted rivers in the Western Balkans. In addition, there is a significant lack of comprehensive ecological and microbiological data for the river, despite its environmental importance and the continuous pressures it faces. This data gap limits the development of effective measures for its management and protection. Therefore, the study of the Sitnica River addresses both a scientifically relevant case of human impact and a need for baseline ecological data in an under-researched region.
The Sitnica River spans an area of 2861 km
2 and flows through the majority of the Kosovo Plain. It is the only major river that flows entirely (approximately 90 km in length) within the borders of the Republic of Kosovo [
36]. Known as a plain river, it is characterized by frequent changes in its course and flooding. Although the Sitnica does not have a distinct spring, it is named after the location where the Shtime stream meets the Sazli stream on its left side, near the village of Robovc. Its source is considered to be Topila, which originates at the northern end of Derman Peak (1364 m). From Topilla (1280 m) to the point where the Sitnica River meets the Ibër, the elevation is 497.2 m. Its total drop is 782.8 m, while its relative drop is 7.2%. The average elevation of the Sitnica River is 734 m, with only 7.8% of its course exceeding 1000 m. The river network density of the Sitnica, calculated using the Neumann formula, is 824 m/km
2 on the right bank and 512.8 m/km on the left [
37]. The river is known for its calm hydrological regime, with an average flow rate of 12.9
/s. Therefore, the Sitnica River is one of Kosovo’s main rivers and the main tributary of the Ibër, which eventually drains into the Black Sea [
36].
The novelty of this study lies in the integration of a robust and interpretable machine learning framework combining evolutionary optimization (GASearchCV), uncertainty quantification via Monte Carlo Simulation, and SHAP-based feature attribution to predict DO in a data-scarce and environmentally stressed region. Rather than proposing new algorithms, the innovation stems from adapting and validating this scalable pipeline in the Sitnica River, where high-frequency yet limited-variable monitoring presents practical challenges rarely addressed in the literature.
The remaining sections of this study are separated as follows:
Section 2 explains the dataset used, as well as the machine learning models employed, the optimization algorithm, and also the performance evaluation metrics.
Section 3 covers the computational experiments as well as a discussion of the results achieved. Finally, the main conclusions are summarized in
Section 4.
4. Conclusions
This study introduces a novel hybrid machine learning framework for predicting DO concentrations in inland water systems, with a case study focused on the Sitnica River in Kosovo. The key innovation lies in the integration of evolutionary optimization via Genetic Algorithm Search with Cross-Validation (GASearchCV) into the model development pipeline, enabling automated and effective hyperparameter tuning for three distinct regressors (EN, SVR, and LGBM).
A significant contribution of this work is the adaptation of the framework to a real-world context characterized by anthropogenic stressors (e.g., industrial and agricultural runoff) and limited monitoring infrastructure. Despite using only three commonly available input parameters (temperature, conductivity, and pH), the LGBM model achieved high accuracy ( and an RMSE = 7.474 mg/L) and demonstrated strong robustness under uncertainty, as verified through Monte Carlo Simulation. To our knowledge, this is one of the first studies to implement an evolutionary-assisted ML approach for DO prediction in the Western Balkans. The methodology is not only scalable and reproducible but also can be used to operate under data-scarce and high-variability environments. In practical terms, the proposed framework serves as a reliable and low-cost alternative for environmental monitoring and early warning systems in under-resourced regions.
Although the findings show promise for data-driven water quality monitoring, there are obstacles to practical application, such as the requirement for regular sensor calibration to preserve data quality, the computational limitations for real-time deployment on edge devices, and the dependence on continuous high-frequency data streams that might not be accessible in areas with limited resources. However, the approach offers a useful starting point for developing sustainable water management, especially when combined with domain-specific validation and existing monitoring infrastructure.
Future research should consider expanding the model’s scope to include additional water quality indicators (e.g., nutrient loads, turbidity), incorporating spatial variability across monitoring locations, and embedding explainable AI techniques to further enhance interpretability. Furthermore, coupling this framework with remote sensing data or physical process-based models could enable hybrid systems that combine the strengths of data-driven learning with domain-specific knowledge in hydrology. Ultimately, such developments will contribute to more resilient, transparent, and sustainable water resource management systems aligned with global environmental goals.