Water Quality Identification: Integrating IoT Sensors and Deep Learning for Near-Real-Time Water Quality Assessment

Tsolaki, Christina; Kokkonis, George; Valsamidis, Stavros; Kontogiannis, Sotirios

doi:10.3390/app16104868

Open AccessArticle

Water Quality Identification: Integrating IoT Sensors and Deep Learning for Near-Real-Time Water Quality Assessment

¹

Laboratory Team of Distributed Microcomputer Systems, Department of Mathematics, University of Ioannina, 45110 Ioannina, Greece

²

Department of Information and Electronic Engineering, International Hellenic University, 57001 Thessaloniki, Greece

³

Department of Accounting and Finance, Democritus University of Thrace, Ag. Loukas, 65404 Kavala, Greece

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(10), 4868; https://doi.org/10.3390/app16104868

Submission received: 10 April 2026 / Revised: 7 May 2026 / Accepted: 10 May 2026 / Published: 13 May 2026

(This article belongs to the Special Issue Applications of Industrial Internet of Things (IIoT) Platforms: 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The increasing demand for sustainable, affordable smart city infrastructure has heightened the need for low-cost near-real-time water quality monitoring systems. In this study, we propose Water-QI, a low-cost Internet of Things (IoT)-based environmental monitoring platform that combines budget-friendly sensors with deep learning for water quality index (WQI) assessment and forecasting. The sensing platform measures five key physicochemical parameters, namely temperature, total dissolved solids (TDS), pH, turbidity, and electrical conductivity, enabling continuous multi-parameter monitoring in urban water environments. To model temporal variations in water quality under both cloud-based and edge-oriented deployment scenarios, we evaluate multiple gated recurrent unit (GRU) architectures with different widths and depths. Experiments are conducted at two temporal resolutions, hourly and minute-level, in order to examine the trade-off between predictive accuracy and edge computational latencies. In the hourly scenario, the single-layer GRU with 64 units achieved the best overall balance, reaching a validation RMSE of 0.0281 and a test

R^{2}

of 0.9820, while deeper stacked GRU models degraded performance substantially. In the minute-resolution scenario, shallow wider GRU models produced the best results, with the single-layer GRU with 512 units attaining the lowest validation RMSE (0.025548) and the 256-unit variant achieving nearly identical accuracy with much lower inference cost. The results show that increasing the GRU model length can yield improvements at high temporal granularity, whereas increasing the GRU layer depth consistently harms convergence and generalization. Overall, the findings indicate that shallow GRU architectures provide the most practical solution for accurate, low-cost, and scalable water quality forecasting. In particular, the 64-unit GRU is the most suitable choice for hourly periodic interval operation, while the 256-unit GRU offers the best edge computational speed and accuracy trade-off for minute-level near-real-time inference on resource-constrained devices.

Keywords:

smart cities; water quality; IoT sensors; machine learning; real-time monitoring; environmental monitoring; deep learning; low-cost sensors

1. Introduction

Modern cities face increasing pressure from population growth, infrastructure demand, and water pollution. Since conventional laboratory-based water quality assessment is costly and unsuitable for real-time monitoring, IoT and cloud technologies offer a practical alternative for continuous environmental data collection and processing. This research explores how low-cost sensors, combined with machine learning and artificial intelligence, can support near-real-time water quality monitoring in urban environments, enabling timely preventive actions and improving urban quality of life.

Recent developments in sensor technology have significantly reduced the cost barriers to environmental monitoring. The study [1] presents a low-cost IoT-based water quality monitoring system that combines Arduino microcontrollers and cloud-connected sensors to measure real-time parameters, including pH, turbidity, temperature, and total dissolved solids (TDS). This system achieves 91% accuracy, 20% higher than traditional models, through a feedforward artificial neural network optimized using a hybrid genetic algorithm–particle swarm optimization (GA–PSO) approach. Its affordability and automation capabilities make it highly applicable across various domains, including drinking-water safety, agriculture, aquaculture, industrial water monitoring, and smart city infrastructure.

In support of the movement toward accessible monitoring solutions, Kyritsakas et al. [2], highlights low-cost machine learning and IoT-based technologies for real-time water quality assessment. The research identifies several cost-effective strategies, including solar disinfection (SODIS), ceramic and bamboo charcoal filters, treadle and rope pumps, low-cost drip irrigation systems, underground storage tanks, and nature-based solutions, such as microalgae filtration. Furthermore, carbon nanotube-based chemical sensors and community-based water management practices enhance the accessibility and sustainability of water quality management. Practical IoT implementation is also presented in [3] through the WaterS system, which uses Sigfox for IoT connectivity. This open-source approach supports scalability and collaborative development, addressing challenges such as increased packet error rates in dense deployments by evaluating protocols and optimizing communication.

Low-cost sensor technologies typically measure essential water quality parameters, offering valuable insights without requiring complex equipment. According to [4], variables such as pH, temperature, and electrical conductivity serve as fundamental inputs for machine learning models that forecast groundwater quality for irrigation. By reducing the need for extensive laboratory investigations, this approach enables near-real-time cost-effective prediction of critical irrigation indicators, including total dissolved solids (TDS), electrical conductivity, turbidity, and potential salinity.

To this extent, the authors of [5] position machine/deep learning as a transformative tool for urban water management, enabling rapid responses to flooding, contamination, and system failures while reducing infrastructure costs. The research underscores the value of low-cost surrogate models for cities with budget constraints and recommends that future studies focus on enhancing model transferability across diverse urban contexts and adapting to evolving infrastructure conditions.

Despite their advantages, low-cost IoT sensor systems face several challenges. The authors of [6] observe that, although traditional laboratory methods are more precise, they are also costly and time-consuming. In contrast, low-cost systems provide real-time monitoring, remote data transmission, and instant alerts but require periodic calibration to maintain accuracy. The study identifies the key barriers to implementation, including insufficient data management, limited model explainability, and low reproducibility. Consequently, integrating low-cost water quality monitoring systems into smart city frameworks offers significant benefits for urban management and sustainability. As noted by [1], affordable and automated monitoring systems are highly applicable to drinking-water safety, agriculture, aquaculture, industrial water monitoring, and broader smart city infrastructure.

This paper introduces a distributed platform, a water quality identification IoT system (Water-QI), designed for periodic, hourly, or near-real-time minute-level monitoring of water quality attributes at the source. The platform leverages low-cost sensors and a high spatial density of GPS-based IoT nodes to monitor qualitative drinking-water attributes in urban environments. Additionally, it leverages existing city Wi-Fi infrastructure, incorporates WQI predictive models at the device level or in the cloud, and employs a device-level correlation function to immediately calculate the water quality index (WQI). The integration of deep learning GRU models for measurement prediction and WQI calculations further enhances the platform’s suitability for edge-level computations. At the core of the Water-QI predictive component are the inference of predictive sensory measurements and the calculation of the water quality index (WQI). Rather than tracking individual sensor fluctuations, the platform’s device-level correlation function compresses five distinct variables, pH, temperature, total dissolved solids (TDS), electrical conductivity (EC), and turbidity, into a single straightforward score indicating overall water health. The mathematical foundation of this score is based on the established Horton and NSF-WQI frameworks, utilizing a weighted sum of normalized values. To maintain focus on architectural deployment, the complete mathematical formulation, including step-by-step equations and the specific weighting factors, is detailed in Section 3.2.

The primary novelty of this study is the Water-QI system architecture that couples affordable IoT sensing with GRU-based deep learning for short-term WQI forecasting. In contrast to most low-cost IoT water monitoring platforms, which primarily operate as telemetry systems that forward measurements to remote servers, the proposed Water-QI platform is designed to support predictive intelligence at the sensing-node level. This is achieved through a lightweight WQI formulation based on the previously mentioned five measurable physicochemical parameters and through the optimization of GRU architectures that are suitable for deployment on constrained embedded hardware. Moreover, the study provides a systematic comparison between hourly and minute-resolution forecasting, showing how temporal granularity, model width, depth, and inference latency jointly affect practical deployment. While several low-cost IoT solutions for water monitoring have been proposed in the literature, many of these systems function primarily as data relay nodes that rely on remote cloud infrastructure for complex processing. The Water-QI framework distinguishes itself by integrating edge intelligence, shifting the predictive heavy lifting from the cloud to a local sensory node. By optimizing GRU models for minute-level forecasting on the Raspberry Pi Zero 2W Water-QI node, a proactive alternative to traditional passive loggers is provided. This approach anticipates future local quality shifts, offering a more resilient decentralized solution for modern smart city infrastructure.

The structure of this paper is as follows: Section 2 presents a review of the related work, emphasizing the differentiation and potential of machine learning and deep learning methods for predicting and classifying water quality attributes. Section 3 details the proposed approach, Section 4 discusses the experimental results obtained using deep learning models, and Section 5 summarizes these findings. Finally, Section 6 provides a conclusion.

2. Related Work

This section reviews the existing literature on machine learning and deep learning approaches for water quality monitoring, categorizing them into traditional methods and advanced deep learning architectures, and then compares their performance. The performance metrics evaluated across these studies are consistent with those summarized in the review by [7]. This survey shows that regression tasks commonly use

R^{2}

and RMSE, whereas classification tasks rely on accuracy.

2.1. Machine Learning Models for Water Quality Assessment

In recent years, machine learning methods have been widely applied to predict and classify water quality parameters. To assess model performance, researchers commonly utilize established metrics. For regression tasks, the primary indicators include the coefficient of determination (

R^{2}

) and root mean square error (RMSE). For classification tasks, standard metrics comprise accuracy, among others.

Iyer et al. [8] investigated the application of machine learning algorithms, including SVM, random forest, and decision tree, for water quality classification. Random forest achieved the highest accuracy at 68%. Karthick et al. [9] assessed machine learning algorithms for water quality classification, utilizing advanced preprocessing techniques, including the Yeo–Johnson transformation, principal component analysis (PCA), and the synthetic minority over-sampling technique (SMOTE). Model performance was evaluated using accuracy, with the XGBoost model achieving the highest accuracy of 96.31% without SMOTE.

Khan et al. [10] developed a water quality classification system using a support vector machine (SVM) classifier, which outperformed XGBoost and AdaBoost. The reported accuracy values ranged from 92% to 95%. However, Figure 6 of the manuscript indicates that the highest observed accuracy was approximately 70%. Garcia et al. [11] assessed the performance of decision tree (DT), random forest (RF), and k-nearest neighbors (KNN) algorithms for classifying groundwater quality based on total dissolved solids (TDS). Both DT and RF achieved perfect scores (100%) across all the evaluation metrics, while KNN attained an accuracy of 92.9%. Patil et al. [12] compared multiple machine learning algorithms for water potability classification, with SVM achieving the highest accuracy of 64%.

Prabu et al. [13] investigated anomaly detection in water treatment plants by employing a modified quality index (QI) and an encoder–decoder architecture. The proposed model achieved an accuracy of 89.18%, indicating robust performance in real-time anomaly detection and water quality monitoring. The study by [14] addresses the common challenge of class imbalance in water quality datasets by using an adaptive synthetic sampling (ADASYN) approach to generate synthetic instances above the threshold. Their evaluation of four machine learning models using KNN, boosting decision trees (BDTs), SVM, and neural networks (MLP–ANN) showed that BDT and MLP–ANN achieved the highest accuracy, over 90%.

The study [15] evaluates several machine learning models for water quality index (WQI) regression and water quality classification (WQC). The results indicate that the ANFIS model produced a low root mean square error (RMSE) of 0.054 for WQI prediction, while the NN–KNN classifier achieved 100% accuracy for WQC. Padmaja et al. [16] evaluated multiple machine learning algorithms, such as logistic regression, support vector machine (SVM), naive Bayes, k-nearest neighbors (KNN), decision tree, random forest (RF), AdaBoost, and XGBoost, for water quality assessment. Among these, the random forest algorithm achieved the highest classification accuracy. Specifically, random forest achieved an accuracy of 85.06%, while the gradient boosting regressor obtained the best regression results with

R^{2} = 0.74

and RMSE = 2.68.

The comparative analysis in [17] investigates residual chlorine prediction in drinking water using various machine learning methods. The study also explores multi-model ensembles (MMEs), demonstrating that optimal model combinations achieve an

R^{2}

value of 0.736 and an RMSE of 0.054. Additionally, research in [18] finds that multiple linear regression (MLR) yields strong predictive performance (

R^{2}

= 0.9992; RMSE = 0.338) for drinking-water quality in seawater desalination plants.

Ding et al. [19] enhanced water quality index (WQI) modeling by using several machine learning-based weighting schemes, including CatBoost, SVM, logistic regression (LR), XGBoost, and LightGBM, together with aggregation functions for WQI calculation. The study does not report conventional classification accuracy since these algorithms were not evaluated as standalone classifiers. Instead, model performance was assessed using regression-related indicators, including

R^{2}

and RMSE. Among the CatBoost-based WQI models, the best performance was obtained by

W Q M_{A C}

, which combines AHP–CatBoost weights with the weighted quadratic mean aggregation function, achieving

R^{2} = 0.842

and RMSE of 3.87.

Walczak and Walczak [20] performed a comparative analysis of machine learning algorithms, including neural networks, random forests, k-nearest neighbors (KNN), and multivariate linear regression, for predicting the water quality index (WQI). Model performance was evaluated using RMSE and

R^{2}

. Both the neural network and linear regression models reported an

R^{2}

value of 1, which was considered superficial. The linear regression model achieved the lowest RMSE value of 0.028. Similarly, the authors in [21] demonstrated that the proposed WDT–ANFIS model, which integrates wavelet denoising techniques with adaptive neuro-fuzzy inference systems, significantly enhances prediction accuracy for water quality parameters, achieving

R^{2}

values of at least 0.9. The lowest RMSE was reported for pH prediction, with a value of 0.15.

Shams et al. [22] evaluated several machine learning models for predicting both the water quality index (WQI) and water quality classification (WQC). The multi-layer perceptron (MLP) model demonstrated the highest regression performance, achieving an

R^{2}

of 99.8% and a mean squared error (MSE) of

2.8 \times 10^{- 5}

. For classification tasks, the gradient boosting model attained an accuracy of 99.5%.

The authors of [23] introduced two hybrid models, CEEMDAN–XGBoost and CEEMDAN–RF, for short-term water quality prediction by combining decision tree algorithms with the CEEMDAN denoising method. These models exhibited superior accuracy and stability compared to LSTM and SVM. CEEMDAN–RF and CEEMDAN–XGBoost achieved RMSEs ranging from 0.02 to 1.27 across six water quality indicators, with an arithmetic mean of 0.328. CEEMDAN–RF yielded the best RMSE for temperature, dissolved oxygen, and specific conductance, while CEEMDAN–XGBoost produced the best RMSE for pH, turbidity, and FDOM.

Furthermore, an ensemble model that integrates XGBoost, CatBoost, random forest, gradient boosting, extra trees, and AdaBoost, as described in [24], achieved

R^{2}

values close to 0.99 and RMSE values of 1.07. Additional quantitative evidence of high XGBoost and LSTM performance is provided by [25], where XGBoost is applied for water quality classification (WQC), achieving 99.83 to 99.99% accuracy, and LSTM is used for water quality index (WQI) regression, attaining

R^{2} = 0.9999

and RMSE = 0.0378.

As reported in [26], the random forest classifier consistently achieved the highest accuracy, reaching 98.2%. Furthermore, [26] reports that models such as the extra tree regressor (ETR) and random forest regressor (RFR) achieve

R^{2}

values close to 0.99, with RMSE values between 1.55 and 1.71.

The authors of [27] applied the LTSF-Linear model for water quality prediction, focusing on dissolved oxygen (DO), pH, and turbidity. According to Table 2, the LTSF-Linear model achieved the best performance across the evaluated indicators, corresponding to a derived RMSE of 0.2341. Finally, the comprehensive review in [28] presents a classification and regression tree model with over 98% accuracy, and random forest (RF)/decision tree (DT)/SVM classification results for Lake Taal, with RF and DT at 95.0% and SVM at 93.33%.

In the context of the literature water quality prediction-classification task evaluation, the authors introduce a new performance metric, the water quality score (WQS), to facilitate the evaluation of models proposed in the literature. The score is computed according to Equation (1):

WQS = α \cdot (1 - RMSE) + (1 - α) \cdot R^{2}

(1)

where

R M S E

represents the normalized root mean square error of the model predictions,

R^{2}

is the coefficient of determination, and

α \in [0.5, 1)

is a weighting factor. Setting

α = 0.5

assigns equal importance to the error and goodness-of-fit terms, while

α = 1

eliminates the contribution of

R^{2}

entirely. The authors adopt

α = 0.8

, emphasizing predictive accuracy over explained variance. This choice is motivated by the nature of water quality datasets, in which measurements are often noisy and acquired at relatively long probing intervals, commonly greater than 24 h, making variance-based metrics less informative. If WQS values are above 1, then it is considered as an underfitted result, or the attributes that contribute to the RMSE value have not been normalized.

Exploratory analysis of the proposed water quality score (WQS) demonstrates that its behavior is primarily determined by the normalized root mean square error (RMSE) term. When

α = 0.8

, predictive error constitutes 80% of the final score, while

R^{2}

contributes 20%. The parameter

α

was selected to emphasize the greater importance of RMSE relative to

R^{2}

and was held constant throughout both the comparative evaluation and experimental procedures. In the intended application, RMSE is normalized and typically remains within the interval

[0, 1]

. Under these conditions,

1 - R M S E

is nonnegative, and, since

R^{2}

is also generally bounded by

[0, 1]

, WQS is expected to fall approximately within

[0, 1]

. Higher WQS values indicate superior predictive performance. Values near 1 reflect both low prediction error and strong explanatory power, while moderate values represent a balance between acceptable fit and non-negligible error. Negative WQS values arise when the normalized RMSE exceeds 1, resulting in a negative (1-RMSE) term and causing the error component to outweigh the positive contribution of

R^{2}

. Such outcomes indicate either very poor predictive accuracy or improper normalization of RMSE. Conversely, WQS values greater than 1 should not occur if RMSE is correctly normalized and

R^{2} \leq 1

. Therefore, values above 1 strongly suggest a scaling inconsistency, a calculation error, or the use of a non-normalized RMSE. The literature review data are summarized in Table 1.

When analyzing high-performance scores reported in the literature, a degree of scientific skepticism is necessary. While seeing a coefficient of determination (

R^{2}

) of 0.9999 might seem like a superficial achievement, in the messy world of real-world environmental sensors, such high values are often a red flag. Therefore, near-perfect values can be superficial, frequently indicating that a model has overfitted to a very specific noise-free dataset or that the experimental conditions were too controlled to be realistic. In a true urban environment, sensor drift, biofouling, and sudden weather changes introduce unpredictability that makes absolute perfection nearly impossible. Therefore, in our research, we prioritize models that demonstrate high reliability in the 94–96% range as they tend to be more robust and better at generalizing to the actual complexity of city water networks. By focusing on realistic loss metrics like RMSE and curve fit metrics like the

R^{2}

we also ensure that our proposition system remains easily re-evaluated by others. In the following Section 2.2, a literature review of the deep learning models is presented.

2.2. Deep Learning Models for Water Quality Assessment

Deep learning has emerged as a powerful tool for predicting water quality, enabling models to capture complex temporal dependencies and nonlinear relationships in dense multivariate time series. The authors of [29] present an integrated Artificial Ecosystem Optimization with Deep Learning Enabled Water Quality Prediction and Classification (AEODL–WQPC) model. The approach employs an optimal stacked bidirectional gated recurrent unit (OSBiGRU) for water quality index prediction and an Improved Elman Neural Network (IENN) for classification. This model achieved an RMSE value of 0.0458 and an accuracy approaching 96%.

The study in [30] examined the use of artificial neural networks (ANNs) to predict six water quality parameters of the Langat River, Malaysia. The reported results achieved

R^{2}

values of 0.88–0.99 during the testing phase and very low RMSE values (

7.7 \times 10^{- 23}

). The ANN with 10 hidden NN layers and the Levenberg–Marquardt algorithm, in particular, yielded the lowest errors in predicting phosphate and TSS.

Wang et al. [31] proposed a comprehensive weighting method combining entropy weighting and Pearson correlation for feature selection in water quality prediction. They compared models including SVM, MLP, RF, XGBoost, and LSTM. The LSTM model outperformed the other models, especially in predicting dissolved oxygen (DO), achieving an

R^{2}

of 0.882 and an RMSE of 1.827. Similarly, the authors of [32] evaluated automated deep learning (AutoDL) for water quality assessment and compared it with conventional deep learning models, including ANN, RNN, LSTM, and CNN. Their findings show that LSTM achieved 92% accuracy for binary classification and 94% for multiclass classification, while CNN achieved 95% for binary classification and 91% for multiclass classification.

In the context of river water quality, the authors of [33] developed an LSTM model enhanced with an attention mechanism (AT-LSTM) to predict dissolved oxygen in the Burnett River, Australia. The model was evaluated using RMSE and

R^{2}

, achieving values of 0.130 and 0.953, respectively. The inclusion of the attention mechanism significantly improved the model’s ability to focus on relevant temporal features compared to the standard LSTM.

For aquaculture applications, Gandhi et al. [34] applied both LSTM and GRU models to predict key water quality parameters, including salinity, pH, DO, and temperature. Through extensive hyperparameter optimization, the GRU model achieved an RMSE of 0.036 for DO prediction.

Addressing the challenge of incomplete datasets due to spurious measurements, the authors of [35] proposed a Kalman filter-based LSTM encoder–decoder model (KF-LSTM) incorporating an attention mechanism. This approach effectively reconstructed missing values and captured long-term dependencies, achieving an RMSE of 0.40 and an

R^{2}

of 0.94 on the test set.

To improve forecasting accuracy, the authors of [36] introduced a hybrid model combining ensemble empirical mode decomposition (EEMD), multivariate linear regression (MLR), and LSTM (EEMD–MLR–LSTM) to predict phytoplankton levels. The model achieved an RMSE of 0.0489 for a 6-h prediction horizon. The integration of EEMD improved feature extraction from non-stationary signals, yielding superior performance compared to standalone LSTM models.

The authors of [37] proposed a methodology that combines an LSTM model with the grasshopper optimization algorithm (GOA) for automatic hyperparameter tuning. This technique achieved an average test-set accuracy of 92.22%, outperforming conventional methods such as SVM and decision trees. This underscores the importance of automated optimization for fully leveraging the potential of deep learning models. The authors in [38] introduced a hybrid LSTM model combined with gray wolf optimization (GWO) and fish swarm optimization (FSO) for predicting parameters in the Thamirabarani River. The LSTM–GWO–FSO model achieved an RMSE of 0.083 and

R^{2}

of 0.94 for DO, outperforming traditional ANN, BPNN, and RNN models.

Finally, for nationwide tap water quality monitoring in South Korea, the authors of [39] developed and compared deep learning models, including LSTM, GRU, and SCINet, against an ARIMA baseline. The optimized SCINet model achieved the best results, with an average RMSE of 0.0006, demonstrating the potential of advanced deep learning architectures for high-accuracy real-time water quality forecasting in large-scale supply systems. Table 2, presents the evaluation summary results for the DL models in the recent literature.

According to the literature review, using DL models for regression tasks, LSTM and GRU models have demonstrated strong performance in complex time-series forecasting tasks. Based on the best-performing regression results reported in Table 2, the highest water quality score (WQS) is achieved by a deep NN model with 10 hidden layers [30], followed by the standard LSTM [25], the GRU-modified OSBiGRU [29], GRU [34], and hybrid LSTM [36]. The top score of the deep feedforward NNs [30], similar to stranded NN [40], suggests that, in some water quality prediction settings, carefully optimized dense NN architectures can outperform more complex sequential models when the underlying relationships are strongly nonlinear but do not require sophisticated temporal decomposition; their increased depths may sufficiently degrade and narrow their performance to specific temporal intervals [40]. Nevertheless, the strong performance of the standard LSTM and the close-to-LSTM performance of the GRU, as a lighter edge-computational candidate, further confirm their effectiveness as a reliable regression baseline for water quality forecasting, particularly in capturing temporal dependencies.

From the regression models in Table 2, GRU is especially relevant for sensor-based water quality forecasting because it offers a simpler gated recurrent architecture than LSTM, with fewer parameters and lower computational complexity, while often retaining comparable predictive accuracy in sequence modeling tasks. This makes GRU attractive for fast inference and resource-constrained edge deployments. Prior comparative work has shown that GRU can be comparable to LSTM and may even converge faster under equal parameter budgets, while hydrology and water-related forecasting studies likewise note that GRU is less complicated and uses fewer parameters than LSTM, making it a practical choice for regression on environmental time series [41].

For classification tasks, the importance of CNN models should be emphasized, particularly one-dimensional convolutional models, which can learn local temporal and cross-channel patterns directly from raw or minimally processed multivariate sensor sequences, avoiding heavy feature engineering while remaining highly effective for classification and anomaly-related decision tasks in water monitoring systems [42]. Hybrid GRU models may outperform CNNs, as shown in [29], by offering a much larger set of hyperparameters and requiring more optimization effort.

2.3. ML–DL Comparative Analysis

The choice between machine learning (ML) and deep learning (DL) is no longer a theoretical debate but a strategic decision based on the results established in Table 1 and Table 2. Our analysis reveals a clear performance hierarchy: for high-precision regression and forecasting, deep learning is the definitive leader, while, for robust classification of static data, traditional machine learning remains highly competitive. The evaluation of machine learning (ML) and deep learning (DL) architectures for water quality monitoring reveals a significant disparity between raw statistical performance and actual model reliability. While traditional models often report near-perfect metrics, a deeper inspection suggests that these results frequently stem from architectural limitations or data-specific artifacts rather than genuine predictive power.

The performance of traditional ML models in regression tasks is characterized by extreme variance. On the one hand, ensemble regression models like [17] achieve a remarkable

W Q S

of

0.94288

, suggesting that aggregating weaker learners can mitigate the nonlinearity of hydrological data, but not as sufficiently as deep learning models. This lack of sufficiency is clearly illustrated by other ensemble or hybrid models used for regression tasks in Table 1. Moreover, critical skepticism must be applied to models reporting perfect or near-perfect scores. These failures are typically attributed to imbalanced datasets and shallow models’ inability to capture minority-class features in complex water quality matrices.

In contrast to traditional methods for regression tasks, deep learning (DL) models exhibit more consistent and honest performance metrics. The dominance of recurrent neural networks (RNNs), specifically long short-term memory (LSTM) networks, is evident across the literature. Standard LSTMs and their optimized lightweight variants, such as the hyperparameter-optimized GRU or other GRU hybrids, perform similarly, as shown in Table 2.

For classification tasks, DL models like the OSBiGRU [29] and CNN-based architectures [32] maintain high accuracies (above

0.95

), demonstrating superior robustness against the noise and complexity inherent in environmental time-series data, compared directly to their ML XGBoost and gradient boosting counterparts. Nevertheless, this shows that some machine learning models can still outperform deep learning models on classification tasks. However, the comprehensive review in [43] also consolidates evidence that recurrent neural network (RNN)-based models, particularly LSTM models, can achieve performance accuracies ranging from 96% to 98% and that GRU models may compete with LSTM because of their simpler structure.

3. Materials and Methods

This section introduces a distributed drinking-water monitoring system called the water quality identification IoT system (Water-QI). The subsequent subsections detail the end-to-end high-level system architecture, the IoT device, implemented communication methods and application protocols, and the proposed deep learning models for localized water quality index predictions. These models are designed for extensibility and edge predictability. Additionally, the evaluation metrics, dataset, proposed models, and training hyperparameters are described.

3.1. Proposed System Architecture

The proposed Water-QI platform is a cost-effective Internet of Things (IoT) system developed for real-time monitoring, visualization, and prediction of water quality, with a focus on the water quality index (WQI). The system architecture integrates a field IoT telemetry device, cloud-based data transmission, a web-based data management and visualization environment, and a mobile application. This configuration enables continuous monitoring of water conditions, reducing dependence on periodic laboratory analysis. The platform automatically collects measurements from the IoT sensing node, transmits data to the cloud via existing Wi-Fi infrastructure, and displays both raw measurements and the calculated WQI through intuitive user interfaces. Beyond real-time monitoring, the system offers historical data inspection, statistical analysis, alert management, and configurable parameter weighting for WQI calculation.

At the cloud level, the platform utilizes the open-source ThingsBoard AS [44] to manage device communication, data visualization, and remote supervision. Data storage is performed using the Cassandra NoSQL database provided by ThingsBoard [45]. The communication workflow links the end node to the cloud through telemetry services, while the application server hosts the predictive component. Specifically, a deep learning algorithm based on a variable-depth gated recurrent unit–recurrent neural network (GRU–RNN) infrastructure model operates on a cloud virtual server that operates on top of a container similar to the thingsAI paradigm [46] to estimate and forecast WQI trends from incoming sensor data streams. This edge-to-cloud architecture enables the system to monitor current water conditions as a weighted cumulative index, facilitating early warning and proactive decision-making in smart city and environmental monitoring contexts. Figure 1 presents the proposed Water-QI system architecture.

The Water-QI system also includes a mobile monitoring application developed in Flutter/Dart, designed to provide real-time supervision of the Water-QI IoT device via a cross-platform Android and iOS interface. In the uploaded project description, the mobile application is presented as a companion to the open-source ThingsBoard application server, which is responsible for telemetry collection, device supervision, alert exchange, and parameter configuration [47]. Within this Water-QI architecture, the mobile application allows users to inspect live sensor measurements, review water quality history, and monitor the operational state of the field device through a portable interface, while the ThingsBoard backend manages data storage, dashboards, and server-side services.

Different protocols are utilized for the collection of data per IoT end-node device of the Water-QI: (1) the MQTT beacon protocol, (2) the HTTP telemetry protocol, and (3) the HTTP request-back control protocol. The MQTT beacon protocol is a real-time protocol for sending beacons from an IoT device to the ThingsBoard AS broker. The beacon packet includes AES-128-encrypted information about the IoT device UUID, the device sensory measurement period

T_{m}

, the data transmission period to the AS

T_{p}

, the AS command update period for the device control protocol

T_{c}

, and the beacon location expressed in latitude and longitude coordinates. The HTTP over SSL telemetry protocol is using the method POST to submit a JSON-encoded string of measurements to the Water-QI AS. Finally, the control protocol is an HTTP over SSL request–response protocol initiated periodically from the end node with the purpose to receive any updated information of probing intervals (periods), WQI weight parameters, and latitude and longitude coordinates on the map if the device does not include a GPS receiver for automatic location updates. The following Section 3.2 provides additional information regarding the IoT device’s sensors, measurements, and protocols, including functionality and interoperability.

3.2. End-Node IoT Device

A primary objective in the design of the Water-QI IoT end-node device with edge capabilities was to demonstrate that high-fidelity environmental monitoring can be achieved using budget-friendly off-the-shelf components. The sensor suite was carefully curated to balance extreme affordability with the data reliability required for deep learning applications. For water temperature monitoring, we selected the DS18B20 digital stainless probe (Analog Devices Inc./Maxim Integrated in Wilmington, MA, USA). This sensor provides a highly stable one-wire digital output at a fraction of the cost of industrial-grade thermocouples or thermometers, making it an ideal candidate for large-scale distributed urban deployments. Figure 2, illustrates the Water-QI IoT prototype.

To maintain the Water-QI device IoT implementation with edge capabilities using a low-power ARM multi-core processor while ensuring multi-parametric low-cost analysis, we integrated a series of analog sensors attached to the RPi zero 2W board via an I2C ADC board (ADS1115-Texas Instruments Inc., Dallas, TX, USA), as illustrated in Figure 2a. The actual implemented prototype includes the DFR0300 sensor for electrical conductivity (EC—DFRobot/Zhiwei Robotics Corp., Shanghai, China) (see Figure 2b(3)), the SEN0244 sensor for total dissolved solids (TDS—DFRobot/Zhiwei Robotics Corp., Shanghai, China) (see Figure 2b(4)), the Grove V1.0 sensor meter (Seeed Studio/Seeed Technology Co., Ltd., Shenzhen, China), for turbidity measurements (see Figure 2b(6)), the SEN0161-V2 sensor (DFRobot/Zhiwei Robotics Corp., Shanghai, China), for pH measurements (see Figure 2b(5)), and the DS18B20 temperature sensor (see Figure 2b(2)). The device is powered using a 5V USB type-A connector (see Figure 2b(7)), and uploads measurements to the cloud AS using Wi-Fi connectivity provided by the RPi Wi-Fi transponder. These probes were specifically chosen because they offer a cost-effective entry point into smart city infrastructure without sacrificing the precision needed to calculate an accurate water quality index (WQI), since we are mainly focusing on measurement deviations rather than absolute values. These low-cost analog sensors’ main drawback is the necessity for frequent monthly calibration to perform adequately.

The low-cost sensors of the Water-QI prototype are calibrated and validated individually and then jointly at the device level. The SEN0161-V2 pH sensor is calibrated using standard buffer solutions, preferably pH 4.00, 7.00, and 10.00, so that the electrode offset and response slope can be adjusted before measurements are used in the WQI calculation. The DFR0300 EC sensor is calibrated using known electrical-conductivity reference solutions, while the SEN0244 TDS sensor is calibrated using TDS reference solutions or conductivity-derived TDS standards. Since both EC and TDS are temperature-dependent, their readings are interpreted alongside the DS18B20 temperature measurements. The Grove V1.0 turbidity sensor is calibrated using reference turbidity solutions in the expected operating range, with particular attention to low turbidity values relevant to drinking-water monitoring. The DS18B20 temperature probe is only validated against a reference thermometer.

Even if monthly calibration is needed, by opting for these accessible analog modules over expensive laboratory-grade equipment and focusing on real-time acquisition of measurement changes, we ensure that the proposed system remains financially viable for municipalities with limited budgets, facilitating the transition toward pervasive and sustainable water management. Furthermore, the device’s capability to include a GPS receiver (NEO 6M GPS module) connected to the RPi’s serial port, if selected or statically assigned localization GPS coordinates, makes the Water-QI system’s distributed approach fundamental for monitoring water quality deviations at city-district levels. Figure 2a shows the actual control device and its interface with the analog sensors via the analog to digital converter, while Figure 2b(1) illustrates the packaging of the actual PoC implementation that was put to the test without the use of a GPS receiver, as shown in Figure 2a.

The probing Water-QI IoT node is built around the Raspberry Pi Zero 2W microprocessor, a compact single-board computer featuring a quad-core 64-bit ARM Cortex-A53 CPU at 1 GHz, 512 MB LPDDR2 RAM, integrated 2.4 GHz 802.11 b/g/n Wi-Fi, Bluetooth 4.2, mini-HDMI, micro-USB OTG, CSI camera connector, and a 40-pin GPIO header. The RPi zero 2W interfaces with an ADS1115 analog-to-digital converter over the I²C bus to acquire the outputs of the analog water quality probes. The ADS1115 is connected to the Raspberry Pi through GPIO2 (SDA) and GPIO3 (SCL), while its four analog 16-bit input channels are assigned as follows: AIN0 to the DFRobot SEN0161-V2 pH sensor, AIN1 to the Grove Turbidity Sensor Meter V1.0, AIN2 to the DFRobot SEN0244 TDS sensor, and AIN3 to the DFRobot DFR0300 electrical conductivity sensor. The pH conditioning board operates at 3.3–5.5 V with an analog output of 0–3.0 V, the TDS board operates at 3.3–5.5 V with an analog output of 0–2.3 V, and the EC board operates at 3.0–5.0 V with an analog output of 0–3.4 V. The Grove turbidity sensor supports 3.3 V/5 V operation and provides both analog and digital output; in the proposed setup, it is configured in analog mode and connected directly to AIN1. In addition, water temperature is measured using a DS18B20 digital sensor connected to GPIO4 via the Raspberry Pi 1-wire interface, with a 4.7 kΩ pull-up resistor between the data line and 3.3 V. All sensors share a common ground, and the DS18B20 temperature reading can also be used for compensation in conductivity and TDS-related calculations. Finally, the GPS receiver with an IPX uFL antenna included is connected via the GPIO 13-14 UART serial port of the RPi Zero 2W MPU.

The National Sanitation Foundation Water Quality Index (NSF-WQI) was proposed by Brown et al. [48] as a refinement of the earlier index-based water quality assessment concept introduced by Horton [49]. Horton’s contribution is generally recognized as the first formal WQI framework, designed to compress multiple physicochemical observations into a single interpretable score for surface-water assessment. Brown and colleagues extended this idea into the NSF-WQI by adopting a multiplicative model of weighting parameters and rating procedure, which made the index easier to apply and helped to establish it as one of the most widely used WQI formulations for rivers and other surface waters. Like the Horton model, the NSF-WQI preserves the four basic components that characterize most classical water quality indices: (i) parameter selection, namely the choice of the physical, chemical, and biological variables to be included; (ii) transformation of raw measurements into sub-indices so that heterogeneous variables with different units can be mapped onto a common quality scale, (iii) parameter weighting, through which more influential variables receive greater importance in the final score, and (iv) aggregation of the weighted sub-indices into a single composite WQI value. These four elements remain the conceptual backbone of many later WQI variants [50,51].

The NSF-WQI has since been widely applied to evaluate surface-water quality across diverse environmental and management settings, including rivers affected by urban, agricultural, and industrial pressures. For example, Abrahao et al. [52] applied index-based analysis to assess a stream receiving industrial effluents, illustrating the practical use of WQI methods in pollution-impact studies. More broadly, the popularity of the NSF-WQI stems from its ability to reduce complex monitoring datasets into a concise communicable measure of overall water status while retaining the essential logic of Horton’s original formulation. The historical development of water quality indices, from Horton’s original formulation to the NSF-WQI and later variants, has been extensively reviewed in [53].

For the real-time edge-device implementation, the weighting strategy was derived by adapting nominal literature-based WQI coefficients to the reduced parameter set available in the proposed sensing platform. Specifically, NSF-WQI-type formulations assign expert-defined raw weights to several physicochemical variables, including pH (

w_{p H}^{r a w} = 0.12

), temperature (

w_{t e m p}^{r a w} = 0.10

), turbidity (

w_{T b}^{r a w} = 0.08

), and total solids (

w_{T D S}^{r a w} = 0.08

) (see [54], Table 2). These coefficients, however, do not constitute a complete weighting scheme for the present five-parameter system since they originate from a broader multi-parameter index and sum to only 0.38 across the overlapping variables. Moreover, electrical conductivity is not explicitly included in the original NSF-WQI formulation and is therefore introduced here as an application-specific extension with raw coefficient

w_{E C}^{r a w} = 0.08

. To obtain a valid edge-computable WQI, all raw coefficients are normalized based on Equation (2),

{\hat{w}}_{i} = \frac{w_{i}}{\sum_{j = 1}^{5} w_{j}}

(2)

where

\hat{w} i

denotes the normalized weight of the i-th measured parameter,

w_{i}

is the corresponding raw weight before normalization,

i \in 1, \dots, 5

indexes the five sensory attribute variables of the proposed Water-QI platform, and j is the summation index used to accumulate the raw weights of all five parameters in the denominator. Thus, the final weights satisfy

\sum {i = 1}^{5} {\hat{w}}_{i} = 1

, or, equivalently, 100%. In this way, the final percentages are not directly copied from the bibliography but are obtained through proportional renormalization of literature-inspired coefficients over the subset of parameters actually measured at the IoT-device level.

According to the Horton model, which is one of the earliest and most influential weighted-arithmetic WQI formulations, five WQI classes are commonly used: very good (91–100), good (71–90), poor (51–70), bad (31–50), and very bad (0–30) [49,50]. Furthermore, there is also the canonical NSF-WQI, which evolved from Horton-type formulations and does not explicitly include electrical conductivity (EC) and uses total solids rather than total dissolved solids (TDS) among its standard variables [50,54]. Therefore, while the final WQI interpretation in this study follows an established five-class Horton-type scale for practical comparison, the individual sub-index equations for turbidity, pH, temperature, TDS, and EC are min–max tailored in the proposed Water-QI platform and measure attributive weights expressed as a quality score, where minimal values are better.

In depth, using the raw literature-inspired coefficients

w_{pH} = 0.12

,

w_{t e m p} = 0.10

,

w_{t b} = 0.08

,

w_{T D S} = 0.08

, and the application-specific extension

w_{E C} = 0.08

, and based on Equation (2), the total raw weight becomes

\sum_{i = 1}^{5} w_{i} = 0.46

. The final normalized weights are then obtained as

{\hat{w}}_{i} = \frac{w_{i}}{0.46}

, which yields

{\hat{w}}_{p H} = 0.2609

,

{\hat{w}}_{t e m p} = 0.2174

,

{\hat{w}}_{T b} = 0.1739

,

\hat{w} T D S = 0.1739

, and

{\hat{w}}_{E C} = 0.1739

. The final weighting scheme for the Water-QI system becomes 26.09% for pH (set to 25%), 21.74% for temperature (set to 15% to denote the minimal significance of temperature over the other parameters since it is rather constant for underground water pipelines and city installations), and 17.39% for turbidity, TDS, and EC, respectively (set to 20% to denote the importance over temperature), summing exactly to 100%. Table 3 summarizes the WQI classes as well as the mathematical formulation for the selected parameters for the WQI index calculation performed by the Water-QI IoT device. Table 3 presents the Horton/NSF-WQI attribute classification with respect to the Horton classification and the Water-QI score based mainly on min–max normalization, the per-measure normalization process, and the final WQI index value acting as a classification index value that is inversely proportional to Horton classification values. Furthermore, the NSF-WQI is disregarded and the TDS metric for total solids is used, along with temperature and EC values, each with its min–max limitations, in accordance with the NSF-WQI classification.

Our weighting strategy is not merely a mathematical convenience but is deeply rooted in the established theoretical foundations of the Horton and NSF-WQI models. These frameworks suggest that certain parameters, such as pH and turbidity, carry a disproportionate impact on overall water stability and consumer health. By aligning our custom weights with WHO and EPA safety thresholds, we ensure that the Water-QI system operates within a validated public health context. This theoretical alignment ensures that our predictive models are not just fitting numerical noise but are tracking the most critical biological and chemical risk factors defined by decades of environmental science research.

To ensure the Water-QI system reliability, specific operational thresholds were defined in accordance with WHO and Environmental Protection Agency guidelines [55,56]. In the proposed Water-QI implementation, the five monitored variables are combined through an application-specific weighted score rather than a canonical Horton or NSF-WQI formulation. The selection of the water quality index as a weighted score in this study was intended to be consistent with the air quality index metric, as mentioned in [57]. Nevertheless, the NSF-WQI can easily be implemented at the Water-QI nodes as an additional measure.

With respect to drinking-water suitability, turbidity should ideally remain below 1 NTU and, in practice, not exceed 5 NTU. The pH value is commonly considered acceptable in the range 6.5–8.5, and total dissolved solids (TDS) are typically limited to 500 mg/L. By contrast, neither electrical conductivity (EC) nor temperature has a single universal WHO/EPA health-based drinking-water limit in the same sense. In the present work, the assigned weights of

w_{p H} = 2.5

,

w_{T} = 1.5

, and

w_{T b} = w_{T D S} = w_{E C} = 2.0

should be interpreted as operational surrogate variables whose influence is set by the custom weighting scheme of this study. Since the weights sum to 10, they correspond to normalized contributions of 25% for pH, 15% for temperature, and 20% each for turbidity, TDS, and EC. Selecting a larger weight for pH is similar to the NSF-WQI selection for water pH values. The same applies to temperature and weight selection. For all other measurements, a value of equal weights has been selected. Consequently, the resulting WQI index score is best described as a custom 0–100 water quality score derived from min–max-normalized measurements. In terms of class interpretation, the adopted bands, as mentioned in Table 3, are not closest in direction to the NSF-WQI classification limits, where higher values denote better quality.

A critical design choice in the Water-QI architecture is the deployment of two separate physical sensors for EC and TDS. Although these parameters are theoretically correlated, where TDS (mg/L) is estimated as

k \times E C

(μS/cm), with a typical conversion factor of

k \approx 0.98

, a single-sensor approach would introduce a static dependency that fails in complex environments. By utilizing distinct sensing elements, we overcome the limitations of pre-determined linear estimation. This redundancy allows the system to capture specific ionic fluctuations that a simple mathematical conversion might miss. For instance, one sensor may detect a spike in a specific mineral salt that alters the water’s conductive profile differently than total dissolved solids do. This dual-sensing strategy prevents blind spots in the detection logic, ensuring that, if one sensor reaches its sensitivity limit or encounters a specific type of ionic interference, the other remains as a fail-safe to maintain the integrity of the water quality index (WQI) calculation. According to regulations, TDS values above 500.0 ppm are considered medium/fair and set as very high for drinking water. Moreover, TDS values above 1200 ppm are considered unacceptable. In accordance, electrical conductivity (EC) is considered unacceptable for drinking water if a value of 2000.0 μS/cm and above is detected (see Table 3).

Temperature measurements for the Water-QI node are performed using a DS18B20 sensor. This is because thermal variations significantly affect ion mobility. Maintaining water temperature between 5 °C and 15 °C is considered ideal for palatability and the prevention of microbial regrowth, which becomes a significant risk at temperatures exceeding 25 °C or with temperature variations of 10 °C (penalty value of 100). The following Section 3.3 describes the metrics used in the authors’ experimentation.

3.3. Metrics Used

To evaluate the performance of our prediction models, we utilize standard regression metrics widely adopted in the literature for water quality forecasting. Specifically, our evaluation is based on the root mean square error (RMSE) and the coefficient of determination (

R^{2}

):

Root Mean Square Error (RMSE):
Indicates the standard deviation of prediction errors. It is highly sensitive to large errors and provides interpretability in the same unit as the scaled target variable. It is defined according to Equation (3),

$R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}$

(3)

where n denotes the total number of observations, $y_{i}$ is the actual value of the target variable for the i-th observation, and ${\hat{y}}_{i}$ is the corresponding predicted value produced by the model.
Coefficient of Determination ( $R^{2}$ ):
Measures the proportion of the variance in the dependent variable (WQI) that is predictable from the independent variables. A score closer to 1 indicates a perfect fit. It is defined according to Equation (4),

$R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}$

(4)

where $y_{i}$ is the actual value of the target variable for the i-th observation, ${\hat{y}}_{i}$ is the predicted value for the i-th observation, $\bar{y}$ is the mean of the observed target values, and n is the total number of observations.

3.4. Proposed Deep Learning Models for WQI Prediction

Following the extensive comparative analysis of the existing literature, we designed a targeted experimental framework. While complex hybrid models are popular, they often require significant computational power, which contradicts the philosophy of low-cost IoT smart city deployments. Instead, our approach focuses exclusively on gated recurrent units (GRUs). GRUs offer a streamlined alternative to LSTMs, requiring fewer computational resources and memory while maintaining excellent retention of temporal dependencies in time-series data.

Heavier architectures, such as temporal convolutional networks (TCN) or transformer-based models, were excluded from this specific study to focus on GRU models that adequately run primarily on edge computing, apart from the cloud scope, with close LSTM model predictive performance. Although effective in high-performance cloud environments, their self-attention mechanisms and heavy parameter profiles often lead to memory overflows or unsustainable latency on resource-constrained hardware like the Raspberry Pi Zero 2W, which is the cornerstone of the Water-QI framework.

To thoroughly evaluate the trade-off between predictive accuracy, temporal granularity, and computational cost, we developed and trained three distinct GRU architectures:

The Standard Model:
Designed with practical low-cost IoT deployment in mind, it consists of a single GRU layer with 64 units. This model is engineered to be computationally inexpensive, capable of running on edge devices, while capturing the general daily trend of the water quality index (WQI).
The High-Capacity/Heavy Model:
Built to capture maximum detail, this model significantly increases the network’s capacity to 256 units or more. It serves as our benchmark for maximum achievable accuracy, albeit at a higher computational cost.
The Deep GRU Model:
To test the limits of network depth and investigate potential diminishing returns, we constructed unusually deep multi-layer (up to 10) GRU architectures. This experimental model acts as a stress test to determine if extremely deep GRU layer networks justify their massive training times in the context of environmental monitoring.

To ensure training stability and prevent the models from simply memorizing the training data (overfitting), we applied rigorous regularization techniques across all three architectures. Batch normalization was applied after each GRU layer to stabilize learning, followed immediately by a 20% dropout rate to promote generalization. The final features are passed through a fully connected (dense) layer and reshaped to output the exact forecasted sequence (either the next 24 h or the next 1440 min) for the five key water parameters (temperature, TDS, EC, pH, and turbidity).

Employing a 1440-step sequence substantially increases input dimensionality, memory requirements, training costs, and inference latency compared to the 24-step hourly configuration. Extended recurrent sequences can also intensify optimization challenges, such as vanishing gradients and slower convergence rates. Consequently, GRU models were chosen instead of simple RNNs because their gating mechanisms enhance the retention of long-term dependencies. Additionally, batch normalization, dropout, learning-rate reduction, and early stopping were implemented to enhance training stability.

The models in the minute-resolution temporal scale also follow the previously mentioned classification of the standard, heavy, and deep GRU models. However, they have been classified according to their GRU depth as follows:

Small models:
Models of a single layer and their corresponding multi-layer derivatives of 64 (standard model) to 128 GRUs per layer.
Medium models:
Models of at least 256–512 GRUs per layer and their corresponding stacked-layer counterparts. A single-layer entity is considered a heavy GRU model, while a multi-layered entity is considered a deep GRU model.
Large models:
Models of at least 512 GRUs per layer, with significant representatives being the 1024 and 2048 GRUs per layer. A single-layer entity is considered a heavy GRU model, while a multi-layered entity is considered a deep GRU model.

3.5. Data Collection and Preprocessing Steps

Effective data preprocessing and feature engineering are crucial for maximizing model performance, especially when working with environmental data that often contains noise and inconsistencies [58], demonstrating the importance of feature selection in the application of a gated recurrent unit (GRU) neural network for measurement predictions. Similarly, [59] addresses the non-stationarity and jitter inherent in environmental data through multi-step forecasting strategies and various train–test splits. In accordance, the research by [60] further emphasizes the value of preprocessing techniques, including normalization and feature selection, in reducing computational overhead while enhancing predictive accuracy.

Raw sensory data are rarely clean. Issues like calibration drift and sudden signal dropouts are common in the field. To address this, we implemented a three-step cleaning pipeline on the dataset before feeding any data to the GRU models. First, we scrubbed the dataset for outliers by removing any readings that fell outside of realistic physical thresholds for each parameter. Next, we addressed data gaps using linear interpolation, effectively filling them to maintain the temporal sequence. Finally, to address the high-frequency jitter typical of budget sensors, we smoothed the data with a 30-sample rolling average. This process ensured that the deep learning models were learning from consistent trends rather than reacting to sensor noise or temporary hardware glitches.

Following these directions, the authors use the OpenData dataset of EYATH for the city of Thessaloniki, Greece [61]. The dataset contains per-month daily measurements from 49 selected areas in and around the city of Thessaloniki from 2021 to 2025. This structured dataset has been temporally interpolated using a linear membership function to map monthly trends into continuous minute-level sequences. Specifically, these daily records were temporally interpolated to generate hourly and minute-level sequences suitable for GRU-based sequence-to-sequence forecasting. This process provides a high-resolution data proxy that serves as a computational stress test for the IoT hardware rather than a claim of raw environmental capture. Therefore, the collected temporal sensory measurement data are partitioned into two distinct temporal resolutions to train the models accordingly:

High Resolution (minute by minute): The input sequence consists of 1440 time steps, representing every single minute of a 24-h period. This allows the model to observe micro-fluctuations and transient spikes in water quality.
Standard Resolution (hourly intervals): The input sequence is condensed into 24 time steps, representing hourly averages. This significantly reduces the data’s dimensionality, filtering out potential sensor noise.

The following Section 4 and Section 5 present the authors’ experimental results and discussion.

4. Experimental Scenarios

To evaluate the effectiveness of our proposed GRU architectures, we structured our experiments around the data temporal resolutions described previously. The training data annotation process involved formatting the sequential datasets to predict the designated forecast horizon (24 steps for hourly; 1440 steps for minute-level).

4.1. Model Training and Hyperparameters

Table 4 summarizes the hyperparameters for the training model scenarios in both low- and high-resolution cases. The optimization setup uses the Adam optimizer with a learning rate of 0.001, RMSE, and MAE as monitoring metrics and

R^{2}

for the testing dataset, batch size 16, and a maximum of 100 training epochs.

To ensure a consistent and valid comparison between the two temporal resolutions, both the hourly and minute-level models were trained using the exact same five sensory variables mentioned above, maintaining identical input features across all experimental scenarios.

The GRU forecasting models are trained as a multivariate sequence-to-sequence predictor using five water quality variables: temperature, TDS, EC, pH, and turbidity. The main architectural hyperparameters are one or multiple stacked recurrent layers (L = 1–10), the layer width of GRUs, followed by a batch normalization layer and a dropout layer with a dropout rate of 0.2 after the recurrent block. Finally, a dense projection–flatten layer maps the hidden representation to

24 \times 5 = 120

neurons for the hourly-resolution case and

1440 \times 5 = 7200

neurons for the minute-resolution case. Then, the output-value layer follows, with the same number of neurons indicating the temporal prediction length (hourly or minute-graded according to this paper’s performed experimentation).

From the low-temporal-resolution case, where the hourly measurement dataset is used, the periodic temporal coverage input depth is SEQ_LEN = 24, meaning that each training sample contains 24 past hourly observations, corresponding to one full day of historical context. Similarly, the prediction horizon is set to PRED_LEN = 24, so the network forecasts the next 24 h hour by hour. Hence, the model learns a one-day-to-one-day mapping, using 24 past hours to predict the next 24 h of measurements.

From the near-real-time high-temporal-resolution perspective, the model has a very large temporal depth. The input depth is SEQ_LEN = 1440, meaning each training sample contains 1440 past time steps. Since the data are sampled at a minute resolution, this corresponds to one full day of historical context. The time window is also 1440 steps since PRED_LEN = 1440; the network predicts the next 24 h minute-by-minute. Therefore, the model learns a one-day-to-one-day mapping: 1440 past minutes are used to forecast 1440 future minutes. The 24-h prediction horizon was specifically chosen to capture a full diurnal cycle, which is the standard operational window for urban water management and planning. In addition, the temporal sampling stride is 60, so neighboring training windows overlap heavily while advancing by one hour.

The preprocessing and training hyperparameters also play an important role. Before sequence generation, the high-temporal-resolution raw sensor data are smoothed with a 30-sample rolling window to reduce short-term noise. The dataset is then split chronologically, with 10% reserved for testing and 10% used for validation, preserving temporal order by setting shuffle to false. Training is further regulated by ReduceLROnPlateau with a factor of 0.5 and a patience of 2, and by early stopping with a patience of 8 and best-weight restoration. The following Section 4.2 and Section 4.3 summarize the experimental results.

4.2. Scenario I: Low-Temporal-Resolution Data Experimentation

The authors trained three distinct GRU configurations: the standard small-scale model with 64 GRUs, the heavy model with 256 GRUs, and the deep large model with multiple GRU layers, each containing 64 GRUs. All the models have also been examined with different stacked-layer configurations (2, 4, and 10) on an hourly-averaged dataset over 100 epochs. The hourly resolution is highly representative of typical smart city IoT deployments, particularly when data transmission and power consumption must be carefully balanced. The learning curves, which illustrate both training and validation RMSE, reveal significant insights into how network complexity affects environmental time-series forecasting.

As shown in Figure 3, the small GRU model with 64 units (standard GRU) achieved the best results for its size and the hourly dataset, converging quickly and yielding a validation RMSE of approximately 0.028, with a

R^{2}

above 0.98. This result suggests that a lightweight recurrent architecture is sufficient to capture the dominant temporal patterns in hourly water quality data. Table 5 summarizes the validation RMSE and test

R^{2}

values for all the hourly GRU configurations.

Interestingly, drastically increasing the network’s size at the medium scale of 256 GRUs (heavy GRU) yielded worse results than the standard GRU model, raising the validation RMSE to roughly 0.0084 (0.84%—still less than 1% worse). While technically superior to the standard GRU, this poor accuracy can be explained by the lower resolution of the training dataset, with fewer short- and long-term characteristics that a less-dense GRU can easily capture. Increasing the number of units from 64 to 256 did not improve performance. On the contrary, the heavy GRU achieved a higher validation RMSE and a slightly lower test

R^{2}

. This indicates that the additional model capacity did not translate into better generalization for the hourly dataset.

The most revealing finding came from the deep large stackable GRU model. Despite its 10 layers of depth, the model struggled with diminishing returns and inherent instability, ultimately plateauing at a significantly higher validation RMSE of 0.053. This provides empirical evidence that blindly adding depth to recurrent neural networks for standard environmental forecasting can be counterproductive, leading to optimization hurdles without improving generalization.

Beyond the error metrics, we evaluated the models’ practical utility by simulating a 24-h-ahead forecasting scenario. The predictions were converted back into their real-world values to calculate the final water quality index (WQI), as illustrated in Figure 4.

Observing Figure 4, all three hourly-resolution models consistently classified the forecasted water quality within the good zone according to the characterization adopted in this work (WQI = 31–50). In contrast to the previous interpretation, the predicted values do not fall in a very poor regime. Instead, they remain in a relatively narrow interval of approximately 42–46. The standard 64-unit GRU in Figure 4a provides the closest agreement with the true daily WQI series. Its predictions remain nearly flat around 42.3–42.6 and closely follow the observed mild upward trend. This behavior is physically reasonable. Daily averaged water quality measurements usually exhibit substantial inertia and do not change abruptly unless there are major contamination events. By comparison, the heavier GRU model in Figure 4b shows a systematic positive drift. The predicted WQI rises from about 42.0 to 43.7, whereas the true series remains much more stable. The deep GRU model in Figure 4c amplifies this effect even further. It produces a stronger monotonic overestimation, reaching approximately 45.6 by the end of the forecasting horizon. Therefore, all three models preserve the same category-level interpretation: good water quality throughout the 30-day horizon. However, the standard GRU clearly offers the best practical trade-off between forecast stability, category consistency, and numerical fidelity to the observed daily WQI trajectory. This makes it the most suitable candidate for deployment on resource-constrained edge or end-node Water-QI devices. In such cases, reliable category-level monitoring and low computational overhead are more important than unnecessarily complex architectures. The following Section 4.3 examines the three representative model categories using a minute-resolution temporal dataset.

4.3. Scenario II: High-Temporal-Resolution Data Experimentation

The hourly models proved highly efficient for general trend monitoring; relying solely on averaged data might, in theory, obscure critical short-lived anomalies. To investigate whether high-frequency sampling offers a strategic advantage, we trained the exact same three GRU architectures using minute-by-minute data (a massive sequence length of 1440 steps per sample). The most immediate observation from this experiment was the staggering computational toll. Transitioning from an hourly (24 steps) to a minute (1440 steps) resolution exponentially increased the processing load. Table 6 summarizes the results.

Looking at the RMSE error and

R^{2}

, the heavy GRU model of 256 or 512 units (similar losses according to Table 6), specifically GRU-512, achieved the lowest overall validation RMSE of approximately 0.025548. The standard GRU model of 64 units closely followed with an RMSE around 0.027. Just like in the hourly experiment, the deep GRU model struggled significantly, stabilizing at a much higher RMSE loss around 0.07, reaffirming that excessive depth hinders learning in this context. Furthermore, deeper models (GRU-1024 and GRU-2048) performed similarly or slightly worse than the GRU-512 model. This indicates that, for the provided dataset, extending the GRUs beyond 512 does not yield better performance (less than 1% improvement in RMSE). Figure 5 presents the representative models (standard, heavy, and deep) and RMSE train, validation, and evaluation curves over training epochs.

The minute-resolution experiment shows that shallow GRU architectures remain the most effective even under very high temporal granularity. As seen in Figure 5, both the standard GRU (1 layer, 64 units) and the wider shallow variants converge rapidly within the first few epochs and stabilize at very low error levels. The best overall validation RMSE was achieved by the heavy GRU (1 layer, 512 units) with 0.025548, followed almost identically by the 1-layer 256-unit model with 0.025552. Compared with the standard 1-layer 64-GRU model (0.025981), these correspond to small RMSE reductions of 1.67% and 1.65%, respectively, nevertheless above 1%, indicating that increasing width still provides a marginal benefit.

In contrast, the deep GRU (10 layers, 64 units) performed substantially worse, yielding a validation RMSE of 0.078124, which is 200.70% higher than the standard model and 205.79% higher than the heavy model. A similar pattern is observed in the test

R^{2}

values: the heavy and 256-unit shallow models provide small improvements over the standard architecture, whereas the deep model drops sharply to 0.849364. Overall, these results confirm that, for minute-resolution sequences, widening a shallow GRU still offers minor gains, while excessive depth severely impairs convergence and generalization. The following Section 5 provides a summary of the experimentation and explores the use of the examined best-case models and their performance, offering edge inference capabilities to the end-node Water-QI device.

4.4. Scenario III: Edge Computation Performance of Minute-Resolution Models

Using an ESP32 microcontroller as the central processing unit for on-device GRU inference, our preliminary experimentation showed that only relatively small recurrent models, approximately in the range of 10–32 GRU cells together with their associated parameters, can be loaded within the memory limits of a dual-core 32-bit ESP32 platform with 4–8 MB RAM. Under these constraints, the device can support only hourly-scale inference, typically with a temporal input window of 12–24 past hours, to produce a forecast horizon of 10–24 future hours for a single measurement variable. Consequently, ESP32-class microcontrollers are considered insufficient for multivariate predictive inference with minute-resolution data and subsequent WQI estimation at the edge.

The Raspberry Pi Zero 2W platform was selected for the proposed Water-QI edge prototype due to its 64-bit quad-core ARM processor and 512 MB of RAM, which support the deployment of more demanding minute-resolution GRU models. To evaluate a lower-bound embedded execution scenario, experiments were conducted on this hardware using a 32-bit Raspberry Pi OS configuration, a custom build of TensorFlow 2.4.0 [62], and Python 3.7. Table 7 presents the measured memory footprint and inference time for the GRU architectures, each executed at least 10 times.

Although cloud services provide computation times of less than 0.5 s and millisecond-scale network latencies, this scenario investigates edge inference on the Raspberry Pi Zero 2W to demonstrate that Water-QI nodes can operate independently and deliver both measurement data and water quality index (WQI) predictions. The measured power consumption of the Water-QI node is approximately 0.5 W when idle and nearly 2 W during inference and data transmission over Wi-Fi. Device autonomy is not a critical requirement for urban deployment; however, these results confirm that the Water-QI node remains energy-efficient for long-term autonomous monitoring.

Comparing the inference-time measurements of Table 7 with the minute-resolution validation errors reported in Table 6, a clear speed–accuracy trade-off emerges for the single-layer models. The best numerical validation RMSE is achieved by the GRU-512 model (0.025548), but the GRU-256 model is only 0.0157% worse in RMSE (0.025552) while completing inference 70.70% faster (3.872 s versus 13.215 s). Likewise, the GRU-64 model is 93.71% faster than GRU-512, at the cost of only a 1.69% increase in RMSE. In contrast, increasing the model size beyond 512 units does not yield a meaningful accuracy benefit: GRU-1024 is 275.69% slower than GRU-512, while its RMSE is 0.235% worse; GRU-2048 is 1409.72% slower, and its RMSE is 3.38% worse. Therefore, from an edge-computing perspective, the GRU-256 configuration provides the most favorable practical balance between predictive accuracy and execution speed, followed by GRU-512, which fine-tunes accuracy while deliberately increasing speed, within the marginal context of minute-level inference. A similarly strong conclusion is obtained for deep stacked models. The 10-layer GRU with 64 units per layer requires 6.370 s for a 24-h minute-resolution forecast, which is 666.55% slower than the single-layer GRU-64 model (0.831 s), while its validation RMSE increases from 0.025981 to 0.078124, corresponding to a 200.70% error increase. Hence, deeper stacking is disadvantageous not only in predictive quality but also in edge-execution efficiency. Moreover, for near-real-time minute-scale deployment, a full 1440-point forecast should complete within 60 s to sustain timely rolling updates. Under this criterion, models whose inference time exceeds 60 s cannot provide near-real-time minute-level operation; therefore, the GRU-2048 model with 199.51 s inference time is unsuitable for practical minute-scale edge inference, while GRU-1024 at 49.647 s is pretty close to the operational limit.

5. Discussion of the Results

To provide a clear comparative evaluation of the reported experiments, Table 8 summarizes the validation RMSE and test

R^{2}

values achieved by representative GRU architectures across both temporal-resolution scenarios.

For the minute-resolution scenario, increasing model capacity from 64 to 256 GRUs yields a small but measurable improvement in validation accuracy: the validation RMSE decreases from 0.0259 to 0.0255 (a 1.54% improvement), and the test

R^{2}

rises from 0.9840 to 0.985448. This clarifies that modest capacity increases lead to slightly better performance.

Further increasing the number of units to 2048 does not improve RMSE. While the 2048-unit model attains the numerically highest test

R^{2}

(0.985454), its advantage over the 256-unit model is negligible, and its validation RMSE is approximately 3.57% worse. This saturation effect suggests that larger single-layer GRU models offer no meaningful practical gains for this dataset.

The deep stacked GRU model performs substantially worse than the shallow minute-resolution models, with a validation RMSE of 0.0781 and test

R^{2}

of 0.8490. These findings reinforce the conclusion that increasing layer depth is not beneficial under the examined conditions and that shallow GRU architectures generalize more effectively than deeper stacked variants.

Minute-resolution shallow models achieve slightly better validation RMSE than their hourly counterparts. For example, the standard 64-unit GRU improves from 0.0281 in the hourly scenario to 0.0259 in the minute scenario, an RMSE reduction of approximately 7.83%. Likewise, the 256-unit GRU improves from 0.0365 to 0.0255, reducing RMSE by approximately 30.14%. In both settings, shallow architectures outperform deeper stacked variants.

A direct comparison between the best hourly and minute-resolution models further highlights the benefit of finer temporal granularity. The best hourly model, namely the standard GRU with 64 units, achieves a validation RMSE of 0.0281 and a test

R^{2}

of 0.9820. In contrast, the best minute-resolution model, namely the single-layer GRU with 256 units, achieves a lower validation RMSE of 0.0255 and a higher test

R^{2}

of 0.985448. This corresponds to an absolute RMSE reduction of 0.0026, or approximately 9.25%, together with an absolute increase of 0.003448 in test

R^{2}

. These results suggest that the minute-resolution setting offers a modest but consistent predictive advantage over the best-performing hourly configuration.

Among the minute-resolution models reported in Table 8, the single-layer 256-unit GRU provides the best trade-off between predictive accuracy and model complexity. Although the 2048-unit model yields a marginally higher test

R^{2}

, its validation RMSE is worse, and its practical advantage is negligible. Therefore, the final results support the use of a shallow single-layer GRU architecture and indicate that performance saturates beyond the moderate-capacity regime, while deeper stacking consistently degrades prediction accuracy.

In the minute-resolution data scenario, the experiments in Section 4.3 show a clear, consistent effect of network layer depth when the number of GRU cells is small. In the 64-cell configurations, increasing the number of layers from 1 to 2, 4, and 10 leads to a progressive deterioration in predictive accuracy, as evidenced by the increase in test RMSE from 0.026 to 0.027, 0.034, and 0.082, respectively, together with the corresponding decrease in

R^{2}

from 0.984 to 0.983, 0.974, and 0.84. Therefore, deeper stacking is not beneficial for this dataset and instead introduces substantial performance degradation.

For medium-sized (heavy) single-layer models, the experimental results indicate a gradual improvement in predictive accuracy as the number of GRU cells increases from 128 to 256 and 512, while the test

R^{2}

values remain very high in all three cases. However, these gains are extremely small, especially between the 256-cell and 512-cell models, where the relative RMSE improvement using the 512-cell model’s minute dataset is only about 1%. This suggests that increasing the number of cells beyond 256 yields only marginal benefits in this performance region. The best trade-off for this dataset is achieved by a single-layer GRU model with 512 cells, or equivalently by models in the same saturation region since their predictive differences are minimal. Although the 2048-cell model yields the lowest numerical test RMSE, its advantage over the 512-cell model is too small to justify the four-fold increase in GRU cells. Therefore, the final results support the use of a shallow single-layer architecture and indicate that performance improvement follows a saturation pattern with diminishing returns, while deeper stacking consistently degrades prediction accuracy under the examined experimental conditions.

To further evaluate the proposed models, the best-performing configurations from Scenario I and Scenario II were compared with the best-performing ML model and with a directly relevant GRU-based DL reference model from the literature, as summarized in Table 1 and Table 2, respectively. The water quality score (WQS) was computed using the same formulation adopted throughout this work, with

α = 0.8

, thereby emphasizing the normalized RMSE term while preserving the contribution of the coefficient of determination. For Scenario I, the best-performing model was the standard GRU with one layer and 64 units, as reported in Table 5. This model achieved a validation RMSE of 0.0281 and a test

R^{2}

of 0.9820, corresponding to a WQS of 0.9739. For Scenario II, the best-performing model was the shallow heavy GRU with one layer and 512 units, as reported in Table 6. This configuration achieved the lowest validation RMSE of 0.025548 and a test

R^{2}

of 0.985448, resulting in the highest proposed WQS of 0.9767.

Compared with the GRU-based DL reference model in Table 2, reported by [34], which achieved an

R^{2}

of 0.908, an RMSE of 0.036, and a WQS of 0.9528, both proposed scenarios demonstrate superior performance. Scenario I improves the WQS by approximately 0.0211, while Scenario II improves it by approximately 0.0239. This indicates that the proposed GRU configurations provide stronger regression performance than the hyperparameter-optimized GRU benchmark reported in the literature while maintaining the same recurrent modeling philosophy.

The best ML result reported in Table 1, namely the MLP model of [22], shows that the proposed models achieve slightly lower WQS values. The MLP model reports a WQS of 0.9954, which is higher than both Scenario I and Scenario II. However, as discussed previously, this result was identified as potentially superficial due to overfitting. Therefore, although the reported ML score is numerically higher, the proposed GRU-based models provide a more realistic and practically reliable performance profile for water quality time-series forecasting.

It is also important to compare the experimental results with other machine learning methods, such as XGBoost, CatBoost and gradient boosting, which have dominated the water quality literature. These models are recognized for their speed and accuracy for classification tasks. However, they often fail to capture the temporal dynamics present in time-series sensor measurements, as shown in Table 1. In the conducted benchmarks, the GRU models achieved validation RMSE values as low as 0.0255, demonstrating competitive performance relative to the higher error margins typically observed in ensemble-based forecasting for high-frequency data. Selecting a recurrent architecture instead of a standard gradient-boosting approach involved trading some training simplicity for a more comprehensive understanding of how water parameters change over short time intervals. This decision was crucial for maintaining a test

R^{2}

above 0.98, indicating that, for near-real-time IoT streams, modeling the temporal sequence is as important as the accuracy of the prediction itself.

The uncertainty of the proposed Water-QI platform originates from three main sources: (a) sensor-level measurement uncertainty, (b) uncertainty propagated through the WQI calculation, and (c) predictive uncertainty introduced by the GRU forecasting models. Since the Water-QI node uses low-cost sensing modules, the reported values should be interpreted as calibrated monitoring estimates rather than certified laboratory measurements. For each measured variable, uncertainty is affected by sensor calibration residuals relative to reference solutions or instruments, as well as by long-term deviations caused by probe aging, fouling, or environmental exposure. These uncertainties are propagated to the final WQI score through the weighted min–max formulation used in this study. As a result, the WQI is more sensitive to pH and turbidity deviations than to temperature, TDS, or EC variations because pH has the highest weight and a narrow acceptable operating range, while turbidity is normalized over a relatively small NTU interval. The GRU prediction uncertainty is represented by the residual error between predicted and observed sequences, with RMSE and

R^{2}

as the principal indicators, and WQS is used to combine normalized error and goodness of fit. In this context, the best hourly model, namely the single-layer 64-unit GRU, and the best minute-resolution model, namely the single-layer 512-unit GRU, exhibit low normalized predictive error and strong goodness of fit. However, because the training sequences were derived from secondary data rather than native Water-QI field measurements, the reported uncertainty should be interpreted as best-effort minima within the constructed temporal-resolution scenarios.

6. Conclusions

In this study, we investigated the integration of low-cost IoT sensing with GRU-based deep learning models for near-real-time periodic water quality assessment in smart city environments. The proposed Water-QI platform combines affordable hardware, cloud-supported telemetry, and predictive analytics to estimate water quality behavior using five measured parameters: temperature, TDS, EC, pH, and turbidity. The results confirm that reliable forecasting can be achieved without resorting to excessively complex architectures, which is especially important for practical deployment in budget-constrained urban infrastructures.

The experimental evaluation across hourly and minute-resolution scenarios showed that shallow GRU models consistently outperform deeper stacked alternatives. In the hourly case, the single-layer 64-unit GRU achieved the best overall performance, with a validation RMSE of 0.0281 and a test

R^{2}

of 0.9820, making it the most suitable solution for low-cost and computationally efficient periodic monitoring. In the minute-resolution case, wider, shallower models provided slightly better predictive accuracy, with the 512-unit GRU achieving the lowest validation RMSE and the 256-unit GRU delivering nearly identical performance at substantially faster inference. These findings indicate that increasing model width yields small gains at very fine temporal granularity, whereas increasing recurrent depth leads to clear degradation in both convergence behavior and generalization.

From a practical edge-computing perspective, the results highlight a clear trade-off between predictive performance and execution cost. Although the 512-unit model achieved the best numerical validation accuracy, the 256-unit model emerged as the most balanced configuration for minute-level forecasting on embedded ARM-based hardware. In contrast, very large or deeply stacked GRU models introduced substantial computational overhead without providing meaningful predictive benefits. Therefore, the experimental evidence supports deploying shallow GRU architectures as the most effective design choice for scalable and resource-aware real-time water quality monitoring systems. This study has several limitations that we should acknowledge. First, we derived the dataset from monthly open data records and temporally interpolated it to produce hourly and minute-level sequences; although this preprocessing enabled controlled forecasting experiments, the resulting high-resolution series do not fully replicate the behavior of truly continuous field measurements. Specifically, while our hourly dataset time-series interpolation to minute scale provides the necessary data volume for inference testing, it inherently lacks the stochastic noise, signal drift, and sudden hardware-induced anomalies that are characteristic of long-term field deployments. Consequently, the reported results should be viewed as a baseline for near-real-time performance, which we intend to validate further through genuine high-frequency sensor streams in subsequent research phases. Second, we focused our experiments on a reduced set of five physicochemical parameters in a single geographical context, which may limit the direct generalizability of the findings to other water networks or hydro-environmental conditions. Third, the proposed forecasting framework primarily models normal temporal evolution and does not explicitly address rare contamination incidents, abrupt anomalies, or sensor failures. Furthermore, while the Water-QI framework integrates a dedicated three-step cleaning pipeline to mitigate sensor jitter and minor signal dropouts, its current performance is baseline-oriented toward normal diurnal cycles. Verifying robustness against catastrophic sensor failures or rapid-onset pollution events through abnormality simulation remains a priority for future iterations. These anti-interference experiments will be essential to ensure that the GRU models can distinguish between hardware-induced noise and genuine ecological emergencies in dynamic urban settings. Although we evaluated edge inference on a representative Water-QI node, we have limited long-term field validation under real operating conditions, including sensor drift, calibration degradation, communication instability, and environmental interference.

Taking the Water-QI framework out of the lab and scaling it across diverse aquatic environments brings up new practical realities. For instance, while the models perform reliably in relatively stable drinking-water setups, deploying these same nodes in urban sewer networks would be entirely different. Sewer environments are highly dynamic, meaning the sensors would face aggressive chemical shifts and severe biofouling, requiring much stricter maintenance schedules and recalibration compared to a calm reservoir. On the algorithmic side, adapting to sudden ecological anomalies remains an open challenge. Predictive deep learning models naturally struggle with concept drift situations where sudden weather events or toxic spills shift the baseline of what is considered normal water quality. Looking forward, a true city-wide rollout must solve these physical and algorithmic constraints together. Future iterations will need online learning mechanisms to catch these sudden shifts, along with self-sustaining solutions, like small solar setups, to keep the remote nodes running autonomously without constant battery replacements.

Future research will focus on extending the proposed Water-QI system toward real multi-node spatial–temporal deployments across broader urban water networks. A first priority is the collection of genuine high-frequency sensor data from the distributed Water-QI IoT nodes to validate the models under fully realistic operating conditions and reduce reliance on external sources and interpolated sequences. In addition, future work will investigate other model approaches with significant performance footprints, like hybrid 1DCNN, LSTM, and GRU combined with NN learning approaches, for jointly modeling temporal evolution and spatial dependencies among sensing locations. Further directions also include incorporating anomaly detection mechanisms for sudden contamination events, uncertainty-aware prediction, adaptive calibration and drift compensation strategies, and online or federated learning schemes that enable models to continuously improve while maintaining low communication overhead. These extensions will strengthen the robustness, transferability, and operational value of the Water-QI platform for smart city water management.

Finally, this work demonstrates that low-cost IoT sensing, combined with carefully selected shallow GRU models, can provide accurate, computationally feasible water quality forecasting. The study shows that practical predictive performance is achieved not by maximizing architectural complexity but by balancing temporal resolution, model capacity, and deployment constraints, specifically inference speed. In this sense, the proposed Water-QI framework offers a realistic pathway toward scalable, intelligent, and proactive water quality monitoring in smart city environments.

Author Contributions

Conceptualization, S.K.; methodology, S.K. and G.K.; software, S.K. and C.T.; validation, C.T. and S.V.; formal analysis, S.K. and G.K.; investigation, C.T.; resources, S.K. and C.T.; data curation, S.K. and C.T.; writing—original draft preparation, C.T.; review and editing, S.K., S.V. and G.K.; visualization, C.T.; supervision, S.K.; project administration, S.K. and C.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AC	AHP–CatBoost (combined weighting scheme)
ADC	Analog-to-Digital Converter
ADASYN	Adaptive Synthetic Sampling
AE	AHP–EWM (combined weighting scheme)
AHP	Analytic Hierarchy Process
ANN	Artificial Neural Network
ARIMA	Autoregressive Integrated Moving Average
ARM	Advanced RISC Machine
AS	Application Server
CART	Classification and Regression Tree
CatBoost	Categorical Boosting
CNN	Convolutional Neural Network
CPU	Central Processing Unit
DL	Deep Learning
DO	Dissolved Oxygen
DT	Decision Tree
EC	Electrical Conductivity
EWMA	Exponentially Weighted Moving Average
FDOM	Fluorescent Dissolved Organic Matter
GPIO	General-Purpose Input/Output
GRU	Gated Recurrent Unit
HTTP	Hypertext Transfer Protocol
I2C	Inter-Integrated Circuit
IoT	Internet of Things
JSON	JavaScript Object Notation
KNN	K-Nearest Neighbors
LightGBM	Light Gradient Boosting Machine
LR	Logistic Regression
LSTM	Long Short-Term Memory
LTSF	Long-Term Series Forecasting
MAE	Mean Absolute Error
ML	Machine Learning
MLP	Multi-Layer Perceptron
MQTT	Message Queuing Telemetry Transport
MSE	Mean Squared Error
NN	Neural Network
NSF-WQI	National Sanitation Foundation Water Quality Index
pH	Potential of Hydrogen
$R^{2}$	Coefficient of Determination
RBF	Radial Basis Function
RF	Random Forest
RMSE	Root Mean Square Error
SCINet	Sample Convolution and Interaction Network
SMOTE	Synthetic Minority Over-Sampling Technique
SSL	Secure Sockets Layer
SVM	Support Vector Machine
TDS	Total Dissolved Solids
TSS	Total Suspended Solids
WQI	Water Quality Index
WQM	Weighted Quadratic Mean
$W Q M_{A C}$	Weighted Quadratic Mean model using AHP–CatBoost weights
WQS	Water Quality Score
XGBoost	eXtreme Gradient Boosting

References

Bamini, A.; Jengan, C.; Agarwal, S.; Kim, H.; Stephan, P.; Stephan, T. IoT-Based Automatic Water Quality Monitoring System with Optimized Neural Network. KSII Trans. Internet Inf. Syst. 2024, 18, 46–63. [Google Scholar] [CrossRef]
Kyritsakas, G. Exploring Machine Learning Applications for Improving Drinking Water Quality. Ph.D. Dissertation, The University of Sheffield, Cambridge, MA, USA, 2021. Available online: https://etheses.whiterose.ac.uk/id/eprint/30179/ (accessed on 12 March 2024).
Boccadoro, P.; Daniele, V.; Di Gennaro, P.; Lofù, D.; Tedeschi, P. Water Quality Prediction on a Sigfox-compliant IoT Device: The Road Ahead of WaterS. Ad. Hoc Netw. 2022, 126, 102749. [Google Scholar] [CrossRef]
El Bilali, A.; Taleb, A.; Brouziyne, Y. Groundwater Quality Forecasting Using Machine Learning Algorithms for Irrigation Purposes. Agric. Water Manag. 2021, 245, 106625. [Google Scholar] [CrossRef]
Garzón, A.; Kapelan, Z.; Langeveld, J.; Taormina, R. Machine Learning-Based Surrogate Modeling for Urban Water Networks: Review and Future Research Directions. Water Resour. Res. 2022, 58, e2021WR031808. [Google Scholar] [CrossRef]
Lowe, M.; Qin, R.; Mao, X. A Review on Machine Learning, Artificial Intelligence, and Smart Technology in Water Treatment and Monitoring. Water 2022, 14, 1384. [Google Scholar] [CrossRef]
Zhu, M.; Wang, J.; Yang, X.; Zhang, Y.; Zhang, L.; Ren, H.; Wu, B.; Ye, L. A Review of the Application of Machine Learning in Water Quality Evaluation. Eco-Environ. Health 2022, 1, 107–116. [Google Scholar] [CrossRef]
Iyer, S.; Kaushik, S.; Nandal, P. Water Quality Prediction Using Machine Learning. MR Int. J. Eng. Technol. 2023, 10, 60–62. [Google Scholar] [CrossRef]
Karthick, K.; Krishnan, S.; Manikandan, R. Water Quality Prediction: A Data-Driven Approach Exploiting Advanced Machine Learning Algorithms with Data Augmentation. J. Water Clim. Change 2024, 15, 431–452. [Google Scholar] [CrossRef]
Khan, P.F.; Zaheen, S.Z.; Sunder, D.P.S.; Shirisha, M.K.; Kotoju, D.R.; Ayvappa, R.M.K. Water Quality Prediction and Classification Using Machine Learning. Int. J. Res. Publ. Rev. 2025, 6, 8425–8435. Available online: https://ijrpr.com/uploads/V6ISSUE5/IJRPR45788.pdf (accessed on 23 January 2026). [CrossRef]
Garcia, J.; Heo, J.; Kim, C. Machine Learning Algorithms for Water Quality Management Using Total Dissolved Solids (TDS) Data Analysis. Water 2024, 16, 2639. [Google Scholar] [CrossRef]
Patil, S.V.; Wankhade, N.R.; Bagal, S.B.; Patel, M.T. Water Quality Analysis and Prediction Using Machine Learning. J. Inf. Syst. Eng. Manag. 2025, 10, 1069–1073. [Google Scholar] [CrossRef]
Prabu, P.; Alluhaidan, A.S.; Aziz, R.; Basheer, S. Comparative analysis of machine learning models for detecting water quality anomalies in treatment plants. Sci. Rep. 2025, 15, 30453. [Google Scholar] [CrossRef] [PubMed]
Xu, T.; Coco, G.; Neale, M. A Predictive Model of Recreational Water Quality Based on Adaptive Synthetic Sampling Algorithms and Machine Learning. Water Res. 2020, 177, 115788. [Google Scholar] [CrossRef]
Hmoud Al-Adhaileh, M.; Waselallah Alsaade, F. Modelling and Prediction of Water Quality by Using Artificial Intelligence. Sustainability 2021, 13, 4259. [Google Scholar] [CrossRef]
Padmaja, P.; Sai, C.S.D.; Teja, V.K.; Ragav, A.P.; Babji, P. Water Quality Prediction Using Machine Learning Algorithms. J. Emerg. Technol. Innov. Res. 2023, 10, c711–c721. Available online: https://www.jetir.org/papers/JETIR2304287.pdf (accessed on 10 December 2025).
Onyutha, C. Multiple Statistical Model Ensemble Predictions of Residual Chlorine in Drinking Water: Applications of Various Deep Learning and Machine Learning Algorithms. J. Environ. Public Health 2022, 2022, 7104752. [Google Scholar] [CrossRef]
Sharaan, M.; Elshemy, M.M.; Fujii, M.; Ibrahim, M.G.; Nada, A.M. Water Quality Prediction and Classification for Drinking Water from Seawater Desalination Plants Using Machine Learning Algorithms. SSRN 2024, 4999808. [Google Scholar] [CrossRef]
Ding, F.; Hao, S.; Zhang, W.; Jiang, M.; Chen, L.; Yuan, H.; Wang, N.; Li, W.; Xie, X. Using Multiple Machine Learning Algorithms to Optimize the Water Quality Index Model and Their Applicability. Ecol. Indic. 2025, 172, 113299. [Google Scholar] [CrossRef]
Walczak, N.; Walczak, Z. Assessing the Feasibility of Using Machine Learning Algorithms to Determine Reservoir Water Quality Based on a Reduced Set of Predictors. Ecol. Indic. 2025, 175, 113556. [Google Scholar] [CrossRef]
Najah Ahmed, A.; Binti Othman, F.; Abdulmohsin Afan, H.; Khaleel Ibrahim, R.; Ming Fai, C.; Shabbir Hossain, M.; Ehteram, M.; Elshafie, A. Machine Learning Methods for Better Water Quality Prediction. J. Hydrol. 2019, 578, 124084. [Google Scholar] [CrossRef]
Shams, M.Y.; Elshewey, A.M.; El-kenawy, E.S.M.; Ibrahim, A.; Talaat, F.M.; Tarek, Z. Water quality prediction using machine learning models based on grid search method. Multimed. Tools Appl. 2024, 83, 35307–35334. [Google Scholar] [CrossRef]
Lu, H.; Ma, X. Hybrid Decision Tree-Based Machine Learning Models for Short-Term Water Quality Prediction. Chemosphere 2020, 249, 126169. [Google Scholar] [CrossRef]
Choudhary, R.; Kumar, A.; C., P.; Naik, M.M.; Choudhury, M.; Khan, N.A. Predicting water quality index using stacked ensemble regression and SHAP based explainable artificial intelligence. Sci. Rep. 2025, 15, 31139. [Google Scholar] [CrossRef]
Elmotawakkili, A.; Enneya, N.; Bhagat, S.K.; Ouda, M.M.; Kumar, V. Advanced Machine Learning Models for Robust Prediction of Water Quality Index and Classification. J. Hydroinform. 2025, 27, 299–319. [Google Scholar] [CrossRef]
Lokman, A.; Ismail, W.Z.W.; Aziz, N.A.A. A Review of Water Quality Forecasting and Classification Using Machine Learning Models and Statistical Analysis. Water 2025, 17, 2243. [Google Scholar] [CrossRef]
Chen, J.; Wei, X.; Liu, Y.; Zhao, C.; Liu, Z.; Bao, Z. Deep Learning for Water Quality Prediction—A Case Study of the Huangyang Reservoir. Appl. Sci. 2024, 14, 8755. [Google Scholar] [CrossRef]
Yan, X.; Zhang, T.; Du, W.; Meng, Q.; Xu, X.; Zhao, X. A Comprehensive Review of Machine Learning for Water Quality Prediction over the Past Five Years. J. Mar. Sci. Eng. 2024, 12, 159. [Google Scholar] [CrossRef]
Islam, N.; Irshad, K. Artificial Ecosystem Optimization with Deep Learning Enabled Water Quality Prediction and Classification Model. Chemosphere 2022, 309, 136615. [Google Scholar] [CrossRef]
Rizal, N.N.M.; Hayder, G.; Yusof, K.A. Water Quality Predictive Analytics Using an Artificial Neural Network with a Graphical User Interface. Water 2022, 14, 1221. [Google Scholar] [CrossRef]
Wang, X.; Li, Y.; Qiao, Q.; Tavares, A.; Liang, Y. Water Quality Prediction Based on Machine Learning and Comprehensive Weighting Methods. Entropy 2023, 25, 1186. [Google Scholar] [CrossRef]
Prasad, D.V.V.; Venkataramana, L.Y.; Kumar, P.S.; Prasannamedha, G.; Harshana, S.; Srividya, S.J.; Harrinei, K.; Indraganti, S. Analysis and Prediction of Water Quality Using Deep Learning and Auto Deep Learning Techniques. Sci. Total Environ. 2022, 821, 153311. [Google Scholar] [CrossRef]
Chen, H.; Yang, J.; Fu, X.; Zheng, Q.; Song, X.; Fu, Z.; Wang, J.; Liang, Y.; Yin, H.; Liu, Z.; et al. Water Quality Prediction Based on LSTM and Attention Mechanism: A Case Study of the Burnett River, Australia. Sustainability 2022, 14, 13231. [Google Scholar] [CrossRef]
Rahul Gandh, D.; Rasheed Abdul Haq, K.P.; Harigovindan, V.P.; Bhide, A. LSTM and GRU based Accurate Water Quality Prediction for Smart Aquaculture. J. Phys. Conf. Ser. 2023, 2466, 012027. [Google Scholar] [CrossRef]
Cai, H.; Zhang, C.; Xu, J.; Wang, F.; Xiao, L.; Huang, S.; Zhang, Y. Water Quality Prediction Based on the KF-LSTM Encoder-Decoder Network: A Case Study with Missing Data Collection. Water 2023, 15, 2542. [Google Scholar] [CrossRef]
Eze, E.; Kirby, S.; Attridge, J.; Ajmal, T. Aquaculture 4.0: Hybrid Neural Network Multivariate Water Quality Parameters Forecasting Model. Sci. Rep. 2023, 13, 16129. [Google Scholar] [CrossRef]
Sathya Preiya, V.M.; Subramanian, P.; Soniya, M.; Pugalenthi, R.; M, S.P.V. Water Quality Index Prediction and Classification Using Hyperparameter Tuned Deep Learning Approach. Glob. NEST J. 2024, 26, 1–8. [Google Scholar] [CrossRef]
Perumal, B.; Rajarethinam, N.; Velusamy, A.D.; Sundramurthy, V.P. Water Quality Prediction Based on Hybrid Deep Learning Algorithm. Adv. Civ. Eng. 2023, 2023, 6644681. [Google Scholar] [CrossRef]
Im, Y.; Song, G.; Lee, J.; Cho, M. Deep Learning Methods for Predicting Tap-Water Quality Time Series in South Korea. Water 2022, 14, 3766. [Google Scholar] [CrossRef]
Kontogiannis, S.; Gkamas, T.; Pikridas, C. Deep Learning Stranded Neural Network Model for the Detection of Sensory Triggered Events. Algorithms 2023, 16, 202. [Google Scholar] [CrossRef]
Gao, S.; Huang, Y.; Zhang, S.; Han, J.; Wang, G.; Zhang, M.; Lin, Q. Short-term runoff prediction with GRU and LSTM networks without requiring time step optimization during sample generation. J. Hydrol. 2020, 589, 125188. [Google Scholar] [CrossRef]
Tornyeviadzi, H.M.; Seidu, R. Leakage detection in water distribution networks via 1D CNN deep autoencoder for multivariate SCADA data. Eng. Appl. Artif. Intell. 2023, 122, 106062. [Google Scholar] [CrossRef]
Jaffar, A.; Thamrin, N.M.; Ali, M.S.A.M.; Misnan, M.F.; Yassin, A.I.M. Water Quality Prediction Using LSTM-RNN: A Review. J. Sustain. Sci. Manag. 2022, 17, 204–225. [Google Scholar] [CrossRef]
ThingsBoard. ThingsBoard Open-source IoT Platform. 2019. Available online: https://thingsboard.io/ (accessed on 10 November 2021).
Apache Foundation. Cassandra, Open Source NoSQL Database. 2015. Available online: https://cassandra.apache.org/ (accessed on 1 August 2021).
Kontogiannis, S.; Koundouras, S.; Pikridas, C. Proposed Fuzzy-Stranded-Neural Network Model That Utilizes IoT Plant-Level Sensory Monitoring and Distributed Services for the Early Detection of Downy Mildew in Viticulture. Computers 2024, 13, 63. [Google Scholar] [CrossRef]
ThingsBoard. ThingsBoard Mobile Application. 2024. Available online: https://github.com/thingsboard/flutter_thingsboard_app (accessed on 20 September 2025).
Brown, R.M.; McClelland, N.I.; Deininger, R.A.; Tozer, R.G. A Water Quality Index—Crashing the Psycological Barrier. Water Sew. Work. 1970, 117, 339–343. [Google Scholar] [CrossRef]
Horton, R.K. An Index Number System for Rating Water Quality. J. Water Pollut. Control. Fed. 1965, 37, 300–306. [Google Scholar]
Uddin, M.G.; Nash, S.; Olbert, A.I. A review of water quality index models and their use for assessing surface water quality. Ecol. Indic. 2021, 122, 107218. [Google Scholar] [CrossRef]
Patel, D.D.; Mehta, D.J.; Azamathulla, H.M.; Shaikh, M.M.; Jha, S.; Rathnayake, U. Application of the Weighted Arithmetic Water Quality Index in Assessing Groundwater Quality: A Case Study of the South Gujarat Region. Water 2023, 15, 3512. [Google Scholar] [CrossRef]
Abrahão, R.; Carvalho, M.; da Silva, W.R., Jr.; Machado, T.T.V.; Gadelha, C.L.M.; Hernandez, M.I.M. Use of Index Analysis to Evaluate the Water Quality of a Stream Receiving Industrial Effluents. Water SA 2007, 33, 459–466. [Google Scholar] [CrossRef]
Lumb, A.; Sharma, T.C.; Bibeault, J.F. A Review of Genesis and Evolution of Water Quality Index (WQI) and Some Future Directions. Water Qual. Expo. Health 2011, 3, 11–24. [Google Scholar] [CrossRef]
Garcia, C.A.B.; Silva, I.S.; Mendonça, M.C.S.; Garcia, H.L. Evaluation of Water Quality Indices: Use, Evolution and Future Perspectives. In Advances in Environmental Monitoring and Assessment; Sarvajayakesavalu, S., Ed.; IntechOpen: London, UK, 2018; Chapter 2. [Google Scholar] [CrossRef]
World Health Organization. Guidelines for Drinking-Water Quality: Fourth Edition Incorporating the First and Second Addenda. 2022. Available online: https://www.who.int/publications/i/item/9789240045064 (accessed on 15 November 2025).
United States Environmental Protection Agency. Drinking Water Regulations and Contaminants. 2025. Available online: https://www.epa.gov/ground-water-and-drinking-water/national-primary-drinking-water-regulations (accessed on 10 December 2025).
Psaropa, M.X.; Kontogiannis, S.; Lolis, C.J.; Hatzianastassiou, N.; Pikridas, C. A Proposed Deep Learning Framework for Air Quality Forecasts, Combining Localized Particle Concentration Measurements and Meteorological Data. Appl. Sci. 2025, 15, 7432. [Google Scholar] [CrossRef]
Jiang, Y.; Li, C.; Sun, L.; Guo, D.; Zhang, Y.; Wang, W. A Deep Learning Algorithm for Multi-Source Data Fusion to Predict Water Quality of Urban Sewer Networks. J. Clean. Prod. 2021, 318, 128533. [Google Scholar] [CrossRef]
Nong, X.; He, Y.; Chen, L.; Wei, J. Machine Learning-Based Evolution of Water Quality Prediction Model: An Integrated Robust Framework for Comparative Application on Periodic Return and Jitter Data. Environ. Pollut. 2025, 369, 125834. [Google Scholar] [CrossRef] [PubMed]
Ahmed, U.; Mumtaz, R.; Anwar, H.; Shah, A.A.; Irfan, R.; García-Nieto, J. Efficient Water Quality Prediction Using Supervised Machine Learning. Water 2019, 11, 2210. [Google Scholar] [CrossRef]
EYATH, S.A. Water Measurements in the Area of Thessaloniki, Greece. Public Page Linking to Area-Level Water Quality Measurements and Historical Data. 2026. Available online: https://quality.eyath.gr/ (accessed on 12 January 2026).
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, Savannah, GA, USA, 2–4 November 2016; pp. 265–283. Available online: https://dl.acm.org/doi/10.5555/3026877.3026899 (accessed on 9 May 2026).

Figure 1. Proposed system architecture of the Water-QI system.

Figure 2. IoT water-quality sensing node architecture and physical prototype: (a) connectivity diagram, where the Raspberry Pi Zero 2 W communicates with the ADS1115 analog-to-digital converter through the I²C interface; (b) proof-of-concept implementation of the sensing node, showing the connectivity of the analog sensors to the Raspberry Pi Zero 2 W.

Figure 3. Training and validation RMSE across 100 epochs for the three evaluated GRU architectures on hourly resolution data: (a) Standard GRU with one layer and 64 units; (b) Heavy GRU with one layer and 256 units; (c) Deep GRU with 10 layers and 64 GRU units per layer.

Figure 4. Next 24-h WQI prediction using the hourly-resolution models: (a) Standard GRU forecast with one layer and 64 units; (b) Heavy GRU forecast with one layer and 256 units; (c) Deep GRU forecast with 10 layers and 64 GRU units per layer.

Figure 5. Training and validation RMSE for the minute-resolution models across 100 epochs: (a) Standard GRU with one layer and 64 units; (b) Heavy GRU with one layer and 512 units; (c) Deep GRU with 10 layers and 64 GRU units per layer.

Table 1. Performance of machine learning models and shallow neural network architectures in water quality prediction and classification tasks.

Regression Tasks
$R^{2}$ Value	RMSE Value	Score (WQS)	Model Architecture	Superficial	Study
0.9239	0.0540	0.9416	ANFIS (5 hidden layers, NN/Sugeno fuzzy)	Probably-simple model	[15] (Table 4)
0.9992	0.3377	0.7299	MLR (linear regression, 20 input parameters)	Yes, underfitting	[18] (Table 6)
0.998	0.00529	0.9954	MLP (numerous hidden layers, unspecified)	Yes, overfitting	[22] (Table 9)
0.99	1.55	−0.242	Extra trees regressor	Yes, high RMSE/non-normalized attributes	[26] (Table 4)
0.7485	2.6835	−1.1971	Gradient boosting (WQI regression, 4 parameters)	High RMSE/non-normalized attributes	[16] (Table 5)
0.99	1.07	0.142	Ensemble XGBoost, CatBoost, RF, gradient boosting, extra trees and AdaBoost	No	[24] (Figure 4 & Table 11S)
0.000	0.028	0.78	Linear regression model (LRM)	Small dataset, superficial fit of $R^{2} = 1$ , set to zero	[20] (Table 4)
- *	0.241	0.607	LTSF-Linear	Simple architecture	[27] (Table 2)
0.94	0.15	0.868	WDT–ANFIS	No	[21] (Table 5 & Figure 12)
0.736	0.054	0.904	Multi-model ensembles (RF+NN+Gaussian process regressor)	No	[17] (Table 1)
- *	0.59/0.35	0.52	CEEMDAN–XGBoost/CEEDMAN-RF	No (hybrid)	[23] (Table 3)
0.842	0.0387	0.9279	$W Q M_{A C}$ (WQI model using AHP–CatBoost weights and weighted quadratic mean aggregation)	Best CatBoost-based case	[19] (Figure 8e)
Classification Tasks
	Metric	Metric Value	Model Architecture	Superficial	Study
	Accuracy	0.982	Random forest classifier	Yes, small dataset	[26] (Table 5)
	Accuracy	0.963	XGBoost (without SMOTE)	No	[9] (Table 5)
	Accuracy	0.9996	XGBoost	Possible overfitting/very high accuracy	[25] (Table 2)
	Accuracy	0.64	Support vector machine (SVM)	Yes (imbalanced dataset)	[12] (Results Section)
	Accuracy	0.8506	Random forest	No	[16] (Table 7)
	Accuracy	0.995	Gradient boosting (GB)	No	[22] (Table 6)
	Accuracy	0.69	Support vector machine (SVM)	Yes (poor minority class prediction)	[8] (Table 1/Results & Discussion Section)
	Accuracy	0.8918	Encoder–decoder (anomaly detection)	No	[13] (Table 9)
	Accuracy	0.92	MLP–ANN	No	[14] (Section 4.4)
	Accuracy	0.95	Support vector machine (SVM)	No	[10] (Section III & Figure 6)
	Accuracy	1	Decision tree & random forest	Yes (multicollinearity & data leakage/overfitting)	[11] (Tables 3 and 4)

* Values with no calculated

R^{2}

are considered as

R^{2} \to 0

. Best WQS score values are bold.

Table 2. Performance of deep learning models in water quality prediction and classification tasks.

Regression Tasks
$R^{2}$ Value	RMSE Value	Score (WQS)	Model Architecture	Superficial	Study
0.953	0.130	0.8866	AT-LSTM (attention mechanism)	No	[33] (Table 4)
0.94	0.40	0.668	KF-LSTM (Kalman filter)	No	[35] (Table 3)
- *	0.008	0.7936	SCINet (1D CNN–NN hybrid model)	No	[39] (Table 7 mean values)
0.964	0.043	0.958	OSBiGRU	GRU hybrid	[29] (Tables 1 and 2)
0.908	0.036	0.9528	GRU (hyperparameter-optimized)	No	[34] (Figure 4)
0.957	0.0489	0.9523	EEMD–MLR–LSTM (hybrid)	No	[36] (Table 3)
0.94	0.083	0.9216	LSTM–GWO–FSO (metaheuristic)	No	[38] (Table 1)
0.882	1.827	−0.4852	LSTM (temporal modeling)	Yes, high RMSE	[31] (Table 5)
0.97	0.019	0.9782	NN-10 hidden layers	Small NN model	[30] (Table 1 mean values)
0.985	0.0378	0.9668	LSTM (standard)	No	[25] (Table 3)
Classification Tasks
	Metric	Metric Value	Model Architecture	Superficial	Study
	Accuracy	0.96	OSBiGRU (hybrid optimization)	No	[29] (Table 4)
	Accuracy	0.951	CNN (convolutional)	No	[32] (Table 3)
	Accuracy	0.926	LSTM (binary classification)	No	[32] (Table 3)
	Accuracy	0.9222	LSTM-GOA (grasshopper opt.)	No	[37] (Table 2)

* Values with no calculated

R^{2}

are considered as

R^{2} \to 0

. Best WQS score values are bold.

Table 3. WQI interpretation classes and parameter sub-index formulas used in the proposed Water-QI edge-device implementation.

Category	Water Quality classification Index (WQI), in this paper	Interpretation/Formula
Excellent	0–30	Water quality is considered very good.
Good	31–50	Water quality is acceptable with minor concerns.
Poor	51–70	Water quality shows noticeable degradation.
Bad	71–90	Water quality is unsuitable without treatment.
Very bad	91–100	Water quality is severely degraded.
NSF-WQI attribute indices (value 1.0 is better)
Turbidity	0–5 NTU	$Q_{T b} = 100 \cdot max (0, min (1, \frac{5 - T b}{5}))$
pH	6.5–8.5	$Q_{p H} = 100 \cdot max (0, 1 - \frac{\| p H - 7.0 \|}{1.5})$
Temp	0–40 °C	$Q_{T} = 100 \cdot max (0, min (1, \frac{40 - T}{40}))$
TDS	0–500 mg/L	$Q_{T D S} = 100 \cdot max (0, min (1, \frac{500 - T D S}{500}))$
EC	0–2000 μS/cm	$Q_{E C} = 100 \cdot max (0, min (1, \frac{2000 - E C}{2000}))$
Min–max-normalized implementation used in this work (value 0.0 is better)
Turbidity	0–5 NTU	$T b^{norm} = \frac{T b - T b_{min}}{T b_{max} - T b_{min}}$
pH	6.5–8.5	$p H^{norm} = \frac{\| p H - 7.5 \|}{1.5}$
Temp	0–40 °C	$T^{norm} = \frac{T - T_{min}}{T_{max} - T_{min}}$
TDS	0–500 mg/L	$T D S^{norm} = \frac{T D S - T D S_{min}}{T D S_{max} - T D S_{min}}$
EC	0–2000 μS/cm	$E C^{norm} = \frac{E C - E C_{min}}{E C_{max} - E C_{min}}$
WQI index	$W Q I = 100 \cdot \frac{1.5 T^{norm} + 2.0 T D S^{norm} + 2.0 E C^{norm} + 2.5 p H^{norm} + 2.0 T b^{norm}}{10}$

Table 4. Training hyperparameters of the GRU forecasting models.

Hyperparameter	Value	Description
Historical depth window (SEQ_LEN)	1440 (minute)/24 (hourly)	Number of past observations used as input. This corresponds to 1440 min (24 h) for minute-resolution data or 24 hourly samples (24 h) for hourly-resolution data.
Prediction horizon (`PRED_LEN`)	1440 (minute)/24 (hourly)	Number of future observations predicted by the model. This corresponds to forecasting the next 1440 min (24 h) for minute data or the next 24 hourly steps (24 h) for hourly data.
Number of input features	5	Multivariate input composed of temperature, TDS, EC, pH, and turbidity.
Number of GRU layers (L)	1	The recurrent architecture uses a single GRU layer.
GRUs/layer (U)	64\|128\|256\|512\|1024\|2048	The number of GRUs/layer.
Batch normalization	Yes	Applied after the GRU layer to stabilize learning.
Dropout rate	0.2	Dropout applied after batch normalization for regularization.
Optimizer	Adam	Optimization algorithm used for training.
Learning rate	0.001	Initial learning rate of the Adam optimizer.
Epochs	100	Maximum number of training epochs.
Batch size	16	Number of samples per gradient update.
Dense output size	$1440 \times 5$ for minute resolution, $24 \times 5$ for hour resolution	Final fully connected layer producing all future values before reshaping to $(1440, 5)$ , $(24, 5)$ .
Optimizer	Adam	Optimization algorithm used for training.

Table 5. Detailed performance evaluation for all GRU architectures using the hourly resolution dataset.

Model Architecture (Hourly)	Validation RMSE	Test $R^{2}$	WQS
Standard GRU (1 layer—64 units)	0.0281	0.9820	0.9739
Heavy GRU (256 units)	0.0365	0.9796	0.9667
Deep GRU (2 layers—64 units)	0.0389	0.9756	0.9640
Deep GRU (4 layers—64 units)	0.0405	0.9541	0.9584
Deep GRU (10 layers—64 units)	0.0529	0.9246	0.9426

Note: Bold values indicate the best-performing results for each evaluation metric.

Table 6. Detailed performance evaluation for all GRU architectures using the minute-resolution dataset.

Model Architecture	Layers	Validation RMSE	Test $R^{2}$	WQS
GRU (64 units)	1 (Standard GRU)	0.025981	0.984846	0.9762
	2	0.027072	0.983401	0.9750
	4	0.035431	0.974196	0.9665
	10 (Deep GRU)	0.078124	0.849364	0.9074
GRU (128 units)	1	0.025697	0.985331	0.9765
	4	0.031230	0.976415	0.9703
GRU (256 units)	1	0.025552	0.985445	0.9766
	4	0.028994	0.937943	0.9644
GRU (512 units)	1 (Heavy GRU)	0.025548	0.985448	0.9767
	2	0.027008	0.976421	0.9737
GRU (1024 units)	1	0.025608	0.985260	0.9766
GRU (2048 units)	1	0.026411	0.985454	0.9760

Note: Bold values indicate the best-performing results for each evaluation metric.

Table 7. Inference performance of the examined GRU architectures on a quad-core 32-bit edge device for a 24-h forecasting horizon using the minute-level setup that predicts

1440 \times 5

minute-resolution samples. Memory values correspond to the approximate FP32 footprint of the loaded model, while inference times are rough ARM CPU-only estimates.

Table 7. Inference performance of the examined GRU architectures on a quad-core 32-bit edge device for a 24-h forecasting horizon using the minute-level setup that predicts

1440 \times 5

minute-resolution samples. Memory values correspond to the approximate FP32 footprint of the loaded model, while inference times are rough ARM CPU-only estimates.

Model	Loaded Model Memory (MB)	Minute-Resolution 24 h (1440-Point) Inference (s)
GRU-64	15.23	0.831
GRU-256	25.74	3.872
GRU-512	35.13	13.215
GRU-1024	61.13	49.647
GRU-2048	113.61	199.51
Stacked GRU (10 × 64)	80.08	6.370

Table 8. Performance metrics of representative GRU architectures for WQI prediction.

Scenario (Resolution)	Model Architecture	Validation RMSE	Test $R^{2}$
Scenario I (Hourly)	Standard GRU (64 units)	0.0281	0.9820
	Heavy GRU (256 units)	0.0365	0.9796
	Deep GRU (10 layers—64 units)	0.0529	0.9246
Scenario II (Minute)	Standard GRU (64 units)	0.0259	0.9840
	Heavy GRU (256 units)	0.0255	0.985448
	Very heavy GRU (2048 units)	0.02641	0.985454
	Deep GRU (10 layers—64 units)	0.0781	0.8490

Note: Bold values indicate the best-performing results for each evaluation metric.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tsolaki, C.; Kokkonis, G.; Valsamidis, S.; Kontogiannis, S. Water Quality Identification: Integrating IoT Sensors and Deep Learning for Near-Real-Time Water Quality Assessment. Appl. Sci. 2026, 16, 4868. https://doi.org/10.3390/app16104868

AMA Style

Tsolaki C, Kokkonis G, Valsamidis S, Kontogiannis S. Water Quality Identification: Integrating IoT Sensors and Deep Learning for Near-Real-Time Water Quality Assessment. Applied Sciences. 2026; 16(10):4868. https://doi.org/10.3390/app16104868

Chicago/Turabian Style

Tsolaki, Christina, George Kokkonis, Stavros Valsamidis, and Sotirios Kontogiannis. 2026. "Water Quality Identification: Integrating IoT Sensors and Deep Learning for Near-Real-Time Water Quality Assessment" Applied Sciences 16, no. 10: 4868. https://doi.org/10.3390/app16104868

APA Style

Tsolaki, C., Kokkonis, G., Valsamidis, S., & Kontogiannis, S. (2026). Water Quality Identification: Integrating IoT Sensors and Deep Learning for Near-Real-Time Water Quality Assessment. Applied Sciences, 16(10), 4868. https://doi.org/10.3390/app16104868

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Water Quality Identification: Integrating IoT Sensors and Deep Learning for Near-Real-Time Water Quality Assessment

Abstract

1. Introduction

2. Related Work

2.1. Machine Learning Models for Water Quality Assessment

2.2. Deep Learning Models for Water Quality Assessment

2.3. ML–DL Comparative Analysis

3. Materials and Methods

3.1. Proposed System Architecture

3.2. End-Node IoT Device

3.3. Metrics Used

3.4. Proposed Deep Learning Models for WQI Prediction

3.5. Data Collection and Preprocessing Steps

4. Experimental Scenarios

4.1. Model Training and Hyperparameters

4.2. Scenario I: Low-Temporal-Resolution Data Experimentation

4.3. Scenario II: High-Temporal-Resolution Data Experimentation

4.4. Scenario III: Edge Computation Performance of Minute-Resolution Models

5. Discussion of the Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI