WaQuPs: A ROS-Integrated Ensemble Learning Model for Precise Water Quality Prediction

: Water presents challenges in swiftly and accurately assessing its quality due to its intricate composition, diverse sources, and the emergence of new pollutants. Current research tends to oversimplify water quality, categorizing it as potable or not, despite its complexity. To address this, we developed a water quality prediction system (WaQuPs), a sophisticated solution tackling the intricacies of water quality assessment. WaQuPs employs advanced machine learning, including an ensemble learning model, categorizing water quality into nuanced levels: potable, lightly polluted, moderately polluted, and heavily polluted. To ensure rapid and precise dissemination of information, WaQuPs integrates an Internet of Things (IoT)-based communication protocol for the efficient delivery of detected water quality results. In its development, we utilized advanced techniques, such as random oversampling (ROS) for dataset balance. We used a correlation coefficient to select relevant features for the ensemble learning algorithm based on the Random Forest algorithm. Further enhancements were made through hyperparameter tuning to improve the prediction accuracy. WaQuPs exhibited impressive metrics, achieving an accuracy of 83%, precision of 82%, recall of 83%, and an F1-score of 82%. Comparative analysis revealed that WaQuPs with the Random Forest model outperformed both the XGBoost and CatBoost models, confirming its superiority in predicting water quality.


Introduction
Water is an irreplaceable natural resource, serving as a foundational element for the human body and playing an indispensable role in our survival.Beyond sustaining bodily functions, water is integral to a myriad of daily activities, including cooking, washing, and bathing.Nevertheless, swiftly and accurately assessing water quality poses a significant challenge due to its intricate composition, diverse sources, and the introduction of new pollutants.The task of promptly and precisely evaluating processed water quality remains an enduring challenge yet to be fully resolved.To address this issue, one promising avenue involves the application of machine learning techniques.
Numerous studies have explored the application of machine learning in the classification and prediction of water quality.Iyer et al. [1] conducted research on predicting water quality using machine learning, employing SVM, Random Forest, and Decision Tree models.The findings of their study revealed that the performance of the Random Forest model surpassed the other models, achieving an accuracy rate of 68%.However, it is noteworthy that the accuracy rate obtained by Random Forest in this study remained below the threshold of 70%.
Continuing the exploration of water quality prediction, Sen et al. [2] conducted further research focusing on addressing challenges in aquaculture through the introduction of an intelligent machine learning and IoT-based biofloc system.This research aimed to enhance efficiency, production, water recycling, and automatic feeding within the aquaculture domain.The system integrates water quality prediction capabilities using advanced machine learning models, including Decision Tree classification and Random Forest.Moreover, realtime monitoring is facilitated through an Android app.The outcomes of this study were highly promising, with Random Forest achieving an impressive accuracy rate of 73.76%.
Xin and Mou [3] conducted research on water quality using a multimodal-based machine learning algorithm.The categories used in this research consisted of only two labels, namely potable or not.The researchers utilized the LGBM, Catboost, and XGBoost algorithms to develop an ensemble learning model from a dataset containing metal elements in water.Employing a 10-fold cross-validation approach and fine-tuning the hyperparameters, XG-Boost demonstrated the highest accuracy, with an average of 79%.While this study exhibited relatively robust accuracy, there remains room for potential enhancements in this domain.
Subsequently, Patel et al. [4] conducted an extensive analysis, evaluating the performance of 15 machine learning models for water quality classification.Similar to prior studies, they classified water quality according to a binary distinction, i.e., potable or non-potable.Following their initial assessment, the top five models exhibiting the highest accuracy were selected and further refined through hyperparameter optimization.Notably, among these models, Random Forest emerged as the top performer, achieving an impressive accuracy rate of 81%.The accuracy value in this research was quite good compared to previous research, namely above 80%.While this accuracy value surpassed that of previous research, exceeding 80%, it is important to note that the classification in this study remained limited to only two categories: potable or not.
In another study, Ahmed et al. [5] addressed the critical issue of declining water quality attributed to rapid urbanization and industrialization, posing considerable health risks.The study considered the application of supervised machine learning algorithms to predict a water quality index (WQI) and a water quality class (WQC) based on four input parameters: temperature, turbidity, pH, and total dissolved solids.Among the classification algorithms employed in this research, Random Forest was included.However, it is noteworthy that Random Forest achieved an accuracy value of only 76% in this particular study.
Furthermore, Wong et al. [6] conducted a comprehensive exploration and analysis of 17 novel input features.Their goal was to formulate an enhanced water quality index (WQI) capable of adapting to the land use activities surrounding the river.For model selection, the researchers employed five regression algorithms-specifically, Random Forest, AdaBoost, Support Vector Regression, Decision Tree Regression, and Multilayer Perceptron.Among these algorithms, Random Forest exhibited superior prediction performance.This study introduced a modified Random Forest method that incorporated the synthetic minority oversampling technique, yielding accuracy values reaching 77.68%.While Ahmed et al. [5] and Wong et al. [6] classified water into five distinct classes in their research, it is noteworthy that the accuracy rate fell slightly below the 80% benchmark.
To address these challenges, we developed a water quality prediction system (WaQuPs), a sophisticated solution tackling the intricacies of water quality assessment employing advanced machine learning, including an ensemble learning model.WaQuPs categorizes water quality into nuanced levels: potable, lightly polluted, moderately polluted, and heavily polluted, aligning with the guidelines set by Government Regulation 82 of 2001.To ensure rapid and precise dissemination of information, WaQuPs integrates an Internet of Things (IoT)-based communication protocol for the efficient delivery of the detected water quality results.
In its development, WaQups leverages ensemble machine learning techniques, specifically combining multiclass classification with random oversampling (ROS) rather than using the synthetic minority oversampling technique (SMOTE) [7], to enhance performance.The main classification algorithm employed in our model is Random Forest, serving as the primary method of analysis.Additionally, we conduct comparative assessments with other ensemble learning algorithms, such as XGBoost and Catboost, to evaluate their effective-ness in this context.To optimize our classification model, we implement cross-validation techniques and carry out hyperparameter tuning.
The adoption of ensemble machine learning in our study is driven by its acknowledged superiority over classical machine learning methodologies.The primary advantage of ensemble learning lies in its utilization of multiple algorithms simultaneously, enhancing overall proficiency and robustness [8,9].Moreover, ensemble learning, marked by the fusion of predictions from multiple models, leads to an elevated level of predictive accuracy [10].This strategic choice is underscored by the findings presented in Ajayi's paper [10], where the ensemble learning model demonstrated superior accuracy compared to classical machine learning.In our pursuit of water quality prediction, we embraced ensemble learning and specifically chose Random Forest as the primary model to handle multi-class classification cases.This decision was informed by insights extracted from prior research, as referenced in research by Patel et al. [4].
Previous studies have underscored Random Forest's superiority within ensemble learning, exhibiting higher accuracy values in determining water quality into two classes [4].Moreover, Random Forest has demonstrated suitability for handling multi-class classification challenges [11].It is noteworthy that while Random Forest has proven effective for multi-class classification, it has been underutilized in water quality studies.We found only two studies [5,6] which to date have explored its application in solving multi-class cases within the realm of water quality.
For comparative analysis, we selected the XGBoost and CatBoost models, both falling within the ensemble learning paradigm and chosen based on insights from prior research [3].CatBoost is renowned for its adeptness in handling categorical and heterogeneous data [12], aligning well with our dataset, which involves categorizing water quality into specific levels: potable, lightly polluted, moderately polluted, and heavily polluted.On the other hand, XGBoost excels in terms of operating speed, scalability, and effectiveness with large datasets [13], attributes that resonate with the size and complexity of our dataset.Considering these factors, we decided to compare the XGBoost and CatBoost models with our primary model, Random Forest.

Related Works
Numerous prior studies have investigated the realm of water quality prediction.In a notable contribution, Torky and colleagues [14] conducted an extensive investigation aimed at ensuring the delivery of safe drinking water and estimating water quality indices through the implementation of machine learning techniques.Their research resulted in the development of a dual-component system: the first component's role is to assess the potability of water, while the second component is focused on predicting water quality index (WQI) values through regression analysis.The initial component employs a range of machine learning algorithms to facilitate water classification.The study strongly emphasizes the essential role of machine learning in improving the precision of water quality prediction models.As a result, it provides valuable insights that can advance the application of machine learning in the field of water quality assessment.The research is of significant importance by highlighting the critical need to ensure safe drinking water for the well-being of communities and the preservation of environmental sustainability.
In 2022, Krtolica et al. [15] conducted a study concerning a crucial aspect of water quality assessment through the presence of macrophytes.Their research was grounded in existing literature pertaining to water quality assessment, highlighting the significance of considering macrophytes as valuable indicators of water quality.They employed a variety of machine learning algorithms to construct a robust water quality assessment model.Their experimental results revealed that the SVM algorithm achieved the highest accuracy rate, an impressive 88%, in evaluating water quality based on the presence of macrophytes.This achievement underscores the substantial potential of machine learning in enhancing the precision and reliability of water quality detection methodologies, particularly in ecosystems where macrophytes play a vital role.
In their notable work, Uddin et al. [16] conducted a comprehensive analysis of the performance of a water quality index model designed for predicting water conditions, leveraging the power of machine learning techniques.The study, situated within the existing body of research, provides a substantial contribution to the ongoing exploration of water quality assessment and prediction.It underscores the critical importance of robust modeling and machine learning techniques for better understanding and forecasting water condition.The implications of their findings extend to enhancing the precision and reliability of water quality prediction models, thereby facilitating more informed decisionmaking and effective water resource management.This research augments the growing body of knowledge in the field of environmental science, with a specific focus on advancing water quality assessment through the innovative application of machine-learning-based approaches.Furthermore, S. Kaddoura [17], assessed the efficacy of employing machine learning (ML) techniques for water quality prediction.In this investigation, a machine learning classifier model was developed to predict water quality utilizing real-world datasets, categorizing the water as potable or not.By employing multiple algorithms, the findings of this research indicate that the support vector machine and k-nearest neighbor models outperformed others, as evidenced by higher F1-score and ROC AUC values.On the other hand, LASSO LARS, and stochastic gradient descent exhibited superior performance when considering recall values.
Additionally, Zhu et al. [18] investigated the potential of machine learning algorithms in assessing water quality across diverse environmental settings, including surface water, groundwater, and seawater.Their research not only offers a conceptual framework for potential applications in water environments but also suggests the use of specific algorithms, such as Support Vector Regression (SVR), Random Forest (RF), and deep neural networks (DNNs).It is crucial to clarify that the proposed work is primarily theoretical, offering a proposal for future research directions, and it does not present concrete evaluation results or empirical findings.For a practical examination of the subject, Hassan and Woo [19] have undertaken a separate study in which they systematically assess and synthesize satellite data for predicting water quality.Their work consolidates findings from various literature sources, including Scopus, Web of Science, and IEEE, to assist environmental restoration programs in aligning with regulatory guidelines.They present several case studies that explore potential parameters and algorithms for predicting water quality, offering valuable insights into the intersection of technology and environmental policy.

Random Forest
Random Forest, our primary ensemble model, stands out as a versatile technique that leverages decision trees to construct an ensemble of trees.Each individual tree in the Random Forest approach independently classifies the data, and the final prediction results derive from a weighted combination of the outputs from these individual trees.This approach is well-recognized for its efficacy in mitigating overfitting, handling highdimensional data, and facilitating feature selection.In terms of prediction, the beauty of Random Forest lies in its use of subsets of the dataset for individual trees.This unique approach allows the 'out-of-bag' samples, which were not used in the training of specific trees, to serve as an internal test set to evaluate the performance of the model.To effectively configure Random Forest, we need to set a few essential parameters, such as the maximum depth of the trees, the number of predictor variables considered at each node, and the total number of trees in the ensemble [20].
Let us consider an n-dimensional random vector, A = (A 1 , A 2 , . . ., A n ) T , representing the predictor variables, and a random variable, b, signifying the output.Random Forest (RF) is geared towards identifying a prediction function, f (a), that can accurately forecast b.This prediction function is determined by minimizing a defined loss function, L(A, B), with the objective of minimizing the expected value of the loss.
Usually, in a classification problem, like facies or fracture classification, the loss can be binary, taking values of either 0 or 1.When we denote the set of potential values for B as B, the objective is to minimize L(B, f(A)) for a 0-1 loss.
In classification problems, f (a) often represents the most frequently predicted class, determined through a voting scheme where h(a) represents the base learners.

XGBoost
XGBoost, on the other hand, is an optimized gradient-boosting algorithm.It uses a gradient-boosting framework to train multiple decision trees sequentially, each of which corrects the errors of its predecessors.XGBoost is renowned for its efficiency, scalability, and robustness, making it an excellent choice for many machine-learning tasks.What sets XGBoost apart from other gradient-boosting algorithms is its adoption of a more regularized model formulation, which effectively mitigates the risk of overfitting and, in turn, contributes to superior performance [21].This entails the learning of individual functions, each structured as a tree with associated leaf scores.Given a dataset comprising m samples and n-features (D = (X j , y j ) where |D| = m, X j ∈ R n , y j ∈ R), a tree ensemble model leverages a collection of L additive functions to make predictions, as elucidated in Equation ( 4 where h(a) represents base learners.In this context, H represents the space of regression trees, denoted as H = h(X) = w q (X) (with q : R n → U and w ∈ R U ). Here, q characterizes the unique structure of each tree, mapping a sample to its respective leaf index, with U representing the total number of leaves within the tree.It is worth noting that each hl corresponds to an independent structure defined by tree q, and the leaf weights w.

CatBoost
In addition to Random Forest and XGBoost, we include CatBoost in our ensemble.Cat-Boost, a gradient-boosting algorithm developed by Yandex, excels at handling frequently encountered categorical features in real-world datasets.CatBoost is a gradient-boosting implementation that utilizes binary decision trees as its fundamental base predictors [21] to improve training speed and prediction accuracy while minimizing overfitting.One of the key strengths of CatBoost lies in its innate capability to directly manage categorical features, eliminating the need for extensive data preprocessing.Let us consider a dataset comprising observations, D = (X j , y j ) for j = 1, 2, . . ., m, where X j = (x 1 j , x 2 j , . . ., x n j ) represents a vector of n features, and the response feature y j ∈ R is either binary (e.g., 'yes' or 'no') or encoded as a numerical feature (0 or 1) [21].These samples, denoted as (X j , y j ), are independently and identically distributed, following an unknown distribution p(•, •).The primary objective of this learning task is to train a function, H : R n → R, which seeks to minimize the expected loss, as defined in Equation (5).
Here, L(•, •) represents a differentiable loss function, and (X, y) pertains to testing data sampled from the training dataset, D.

Dataset
Our study leverages data from the Jakarta Open Data website [22] as our primary data source, comprising a comprehensive dataset with 267 entries.Each entry encompasses 16 critical physical and chemical parameters for water quality assessment, including metrics like TDS, turbidity, iron content, fluoride levels, and total hardness.It is crucial to note that this dataset is specifically tailored to the Jakarta, Indonesia area, as illustrated in Figure 1 [23].Accessible in csv format, the dataset is directly downloadable from the website.Importantly, our data collection strictly adheres to the pollution index method in accordance with Indonesian regulations, ensuring water quality compliance with established safety thresholds.These quality standards are outlined in the Regulation of the Minister of Health of the Republic of Indonesia No. 492/MENKES/PER/IV/2010, specifically addressing drinking water quality requirements [24].A comprehensive breakdown of the parameters and their respective thresholds can be found in Table 1, providing a detailed reference for our analysis.Within this dataset, in addition to the previously discussed 16 parameters, a key variable is the pollution index value, which serves as a critical determinant for categorizing each water sample into specific classes.To assign the appropriate class to each sample, we adhere to the guidelines set forth in Government Regulation 82 of 2001, governing water quality management and water pollution control in Indonesia through the pollution index method [25].This regulation introduces a structured water quality classification system comprising four distinct classes: 'Meeting Quality Standards', 'Lightly Polluted', 'Moderately Polluted', and 'Heavily Polluted'.Calculations to find the IP value are explained by formula number 6 [25] and detailed specifications and guidelines for these classes can be found in Table 2, offering a comprehensive reference for our classification process.
Within this context, the following variables are defined: • IPj = pollution index for designation j • Ci = concentration of water quality parameters i • Lij = concentration of water quality parameter i listed in water designation standard j Furthermore, our research entailed dataset preprocessing, a crucial step in the realm of data analysis that significantly impacts the overall data quality.Dataset preprocessing plays a pivotal role in enhancing data accuracy and ensuring the robustness of insights derived from the data.In essence, data preprocessing involves "the aggregation and modification of data components to derive meaningful insights" as defined succinctly by Patel [4].In the context of our study, we identified an imbalance in the class distribution within the dataset we employed.To rectify this imbalance, we adopted the ROS method.For a more detailed explanation of ROS, please consult Section 4.2.
In this study, we partitioned the dataset into two distinct subsets: a training set and a validation set, employing a predefined split ratio of 80% for the training set and 20% for the validation set.The creation of these training and validation datasets is a fundamental step in model development [3], as it allows us to assess and fine-tune the model's performance effectively.

Random Oversampling (ROS)
In the domain of dataset preprocessing to address class imbalance in classification problems, random oversampling (ROS) is a prominent technique.This method operates by augmenting the number of instances in the minority class within the training set through a process of random replication of existing minority class members [26].ROS represents a well-established oversampling technique renowned for its efficacy in mitigating class imbalances, especially in multiclass scenarios [7].
Empirical studies consistently underscore the favorable performance of ROS, demonstrating its effectiveness even when benchmarked against more intricate oversampling methods [27].ROS serves as the foundational method for various oversampling techniques, including SMOTE [28].To gain a visual understanding of ROS's operation, refer to Figure 2   The random oversampling (ROS) technique functions by randomly duplicating minority class data points with replacement until a balanced distribution is achieved between the two classes.In essence, the instances belonging to the minority class in the target variable are replicated at random until their count aligns with that of the majority class [28].This strategic augmentation of minority class examples contributes to enhanced predictive model accuracy, offering researchers a more extensive dataset for the minority class [28].This approach effectively addresses the challenge of class imbalance, ultimately improving the model's ability to handle diverse scenarios [28].
Figure 3 displays visual representations illustrating the dataset's condition both before and after the application of ROS.The oversampling technique involves analyzing the training data for one class and assigning a similar probability to both Y 0 and Y 1 .In the ROS algorithm, new samples are evaluated based on their neighbors, and the sample width for H j is determined.The selection of KH j is made according to a unimodal symmetric distribution.The ROS algorithm, as outlined in [29], follows these steps: , with KH j likelihood dissemination centered at x i and depending on a matrix H j of scale.

Evaluation Metrics
The evaluation metrics employed during the algorithm testing phase align with those utilized in previous research, particularly in the study [3].These established metrics encompass key performance indicators, including accuracy, precision, recall, and F1-score, essential for a comprehensive assessment of the model's effectiveness and reliability [30].These evaluation metrics provide a comprehensive framework for assessing the model's capacity to accurately classify water quality.The application of these wellestablished measures allows for an objective and rigorous evaluation of the ensemble model's performance and dependability, ensuring its alignment with the desired criteria for practical applications in water quality assessment and monitoring.

Methods to Develop Ensemble-Learning-Based Models for WaQuPs
The research methodology encompasses two distinct scenarios: the first scenario employs default parameters, while the second scenario involves hyperparameter tuning.A more detailed elucidation of these scenarios will be presented in this sub-subsection.
These scenarios serve as a practical demonstration of ensemble learning techniques and offer valuable insights into the research approach.

Default Parameter
As depicted in Figure 4, the implementation of the first scenario encompasses a sequence of essential steps.It commences with the retrieval of the dataset employed in this research, followed by the removal of less pertinent columns, such as location names.Subsequently, the dataset undergoes ROS to achieve class balance.The next stage involves a feature selection process to identify and retain the top 10 features within the dataset.Feature selection is instrumental in enhancing the efficiency and efficacy of our machine learning models.Specifically, we employed the coefficient correlation method, employing a predefined threshold of 0.25.This threshold functions as a filtering criterion, allowing us to retain features exhibiting a significant correlation with the target variable while eliminating redundant or less informative attributes.As a result, we identified and retained ten highly correlated features, which were deemed pivotal in driving the accuracy and overall performance of our machine learning model.The selected features, along with their respective correlation values, are comprehensively detailed in Table 3), offering valuable insights into their relevance and influence.After that, the dataset is then divided into distinct test and training sets.The initial scenario culminates with the evaluation of performance metrics, specifically encompassing accuracy, precision, recall, and the F1-score.

Hyperparameter Tuning
As illustrated in Figure 5, following the completion of the initial steps in the first scenario, the subsequent phase entails the refinement of hyperparameters through Grid-SearchCV.After implementing these hyperparameter enhancements, the research outcomes include accuracy, precision, recall, F1-score, and the optimized parameters aimed at enhancing the evaluation metrics.This step holds significance in evaluating the variations in the test results based on predetermined evaluation criteria.Hyperparameter tuning is of paramount importance in the development of machine learning models, including the ensemble models utilized in this research.The optimization of hyperparameters is a critical step that significantly enhances a model's performance and its ability to generalize effectively.These hyperparameters govern diverse aspects of model behavior, encompassing complexity, regularization, and overall efficacy.In our study, we employed the grid search cross-validation (GridSearchCV) technique as our preferred method for hyperparameter optimization.GridSearchCV is a systematic approach that systematically explores a grid of predefined hyperparameter combinations, aiming to identify the most optimal configuration [31].This method offers several advantages, including simplicity, comprehensiveness, and the assurance that no potentially valuable hyperparameter settings are overlooked.Through the application of GridSearchCV, our objective is to meticulously fine-tune our ensemble model, achieving the highest level of accuracy and effectiveness in classifying water quality across multiclass categories.

Methods to Develop the WaQuPs Application
The central aim of our research is to seamlessly integrate ensemble learning and IoT concepts into a unified application.This integration empowers us to categorize water quality into more nuanced levels, effectively distinguishing between potable, lightly polluted, moderately polluted, and heavily polluted water sources.We have painstakingly developed our application, handpicking the machine learning model that exhibited the highest accuracy in prior scenario experiments.This user-centric solution manifests as both a web-based platform designed for administrators and a mobile application customized for the local community.Furthermore, our innovative approach harnesses the Internet of Things (IoT) paradigm, making use of MQTT as a communication protocol to facilitate effortless interactions between servers and mobile devices.For a comprehensive and detailed overview of our system's development, please refer to Figure 6.

Data Processing
The WaQuPs application facilitates the submission of water testing procedures to the Water Quality Laboratory within the Faculty of Civil and Environmental Engineering at the Bandung Institute of Technology.This approach is particularly pertinent due to the limitations of expensive sensors for assessing various parameters.Furthermore, leveraging the laboratory for testing significantly enhances the precision and accuracy of the data.From the laboratory test results, only the top 10 features, as identified in the previous subsection, are extracted and employed in the analysis.The analysis adheres to the reference guidelines outlined in the "Standard Methods For The Examination of Water and Wastewater, 23rd Edition, 2018 (APHA)".and the "Indonesian National Standards 2004 (SNI)".Additionally, a comprehensive overview of the per-parameter test methods employed by the laboratories in water analysis is presented in the forthcoming Table 4.This systematic approach ensures data accuracy and consistency in water quality assessment.

Web-Based Application
In the realm of web-based applications, WaQuPs leverages React.js[32], a renowned JavaScript library tailored for crafting dynamic and interactive user interfaces (UI) in web development.Within our web applications, administrators can access four principal pages and one essential component.This website application empowers administrators to seamlessly input new water quality data or modify previously registered information within the system.Furthermore, administrators have the capability to view the comprehensive list of registered water quality data on the dedicated water quality list page.It is imperative to note that system access is restricted to authenticated users who have previously registered their credentials.Hence, users must undergo the authentication process by providing their username and password on the login page before gaining entry into the main system.

Web Services
In the realm of web services, WaQuPs employs FastAPI [33], a cutting-edge Python framework tailored for crafting high-performance APIs (application programming interfaces).Leveraging the Python programming language, FastAPI stands out for its exceptional blend of speed and developer productivity when building web APIs.In our application, the API serves as the crucial communication bridge linking the website application with the server.This encompassing connection enables various functionalities, including user login, the addition of fresh water quality data, and the seamless editing of existing information.The choice of FastAPI as the application development framework is well-founded, primarily due to its compatibility with the Python language.This compatibility ensures a seamless integration with servers employing Python for machine learning purposes, enhancing the overall system efficiency.

Water Quality Server
As illustrated in Figure 7, the water quality server serves as a pivotal component, offering crucial backend services and database functionality, which are readily accessible to clients.These backend operations are supported by the versatile Python framework [34], while the chosen database system is MySQL [35].The water quality server plays a pivotal role, particularly in the water quality prediction phase, which hinges on newly input data administered via the website application by the administrator.The freshly acquired data and their corresponding prediction results are stored in the database, making them readily available for retrieval and utilization by both the website and mobile applications.In our predictive modeling, we harness the capabilities of the Random Forest model, leveraging a dataset that has undergone rigorous preprocessing and feature selection procedures.

Database
WaQuPs utilizes MySQL [35] as its database system, which is a widely recognized open-source relational database management system (RDBMS).RDBMS is the backbone for storing, managing, and structuring interconnected data in tables.In WaQuPs, the database plays a pivotal role in storing crucial data, including water quality datasets, newly added water quality information, and user credential details within the website application.Noteworthy advantages of MySQL encompass: 1.
Reliability: MySQL has proven to be dependable for stable data management.In recent years, its owner company, Oracle, has continually improved system stability and performance.

2.
Scalability: MySQL can handle large amounts of data and high traffic.This versatility allows it to be used in various applications, ranging from simple websites to largescale systems.

3.
Speed: MySQL is designed for high performance.Under suitable conditions, MySQL can execute queries and transactions swiftly, providing rapid responses in application usage.This protocol is constructed with a focus on efficiency, reliability, and minimal resource usage, making it well-suited for constrained or low-power environments, such as microcontroller devices.MQTT follows a publish-subscribe model, where one or more devices can publish messages to specific topics, and other interested devices can subscribe to those topics to receive the messages.
Within WaQuPs, the water quality server operates as the publisher, while the mobile application users function as the subscribers.To illustrate this interaction, consider user AB who subscribes to water quality information 1.When an administrator, using the website application, modifies and saves certain data related to water quality information, the changes are initially transmitted to the server via FastAPI.Subsequently, the server processes the new classification, updates the database, and then promptly publishes the altered information for water quality 1.As a result, all subscribed users receive realtime updates on water quality 1, ensuring they stay informed with the latest information.This real-time functionality enhances the overall user experience and facilitates the rapid dissemination of critical data.

Mobile Application
In the realm of mobile applications, WaQuPs leverages the power of React Native [37].React Native stands as a versatile framework for crafting mobile applications, harnessing the capabilities of the JavaScript language and building upon the foundation of React.js.The mobile application is accessible to the entire community, enabling it to access comprehensive water quality information within the system.WaQuPs comprises three primary pages: the main page, the water quality search page, and the detailed water quality information page.React Native, serving as the framework for mobile applications, functions as a subscriber, ensuring users receive real-time updates in the event of any recent data changes to their subscribed information.This real-time functionality enhances user engagement and keeps users informed about the latest developments.

Comparison
In this study, we conducted comparisons using two distinct methodologies.Firstly, we compared the Random Forest model with two alternative ensemble learning algorithms, namely XGBoost and CatBoost.Secondly, we compared our proposed Random Forest model with insights derived from several prior studies.For a more detailed examination of the advantages and disadvantages associated with the ensemble models used, please see Table 5.

Model Advantages Disadvantages
Random Forest 1. Variable importance and accuracy are generated automatically [38].
1. Need a huge amount of labeled data to achieve good performance [39].2. Addressing the issue of overfitting [38].CatBoost 1. Reduce the probability of misclassifying sample data [40].
1. Model training is slow, and performance can be poor on high-dimensional data [13].2. CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data [12].XGBoost 1. Works effectively with fast operation speed, scalability, and large data [13].
1. Low performance with high memory usage and imbalanced data [13].2. High prediction performance for high-dimensional data and sparse data [13].
A detailed exposition of the background information from previous research, particularly pertaining to water quality detection using machine learning, is presented in Table 6.This information, previously introduced in the introduction section, serves as a foundational reference for understanding the context and significance of our current study.

Experiment Results on Dataset Preprocessing
This section elucidates the experimental outcomes derived from both the original dataset and the dataset wherein the ROS technique was integrated into one of the machine learning models, specifically Random Forest.In this experiment, we use the recall value as the benchmark for evaluation.The experimental findings indicate that, following the application of ROS, the Random Forest model exhibited an improved average recall value of 83%, compared to its pre-ROS state where the average recall value stood at 82%.Table 7 shows the results of the experiment before and after implementing ROS.This enhancement serves as a compelling benchmark, indicating that the use of ROS is effective in addressing unbalanced data challenges.

Experiment Results on Developing Ensemble-Learning-Based Models
This section offers a comprehensive performance evaluation of the Random Forest model, accompanied by an in-depth comparison with two alternative algorithms, specifically XGBoost and CatBoost.In both scenarios, the experiments involved an extensive pre-processing phase, which included data balancing through ROS techniques, allowing for precise classification of water quality into four predefined categories.Subsequently, in experiment scenario 2, the model underwent refinement via hyperparameter optimization, resulting in commendable accuracy outcomes.The effectiveness of the model is assessed using a set of standard evaluation metrics, including accuracy, precision, recall, and the F1-score.In addition to the machine learning model experiments, this section presents the results of creating both website and mobile applications, demonstrating a holistic approach to addressing water quality management.

Performance of Ensemble-Learning-Based Default Parameters
This phase unveils the outcomes of the training and validation procedures performed on the Random Forest model, offering valuable comparisons with two alternative algorithms, namely XGBoost and CatBoost.
Figure 8, visually presents the most optimal confusion matrix achieved during the rigorous five-fold training and validation sessions conducted for the Random Forest model.Remarkably, this specific confusion matrix emerged from the five-fold cross-validation process.Significantly, the incidence of prediction errors remained minimal, with only nine instances, indicating the model's remarkable accuracy in predicting the majority of data points with precision.In Figure 9, we present the most optimal confusion matrix resulting from the fivefold training and validation procedures applied to the XGBoost model.This matrix was specifically obtained through the rigorous cross-validation process.It is worth highlighting that the model exhibited remarkable accuracy, with a minimal count of just 11 prediction errors, while the majority of data points were accurately classified.
Figure 10 presents the most optimal confusion matrix derived from the five-fold training and validation cycles conducted for the CatBoost model.This matrix, arising from the same cross-validation process, again underlines the model's exceptional accuracy, with merely nine prediction errors.This shows the model's proficiency in accurately classifying the majority of data points.

Performance of Ensemble-Learning-Based Hyperparameter Tuning
This phase presents the outcomes of implementing hyperparameter tuning after conducting the training and validation processes on the Random Forest model, along with insights from comparisons with two alternative algorithms, XGBoost and CatBoost.
Figure 11 vividly illustrates the most optimal confusion matrix obtained during the five-fold training and validation phase for the Random Forest model, following the application of hyperparameter tuning using GridSearchCV.This specific confusion matrix was derived from the rigorous five-fold cross-validation process.Impressively, the incidence of prediction errors remains notably low, with only eight occurrences, underscoring the model's impressive accuracy in correctly classifying the majority of data points.
Figure 12 presents the most optimal confusion matrix resulting from the five-fold training and validation phase applied to the XGBoost model following hyperparameter tuning through GridSearchCV.This particular confusion matrix was constructed through the rigorous five-fold cross-validation process.Impressively, the number of prediction errors remains quite low, with just nine instances, highlighting the model's accuracy in correctly classifying the majority of data points.
Moving to Figure 13, it shows one of the most optimal confusion matrices achieved during the five-fold training and validation phase for the Catboost model, subsequent to hyperparameter tuning via GridSearchCV.Again, this confusion matrix was a product of the five-fold cross-validation process.Notably, the incidence of prediction errors is minimal, with only 12 occurrences, underscoring the model's ability to accurately predict the majority of data points.

Proposed WaQuPs Application for Detecting Water Quality WaQuPs Application
Having identified Random Forest as the optimal model and meticulously fine-tuned it with the most suitable parameters, we embarked on the development of an integrated application.This versatile application offers a web-based interface tailored for the input of water quality data and a mobile application that empowers users to access a wealth of comprehensive water-related information.In the upcoming sub-section, we present the process flow of the WaQuPs application for predicting air quality.

1.
The administrator is responsible for inputting new data using the provided form and must ensure that all fields are filled in.Please refer to Figure 14 for visualization guidance.
2. Once all fields have been completed, the administrator presses the 'Save' button.Refer to Figure 15 for a visual representation.

3.
Upon pressing the 'Save' button, the administrator triggers the transmission of data to the water quality server via web services.In the subsequent stage, the system executes an algorithm for the water quality prediction process.4.
Once the prediction results are generated, the water quality server will disseminate the data to subscribers in the mobile application using MQTT.Refer to Figure 16 for a visual representation.Prior to implementing hyperparameter tuning, the training and testing results of the three initial machine learning models revealed that Random Forest outperformed XGBoost and Catboost, achieving a superior accuracy rate of 83%.For a comprehensive understanding of the model comparison, including the metrics, such as accuracy, precision, recall, and F1-score, please refer to Table 8.After implementing hyperparameter tuning, it was observed that the accuracy values for Catboost and XGBoost experienced a modest increase from 81% to 82%, while Random Forest remained consistently high at 83%.This experiment highlights that hyperparameter tuning may not uniformly enhance model accuracy in every instance.It underscores the importance of meticulous consideration when experimenting with various hyperparameter combinations.Identifying the optimal hyperparameters demands a significant investment of time and rigorous experimentation, often necessitating the use of appropriate computational resources.For a comprehensive overview of the model accuracy compari-son, including the metrics, such as accuracy, precision, recall, and F1-score, following the implementation of hyperparameter tuning, please refer to Table 9. Upon examining Table 9, both Random Forest and CatBoost emerge as strong contenders.Nevertheless, our preference leans towards Random Forest, primarily determined by the evaluation metrics observed both before and after hyperparameter tuning.Our emphasis revolves around the accuracy metric, and it is noteworthy that the accuracy of Random Forest consistently demonstrates superiority in both pre-tuning and posttuning conditions.
In the analysis, it is notable that the F1-score value is lower than the accuracy value in the Random Forest model, while conversely, the F1-score value is higher than the accuracy value in the CatBoost model.However, such discrepancies are expected and not unusual.Accuracy gauges the overall proportion of correct predictions, while the F1-score, encompassing both precision and recall, provides a more nuanced assessment of model performance.Examples of research that found the F1-score value to be lower than the accuracy value include that by Xin and Mou [3] and Xu et al. [41].However, examples of research that found the F1-score value to be higher than the accuracy value include that by Tasnim et al. [42] and Patel et al. [4].

Comparison to State-of-the-Art of Other Studies on Water Quality Detection
Table 10 presents a comparative analysis of our water quality classification results in relation to the findings of previous researchers.Earlier investigations have emphasized several key observations.First, previous research predominantly focused on binary classification tasks.Notably, the NuSVC machine learning model achieved an accuracy of 75%, while the XGBoost machine learning model delivered accuracies ranging from 78% [4].Furthermore, the Catboost machine learning model exhibited an accuracy level of approximately 79% [4].In past studies, the Random Forest machine learning model achieved an accuracy rate of about 81% [3].In our comparative analysis with previous research, we noted significant differences in water quality classification and remarkable enhancements in accuracy.These improvements included our study's departure from binary classification methods in favor of a multiclass approach, the introduction of a novel data balancing technique ROS, and the observation of an increased accuracy rate of approximately 83% for the Random Forest machine learning model.Additionally, the XGBoost machine learning model demonstrated improved accuracy within the range of 81%, and the Catboost machine learning model showed a substantial accuracy boost, reaching nearly 81%.These findings underscore the efficacy of our approach in advancing water quality classification accuracy and establishing a more intricate multiclass classification framework.

Discussion
In this research, we successfully crafted a water quality prediction system named WaQuPs.The dataset, comprising 267 entries and encompassing 16 physical and chemical parameters, was sourced from the Jakarta Open Data site [22].Employing advanced machine learning techniques, particularly ensemble learning, we aimed to categorize water quality into nuanced levels: potable, lightly polluted, moderately polluted, and heavily polluted.WaQuPs was developed with Random Forest as the primary model, and we further explored alternative ensemble models, such as XGBoost and CatBoost, comparing their performance.To address potential imbalances in the dataset, we implemented the random oversampling (ROS) technique.
The implementation of random oversampling (ROS) to tackle data imbalances has proven to notably enhance recall values, as illustrated in Table 7.In addition to its positive impact on recall, the integration of ROS with the Random Forest model has consistently shown an across-the-board improvement in accuracy, positioning it as the leading model across multiple studies [9,43].This substantiates the efficacy of ROS in addressing imbalances within the classification of the dataset highlighted in [44].
The comparison of our primary ensemble learning model, Random Forest, with alternative models such as XGBoost and CatBoost involved two scenarios: one employing default parameters and the other incorporating hyperparameter tuning.The analysis of accuracy values in this study consistently demonstrates the superior performance of Random Forest compared to CatBoost and XGBoost, both before and after hyperparameter tuning.Specifically, Random Forest achieved an accuracy of 83%, precision of 82%, recall of 83%, and an F1-score of 82%, as detailed in Tables 8 and 9.This superiority is attributed to Random Forest's inherent advantages in mitigating overfitting, as highlighted in the works of Matsuki et al. and Ali et al. [38,45].
Our research stands out in comparison to prior studies in the field of water quality detection, as indicated in Table 10.This superior performance not only underscores our positive contribution to advancing water quality prediction but also emphasizes the successful development of WaQuPs.This comprehensive system includes both a web application and a mobile application, achieved through the seamless integration of machine learning principles and the strategic utilization of the Internet of Things (IoT) concept.
Nevertheless, it is imperative to acknowledge that the performance of predictive models is influenced significantly by various factors, such as data quality and complexity.Effectively addressing these aspects and selecting appropriate classification algorithms play a pivotal role in enhancing the accuracy and overall effectiveness of water quality predictions.

Future Work
In our research, we achieved the development of the water quality prediction system (WaQuPs) by leveraging advanced ensemble learning technology with a focus on Internet of Things (IoT) integration.This research yielded several notable advantages, including the categorization of water quality into four distinct classes, substantial improvements in accuracy metrics, and the creation of applications that seamlessly incorporate machine learning and IoT.In order to advance further research, we have several aspirations aimed at refining and expanding the scope of our study: 1.
Enhancing Real-Time Data Integration: A primary focus lies in augmenting our system by integrating real-time data sourced from sensors.This endeavor seeks to reinforce the accuracy and immediacy of our water quality prediction system (WaQuPs), enabling a more comprehensive and dynamic assessment of water quality conditions.2.
Comparative Assessment of Pollution Standards: Our ambition extends to comparing our system's performance using the pollution index method aligned with Indonesian regulations to the CCME standards, a globally recognized benchmark.This comparative analysis aims to evaluate the efficacy and alignment of our predictive model with international standards, contributing to its robustness and adaptability beyond regional frameworks.

3.
Implementation of Public Alert Mechanism: Should the integration of real-time sensor data be realized, a novel feature in our system will involve the provision of warning signals to the public when the water quality index indicates a significant decline.This proactive measure aims to empower communities by providing timely and crucial information about water quality fluctuations, fostering awareness and prompt actions to safeguard public health.
These envisioned advancements hold promise not only for refining the accuracy and scope of our predictive system but also for aligning it with global standards and empowering communities with actionable information for informed decision-making regarding water quality concerns.

Limitations
This research is subject to several limitations that warrant acknowledgment.Firstly, there is a reliance on laboratory tests for obtaining precise results, as opposed to utilizing dedicated on-site sensors.This reliance primarily reflects the constraints imposed by the costliness of specialized sensors.Secondly, the dataset utilized in this study remains relatively modest in size, comprising only 267 rows of data.The restricted scale of the dataset can influence the attainable levels of the evaluation metrics, potentially resulting in suboptimal outcomes in certain instances.

Conclusions
In our discussion, our research showcases the successful development of the water quality prediction system (WaQuPs), a sophisticated platform encompassing both a website and a mobile application.By leveraging advanced machine learning techniques, particularly ensemble learning models like Random Forest, the system can adeptly categorize water quality into distinct levels.Integration of Internet of Things (IoT) principles, with MQTT as the communication protocol, facilitates seamless interactions between servers and mobile devices, thereby expanding the system's potential for real-world applications.Our research methodology incorporates meticulous strategies, including ROS to address class imbalance, five-fold cross-validation, and hyperparameter tuning, collectively enhancing the robustness and precision of our machine learning model.
WaQuPs demonstrates its superiority over Random Forest models, achieving an impressive 83% accuracy rate, along with precision of 82%, recall of 83% and an F1-score of 82%.Comparative analysis revealed that WaQuPs with the Random Forest model outperformed both XGBoost and CatBoost models, affirming its superiority in predicting water quality.In the preceding section, we emphasize key points essential for advancing future research in this field.Implementation of these suggestions has the potential to significantly elevate the quality and depth of forthcoming studies within the domain of water quality prediction and machine learning applications.

Figure 3 .
Figure 3. Diagram before and after implementing ROS.
Within this context, the following variables are defined: • TP (True Positive): Represents the count of correctly classified instances of healthy water quality.• TN (True Negative): Denotes the count of accurately classified instances of non water quality.• FP (False Positive): Signifies the count of misclassified instances as water quality when they are, in fact, non water quality.• FN (False Negative): Indicates the count of misclassified instances as non water quality when they are, in reality, water quality.

Figure 6 .
Figure 6.Overview of the proposed system.

Figure 7 .
Figure 7. Overview of water quality server.

4. 5
.6.MQTT MQTT (message queuing telemetry transport) is a lightweight and open communication protocol specifically designed for devices within the Internet of Things (IoT) network [36].

Figure 8 .
Figure 8.The confusion matrix of Random Forest at fold 5 before tuning.

Figure 9 .
Figure 9.The confusion matrix of XGBoost at fold 5 before tuning.

Figure 10 .
Figure 10.The confusion matrix of CatBoost at fold 5 before tuning.

Figure 11 .
Figure 11.The confusion matrix of Random Forest at fold 5 after tuning.

Figure 12 .
Figure 12.The confusion matrix of XGBoost at fold 5 after tuning.

Figure 13 .
Figure 13.The confusion matrix of CatBoost at fold 5 after tuning.

Figure 14 .
Figure 14.Visualization for form page.

Table 2 .
Evaluation of the pollution index value.

Table 3 .
Selected features with high correlation.

Table 6 .
Previous research cases.

Table 7 .
Table scenarios of ROS in Random Forest based on recall.

Table 8 .
Comparison of the overall results of each model.

Table 9 .
Comparison of the overall results of each model after tuning.

Table 10 .
Comparative analysis with previous studies.