K-Nearest Neighbors Model to Optimize Data Classification According to the Water Quality Index of the Upper Basin of the City of Huarmey

Vega-Huerta, Hugo; Pajuelo-Leon, Jean; De-la-Cruz-VdV, Percy; Calderón, David; Maquen-Niño, Gisella Luisa Elena; Rios-Castillo, Milton E.; Camara-Figueroa, Adegundo; Gil-Calvo, Rubén; Guerra-Grados, Luis; Benito-Pacheco, Oscar

doi:10.3390/app151810202

Open AccessArticle

K-Nearest Neighbors Model to Optimize Data Classification According to the Water Quality Index of the Upper Basin of the City of Huarmey

by

Hugo Vega-Huerta

^1,*

,

Jean Pajuelo-Leon

¹

,

Percy De-la-Cruz-VdV

¹

,

David Calderón

¹

,

Gisella Luisa Elena Maquen-Niño

²

,

Milton E. Rios-Castillo

¹

,

Adegundo Camara-Figueroa

^1,*

,

Rubén Gil-Calvo

¹

,

Luis Guerra-Grados

¹ and

Oscar Benito-Pacheco

¹

Department of Computer Science, Universidad Nacional Mayor de San Marcos, Lima 15081, Peru

²

Department of Electronic and Computing Engineering, Universidad Nacional Pedro Ruiz Gallo, Lambayeque 14013, Peru

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(18), 10202; https://doi.org/10.3390/app151810202

Submission received: 30 June 2025 / Revised: 4 August 2025 / Accepted: 15 August 2025 / Published: 19 September 2025

(This article belongs to the Special Issue AI in Wastewater Treatment)

Download

Browse Figures

Versions Notes

Abstract

Water quality in Peru is an increasing concern, particularly in the upper Huarmey watershed, which is affected by heavy metal contamination and untreated wastewater. This study proposes an automated classification approach using three supervised machine learning algorithms—K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Random Forest (RF)—to assess the water quality based on the Water Quality Index (WQI) of Peru. The experimental results show that KNN outperforms other methods, reaching an accuracy of 95.2%. The proposed system automates and improves the classification accuracy compared with manual methods based on Microsoft Excel. The methodology, performance metrics, dataset characteristics, and geographical context are detailed to ensure replicability. This algorithm assists decision-makers with environmental monitoring and public health protection.

Keywords:

water quality index; physicochemical parameters; machine learning; K-Nearest Neighbors

1. Introduction

Water quality monitoring is essential for public health, ecological integrity, and sustainable resource governance. In Peru, freshwater ecosystems are increasingly threatened by mining, industrial discharge, and insufficient wastewater treatment [1,2].

The National Water Authority of Peru (ANA) utilizes the Water Quality Index (WQI) to synthesize multiple environmental parameters into categorical water quality ratings [3]. However, the current implementation relies on Excel macros, which are time-intensive, prone to human error, and unsuited for real-time monitoring [4].

Supervised machine learning (ML) algorithms have demonstrated strong potential to automate environmental classification, especially under complex, nonlinear conditions [5]. K-Nearest Neighbors (KNN) is effective for structured environmental datasets due to its instance-based, non-parametric learning model [6,7,8].

Therefore, this study aims to develop and validate a machine learning pipeline centered on the KNN algorithm to classify water quality samples from the Huarmey River Basin according to the Peruvian WQI system; it demonstrates improvements over traditional Excel-based approaches by integrating normalization, imputation, and label encoding, as well as classifying samples into WQI Categories 1-A2 and 3-D2.

This approach supports scalable, automated environmental monitoring aligned with Peru’s digital transformation agenda. Beyond technical innovation, the proposed system contributes to the digital transformation of environmental monitoring in Peru by providing an open-source, scalable, and automated tool that is deployable on platforms such as Google Colab. This aligns with regional goals for sustainable water management and strengthens institutional capacity for timely environmental decision-making [9].

Previous studies [10,11,12,13,14,15] emphasize that accurate classification requires datasets with sufficient and consistent information per class, as well as proper data cleaning to eliminate noise; for this it is additionally important to carry out an adequate debugging or cleaning of data, discarding or eliminating information that may distort the adequate classification process.

Figure 1 shows the current process used to classify water quality using the WQI via Excel macros. While effective for basic tasks, this manual approach suffers from limited scalability, high error susceptibility, and long processing times. Its rigid structure is inadequate for modeling complex, nonlinear interactions between water quality parameters, this figure establishes a critical reference point for assessing the improvements introduced by machine learning methods. It highlights the necessity of transitioning to automated, data-driven systems capable of enhancing the classification accuracy, reducing the latency, and enabling real-time environmental monitoring.

Figure 2 presents the proposed flow for water quality classification using the KNN algorithm. The system begins with the collection of physicochemical and microbiological data in the CSV format, which enters the processing module. This module applies three stages: normalization of numerical characteristics to the range [0, 1]; imputation of missing values by means of the mean per class; and encoding of categorical labels as integers. The processed data feeds into the KNN model, which classifies samples according to the WQI of Peru. The results are exported for validation and communication to competent entities. This architecture allows the process to be automated, reducing human errors and sorting times.

2. Background

The Peruvian Water Quality Index (WQI) provides a standardized method for classifying surface water by aggregating physicochemical and microbiological parameters [3]. However, its current spreadsheet-based implementation limits reproducibility, scalability, and speed—factors that are essential for modern environmental decision-making [4]. The classification ranges defined by the Peruvian WQI, including the corresponding interpretations of water quality status, are summarized in Table 1.

Supervised ML models, such as KNN, Random Forest (RF), and Support Vector Machine (SVM), have shown strong performance in water quality prediction, surpassing traditional rule-based systems in both accuracy and computational efficiency [5,6].

KNN is particularly well-suited for environmental applications due to its ability to classify based on proximity in feature space without requiring distributional assumptions [7,8].

While international studies have applied ML to riverine systems [16], groundwater classification [17], and anomaly detection [18], few implementations exist in Peru. To address this, we propose a KNN-based classifier using official Peruvian WQI datasets. The system emphasizes reproducibility and scalability by leveraging open-source tools and publicly available data for national deployment in the water governance infrastructure.

The WQI of Peru is also linked to a five-color scale to aid interpretation, ranging from green (excellent) to red (very poor). The index is calculated from parameters such as the pH, DO, BOD, and heavy metals (e.g., Pb, Zn, As, Cd), using weighted formulas to normalize the influence of each component [19].

Given the increasing complexity and volume of environmental data, ML models can recognize complex patterns, classify unseen data, and adapt to evolving datasets [6].

Among supervised learning algorithms, KNN is particularly well-suited for environmental applications. It is a non-parametric technique that assigns classes based on the majority label among a data point’s closest neighbors. KNN is particularly suited for WQI classification due to its ability to categorize new samples based on similarity to existing data points [7]. This makes it robust for datasets where class boundaries are not linear or predefined.

KNN also integrates well with cross-validation strategies, reducing the risk of overfitting. Recent studies have demonstrated its superior performance in classifying river water quality [8], groundwater classification [17], and detecting spatial anomalies [18]. Other models such as RF and SVM also offer competitive accuracy. RF is valued for its ensemble decision tree approach and feature importance interpretability [20]. SVM, though sensitive to kernel selection and parameter tuning, performs well in high-dimensional spaces and is effective when training data is limited [21].

Use cases where ML outperforms traditional methods:

Multivariate water quality prediction [22].
Real-time anomaly detection in contamination events [23].
Seasonal pattern recognition in water ecosystems [16].

By transitioning from rule-based systems to data-driven models, agencies can enhance the reproducibility, accuracy, and timeliness of water quality assessments. This study builds on those findings and proposes a robust KNN-based classifier, validated through real field data from Huarmey’s watershed.

Building on this context, the following section describes the dataset, preprocessing pipeline, and implementation details of the proposed KNN-based classification system.

3. Materials and Methods

This section describes the study area, dataset structure, preprocessing pipeline, model implementation, and validation strategy used for water quality classification using ML techniques. The goal is to achieve reliable classification between water samples suitable for human consumption (1-A2) and livestock supply (3-D2), as defined by the WQI of Peru.

3.1. Study Area and Dataset

The Huarmey River Basin in Ancash, Peru, is subject to continuous monitoring by the ANA. Water quality data were collected during campaigns in 2020, 2021, and 2023, from multiple sampling points across the basin [24].

Figure 3 illustrates the geographical map of the Huarmey River Basin. It shows the main canals, sampling points, irrigation systems, and areas considered vulnerable to contamination.

The map illustrates the Huarmey River Basin; the main and lateral drainage canals (cyan); and the upgraded irrigation systems, including wells and canals (yellow). Sampling locations are distributed across the basin grid, with elevation contours and shallow water table zones marked to indicate areas vulnerable to contamination.

A total of 1204 water quality records were collected across the three monitoring campaigns, including both physicochemical and microbiological parameters:

pH—indicates acidity/alkalinity.
Dissolved oxygen (DO)—essential for aquatic health.
Biochemical oxygen demand (BOD)—measures organic pollutant levels.
Heavy metals—the dataset includes concentrations of lead (Pb), zinc (Zn), arsenic (As), cadmium (Cd), and copper (Cu), which are key indicators of industrial or agricultural contamination under the WQI of Peru standards.
Coliforms—indicate fecal and microbiological pollution.

These samples represent various hydrological and ecological zones across the Huarmey River Basin. Data used in this study are publicly available from the Peruvian ANA at the following link: https://repositorio.ana.gob.pe/handle/20.500.12543/2440 (accessed on 10 June 2025)

Table 2 details the main parameters considered in the classification process, including their type, units, and relevance to the WQI of Peru methodology; only 3 metals appear in the table (Pb, Zn, Cd) due to the availability of data and consistency in the records for the years 2021 to 2023.

3.2. Data Preprocesing

Data were preprocessed using Python (v3.10) in Google Colab. The main steps included the following:

Normalization: Min–Max scaling was applied to all numerical variables, including the pH, DO, BOD, and concentrations of heavy metals (Pb, Zn, As, Cd, Cu). This ensured that all parameters were standardized to a [0, 1] range, preventing variables with larger scales from dominating the distance metrics used by the KNN classifier.
Imputation: Missing values were detected in variables such as coliforms, BOD, and heavy metals. These were imputed using the mean value per parameter and per class to preserve the class-specific characteristics and avoid biasing the model.
Label encoding: The WQI target classes were mapped numerically to allow classification. Specifically, 1-A2 (water for human consumption) was encoded as Class 0, and 3-D2 (water for livestock) as Class 1.

Figure 4 shows the data preprocessing applied to the set of water quality samples. From the CSV files with raw data, a processing module is implemented that includes three main stages: normalization, where all numerical variables are scaled to the range [0, 1] to avoid bias in the calculation of distances; imputation, which replaces missing values by stratified mean by class, preserving intra-class characteristics; and coding, in which categorical labels (such as types of water use) are converted to integers to allow their use by classification algorithms. The result is a processed dataset, ready for KNN model training.

3.3. KNN Model Implementation

The selection of KNN, SVM, and RF was based on their proven effectiveness in environmental classification tasks involving multivariate, nonlinear datasets. KNN is favored for its simplicity and robustness with structured data [7,8], SVM excels in high-dimensional spaces with small sample sizes [21], and RF offers strong performance in noisy environments through ensemble learning [20]. These models have consistently outperformed traditional rule-based systems in water quality classification scenarios [6]. Therefore, this study compares the performances of KNN, SVM, and RF for classifying water quality in the Huarmey watershed. The KNN algorithm was selected as the primary classification method due to its simplicity, interpretability, and proven effectiveness in structured environmental datasets [6]. KNN is a non-parametric, instance-based learning algorithm that assigns class labels based on the majority vote of the k-nearest data points, typically using Euclidean distance.

The implementation was carried out in Python 3.10 using the scikit-learn library within a Google Colab environment. The training data consisted of physicochemical and microbiological parameters defined by the WQI of Peru, including pH; turbidity; conductivity; dissolved oxygen (DO); biochemical oxygen demand (BOD); fecal coliforms; and concentrations of heavy metals, such as lead (Pb), cadmium (Cd), and arsenic (As).

All data were preprocessed through Min–Max normalization, missing value imputation based on class means, and categorical encoding. Subsequent training and evaluation procedures, including hyperparameter tuning and the validation strategy, are described in Section 3.3.1 and Section 3.3.2, respectively.

3.3.1. Hyperparameter Optimization

To enhance the classification performance of the KNN algorithm, hyperparameter tuning was conducted using a grid search approach via the GridSearchCV module from scikit-learn. The primary hyperparameter of interest was the number of neighbors (k), which determines how many surrounding points influence the classification decision of a new instance.

The grid search explored odd values of k ranging from 1 to 15 to avoid tie situations in the majority vote. The evaluation metric used during the search was the F1-score, as it balances both precision and recall, making it especially suitable for datasets with a slight class imbalance. The tuning process was integrated with a 5-fold cross-validation scheme to ensure that the selected k value provided stable performance across all data partitions.

The optimal configuration was found at k = 8, which yielded the highest F1-score on average across folds. This value was used in the final model for all subsequent training and testing procedures.

The dataset used for this study is moderately imbalanced, with a slightly higher number of samples in Category 1-A2 compared with 3-D2. For this reason, the F1-score was selected as a key evaluation metric, as it balances precision and recall, making it more appropriate than accuracy for imbalanced classification problems.

This systematic optimization step ensured that the model achieved a balance between bias and variance, avoiding both underfitting (k too small) and overfitting (k too large), as emphasized by [7,25]. Additionally, ref. [23] highlighted that proper tuning of k is especially critical in water quality datasets characterized by heterogeneous spatial distributions and overlapping class boundaries.

3.3.2. Validation Strategy

To ensure the robustness and generalizability of the KNN classifier, a 5-fold cross-validation (CV) approach was employed. This strategy involves partitioning the dataset into five equal subsets (folds). In each iteration, one fold is held out as the validation set, while the remaining four are used for training. This process is repeated five times so that each subset serves as validation exactly once. The final performance metrics are obtained by averaging the results across all folds. The structure of the cross-validation and the evaluation metrics used are shown in Figure 5.

This method was selected to mitigate overfitting and to obtain a more accurate estimate of the model’s performance on unseen data, especially given the moderate sample size and potential for class imbalance.

The evaluation relied on five key performance indicators:

-: Accuracy: Proportion of correctly classified samples over the total number of samples.
-: Precision: Proportion of true positives over all predicted positives.
-: Recall: Proportion of true positives over all actual positives.
-: F1-score: Harmonic mean of precision and recall; useful when classes are imbalanced.
-: Coefficient of determination (R²): Although more commonly associated with regression, R² was used here to assess the variance explained by the model in a regression-style interpretation of the class probabilities, as applied in prior environmental studies [26].

Figure 5 presents the cross-validation strategy applied to train and evaluate the KNN model, along with the main performance metrics.

Cross-validation is a widely adopted statistical method for assessing model performance and mitigating overfitting [27,28]. In this study, the 5-fold cross-validation strategy was implemented using the “KFold” class from the scikit-learn library, ensuring that each sample served as both training and validation data across different iterations. This design allowed for a robust performance evaluation while minimizing the risk of model bias linked to specific data partitions.

Beyond classification metrics, the coefficient of determination (R²) was calculated for each fold to assess the variance explained by the model in a regression-style manner—an approach commonly used in environmental classification problems [19].

Additionally, the training data were sourced from validated WQI records collected during field-monitoring campaigns in 2020, 2021, and 2023. These temporal datasets were independently verified using the official Peruvian WQI guidelines, further reinforcing the robustness and representativeness of the model evaluation.

As shown in Figure 5, this validation framework helped ensure that the model’s performance was not dependent on a single data slice and supported the generalization of KNN predictions to unseen water quality scenarios.

3.4. System Implementation and Workflow

The water quality classification system developed in this study follows an end-to-end modular architecture comprising five interconnected components. The system was implemented in Python 3.10 within the Google Colab environment, leveraging key libraries such as pandas, NumPy, scikit-learn, and matplotlib for data ingestion, processing, modeling, and visualization. This modular approach promotes transparency, reproducibility, and scalability, facilitating integration with broader environmental data systems.

The system workflow includes the following components:

1. Data Input Module

Responsible for importing structured water quality data from the Huarmey watershed, this module handles files in the CSV or Excel (XLSX) format. Each record includes physicochemical and microbiological parameters, such as pH, dissolved oxygen (DO), biochemical oxygen demand (BOD), turbidity, and concentrations of heavy metals (Pb, Zn, Cd, As), as reported by the Peruvian ANA.

2. Preprocessing Module

Data preprocessing comprises three main operations:

-: Normalization: All numeric variables are scaled to a [0, 1] range using Min–Max normalization to standardize the magnitude.
-: Missing value imputation: Missing entries are filled using the mean of the corresponding variable, stratified by class.
-: Label encoding: Target classes (e.g., 1-A2, 3-D2) are converted into binary labels to meet model input requirements.

3. Training Module

The KNN algorithm was applied with hyperparameter optimization (detailed in Section 3.3.1). A 5-fold cross-validation scheme was used to select the optimal k, which was determined to be 8.

4. Classification Module

This module applies the trained model to new samples, assigning class labels based on the majority vote of the KNN using the Euclidean distance metric.

5. Visualization Module

The final component is responsible for generating visual outputs that support model interpretation and stakeholder reporting. These outputs include bar plots of classification results, heatmaps representing feature importance, and time-series trends for key parameters.

The entire classification workflow, including all system modules, is illustrated in Figure 6. This architecture is designed for future integration into municipal dashboards or national water monitoring platforms. Its deployment in a cloud-based environment ensures accessibility and scalability, particularly in resource-constrained settings.

Table 3 compares the performance of the Excel macro method and the KNN model, highlighting improvements in the accuracy, speed, and scalability.

3.5. Results of Cross-Validation

To evaluate the robustness and generalization capacity of the KNN model, a 5-fold cross-validation procedure was applied across the full dataset. This strategy divides the data into five equal parts, ensuring that each subset is used once for validation while the others are used for training, allowing the model to be assessed under different data configurations.

Figure 7 shows the R² results of the coefficient of determination obtained in each of the five folds of the cross-validation applied to the KNN model. A high consistency in predictive performance is observed, with R² values greater than 0.99 in all folds and values of 1.00 in folds 2 and 3, indicating a perfect agreement between the predictions and the actual values. These results reflect the high reliability and generalizability of the model against different partitions of the dataset, confirming its robustness for the classification of the Water Quality Index (WQI) in real environmental contexts. These results are particularly significant given the environmental variability and multivariable nature of the dataset, which included both physicochemical and microbiological indicators.

The high stability observed across all the folds is also supported by the volume and quality of the training data: the model was trained on more than 8000 validated samples gathered from multi-year monitoring campaigns. This sample size provided sufficient representation of class distributions and environmental conditions, contributing to the strong statistical performance and resilience of the KNN classifier.

Overall, these validation results strengthen the case for deploying the model in real-world settings, where consistent behavior across changing data scenarios is essential.

3.6. Comparison with Traditional WQI Method

To evaluate the improvement provided by the ML approach, a direct comparison was made between the KNN classifier and traditional Excel macros used by the ANA. Table 4 compares the classification efficiency of the traditional Excel macro method and the ML based approach using the KNN model, highlighting improvements in the accuracy and processing time.

These results demonstrate a significant improvement in both the processing time and classification reliability. The classification training structure for both categories is presented in Figure 8, while the performance indicators are summarized in Figure 9.

Figure 9 presents the performance metrics obtained by the KNN model, highlighting its high predictive accuracy in the classification of water quality samples.

4. Results

This section presents the key outcomes of implementing the KNN algorithm for classifying water quality samples from the Huarmey River Basin. The results are analyzed in terms of performance metrics, temporal generalization, validation stability, and comparative accuracy against the traditional WQI classification method.

For model development, the dataset was divided into three subsets: 70% was used for training, 15% for validation (used during hyperparameter tuning), and 15% for final testing. This approach ensured a fair evaluation of model generalization and reduced the risk of overfitting.

4.1. Classification Accuracy and Performance Metrics

The optimized KNN model demonstrated a consistent and high classification accuracy across both water quality categories: 1-A2 (human consumption) and 3-D2 (livestock consumption). The results are summarized in Table 5.

As shown in Table 5, the KNN model outperformed the traditional Excel method in all evaluated aspects. Notably, the F-score improved by 15 percentage points, and the processing time decreased from several hours to just a few minutes.

Figure 10 compares the time required for classification and the F-score achieved by the KNN model versus the traditional Excel macro method, emphasizing the efficiency and accuracy gains.

4.2. Cross-Validation and Model Stability

To ensure model generalizability and prevent overfitting, a 5-fold cross-validation strategy was implemented. The results demonstrate high stability:

Without cross-validation: High variance in predictions across different periods.
With cross-validation: Accuracy consistently remained within 89–91%, confirming robustness.

The optimal number of neighbors (k = 8) was selected using GridSearchCV, maximizing the classification performance across folds [23]. Table 6 presents the F1-score values obtained during the five-fold cross-validation process, demonstrating the stability and consistency of the KNN model across different data partitions.

These findings confirm that the model performs well regardless of which year’s data is used for testing (2020, 2021, or 2023), reinforcing its reliability for multiyear deployment.

4.3. Analysis of 2020 Dataset—Classification Output

To further evaluate the robustness of the classification models, a detailed analysis was performed on the 2020 physicochemical dataset for water samples classified under Category 3-D2. The evaluation was carried out using the same three ML models: SVM, RF, and KNN.

All models consistently classified the selected sample as “Regular” (WQI = 47.82), matching the official manual classification by the Peruvian ANA. This consistency across both automated and manual methods reinforces the models’ reliability and confirms their alignment with established environmental assessment practices. Figure 11 provides a detailed multivariable analysis of water quality indicators for Category 3-D2 using the 2020 dataset. It visualizes key physicochemical and biological parameters across nine sampling locations, supporting the model’s ability to detect local contamination patterns.

This multivariable visualization confirms the model’s ability to classify samples under diverse chemical scenarios, reinforcing its utility in detecting spatial anomalies and supporting real-time, data-driven environmental decision-making.

The KNN, SVM, and RF models correctly classified a Category 3-D2 sample with a WQI of 47.82, matching the official ANA classification. This consistency across multiple models confirms the system’s reliability and its alignment with national environmental standards for water quality classification.

4.4. Comparative Evaluation of Machine Learning Models

To validate the performance of the proposed KNN classification model, a comparative analysis was conducted using two additional ML algorithms: SVM and RF. All models were evaluated using the same preprocessed datasets and underwent 5-fold cross-validation to ensure consistency and robustness in the results.

Table 7 summarizes the key performance metrics of the three models. The KNN algorithm achieved the highest accuracy (95.2%), precision (0.96), and F1-score (0.94), outperforming SVM and RF. These results are consistent with findings reported by Tahraoui et al. (2023) [8], who demonstrated KNN’s efficiency in water quality applications when paired with proper hyperparameter tuning and normalization strategies. Similarly, Nasir et al. (2022b) [7] highlighted the model’s reliability in multivariable environments, where the decision boundaries are well-defined.

On the other hand, SVM achieved slightly lower scores, which may be attributed to its sensitivity to kernel parameter settings in highly nonlinear datasets, as also observed by [30]. RF demonstrated competitive results, particularly in recall, confirming its strength in ensemble-based classification tasks involving noisy data.

These results support the selection of KNN as the optimal model for the Huarmey water quality classification problem, given its strong balance between precision and computational feasibility in medium-scale datasets.

4.5. Comparative Evaluation with Traditional Method

The classification approach proposed in this study represents a significant advancement over the current methodology used by the Peruvian ANA, which relies on manually programmed Excel macros. While functional for small-scale applications, the traditional approach presents several limitations in terms of scalability, automation, and error control.

Table 8 highlights key comparative aspects between the Excel-based method and the KNN model developed in this study. Notably, the KNN implementation allows for flexible input formats (e.g., CSV, XLSX), automatic classification without user intervention, and integration potential with external systems, such as APIs or IoT-based monitoring tools. In contrast, Excel macros require strict formatting and manual input, which increases the likelihood of user error and delays.

From an operational perspective, the KNN model reduces the classification time from hours to under five minutes, enabling real-time decision support in environmental monitoring systems.

4.6. Strategic Impact of KNN in Environmental Monitoring

The implementation of the KNN model as the core classification engine presents significant advantages for modern environmental monitoring systems. Among its key contributions are the following:

Real-time alert generation in scenarios involving critical contamination, enabling authorities to respond proactively to health and ecological risks.
Scalability at the national level, facilitating integration with Internet of Things (IoT) networks and remote water quality sensors.
Incremental learning capacity, allowing the model to be updated continuously as new data becomes available, thereby improving the accuracy and adaptability over time.

Moreover, the system significantly reduces the processing time—from hours to minutes—enabling faster and more efficient water quality reporting.

This advancement enables faster and more consistent decision-making by environmental agencies, aligning with the growing demand for digital transformation in public sector monitoring [29,31].

5. Discussion

The implementation of ML models for water quality classification represents a significant advance in the modernization of environmental monitoring systems in Peru. In this study, the KNN algorithm demonstrated a robust capacity for classifying water samples according to the WQI of Peru, outperforming traditional Excel-based methods in both accuracy and efficiency.

5.1. Interpretation of Results

The KNN model achieved an F1-score of 94%, surpassing the 75% score of the Excel macro-based classification. The high R² values across all cross-validation folds (ranging from 0.99 to 1.00) confirmed the stability and generalization capabilities of the model across diverse annual datasets. These results align with similar findings from prior studies in environmental classification using KNN [7,8].

Furthermore, the model’s consistent performance across data from 2020, 2021, and 2023 suggests that it can be deployed in longitudinal monitoring systems. The processing time reduction—from hours to minutes—makes it viable for real-time water quality assessment and decision-making.

Similar results were reported by Tahra [8], who applied a KNN model for dry residue prediction and obtained F1-scores above 0.94 when proper normalization and optimization were used. Likewise [7], highlighted the adaptability of KNN for multivariable water quality classification with imbalanced datasets. Compared with their reported F1-scores ranging from 0.85 to 0.89, the present study achieved a higher performance (0.94), likely due to the use of class-based mean imputation and grid search optimization. These findings confirm the effectiveness of the proposed pipeline and reinforce its contribution to existing knowledge in the domain of environmental monitoring with ML.

5.2. Comparative Advantages of Traditional Methods

Traditional manual classification systems, although functional, present several drawbacks:

High processing time due to manual calculations.
Human error risk in data entry and formula handling.
Low scalability for large datasets.

By contrast, the KNN approach is scalable, reproducible, and suitable for deployment in cloud platforms like Google Colab, making it a cost-effective tool for public agencies. Additionally, the ability to adapt to new datasets allows the model to improve over time, enabling predictive updates and re-training.

These advantages align with recent international literature supporting the use of AI in environmental governance and water security. Ref. [29] showed that ML models optimized via grid search significantly improve real-time water quality prediction. Ref. [18] applied ML to groundwater assessment, enhancing decision-making in irrigation planning. These studies confirm AI’s growing role in water monitoring, offering accurate, scalable, and interpretable solutions.

5.3. Strengths and Limitations of the KNN Algorithm

Although the KNN model demonstrated superior classification accuracy and robustness, it is important to critically evaluate both its strengths and weaknesses. One of the primary advantages of KNN is its non-parametric and intuitive nature. Unlike models that require explicit training or functional assumptions, KNN adapts directly to data structures by evaluating distances between instances. This makes it highly effective in multivariate, real-world environmental datasets, where decision boundaries are often irregular. As noted by [7], KNN performed consistently across diverse water quality scenarios, demonstrating resilience and generalizability when appropriately tuned.

Another key benefit is KNN’s simplicity and transparency, which aligns well with public agency needs for traceable and interpretable models in water governance [8].

Although KNN demonstrated high accuracy in this study, it presents limitations, particularly when applied to large-scale datasets. Its computational complexity, which scales linearly with the number of training instances, affects the inference speed and scalability. Furthermore, KNN is sensitive to noisy or irrelevant features that can distort distance metrics, reducing the classification performance. As [25] highlighted, effective use of KNN in hydrological applications requires careful feature selection and dimensionality reduction. To address these challenges, this study implemented Min–Max normalization, class-based imputation, and hyperparameter optimization via GridSearchCV. While these strategies enhanced the performance, future deployments should explore approximate KNN variants or hybrid models (e.g., KD-Trees) to improve the computational efficiency and maintain robustness in high-dimensional, real-world environmental monitoring scenarios.

5.4. Limitations of the Study

Despite its promising results, several limitations should be acknowledged:

Imbalanced datasets: Although the model performed well, additional balancing strategies (e.g., SMOTE) could be tested.
Model simplicity: KNN requires full dataset storage in memory during inference, which may not scale well to millions of samples without optimization.
Single-basin scope: The findings are specific to the Huarmey watershed; generalization to other basins would require retraining.

These limitations are shared by other supervised learning applications in environmental data science [26,27].

5.5. Practical Implications

The successful implementation of KNN for WQI classification provides a scalable and transparent alternative to manual systems. Its integration into local governments’ workflows could allow the following:

Reduce delays in contamination alerts.
Enable large-scale data processing from national monitoring systems.
Support early warning systems in high-risk areas.

Furthermore, visualization tools incorporated into the Colab prototype can be extended for public use or integrated into municipal dashboards for real-time reporting.

5.6. Future Work

Future improvements should focus on the following:

Expanding model testing to other river basins and regions.
Comparing KNN with deep learning methods (e.g., CNNs, LSTM) for time-series forecasting.
Incorporating geospatial data and satellite-based monitoring.
Evaluating the model’s performance under seasonal and climate variability conditions.

Such efforts will further validate the model’s applicability and guide policymakers in developing digital solutions for environmental resource management.

6. Conclusions

This study demonstrates that ML, particularly the KNN algorithm, offers a reliable and scalable alternative to traditional methods for classifying water quality under the WQI of Peru standard. Using real-world datasets from the Huarmey River Basin across three monitoring years (2020, 2021, and 2023), the KNN model consistently achieved a high classification performance, with F-scores reaching 94% and R² values exceeding 0.99 across validation folds.

These results highlight the potential of automated tools like KNN to enhance national water monitoring systems.

The results affirm that data-driven models not only enhance operational efficiency but also strengthen the ability of environmental agencies to monitor water quality in real time. Additionally, the integration of the KNN system into open platforms like Google Colab ensures accessibility and encourages the adoption of replicable tools in government workflows.

Despite the success of the model, limitations remain. These include the need for balanced datasets, computational efficiency for larger-scale implementations, and testing in other hydrological contexts. However, these are addressable through future research.

The model achieved perfect R² scores (1.00) in two of the five validation folds, indicating exceptional predictive alignment with actual values. This result reinforces the model’s stability and robustness across varying data partitions, supporting its deployment in real-world, multiyear environmental monitoring scenarios.

In conclusion, the application of ML to water quality classification represents a forward-thinking step in Peru’s digital transformation of environmental governance. With further development and validation, such systems can play a critical role in ensuring safe water access, timely contamination responses, and sustainable management of natural resources across the country and beyond.

Author Contributions

Conceptualization, H.V.-H. and J.P.-L.; methodology, H.V.-H., J.P.-L., and P.D.-l.-C.-V.; software, J.P.-L. and G.L.E.M.-N.; validation, D.C. and M.E.R.-C.; formal analysis, H.V.-H., A.C.-F., L.G.-G., and R.G.-C.; investigation, J.P.-L.; resources, G.L.E.M.-N. and O.B.-P.; data curation, J.P.-L.; writing—original draft preparation, J.P.-L. and H.V.-H.; writing—review and editing, G.L.E.M.-N. and D.C.; supervision, H.V.-H.; funding acquisition, H.V.-H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Universidad Nacional Mayor de San Marcos—RR_005446-2025-R and Project number C25202481—Project type PCONFIGI 2025.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Acknowledgments

We extend our gratitude to the ANA for providing access to critical water quality data. Special thanks to the academic advisors and research team for their invaluable guidance throughout the project.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Organización Mundial de la Salud. Water, Sanitation and Hygiene Links to Health: Facts and Figures; WHO: Geneva, Switzerland, 2004; Available online: https://iris.who.int/handle/10665/69489 (accessed on 7 June 2025).
La República. Áncash: Moradores de Ocho Centros Poblados de Huarmey Consumen Agua sin Cloración. La República, 17 November 2021, pp. 1–12. Available online: https://larepublica.pe/sociedad/2021/11/17/ancash-agua-sin-cloracion-consumen-pobladores-de-ocho-centros-poblados-de-huarmey-lrnd (accessed on 7 June 2025).
Autoridad Nacional del Agua. Metodología para la Determinación del Índice de Calidad de Agua Ica-PE, Aplicado a los Cuerpos de Agua Continentales Superficiales; Autoridad Nacional del Agua: Lima, Peru, 2018; pp. 1–55. Available online: https://hdl.handle.net/20.500.12543/2440 (accessed on 7 June 2025).
Castillo Suarez, D. Diseño del Sistema de Abastecimiento de Agua Potable para la Mejora de la Condición Sanitaria del Caserío Molinopampa, Distrito de Malvas, Provincia De Huarmey, Región Ancash—2020. Bachelor’s Thesis, Universidad Católica Los Ángeles de Chimbote, Chimbote, Peru, 2020. [Google Scholar]
Dritsas, E.; Trigka, M. Efficient Data-Driven Machine Learning Models for Water Quality Prediction. Computation 2023, 11, 16. [Google Scholar] [CrossRef]
Zounemat-Kermani, M.; Batelaan, O.; Fadaee, M.; Hinkelmann, R. Ensemble machine learning paradigms in hydrology: A review. J. Hydrol. 2021, 598, 126266. [Google Scholar] [CrossRef]
Nasir, N.; Kansal, A.; Alshaltone, O.; Barneih, F.; Sameer, M.; Shanableh, A.; Al-Shamma’a, A. Water quality classification using machine learning algorithms. J. Water Process Eng. 2022, 48, 102920. [Google Scholar] [CrossRef]
Tahraoui, H.; Toumi, S.; Hassein-Bey, A.H.; Bousselma, A.; Sid, A.N.E.H.; Belhadj, A.E.; Triki, Z.; Kebir, M.; Amrane, A.; Zhang, J.; et al. Advancing Water Quality Research: K-Nearest Neighbor Coupled with the Improved Grey Wolf Optimizer Algorithm Model Unveils New Possibilities for Dry Residue Prediction. Water 2023, 15, 2631. [Google Scholar] [CrossRef]
Gestión Sostenible del Agua. Sustainable Water Management. 2024. Available online: https://www.agry.purdue.edu/hydrology/projects/nexus-swm/es/Tools/WaterQualityCalculator.php (accessed on 7 June 2025).
López-Córdova, F.; Vega-Huerta, H.; Maquen-Niño, G.; Cáceres-Pizarro, J.; Adrianzén-Olano, I.; Benito-Pacheco, O. Construction of a New Data Set of Pleural Fluid Cytological Images for Research. Int. J. Online Biomed. Eng. 2025, 21, 138–151. [Google Scholar] [CrossRef]
Vega-Huerta, H.; Pantoja-Pimentel, K.; Quintanilla Jaimes, S.; Maquen-Niño, G.; De-La-Cruz-VdV, P.; Guerra-Grados, L. Classification of Alzheimer’s Disease Based on Deep Learning Using Medical Images. Int. J. Online Biomed. Eng. 2024, 20, 101–114. [Google Scholar] [CrossRef]
Vega-Huerta, H.; Villanueva-Alarcón, R.; Mauricio, D.; Moreno, J.G.; Vilca, H.D.C.; Rodriguez, D.; Rodriguez, C. Convolutional neural networks on assembling classification models to detect melanoma skin cancer. Int. J. Online Biomed. Eng. 2020, 18, 59–76. [Google Scholar] [CrossRef]
Yauri, J.; Lagos, M.; Vega-huerta, H.; De-la-Cruz-VdV, P.; Maquen-niño, G.L.E.; Condor-tinoco, E. Detection of Epileptic Seizures Based-on Channel Fusion and Transformer Network in EEG Recordings. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 1067–1074. [Google Scholar] [CrossRef]
Vega-Huerta, H.; Rivera-Obregón, M.; Maquen-Niño, G.L.E.; De-la-Cruz-VdV, P.; Lázaro-Guillermo, J.C.; Pantoja-Collantes, J.; Cámara-Figueroa, A. Classification Model of Skin Cancer Using Convolutional Neural Network. Ing. Des Syst. D′Inf. 2025, 30, 387–394. [Google Scholar] [CrossRef]
De-la-Cruz-VdV, P.; Cadenillas-Rivera, D.; Vega-Huerta, H. Diagnosis of Brain Tumors Using a Convolutional Neural Network. In Perspectives and Trends in Education and Technology, Proceedings of the ICITED 2023, Manaus, Brazil, 29–30 June 2023; Springer: Berlin/Heidelberg, Germany, 2023. [Google Scholar] [CrossRef]
Wang, L.; Li, Y.; Li, J.; Tan, L.; Rizo, E.Z.; Han, B.P. The seasonal patterns of aquatic insect composition and diversity in a disturbed (sub)tropical river revealed by linear and nonlinear approaches with occurrence and abundance data. Hydrobiologia 2023, 850, 3949–3963. [Google Scholar] [CrossRef]
Aish, A.M.; Zaqoot, H.A.; Sethar, W.A.; Aish, D.A. Prediction of groundwater quality index in the Gaza coastal aquifer using supervised machine learning techniques. Water Pract. Technol. 2023, 18, 501–521. [Google Scholar] [CrossRef]
Hussein, E.E.; Derdour, A.; Zerouali, B.; Almaliki, A.; Wong, Y.J.; los Santos, M.B.; Ngoc, P.M.; Hashim, M.A.; Elbeltagi, A. Groundwater Quality Assessment and Irrigation Water Quality Index Prediction Using Machine Learning Algorithms. Water 2024, 16, 264. [Google Scholar] [CrossRef]
Ochoa, L.L. Evaluation of Classification Algorithms using Cross Validation. In Proceedings of the 17th LACCEI International Multi-Conference for Engineering, Education and Technology: “Industry, Innovation, and Infrastructure for Sustainable Cities and Communities”, Montego Bay, Jamaica, 24–26 July 2019; pp. 24–26. [Google Scholar] [CrossRef]
Kanyama, M.N.; Shava, F.B.; Gamundani, A.M.; Hartmann, A. Machine learning applications for anomaly detection in Smart Water Metering Networks: A systematic review. Phys. Chem. Earth 2024, 134, 103558. [Google Scholar] [CrossRef]
Chen, P. Unlocking policy effects: Water resources management plans and urban water pollution. J. Environ. Manag. 2024, 365, 121642. [Google Scholar] [CrossRef]
Wang, Y.; Ho, I.W.H.; Chen, Y.; Wang, Y.; Lin, Y. Real-Time Water Quality Monitoring and Estimation in AIoT for Freshwater Biodiversity Conservation. IEEE Internet Things J. 2022, 9, 14366–14374. [Google Scholar] [CrossRef]
Ni, Q.; Cao, X.; Zhao, Z.; Yuan, J.; Tan, C. An unsupervised water quality anomaly detection method based on a combination of time-frequency analysis and clustering. Environ. Sci. Pollut. Res. 2024, 31, 15920–15931. [Google Scholar] [CrossRef]
Autoridad Nacional del Agua; Comité de Monitoreo-Vigilancia y Fiscalización Ambiental de Huarmey. Monitoreo de parametros cuenca Huarmey 2023. Available online: https://www.comitedemonitoreohuarmey.com/ (accessed on 31 May 2025).
Dong, L.; Zuo, X.; Xiong, Y. Prediction of hydrological and water quality data based on granular-ball rough set and k-nearest neighbor analysis. PLoS ONE 2024, 19, e0298664. [Google Scholar] [CrossRef]
Elmeddahi, Y.; Ragab, R. Prediction of the groundwater quality index through machine learning in Western Middle Cheliff plain in North Algeria. Acta Geophys. 2022, 70, 1797–1814. [Google Scholar] [CrossRef]
Fernández-Hernández, J.L.; Herranz-Hernández, P.; Segovia-Torres, L. Validación cruzada sobre una misma muestra: Una práctica sin fundamento. REMA Rev. Electrón. Metodol. Apl. 2022, 24, 38–40. [Google Scholar] [CrossRef]
López Lozano, L.; Palazón Bru, I.; Palazón Bru, A.; Arroyo Fernández, M.; González-Estecha, M. Procedimiento de validación de un método para cuantificar cobalto en suero por espectroscopia de absorción atómica con atomización electrotérmica. Rev. Lab. Clin. 2015, 8, 46–51. [Google Scholar] [CrossRef]
Shams, M.Y.; Elshewey, A.M.; El-kenawy, E.S.M.; Ibrahim, A.; Talaat, F.M.; Tarek, Z. Water quality prediction using machine learning models based on grid search method. Multimed. Tools Appl. 2024, 83, 35307–35334. [Google Scholar] [CrossRef]
Uddin, M.G.; Nash, S.; Rahman, A.; Olbert, A.I. Performance analysis of the water quality index model for predicting water state using machine learning techniques. Process Saf. Environ. Prot. 2023, 169, 808–828. [Google Scholar] [CrossRef]
Seyedmohammadi, J.; Zeinadini, A.; Navidi, M.N.; McDowell, R.W. A new robust hybrid model based on support vector machine and firefly meta-heuristic algorithm to predict pistachio yields and select effective soil variables. Ecol. Inform. 2023, 74, 102002. [Google Scholar] [CrossRef]

Figure 1. Current process for classifying water quality using WQI with Excel macro techniques.

Figure 2. Proposed process for water quality classification using WQI and KNN.

Figure 3. Geographic map of the Huarmey River Basin illustrating key hydrological and infrastructure components relevant to water management; sampling locations correspond to work campaigns conducted by the ANA in 2020, 2021, and 2023.

Figure 4. Preprocessing pipeline, including normalization, imputation, and encoding.

Figure 5. Cross-validation structure and scoring metrics used.

Figure 6. System workflow for KNN-based classification.

Figure 7. R² scores by fold in KNN validation.

Figure 8. Training architecture for Categories 1-A2 and 3-D2 classifications.

Figure 9. Performance metrics of KNN model showing high predictive accuracy [29].

Figure 10. Comparison of classification time and F-score between KNN and Excel macro method.

Figure 11. Multivariable classification of water quality indicators for Category 3-D2 using the 2020 dataset. This figure illustrates the behavior of key physicochemical and biological parameters across nine sampling locations during 2020. Panel (a) presents the conductivity, coliforms, and aluminum (Al), where the conductivity peaks at samples 3 and 5 suggest possible ionic contaminant discharges; the coliform levels rise sharply at sample 3, indicating localized microbial pollution; and aluminum concentrations remain consistently low. Panel (b) shows the biochemical oxygen demand (BOD), manganese (Mn), and zinc (Zn); the BOD increases significantly at sample 6, implying high organic matter content, while Mn and Zn maintain relatively stable trends with minor elevations at samples 3 and 5. Panel (c) displays the pH, dissolved oxygen (DO), and copper (Cu), highlighting stable pH values, moderate DO fluctuations, and consistently low copper concentrations. Finally, panel (d) illustrates the levels of arsenic (As), cadmium (Cd), mercury (Hg), and lead (Pb), where sample 3 registers elevated Cd and Pb values—potentially indicating industrial or mining-related contamination—while As and Hg remain within safe regulatory limits.

Table 1. Categorization of water quality according to the WQI of Peru.

WQI of Peru	Rating	Interpretation
100–90	Excellent	The water quality is protected with no threats or damage. Conditions are very close to natural or desirable levels.
89–75	Good	The water quality deviates slightly from its natural state. However, desirable conditions may be affected by minor threats or damage.
74–45	Regular	Natural water quality is occasionally threatened or degraded. Water quality often deviates from desirable values. Many uses require treatment.
44–30	Poor	The water quality does not meet quality objectives, and desirable conditions are frequently threatened or degraded. Many uses require treatment.
29–0	Very Poor	The water quality does not meet quality objectives, is almost always threatened or degraded, and all uses require prior treatment.

Table 2. Key attributes used for WQI classification and their relevance.

Parameter	Type	Unit	Relevance to WQI of Peru
pH	Physicochemical	-	Indicates water acidity/alkalinity
Dissolved Oxygen (DO)	Physicochemical	mg/L	Essential for aquatic life
Biochemical Oxygen Demand (BOD)	Chemical	mg/L	Indicates organic pollution
Coliforms	Microbiological	NMP/100 mL	Indicates fecal contamination
Lead (Pb)	Chemical—Heavy Metal	mg/L	Industrial/agricultural contamination indicator
Zinc (Zn)	Chemical—Heavy Metal	mg/L	Trace element affecting water quality
Cadmium (Cd)	Chemical—Heavy Metal	mg/L	Highly toxic at low concentrations

Table 3. Excel vs. KNN: Processing performance and classification efficiency.

Metric	Excel Macros	KNN Classification
Accuracy (F-Score)	75%	94%
Processing Time	Hours	Minutes
Error Margin	High	Low
Scalability	Limited	High
Human Intervention	Required	None (Post-training)

Table 4. WQI classification efficiency: Traditional vs. ML-based methods.

Method	Processing Time	Accuracy (F1-Score)
Excel Macros	Hours	75%
KNN (k = 8)	Minutes	94%

Table 5. ML-based methods. Performance comparison between KNN and traditional Excel-based method (manual WQI).

Metric	Excel Macros	KNN Model (k = 8)
F-Score (%)	75%	94%
Processing Time	Hours	Minutes
Scalability	Low	High
Automation	No	Yes
Accuracy Consistency	Variable	High (stable)

Table 6. R² scores per fold for KNN during cross-validation.

Fold	R² Obtained
1	0.92
2	0.91
3	0.93
4	0.90
5	0.92
Average	0.92

Table 7. Comparative performance metrics of KNN, Random Forest, and SVM models.

Model	Accuracy	Precision	Recall	F1-Score	R²
KNN	95.2%	0.96	0.93	0.94	0.91
Random Forest	93.7%	0.91	0.93	0.92	0.88
SVM	91.4%	0.88	0.90	0.89	0.85

Table 8. Comparative evaluation of Excel macros versus the KNN model for WQI classification.

Criteria	Excel Macros	KNN Model
Input Flexibility	Rigid (.xls only)	Any structured format (CSV, XLSX)
Error Rate	High (manual input)	Low (automated)
Integration	Manual-only	Scalable to APIs/IoT
Time to Classify	Hours	<5 min

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vega-Huerta, H.; Pajuelo-Leon, J.; De-la-Cruz-VdV, P.; Calderón, D.; Maquen-Niño, G.L.E.; Rios-Castillo, M.E.; Camara-Figueroa, A.; Gil-Calvo, R.; Guerra-Grados, L.; Benito-Pacheco, O. K-Nearest Neighbors Model to Optimize Data Classification According to the Water Quality Index of the Upper Basin of the City of Huarmey. Appl. Sci. 2025, 15, 10202. https://doi.org/10.3390/app151810202

AMA Style

Vega-Huerta H, Pajuelo-Leon J, De-la-Cruz-VdV P, Calderón D, Maquen-Niño GLE, Rios-Castillo ME, Camara-Figueroa A, Gil-Calvo R, Guerra-Grados L, Benito-Pacheco O. K-Nearest Neighbors Model to Optimize Data Classification According to the Water Quality Index of the Upper Basin of the City of Huarmey. Applied Sciences. 2025; 15(18):10202. https://doi.org/10.3390/app151810202

Chicago/Turabian Style

Vega-Huerta, Hugo, Jean Pajuelo-Leon, Percy De-la-Cruz-VdV, David Calderón, Gisella Luisa Elena Maquen-Niño, Milton E. Rios-Castillo, Adegundo Camara-Figueroa, Rubén Gil-Calvo, Luis Guerra-Grados, and Oscar Benito-Pacheco. 2025. "K-Nearest Neighbors Model to Optimize Data Classification According to the Water Quality Index of the Upper Basin of the City of Huarmey" Applied Sciences 15, no. 18: 10202. https://doi.org/10.3390/app151810202

APA Style

Vega-Huerta, H., Pajuelo-Leon, J., De-la-Cruz-VdV, P., Calderón, D., Maquen-Niño, G. L. E., Rios-Castillo, M. E., Camara-Figueroa, A., Gil-Calvo, R., Guerra-Grados, L., & Benito-Pacheco, O. (2025). K-Nearest Neighbors Model to Optimize Data Classification According to the Water Quality Index of the Upper Basin of the City of Huarmey. Applied Sciences, 15(18), 10202. https://doi.org/10.3390/app151810202

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

K-Nearest Neighbors Model to Optimize Data Classification According to the Water Quality Index of the Upper Basin of the City of Huarmey

Abstract

1. Introduction

2. Background

3. Materials and Methods

3.1. Study Area and Dataset

3.2. Data Preprocesing

3.3. KNN Model Implementation

3.3.1. Hyperparameter Optimization

3.3.2. Validation Strategy

3.4. System Implementation and Workflow

3.5. Results of Cross-Validation

3.6. Comparison with Traditional WQI Method

4. Results

4.1. Classification Accuracy and Performance Metrics

4.2. Cross-Validation and Model Stability

4.3. Analysis of 2020 Dataset—Classification Output

4.4. Comparative Evaluation of Machine Learning Models

4.5. Comparative Evaluation with Traditional Method

4.6. Strategic Impact of KNN in Environmental Monitoring

5. Discussion

5.1. Interpretation of Results

5.2. Comparative Advantages of Traditional Methods

5.3. Strengths and Limitations of the KNN Algorithm

5.4. Limitations of the Study

5.5. Practical Implications

5.6. Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI