Integrated Machine Learning Based Groundwater Quality Prediction in a Peri-Urban Area: The Case of Attica Region, Greece

Pyrgaki, Konstantina; Ntona, Maria Margarita; Bhagat, Suraj Kumar

doi:10.3390/urbansci10060323

Open AccessArticle

Integrated Machine Learning Based Groundwater Quality Prediction in a Peri-Urban Area: The Case of Attica Region, Greece

by

Konstantina Pyrgaki

¹,

Maria Margarita Ntona

²

and

Suraj Kumar Bhagat

^3,*

¹

Department of Geology & Geoenvironment, National and Kapodistrian University of Athens, Panepistimiopolis Zographou, 15784 Athens, Greece

²

Laboratory of Engineering Geology & Hydrogeology, School of Geology, Aristotle University of Thessaloniki, 54636 Thessaloniki, Greece

³

Centre for Interdisciplinary Research, SRM University-AP, Amaravati, Andhra Pradesh 522240, India

^*

Author to whom correspondence should be addressed.

Urban Sci. 2026, 10(6), 323; https://doi.org/10.3390/urbansci10060323

Submission received: 24 April 2026 / Revised: 3 June 2026 / Accepted: 5 June 2026 / Published: 10 June 2026

(This article belongs to the Special Issue Sustainable Groundwater Management in Urban Areas)

Download

Browse Figures

Versions Notes

Abstract

Groundwater quality assessment in urban and peri-urban environments is often constrained by incomplete monitoring records, irregular sampling frequencies, and heterogeneous environmental datasets. The primary objective of this study is to predict the Water Quality Index (WQI) in the Attica River Basin, Greece, using advanced machine learning (ML) techniques. A groundwater quality dataset comprising 958 observations from 80 monitoring stations was analyzed using six physicochemical parameters, namely electrical conductivity, ammonium, nitrate, nitrite, chloride, and sulphate. Three modeling approaches, namely TabNet (with Winsorization), SVM, and Gradient Boosting Machines (GBM), were implemented to estimate groundwater quality conditions. To address the challenge of missing data, Multiple Imputation by Chained Equations (MICE) with Predictive Mean Matching (PMM) was implemented and systematically compared against conventional imputation approaches, including smoothed averages, interpolation, and forward-fill methods. The novelty of this study lies in the integration of open-access groundwater chemistry data, advanced multivariate imputation (MICE-PMM), and attention-based deep learning (TabNet) for groundwater quality prediction in a Mediterranean peri-urban area under data-scarce conditions. Using a multi-year groundwater monitoring dataset, the results indicate that the integrated MICE-PMM and TabNet framework achieved the highest predictive performance, with R² = 0.91, NSE = 0.91, RMSE = 52.21, and MAE = 25.68. Feature importance and sensitivity analyses identified nitrate as the dominant driver of WQI variability, highlighting the strong influence of anthropogenic nutrient loading on groundwater quality. Overall, the proposed framework provides a transferable, data-driven approach for groundwater quality prediction, environmental monitoring, and groundwater resource management in urban and peri-urban aquifer systems characterized by incomplete environmental datasets.

Keywords:

groundwater quality index; MICE-PMM; TabNet; Gradient Boosting; peri-urban area; groundwater pollution; machine learning

Graphical Abstract

1. Introduction

Urban water systems present a multitude of scientific challenges due to their intrinsic complexity, spatial heterogeneity, and the dynamic interactions between anthropogenic activities and natural processes [1]. In particular, groundwater quality in urban environments is influenced by a variety of factors, including land use change, wastewater discharge, climate variability, and socio-economic pressures [2]. These interacting drivers result in highly variable hydrochemical conditions, often characterized by nonlinear responses and strong spatial and temporal heterogeneity, which complicate robust assessment and predictive modeling [3].

The evaluation of groundwater quality through composite indices, such as the Water Quality Index (WQI), has been widely adopted as an effective tool for summarizing complex physicochemical information into a single, interpretable metric [4]. WQI facilitates communication between scientists, policymakers, and stakeholders, supporting water resource management and decision-making processes. However, despite its widespread application, WQI-based assessments remain sensitive to data quality, parameter selection, and temporal resolution, while their reliability is often constrained by the limited availability of high-frequency monitoring data [5]. Consequently, the modeling and prediction of water quality indices (WQIs) under such multifactorial conditions require robust analytical frameworks capable of accommodating data limitations while preserving predictive accuracy [6]. This limitation is particularly critical in urban environments, where rapid environmental changes and contamination events are difficult to capture using sparse or irregular datasets [7].

One of the fundamental challenges in contemporary urban hydrology is therefore the need for high-temporal-resolution data capable of capturing transient processes and short-term variability [8]. However, environmental datasets are frequently characterized by missing values, inconsistent sampling intervals, and heterogeneous data structures, which introduce uncertainty and limit the applicability of conventional analytical approaches [9]. Addressing these data limitations is essential for improving the robustness and predictive capability of groundwater quality assessments.

In recent years, the increasing availability of open-access environmental datasets has created new opportunities for addressing these challenges. The use of standardized global datasets enables harmonized analyses across different geographical regions, facilitating cross-country comparisons and integrated environmental assessments. Nevertheless, open data sources are often characterized by heterogeneous spatial and temporal resolutions, data gaps, inconsistent metadata, and uncertainties related to data quality and governance, thereby necessitating the development of robust methodologies capable of effectively handling incomplete and heterogeneous datasets [10].

To address the complexity of urban hydrological and hydrogeological systems, machine learning (ML) techniques have been increasingly adopted as powerful tools for environmental modelling. These systems are governed by the interaction between surface-water processes, groundwater flow, aquifer characteristics, and anthropogenic pressures, resulting in highly complex environmental responses. Algorithms such as Random Forest (RF), Support Vector Machine (SVM), Artificial Neural Network (ANN), and boosting methods have demonstrated strong capabilities in capturing nonlinear relationships and handling high-dimensional datasets [11,12,13]. Among these approaches, SVM is widely recognized for its robustness in modeling nonlinear relationships and its effectiveness in relatively small datasets [6,14], while Gradient Boosting Machine (GBM) provide enhanced predictive performance through ensemble learning and iterative error minimization [15,16,17]. More recently, deep learning architectures namely TabNet have been introduced, offering improved performance for tabular data by leveraging sequential attention mechanisms, while retaining a degree of interpretability compared to conventional neural networks [18].

Recent studies have shown that ML approaches can outperform traditional statistical and physically based models, particularly in complex and data-limited environments, due to their ability to learn intricate patterns from heterogeneous datasets [19,20]. Nevertheless, despite these advancements, several challenges remain. ML models are often sensitive to data quality and may suffer from overfitting or reduced generalization when applied to datasets with missing values or limited temporal resolution. Furthermore, the lack of interpretability in many ML approaches has raised concerns regarding their applicability in environmental decision-making contexts [9,21]. In addition, the combined impact of missing data and irregular temporal resolution on ML model performance remains insufficiently investigated in groundwater quality studies, particularly when integrated with composite indices such as WQI [22].

Despite the growing application of machine learning techniques in groundwater quality assessment, relatively few studies have simultaneously addressed the challenges of incomplete environmental datasets, groundwater quality index prediction, and model interpretability within highly urbanized Mediterranean aquifer systems. In this context, the present study introduces an integrated framework that combines open-access groundwater chemistry datasets, advanced multivariate imputation through Multiple Imputation by Chained Equations with Predictive Mean Matching (MICE-PMM), and the attention-based TabNet deep learning architecture. The proposed approach enables robust groundwater quality prediction under data-scarce conditions while preserving model interpretability and supporting hydrochemical understanding of groundwater quality drivers. The novelty of this work lies in the simultaneous integration of data-gap handling, explainable deep learning, and groundwater quality assessment within a Mediterranean peri-urban aquifer system. By combining statistical analysis, hydrochemical interpretation, sensitivity assessment, and machine learning explainability, the study provides a transferable framework for groundwater quality monitoring and management in regions characterized by fragmented environmental datasets.

The Attica river basin in Greece represents a characteristic example of a highly stressed urban hydrological system. This area reflects common hydrological challenges observed in urban Mediterranean environments, where rapid urbanization has significantly altered natural hydrological processes. The region is affected by intense urbanization, industrial activities, agricultural pressures, and increasing water demand, which collectively impacts groundwater quality [23]. Additionally, climate variability and irregular precipitation patterns further exacerbate the vulnerability of groundwater resources, highlighting the need for reliable monitoring and predictive tools [24]. Despite its importance, groundwater quality assessment in the region is often limited by fragmented datasets, missing observations, and lack of high-frequency monitoring, making it an ideal case study for the application of data-driven methodologies.

Building upon these challenges, this study integrates open geospatial datasets with advanced machine learning approaches to model and predict groundwater quality dynamics in a complex urban environment, while explicitly examining the effects of data scarcity, missing values, and temporal irregularity. Specifically, it aims to: (i) develop and compare machine learning models (TabNet, SVM, and Gradient Boosting Machines) for forecasting groundwater quality based on WQI, (ii) evaluate the effectiveness of advanced imputation techniques (MICE-PMM) against conventional methods, (iii) identify the key environmental drivers influencing groundwater quality variability, and (iv) to evaluate the potential of open-access data to support scalable and transferable urban hydrological modeling frameworks.

2. Materials and Methods

2.1. Study Area

This study focuses on the river basin district of Attica, located in central Greece and encompassing the metropolitan area of Athens. The Attica river basin district covers an area of approximately 3187 km² and hosts a population exceeding 3.8 million inhabitants, representing nearly 35% of Greece’s total population [25]. The study area constitutes a highly urbanized system characterized by intense anthropogenic pressures, rapid population growth, and continuous land-use transformations, particularly during the second half of the twentieth century (Figure 1) [26].

The climatic conditions of the Attica region are characteristic of a Mediterranean (Csa) climate, defined by warm, dry summers and mild, wet winters [27]. The mean annual temperature is around 18.5 °C, while mean annual precipitation varies spatially between 350 and 750 mm, increasing from the coastal plains towards the surrounding mountainous zones such as Parnitha (1413 m a.s.l.), Penteli (1109 m a.s.l.), and Hymettus (1026 m a.s.l.). Precipitation is typically characterized by high-intensity, short-duration storm events, which frequently trigger flash flooding in urbanized lowland areas [28]. Wildfires represent an additional environmental pressure in the Attica region, significantly altering land cover, enhancing soil erosion, and modifying surface runoff patterns, groundwater recharge processes, and the overall hydrogeological balance [29]. From a geomorphological perspective, the Attica Basin is a semi-closed hydrological system bounded by mountainous terrain to the north and east and opening towards the Saronic Gulf to the southwest.

The hydrographic network of the Attica region is primarily composed of ephemeral and intermittent streams, with the Kifissos and Ilissos rivers representing the main drainage axes. However, extensive urban development, channel modifications, and the widespread culverting of natural watercourses by transport and drainage infrastructure have significantly altered the natural hydrological regime, reduced infiltration capacity and modifying flow dynamics. In addition to these structural changes, multiple anthropogenic pollution sources further degrade urban water quality, including industrial discharges, effluents from wastewater treatment systems, agricultural runoff, and diffuse inputs associated with recreational and illegal activities [26,30].

The geological structure of the area is dominated by Mesozoic limestones and schists, interbedded with Neogene marls and Quaternary alluvial deposits, which exert a strong control on subsurface flow patterns and aquifer recharge processes. Urban imperviousness, together with soil compaction and the reduction in permeable surfaces, has intensified hydrological extremes, particularly in western Attica, where flood-prone zones such as Mandra and Nea Peramos have been repeatedly affected by catastrophic events in recent decades [31,32]. Furthermore, the Attica region is characterized by significant seismic activity associated with a dense network of active faults, which may locally influence groundwater flow paths, fracture permeability, and hydrogeological connectivity [33,34].

Figure 1. Land use map of Attica River Basin District based on Corine dataset.

2.2. Dataset Origin and Overall Methodology

The dataset used in this study was obtained from the European Environment Agency (EEA) via its official data platform, Discodata (https://discodata.eea.europa.eu/) for the study area. This open-access dataset includes 958 groundwater quality measurements, covering 15 parameters from 80 monitoring stations. However, due to the presence of substantial missing values (>25%) in several parameters including major cations in numerous monitoring points, the analysis was restricted to six key variables, namely electrical conductivity (EC), chloride (Cl⁻), sulphate (SO₄²⁻), nitrate (NO₃⁻), nitrite (NO₂⁻), and ammonium (NH₄⁺), which were subsequently used for Water Quality Index (WQI) calculation and prediction. The dataset represents the complete set of groundwater quality observations available through the EEA monitoring database for the Attica River Basin during the study period. Consequently, the sample size was determined by the availability of monitoring records within the official groundwater monitoring network rather than by a predefined sampling design. The final dataset consisted of 958 groundwater quality observations from 80 monitoring stations and represents the complete set of publicly available records for the study area within the selected monitoring period. Therefore, the sample size was governed by data availability rather than a predefined experimental sampling strategy. Thus, procedures such as randomization and blinding were not applicable. Potential sources of bias were minimized through standardized data preprocessing, objective missing-data treatment, train-test separation, and the use of multiple performance metrics for model evaluation.

The methodological framework adopted in this study (Figure 2) follows a structured workflow for groundwater quality prediction using machine learning techniques. Initially, groundwater quality data for the Attica River Basin, including key physicochemical parameters (NH₄⁺, Cl⁻, EC, NO₃⁻, NO₂⁻, SO₄²⁻) and WQI, were collected and assessed for missing values. Data gaps (ranging between 8 and 13% in the retained variables) were addressed using Multiple Imputation by Chained Equations (MICE) with Predictive Mean Matching (PMM), while imputation performance was assessed using Kolmogorov–Smirnov (K–S) tests and Standardized Mean Differences (SMD < 0.1), ensuring statistical consistency between observed and imputed distributions. Subsequently, exploratory data analysis (EDA) was performed, including correlation matrix evaluation, descriptive statistics, and assessment of temporal variability (trend and seasonality). Preprocessing steps, including winsorization for outlier mitigation and Min–Max feature scaling, were applied prior to model development. The processed dataset was then partitioned into training (80%) and testing (20%) subsets to enable model evaluation under unseen conditions. No observations were removed from the dataset based on outlier detection criteria. Instead, winsorization was applied to reduce the influence of extreme values while preserving all available groundwater quality observations. This approach was considered appropriate because extreme values may reflect genuine hydrochemical conditions or localized contamination events rather than measurement errors.

Three machine learning models—TabNet, Gradient Boosting Machine (GBM), and Support Vector Machine (SVM)—were implemented to predict groundwater quality based on WQI. Model performance was evaluated using multiple statistical metrics, including the coefficient of determination (R²), root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and Nash–Sutcliffe efficiency (NSE), providing a comprehensive assessment of predictive accuracy and model robustness. Residual diagnostics, including Q–Q plots and distribution analysis, were further employed to evaluate model behavior. Finally, post-prediction analysis was conducted using global feature importance (derived from TabNet) and sensitivity analysis (tornado plots), enabling the identification of key variables influencing WQI.

Figure 2. Methodological approach adopted in the research.

2.3. Missing Data Handling

To address missing values in the groundwater quality time series, this study employed Multiple Imputation by Chained Equations (MICE), a widely used multivariate imputation framework that accounts for statistical relationships among variables during the imputation process. The imputation was performed using the Predictive Mean Matching (PMM) algorithm, which is particularly suitable for environmental datasets characterized by non-normal distributions and skewed values.

MICE operates through an iterative procedure, where each variable with missing values is sequentially imputed using regression models based on the remaining variables. In this context, PMM selects a set of candidate donor values from observed cases whose predicted values are closest to those of the missing observations [35]. Formally, for a missing value

{\hat{y}}_{j}

, a predicted value

{\hat{y}}_{j}

is first estimated, and donor values are identified from observed cases

{\hat{y}}_{i}

with similar predicted values

{\hat{y}}_{i}

, as expressed in Equation (1):

{\hat{y}}_{j} = y_{i j} (For j = 1, \dots, n_{0})

(1)

where

n_{1}

and

n_{0}

denote the number of observed and missing values, respectively. One of the observed donor values is then randomly selected and used to impute the missing entry, thereby preserving the original data distribution.

The accuracy of the MICE missing data calculation was validated using Standardized Mean Differences (SMD) (Figure 3A). All variables displayed an SMD below the 0.1 threshold, with an average accuracy score of 95.02%, indicating that the distributional integrity of the water quality parameters was preserved (Figure 3B). The MICE algorithm successfully replicated the central tendency and variance of the water quality samples. Also, density distribution (Figure 3C) comparison ensures that the statistical “fingerprint” of the chemical parameters remained stable after the imputation process.

Figure 3. Validation of MICE Imputation Accuracy. (A) Standardized Mean Difference (SMD) across all numeric variables; (B) Consistency of WQI categorical descriptions between original (colored in blue) and modeled datasets (colored in red). (C) Density distribution comparison ensuring that the statistical “fingerprint” of the chemical parameters remained stable after the imputation process.

2.4. Dataset Statistical Analysis

The correlation matrix (Figure 4) provides insight into the relationships among the selected physicochemical parameters and the WQI. A strong positive correlation is observed between electrical conductivity and chloride (r = 0.96), suggesting a common origin associated with salinity and dissolved ionic content. Sulphate also exhibits moderate to strong correlations with electrical conductivity (r = 0.68) and chloride (r = 0.61), indicating the influence of mineralization processes and possible geogenic contributions.

Since WQI is a composite index derived from the weighted contribution of EC, Cl⁻, SO₄²⁻, NO₃⁻, NO₂⁻ and NH₄⁺, the correlations shown in Figure 4 reflect the relative influence of each physicochemical parameter on overall groundwater quality and drinking-water suitability. Higher correlation coefficients indicate a stronger contribution of the corresponding parameter to the variability of the calculated WQI values.

Among the nutrient-related parameters, nitrite shows the strongest correlation with WQI (r = 0.85), followed by nitrate (r = 0.52), highlighting the importance of nitrogen species in influencing groundwater quality conditions. In contrast, ammonium displays weak correlations with both WQI (r = 0.20) and other variables, suggesting a more localized or variable behavior, potentially linked to specific sources or redox conditions. Overall, the results indicate that both nutrient loading and ionic composition contribute to groundwater quality variability, with nitrogen compounds emerging as key controlling factors in the dataset.

Figure 4. Correlation matrix of the physicochemical parameters.

Principal Component Analysis (PCA) was further applied to explore the underlying structure of the dataset and to identify the dominant factors controlling water quality variability (Figure 4). The scree plot (top panel) shows the proportion of variance explained by each principal component. The first principal component (PC1) accounts for 44.1% of the total variance, followed by PC2 explaining 21.4%. Together, the first two components explain approximately 65.5% of the total variance, indicating that they capture the majority of the dataset variability. Subsequent components contribute progressively less, with PC3 (14.9%) and PC4 (12.9%) showing moderate contributions, while the remaining components explain less than 7% each.

The biplot (bottom panel) (Figure 5) illustrates the loadings of the variables on the first two principal components. Variables such as WQI, nitrate, and nitrite exhibit strong positive loadings on both PC1 and PC2, indicating their significant contribution to overall water quality variability. In contrast, electrical conductivity, sulphate, and chloride are positively associated with PC1 but negatively correlated with PC2, suggesting the presence of a secondary hydrochemical gradient, likely related to ionic composition and mineral dissolution processes. Ammonium shows a comparatively weaker contribution, as indicated by its proximity to the origin.

Overall, the PCA results indicate that nutrient-related parameters (particularly nitrogen species) constitute the primary source of variability in the dataset, while chloride and sulphate represent a secondary hydrochemical gradient associated with groundwater mineralization. Electrical conductivity, which reflects the overall dissolved ionic content of groundwater, is strongly associated with this gradient and serves as an integrated indicator of ionic concentration.

Figure 5. Principal component analysis of water quality parameters.

2.5. Machine Learning Models Methodology

2.5.1. Tabular Network (TabNet) Model

TabNet is a deep learning architecture designed for tabular data, enabling the effective learning of high-dimensional feature representations [36,37]. The model employs a sparse, instance-wise feature selection mechanism, which enhances learning efficiency during the training process, as described in Equation (2).

M [i] = e n t m a x (P [i - 1] \cdot h_{i} (a [i - 1]))

(2)

where

h_{i}

(.) denotes a trainable transformation composed of fully connected (FC) and batch normalization (BN) layers, while P_i−1 represents the prior scale information from the previous decision step. Feature selection is achieved through the application of a sparsemax activation function, which performs coefficient normalization and promotes sparsity in the selected features.

The learnable mask M[i]∈R^(B×D) is used for sparse selection of the most salient features. Since the masking process is multiplicative (i.e., M[i]×f), an attentive transformer is used to obtain the masks using the processed features from the previous step (i.e., a[i − 1]). The resulting representation is then transformed according to Equation (3).

L_{sparse} = \sum_{i = 1}^{N_{steps}} \sum_{b = 1}^{B} \sum_{j = 1}^{d} \frac{{- M}_{b j} [i]}{(N_{steps} \cdot B)} \log (M_{b j} [i] + ε)

(3)

2.5.2. Generalized Boosted Regression Modeling (GBM)

GBM is employed due to its strong predictive performance and flexibility in handling structured and tabular data [38]. The boosting approach builds an ensemble of weak learners in a sequential manner, where each subsequent model aims to correct the errors of the previous one, typically using decision trees as base learners [39]. This iterative process reduces overall prediction error and enhances model accuracy and generalization performance. Given its ability to effectively capture nonlinear relationships and handle heterogeneous tabular datasets, GBM is particularly suitable for the present application, as described in Equation (4).

ŷ (x) = F_{0} (x) + λ \sum_{m = 1}^{M} f_{m} (x),

(4)

where F₀(x) denotes the initial model (e.g., mean of the response of the variable), f_m(x) represents the base learner, typically a decision tree, fitted at iteration m, and λ∈(0, 1] is the learning rate (shrinkage parameter), which controls the contribution of each learner to the final model. The parameter M denotes the total number of boosting iterations.

At each iteration m, the model computes the negative gradient (pseudo-residuals) of the loss function with respect to the current prediction Fm − 1 = (x). A base learner f_m(x) is then fitted to these residuals, and the model is updated accordingly, as shown in Equation (5). This iterative procedure continues until either convergence is achieved (i.e., no significant improvement in the loss function) or a predefined maximum number of iterations M is reached.

F_{m} (x) = F_{m - 1} (x) + λ f_{m} (x)

(5)

2.5.3. Support Vector Machine Model (SVM)

Support Vector Machines (SVM) are supervised learning models that construct an optimal hyperplane to separate data in the feature space. The decision boundary is defined such that it maximizes the margin between support vectors belonging to different classes. Although originally developed for classification tasks, SVMs can be effectively extended to regression problems through Support Vector Regression (SVR) [40]. In this study, SVM is applied for groundwater quality prediction based on WQI. The method is particularly effective in capturing nonlinear relationships by employing kernel functions, which map the input data into a higher-dimensional feature space where linear separation becomes feasible. The model formulation is described in Equation (6).

f (x) = \sum_{i \in S V} α_{i} y_{i} K (x_{i}, x) + b

(6)

where α_i Lagrange multipliers (non-zero only for support vectors), y_i: target labels, K(x_i, x) kernel function (e.g., linear), SV is set of support vectors.

2.6. Water Quality Index (WQI) Calculation

A weighted arithmetic WQI method was applied to evaluate groundwater quality. Six physicochemical parameters were considered, including electrical conductivity (EC), chloride (Cl⁻), sulphate (SO₄²⁻), nitrate (NO₃⁻), nitrite (NO₂⁻), ammonium (NH₄⁺). Each parameter was assigned a weight (wi) based on its relative importance to groundwater quality and human health. To assess the suitability of groundwater for drinking water purposes the standards for drinking purposes as recommended by the World Health Organization [41] and drinking water directive (98/83/ΕC) have been used for the calculation of WQI, which involves three steps. In the first step, each parameter was assigned a relative weight (R_w) on a scale of 1–5, reflecting its relative importance in water quality assessment and its potential impact on human health. In the second step, relative weights (Wi) are calculated through Equation (7).

W i = \frac{w i}{\sum_{i = 1}^{n} w i}

(7)

where Wi is the relative weight, ‘‘wi’’ is the weight of each parameter and ‘‘n’’ is the number of parameters. In the third step, quality rating scale calculation (Qi) for each individual parameter is computed by dividing its concentration for each groundwater sample with drinking water quality standards and then multiplied by 100 using the following Equation (8).

Q i = \frac{C i}{S i} \times 100

(8)

where Qi is the quality rating, Ci is the concentration of each chemical parameter in each water sample in milligrams per liter (mg/L) and Si was taken from the European Drinking Water Directive (98/83/EC) guidelines for each chemical parameter. Eventually, the water quality sub-index (SIi) for each chemical parameter is computed by Equation (8), and the whole WQI is determined by Equation (9).

SIi = Wi × Qi
WQI = ΣSIi

(9)

where SIi is the sub-index of the ith parameter, Qi is the rating based on the concentration of ith parameter and n is the total number of parameters.

The overall WQI for each groundwater sample was calculated as the sum of the weighted quality ratings of all selected parameters. Subsequently, mean WQI values were computed for each monitoring site and year to facilitate spatial analysis and mapping. Based on the calculated WQI values, groundwater quality was classified into five categories: excellent (WQI < 50), good (50–100), poor (100–200), very poor (200–300), and unsuitable for drinking purposes (WQI > 300).

3. Results and Discussion

3.1. WQI Prediction and Machine Learning Model Performance

The performance of the implemented machine learning models highlights the potential of data-driven approaches for groundwater quality prediction in complex urban environments [42]. Among the evaluated models, the TabNet architecture, combined with winsorization, feature scaling, and MICE-based imputation, demonstrated the highest predictive performance. The model achieved a coefficient of determination (R²) of 0.91, a Nash–Sutcliffe Efficiency (NSE) of 0.91, a Root Mean Square Error (RMSE) of 52.21, and a Mean Absolute Error (MAE) of 25.68 in predicting WQI values (Table 1).

The enhanced performance of the TabNet model can be attributed to its increased model capacity and optimized hyperparameter configuration, which enabled effective learning of nonlinear relationships between physicochemical parameters and WQI. In comparison to conventional approaches, such as Support Vector Machines (SVM) and Gradient Boosting Machines (GBM), TabNet exhibited improved generalization capability and greater robustness under conditions of incomplete and heterogeneous environmental data. These findings are consistent with recent studies that emphasize the advantages of deep learning architectures in capturing complex hydrochemical interactions and nonlinear patterns in environmental datasets [12,43].

Residual diagnostics further support the reliability of the model. The residual plots do not exhibit strong systematic patterns, suggesting the absence of significant model bias. However, the Shapiro–Wilk test indicates that the residuals deviate from normality (p < 0.05). Such behavior is commonly observed in environmental datasets, where nonlinear processes, heterogeneity, and the presence of extreme values often lead to departures from normality assumptions [11].

Model validation results (Figure 6) indicate a strong agreement between predicted and observed WQI values. The overall predictive performance of the TabNet model remains robust, supporting its suitability for groundwater quality forecasting in complex urban environments. The ability of the model to maintain high predictive performance despite data gaps further highlights the effectiveness of the MICE imputation approach, which preserves the statistical properties of the dataset.

Table 1. ML Model performance metrics.

Model	R²	RMSE	MAE	MAPE	NSE	MD	Huber
TabNet	0.91	52.21	25.68	14.56	0.91	0.89	25.18
SVM	0.83	72.79	24.15	7.25	0.83	0.90	23.68
GBM	0.83	71.28	29.94	12.82	0.83	0.88	29.45

R²: coefficient of determination; RMSE: root mean square error; MAE: mean absolute error; MAPE: mean absolute percentage error; NSE: Nash–Sutcliffe efficiency; MD: Modified index of agreement; Huber: Huber loss.

Figure 6. Predicted versus observed WQI values for GBM, SVM, and TabNet models; the dashed line represents the 1:1 agreement line.

The global feature importance analysis (Figure 7) reveals that nitrate (NO₃⁻) is the dominant predictor of WQI, accounting for nearly 50% of the model’s predictive power. This result underscores the critical role of nutrient loading in groundwater quality degradation, which is widely associated with agricultural runoff, wastewater discharge, and urban pollution sources. Ammonium (NH₄⁺) and nitrite (NO₂⁻) were identified as the second and third most influential variables, respectively. Collectively, nitrogen-based compounds account for approximately 75% of the total feature importance, suggesting a strong influence of anthropogenic activities on groundwater quality, particularly in densely urbanized regions such as Attica [44].

Figure 7. Q-Q plots of residuals for GBM (blue), SVM (red), and TabNet (yellow) models; the red line represents the theoretical normal distribution.

In contrast, electrical conductivity and Cl⁻ exhibited moderate importance, reflecting their association with salinization processes, while sulphate showed minimal contribution. These results are consistent with previous hydrochemical studies, which indicate that nutrient enrichment is often the primary driver of groundwater quality variability in anthropogenically impacted aquifers [45].

Importantly, these findings are in agreement with the PCA results (Figure 5), where nutrient-related variables (nitrate and nitrite) and WQI were identified as major contributors to dataset variability, confirming the consistency between statistical and machine learning analyses.

3.2. Sensitivity Analysis and Model Behavior

The sensitivity analysis (Figure 8) provides further insight into the influence of key variables on WQI predictions. The results indicate a strong and near-linear relationship between nitrate concentration and WQI, indicating that increases in nitrate levels directly correspond to groundwater quality deterioration. Quantitatively, a 10% increase in nitrate concentration resulted in an approximate 22.49% increase in WQI, while a 50% increase led to a 47.7% rise in predicted WQI. These findings highlight the high sensitivity of groundwater quality to nutrient enrichment and support the use of nitrate as a key monitoring indicator [12,46].

Figure 8. Global feature importance derived from the TabNet model, showing the relative contribution of each physicochemical parameter to WQI prediction.

In contrast, nitrite exhibited a different behavior. Although its direct contribution to WQI variation was relatively low, it showed a near-perfect correlation (r ≈ 0.98) with model prediction errors (Figure 9). This suggests that nitrite acts as a “stochastic disruptor”, introducing localized nonlinearities that reduce model accuracy during extreme pollution events. Such behavior has also been observed in environmental modeling studies, where rare but high-intensity contamination events significantly affect model performance due to their deviation from typical data distributions [47].

Figure 9. Tornado plot illustrating the sensitivity of WQI to a 10% increase in each physicochemical parameter (parameters in red correspond to positive influence while green to negative influence).

The scenario analysis (Figure 10 and Figure 11) provides insight into the sensitivity of the TabNet model to nutrient loading. A 50% increase in nitrate concentration resulted in a 47.7% increase in the predicted WQI, shifting water quality conditions from the ‘Poor’ category towards the ‘Very Poor’ threshold. In contrast, the model exhibited limited sensitivity to nitrite increases within this range. Despite the strong correlation between nitrite and prediction error, its relatively low influence on WQI suggests that nitrite may introduce localized nonlinear effects rather than acting as a primary predictive variable. From a practical perspective, these findings indicate that nitrate represents a reliable indicator for WQI-based forecasting, while nitrite concentrations may require independent monitoring, particularly under high-pollution conditions, to reduce potential uncertainties in model predictions.

Figure 10. Sensitivity analysis showing the effect of nitrate (NO₃⁻) concentration on predicted WQI values.

Figure 11. Impact of pollutant spikes on predicted WQI values under baseline conditions and 50% increases in nitrate and nitrite concentrations.

3.3. Temporal Variability and WQI Spatial Distribution

The boxplot analysis of WQI values (Figure 12) reveals significant interannual variability in groundwater quality across the study period. Several years exhibit increased dispersion and the presence of extreme outliers, corresponding to samples classified as “very poor” or “unsuitable” for drinking purposes. The progressive increase in variability from excellent to unsuitable classes confirms the sensitivity of the WQI methodology in capturing cumulative hydrochemical effects (Table 2). The clear statistical separation between classes further validates the robustness of the classification scheme [45]. This temporal variability can be attributed to the lack of consistency of sampling the same monitoring stations and fluctuations in anthropogenic pressures, including seasonal agricultural practices, urban runoff, and wastewater discharge.

The observed increase in dispersion from excellent to unsuitable classes is consistent with the conceptual basis of the WQI method, whereby cumulative weighting of multiple physicochemical parameters amplifies the influence of localized contamination and hydrogeochemical processes [45,48]. The distinct separation and progressive widening of variability across classes confirm that the applied WQI thresholds effectively discriminate groundwater quality conditions and are not the result of arbitrary classification.

Table 2. WQI descriptive statistics for the available monitoring years.

Year	Mean	Std Dev	Min	Q1 (25%)	Median	Q3 (75%)	Max
2006	95.4	61.8	17.7	37	93.1	119.4	265.2
2007	92.1	63.1	8.73	33.8	85.2	128.6	259.7
2008	71.2	43.4	8.73	29.9	75	98.4	156
2013	288.8	432.5	22.6	89.2	164.3	335	2915
2014	224.8	427.5	22.5	82	139.5	212.7	3586.5
2015	274.7	279.5	18	86.9	172	373.3	1282.5

Std Dev: Standard Deviation; Min: Minimum value; Q1 (25%): First Quartile (25th percentile); Q3 (75%): Third Quartile (75th percentile); Max: Maximum value.

The spatial distribution of sampling-based WQI classes and the official groundwater status reported in the River Basin Management Plan for the Attica Water District (GR06) is provided in Figure 13. Several clusters of low WQI values coincide with groundwater bodies classified as having poor chemical status. According to the GR06 management plan, groundwater bodies such as Megaron–Alepochoriou, Thriasio Pedio, Lekani Kifisou, and Marathonas are characterized by poor chemical status, primarily due to salinization, nitrate contamination associated with agricultural activities and sewage discharges, and, in some cases, the presence of trace metals linked to industrial sources.

Figure 12. Distribution of WQI values across monitoring years, illustrating temporal variability in groundwater quality (asterisks correspond to outlier values).

Overall, the comparison indicates that the WQI derived in this study is consistent with the broader spatial patterns of groundwater degradation reported in the official management plan. At the same time, the sampling-based WQI provides higher-resolution information at the monitoring-point scale, revealing intra-body variability that is not fully captured by basin-scale classifications.

These findings suggest that the proposed WQI approach can complement the GR06 assessment framework by providing a more detailed and spatially refined representation of groundwater quality conditions, thereby supporting improved monitoring and management strategies.

Figure 13. Spatial distribution of Water Quality Index (WQI) values across the Attica Basin, combined with groundwater quality classification from River Basin Management Plans (GR06).

3.4. Implications for Groundwater Management

The findings of this study provide important insights for groundwater monitoring and management in urbanized basins such as Attica. The dominance of nitrogen-based compounds highlights the urgent need for targeted pollution control strategies focusing on agricultural nutrient management and wastewater treatment systems. This observation is consistent with previous studies emphasizing the role of nutrient pollution in groundwater degradation [49].

The strong predictive performance of the TabNet model demonstrates its potential as a cost-effective tool for real-time water quality monitoring. By relying on a limited number of key parameters, the model enables efficient resource allocation and optimization of monitoring networks.

However, the observed sensitivity of model errors to nitrite spikes suggests that this parameter requires special attention in monitoring programs, as it may lead to underestimation of extreme contamination events. Overall, the integration of machine learning with open-access environmental data provides a scalable and transferable framework for groundwater quality assessment, supporting adaptive and data-driven water resource management in urban environments [50].

4. Conclusions

This study presented an integrated machine learning framework for groundwater quality prediction in the Attica basin, combining open-access environmental data, MICE imputation, and the TabNet model. The results indicate that this approach provides reliable WQI predictions in data-scarce and heterogeneous urban environments, with TabNet achieving improved predictive performance and generalization compared to conventional models.

Feature importance and PCA analyses consistently identified nitrate as the dominant driver of groundwater quality variability, while nitrogen-based compounds overall accounted for most of the model importance, reflecting the strong influence of anthropogenic pressures such as agricultural runoff and wastewater discharge. Sensitivity analysis further confirmed a strong and near-linear relationship between nitrate concentration and WQI.

In contrast, nitrite exhibited a distinct behavior, contributing to prediction uncertainty under high-concentration conditions, highlighting the importance of accounting for nonlinear and episodic contamination effects. Temporal analysis also indicated increased variability and potential deterioration of groundwater quality in recent years.

Although the study focuses on the Attica River Basin, the proposed framework may be applicable to other groundwater systems characterized by incomplete monitoring records, missing observations, and heterogeneous environmental datasets. Overall, the proposed framework supports cost-effective monitoring strategies by focusing on key parameters and provides a scalable and transferable tool for data-driven groundwater management in urban environments.

Author Contributions

Conceptualization, K.P. and M.M.N.; methodology, K.P. and S.K.B.; data curation, S.K.B., K.P. and M.M.N.; writing—original draft preparation K.P., M.M.N. and S.K.B.; writing—review and editing, K.P., M.M.N. and S.K.B.; visualization, S.K.B., M.M.N.; supervision, K.P. and S.K.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The used dataset is open and can be found here https://discodata.eea.europa.eu/ (accessed on 18 April 2025). The modeled datasets are available through request.

Acknowledgments

During the preparation of this manuscript/study, the authors used ChatGPT 5-5 for text generation. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pitidis, V.; Tapete, D.; Coaffee, J.; Kapetas, L.; Porto de Albuquerque, J. Understanding the Implementation Challenges of Urban Resilience Policies: Investigating the Influence of Urban Geological Risk in Thessaloniki, Greece. Sustainability 2018, 10, 3573. [Google Scholar] [CrossRef]
Abanyie, S.K.; Apea, O.B.; Abagale, S.A.; Amuah, E.E.Y.; Sunkari, E.D. Sources and Factors Influencing Groundwater Quality and Associated Health Implications: A Review. Emerg. Contam. 2023, 9, 100207. [Google Scholar] [CrossRef]
Lu, J.; Liu, Y.; Liu, T.; Shi, X.; Zhao, S.; Mi, J.; Shi, Z.; Li, Z. Hydrochemical Characteristics and Coupled Driving Mechanisms of Fluoride Enrichment in the Daihai Basin: Insights from Hydrogeochemical Methods and Machine Learning Models. Environ. Technol. Innov. 2025, 40, 104578. [Google Scholar] [CrossRef]
Kumar, D.; Kumar, R.; Sharma, M.; Awasthi, A.; Kumar, M. Global Water Quality Indices: Development, Implications, and Limitations. Total Environ. Adv. 2024, 9, 200095. [Google Scholar] [CrossRef]
Uddin, M.G.; Rahman, A.; Rosa Taghikhah, F.; Olbert, A.I. Data-Driven Evolution of Water Quality Models: An in-Depth Investigation of Innovative Outlier Detection Approaches-A Case Study of Irish Water Quality Index (IEWQI) Model. Water Res. 2024, 255, 121499. [Google Scholar] [CrossRef] [PubMed]
Guido, R.; Ferrisi, S.; Lofaro, D.; Conforti, D. An Overview on the Advancements of Support Vector Machine Models in Healthcare Applications: A Review. Information 2024, 15, 235. [Google Scholar] [CrossRef]
Rahman, A.; Chung, G.C.; Ng, Y.H. Applications, Challenges, and Future Trends of Artificial Intelligence of Things (AIoT)-Enabled Water Quality and Resource Management. Water 2026, 18, 919. [Google Scholar] [CrossRef]
Guo, D.; Thomas, J.; Lazaro, A.; Mahundo, C.; Lwetoijera, D.; Mrimi, E.; Matwewe, F.; Johnson, F. Understanding the Impacts of Short-Term Climate Variability on Drinking Water Source Quality: Observations from Three Distinct Climatic Regions in Tanzania. Geohealth 2019, 3, 84–103. [Google Scholar] [CrossRef]
Xie, J.; Sun, L.; Zhao, Y.F. On the Data Quality and Imbalance in Machine Learning-Based Design and Manufacturing—A Systematic Review. Engineering 2025, 45, 105–131. [Google Scholar] [CrossRef]
Addor, N.; Do, H.X.; Alvarez-Garreton, C.; Coxon, G.; Fowler, K.; Mendoza, P.A. Large-Sample Hydrology: Recent Progress, Guidelines for New Datasets and Grand Challenges. Hydrol. Sci. J. 2020, 65, 712–725. [Google Scholar] [CrossRef]
Che Nordin, N.F.; Mohd, N.S.; Koting, S.; Ismail, Z.; Sherif, M.; El-Shafie, A. Groundwater Quality Forecasting Modelling Using Artificial Intelligence: A Review. Groundw. Sustain. Dev. 2021, 14, 100643. [Google Scholar] [CrossRef]
Nourani, V.; Ghaffari, A.; Behfar, N.; Foroumandi, E.; Zeinali, A.; Ke, C.-Q.; Sankaran, A. Spatiotemporal Assessment of Groundwater Quality and Quantity Using Geostatistical and Ensemble Artificial Intelligence Tools. J. Environ. Manag. 2024, 355, 120495. [Google Scholar] [CrossRef] [PubMed]
Msaddek, M.H.; Alaya, M.B.; Zouhri, L.; Moumni, Y.; Abdelkarim, B. Geo-AI Ensemble Modeling Framework for Assessing Groundwater Contamination Under Anthropogenic Pressures in an Extensive Peri-Urban Agricultural Aquifer to Support Sustainable Groundwater Management. Water 2026, 18, 937. [Google Scholar] [CrossRef]
Maggioni, F.; Spinelli, A. A Novel Robust Optimization Model for Nonlinear Support Vector Machine. Eur. J. Oper. Res. 2025, 322, 237–253. [Google Scholar] [CrossRef]
Raheja, H.; Goel, A.; Pal, M. Prediction of Groundwater Quality Indices Using Machine Learning Algorithms. Water Pract. Technol. 2021, 17, 336–351. [Google Scholar] [CrossRef]
Khoi, D.N.; Quan, N.T.; Linh, D.Q.; Nhi, P.T.T.; Thuy, N.T.D. Using Machine Learning Models for Predicting the Water Quality Index in the La Buong River, Vietnam. Water 2022, 14, 1552. [Google Scholar] [CrossRef]
Hridoy, M.A.A.M.; Shawkat, A.I.; Bordin, C.; Acharjee, M.R.; Masood, A.; Baki, A.O.; Al Mamun, M.A. Advanced Machine Learning Models for Accurate Water Quality Classification and WQI Prediction: Implications for Aquatic Disease Risk Management. Sci. Total Environ. 2025, 1008, 180965. [Google Scholar] [CrossRef]
Kebede, Y.B.; Yang, M.-D. A Hybrid Framework of Attention-Based Tabular Network and Ensemble Learning with Generative Adversarial Network for Stiffness Modulus Prediction. Case Stud. Constr. Mater. 2025, 23, e05401. [Google Scholar] [CrossRef]
Schmid, L.; Roidl, M.; Kirchheim, A.; Pauly, M. Comparing Statistical and Machine Learning Methods for Time Series Forecasting in Data-Driven Logistics—A Simulation Study. Entropy 2024, 27, 25. [Google Scholar] [CrossRef] [PubMed]
Hall, T.; Rasheed, K. A Survey of Machine Learning Methods for Time Series Prediction. Appl. Sci. 2025, 15, 5957. [Google Scholar] [CrossRef]
Ahmed Osman, A.I.; AlDahoul, N.; Chong, K.L.; Huang, Y.F.; Ng, J.L.; Elshafie, A.; Sherif, M.; Ahmed, A.N. A Review on Machine Learning Models for Drought Monitoring and Forecasting. Clim. Risk Manag. 2025, 50, 100758. [Google Scholar] [CrossRef]
Nogueira, L.S.R.; de Carvalho, M.A.S.; Santos, B.D.O.; Yonaba, R.; Bamal, A.; Uddin, M.G.; Bodini, M.; Goliatt, L. A Comparative Study of Ensemble and Non-Ensemble Machine Learning Methods for Predicting River Pollution Index. Ecol. Inform. 2026, 94, 103617. [Google Scholar] [CrossRef]
Kourgialas, N.N. A Critical Review of Water Resources in Greece: The Key Role of Agricultural Adaptation to Climate-Water Effects. Sci. Total Environ. 2021, 775, 145857. [Google Scholar] [CrossRef]
Rosińska, W.; Jurasz, J.; Przestrzelska, K.; Wartalska, K.; Kaźmierczak, B. Climate Change’s Ripple Effect on Water Supply Systems and the Water-Energy Nexus–A Review. Water Resour. Ind. 2024, 32, 100266. [Google Scholar] [CrossRef]
ELSTAT. Hellenic Statistical Authority. Census Results. 2022. Available online: https://www.statistics.gr/ (accessed on 5 January 2026).
Ministry of Environment and Energy. 1st Revision of the River Basin Management Plan of the Attica River Basin District (EL06): Summary Report; Special Secretariat for Water, Ministry of Environment and Energy: Athens, Greece, 2021. Available online: https://wfdver.ypeka.gr/wp-content/uploads/2021/12/EL06_1REV_P22b_Perilipsi_EN.pdf (accessed on 25 January 2026).
The Mediterranean Climate: An Overview of the Main Characteristics and Issues. In Developments in Earth and Environmental Sciences; Elsevier: Amsterdam, The Netherlands, 2006; Volume 4, pp. 1–26.
Georganta, C.; Feloni, E.; Nastos, P.; Baltas, E. Critical Rainfall Thresholds as a Tool for Urban Flood Identification in Attica Region, Greece. Atmosphere 2022, 13, 698. [Google Scholar] [CrossRef]
Arianoutsou, M.; Athanasakis, G.; Kazanis, D.; Christopoulou, A. Attica: A Hot Spot for Forest Fires in Greece. Fire 2024, 7, 467. [Google Scholar] [CrossRef]
Papadaki, C.; Lagogiannis, S.; Dimitriou, E. Preliminary Analysis of the Water Quality Status in an Urban Mediterranean River. Appl. Sci. 2023, 13, 6698. [Google Scholar] [CrossRef]
Stamatis, G.; Lambrakis, N.; Alexakis, D.; Zagana, E. Groundwater Quality in Mesogea Basin in Eastern Attica (Greece). Hydrol. Process. 2006, 20, 2803–2818. [Google Scholar] [CrossRef]
Tavoularis, N.; Papathanassiou, G.; Ganas, A.; Argyrakis, P. Development of the Landslide Susceptibility Map of Attica Region, Greece, Based on the Method of Rock Engineering System. Land 2021, 10, 148. [Google Scholar] [CrossRef]
Ganas, A.; Pavlides, S.; Karastathis, V. DEM-Based Morphometry of Range-Front Escarpments in Attica, Central Greece, and Its Relation to Fault Slip Rates. Geomorphology 2005, 65, 301–319. [Google Scholar] [CrossRef]
Deligiannakis, G.; Papanikolaou, I.D.; Roberts, G. Fault Specific GIS Based Seismic Hazard Maps for the Attica Region, Greece. Geomorphology 2018, 306, 264–282. [Google Scholar] [CrossRef]
Hallam, A.; Mukherjee, D.; Chassagne, R. Multivariate Imputation via Chained Equations for Elastic Well Log Imputation and Prediction. Appl. Comput. Geosci. 2022, 14, 100083. [Google Scholar] [CrossRef]
Shah, C.; Du, Q.; Xu, Y. Enhanced TabNet: Attentive Interpretable Tabular Learning for Hyperspectral Image Classification. Remote Sens. 2022, 14, 716. [Google Scholar] [CrossRef]
Joseph, L.P.; Joseph, E.A.; Prasad, R. Explainable Diabetes Classification Using Hybrid Bayesian-Optimized TabNet Architecture. Comput. Biol. Med. 2022, 151, 106178. [Google Scholar] [CrossRef]
Elshaarawy, M.K. Stacked-Based Hybrid Gradient Boosting Models for Estimating Seepage from Lined Canals. J. Water Process Eng. 2025, 70, 106913. [Google Scholar] [CrossRef]
Xu, F.; Huang, Y.; Wang, H.; Fan, Z. A Novel Heterogeneous Data Classification Approach Combining Gradient Boosting Decision Trees and Hybrid Structure Model. Pattern Recognit. 2025, 165, 111614. [Google Scholar] [CrossRef]
Jalili, A.; Saleki, Z.; Luo, Y.A.; Pan, F.; Chen, A.X.; Draayer, J.P. Performance of Various Kernel Functions for Mass Prediction with Support Vector Machine. Eur. Phys. J. A 2025, 61, 143. [Google Scholar] [CrossRef]
Guidelines for Drinking-Water Quality: Fourth Edition Incorporating the First and Second Addenda. Available online: https://www.who.int/publications/i/item/9789240045064 (accessed on 21 April 2026).
Sham, F.; Ammar, F.; El-Shafie, A.; Jaafar, W.Z.B.W.; Adarsh, S.; Ahmed, A.N. Machine Learning-Based Model for Groundwater Quality Prediction: A Comprehensive Review and Future Time–Cost Effective Modelling Vision. Arch. Comput. Methods Eng. 2025, 32, 3593–3608. [Google Scholar] [CrossRef]
Shikur, H.D.; Yang, M.-D.; Kebede, Y.B. Explainable Pavement Surface Condition Classification Using a TabNet-CatBoost Hybrid Machine Learning Framework. Case Stud. Constr. Mater. 2025, 23, e05333. [Google Scholar] [CrossRef]
Abascal, E.; Gómez-Coma, L.; Ortiz, I.; Ortiz, A. Global Diagnosis of Nitrate Pollution in Groundwater and Review of Removal Technologies. Sci. Total Environ. 2022, 810, 152233. [Google Scholar] [CrossRef] [PubMed]
Adimalla, N.; Qian, H. Groundwater Quality Evaluation Using Water Quality Index (WQI) for Drinking Purposes and Human Health Risk (HHR) Assessment in an Agricultural Region of Nanganur, South India. Ecotoxicol. Environ. Saf. 2019, 176, 153–161. [Google Scholar] [CrossRef]
Jun, C.; Kim, D.; Bateni, S.M.; Biyari, M.; Salwana, E.; Sajedi Hosseini, F.; Mosavi, A.; Pai, H.-T.; Choubin, B. Aquifer Vulnerability Assessment in Data-Scarce Areas: A Spatially Explicit Assessment. Geomat. Nat. Hazards Risk 2025, 16, 2487816. [Google Scholar] [CrossRef]
Carvalho, T.P.; Soares, F.A.A.M.N.; Vita, R.; Francisco, R.D.P.; Basto, J.P.; Alcalá, S.G.S. A Systematic Literature Review of Machine Learning Methods Applied to Predictive Maintenance. Comput. Ind. Eng. 2019, 137, 106024. [Google Scholar] [CrossRef]
Shakeri, A.; Hosseini, H.; Rastegari Mehr, M.; Dashti Barmaki, M. Groundwater Quality Evaluation Using Water Quality Index (WQI) and Human Health Risk (HHR) Assessment in Herat Aquifer, West Afghanistan. Hum. Ecol. Risk Assess. An. Int. J. 2022, 28, 711–733. [Google Scholar] [CrossRef]
Wakida, F.T.; Lerner, D.N. Non-Agricultural Sources of Groundwater Nitrate: A Review and Case Study. Water Res. 2005, 39, 3–16. [Google Scholar] [CrossRef] [PubMed]
Yu, Z.; Bu, C.; Li, Y. Machine Learning for Ecological Analysis. Chem. Eng. J. 2025, 507, 160780. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pyrgaki, K.; Ntona, M.M.; Bhagat, S.K. Integrated Machine Learning Based Groundwater Quality Prediction in a Peri-Urban Area: The Case of Attica Region, Greece. Urban Sci. 2026, 10, 323. https://doi.org/10.3390/urbansci10060323

AMA Style

Pyrgaki K, Ntona MM, Bhagat SK. Integrated Machine Learning Based Groundwater Quality Prediction in a Peri-Urban Area: The Case of Attica Region, Greece. Urban Science. 2026; 10(6):323. https://doi.org/10.3390/urbansci10060323

Chicago/Turabian Style

Pyrgaki, Konstantina, Maria Margarita Ntona, and Suraj Kumar Bhagat. 2026. "Integrated Machine Learning Based Groundwater Quality Prediction in a Peri-Urban Area: The Case of Attica Region, Greece" Urban Science 10, no. 6: 323. https://doi.org/10.3390/urbansci10060323

APA Style

Pyrgaki, K., Ntona, M. M., & Bhagat, S. K. (2026). Integrated Machine Learning Based Groundwater Quality Prediction in a Peri-Urban Area: The Case of Attica Region, Greece. Urban Science, 10(6), 323. https://doi.org/10.3390/urbansci10060323

Article Menu

Integrated Machine Learning Based Groundwater Quality Prediction in a Peri-Urban Area: The Case of Attica Region, Greece

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Dataset Origin and Overall Methodology

2.3. Missing Data Handling

2.4. Dataset Statistical Analysis

2.5. Machine Learning Models Methodology

2.5.1. Tabular Network (TabNet) Model

2.5.2. Generalized Boosted Regression Modeling (GBM)

2.5.3. Support Vector Machine Model (SVM)

2.6. Water Quality Index (WQI) Calculation

3. Results and Discussion

3.1. WQI Prediction and Machine Learning Model Performance

3.2. Sensitivity Analysis and Model Behavior

3.3. Temporal Variability and WQI Spatial Distribution

3.4. Implications for Groundwater Management

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI