1. Introduction
Urban water systems present a multitude of scientific challenges due to their intrinsic complexity, spatial heterogeneity, and the dynamic interactions between anthropogenic activities and natural processes [
1]. In particular, groundwater quality in urban environments is influenced by a variety of factors, including land use change, wastewater discharge, climate variability, and socio-economic pressures [
2]. These interacting drivers result in highly variable hydrochemical conditions, often characterized by nonlinear responses and strong spatial and temporal heterogeneity, which complicate robust assessment and predictive modeling [
3].
The evaluation of groundwater quality through composite indices, such as the Water Quality Index (WQI), has been widely adopted as an effective tool for summarizing complex physicochemical information into a single, interpretable metric [
4]. WQI facilitates communication between scientists, policymakers, and stakeholders, supporting water resource management and decision-making processes. However, despite its widespread application, WQI-based assessments remain sensitive to data quality, parameter selection, and temporal resolution, while their reliability is often constrained by the limited availability of high-frequency monitoring data [
5]. Consequently, the modeling and prediction of water quality indices (WQIs) under such multifactorial conditions require robust analytical frameworks capable of accommodating data limitations while preserving predictive accuracy [
6]. This limitation is particularly critical in urban environments, where rapid environmental changes and contamination events are difficult to capture using sparse or irregular datasets [
7].
One of the fundamental challenges in contemporary urban hydrology is therefore the need for high-temporal-resolution data capable of capturing transient processes and short-term variability [
8]. However, environmental datasets are frequently characterized by missing values, inconsistent sampling intervals, and heterogeneous data structures, which introduce uncertainty and limit the applicability of conventional analytical approaches [
9]. Addressing these data limitations is essential for improving the robustness and predictive capability of groundwater quality assessments.
In recent years, the increasing availability of open-access environmental datasets has created new opportunities for addressing these challenges. The use of standardized global datasets enables harmonized analyses across different geographical regions, facilitating cross-country comparisons and integrated environmental assessments. Nevertheless, open data sources are often characterized by heterogeneous spatial and temporal resolutions, data gaps, inconsistent metadata, and uncertainties related to data quality and governance, thereby necessitating the development of robust methodologies capable of effectively handling incomplete and heterogeneous datasets [
10].
To address the complexity of urban hydrological and hydrogeological systems, machine learning (ML) techniques have been increasingly adopted as powerful tools for environmental modelling. These systems are governed by the interaction between surface-water processes, groundwater flow, aquifer characteristics, and anthropogenic pressures, resulting in highly complex environmental responses. Algorithms such as Random Forest (RF), Support Vector Machine (SVM), Artificial Neural Network (ANN), and boosting methods have demonstrated strong capabilities in capturing nonlinear relationships and handling high-dimensional datasets [
11,
12,
13]. Among these approaches, SVM is widely recognized for its robustness in modeling nonlinear relationships and its effectiveness in relatively small datasets [
6,
14], while Gradient Boosting Machine (GBM) provide enhanced predictive performance through ensemble learning and iterative error minimization [
15,
16,
17]. More recently, deep learning architectures namely TabNet have been introduced, offering improved performance for tabular data by leveraging sequential attention mechanisms, while retaining a degree of interpretability compared to conventional neural networks [
18].
Recent studies have shown that ML approaches can outperform traditional statistical and physically based models, particularly in complex and data-limited environments, due to their ability to learn intricate patterns from heterogeneous datasets [
19,
20]. Nevertheless, despite these advancements, several challenges remain. ML models are often sensitive to data quality and may suffer from overfitting or reduced generalization when applied to datasets with missing values or limited temporal resolution. Furthermore, the lack of interpretability in many ML approaches has raised concerns regarding their applicability in environmental decision-making contexts [
9,
21]. In addition, the combined impact of missing data and irregular temporal resolution on ML model performance remains insufficiently investigated in groundwater quality studies, particularly when integrated with composite indices such as WQI [
22].
Despite the growing application of machine learning techniques in groundwater quality assessment, relatively few studies have simultaneously addressed the challenges of incomplete environmental datasets, groundwater quality index prediction, and model interpretability within highly urbanized Mediterranean aquifer systems. In this context, the present study introduces an integrated framework that combines open-access groundwater chemistry datasets, advanced multivariate imputation through Multiple Imputation by Chained Equations with Predictive Mean Matching (MICE-PMM), and the attention-based TabNet deep learning architecture. The proposed approach enables robust groundwater quality prediction under data-scarce conditions while preserving model interpretability and supporting hydrochemical understanding of groundwater quality drivers. The novelty of this work lies in the simultaneous integration of data-gap handling, explainable deep learning, and groundwater quality assessment within a Mediterranean peri-urban aquifer system. By combining statistical analysis, hydrochemical interpretation, sensitivity assessment, and machine learning explainability, the study provides a transferable framework for groundwater quality monitoring and management in regions characterized by fragmented environmental datasets.
The Attica river basin in Greece represents a characteristic example of a highly stressed urban hydrological system. This area reflects common hydrological challenges observed in urban Mediterranean environments, where rapid urbanization has significantly altered natural hydrological processes. The region is affected by intense urbanization, industrial activities, agricultural pressures, and increasing water demand, which collectively impacts groundwater quality [
23]. Additionally, climate variability and irregular precipitation patterns further exacerbate the vulnerability of groundwater resources, highlighting the need for reliable monitoring and predictive tools [
24]. Despite its importance, groundwater quality assessment in the region is often limited by fragmented datasets, missing observations, and lack of high-frequency monitoring, making it an ideal case study for the application of data-driven methodologies.
Building upon these challenges, this study integrates open geospatial datasets with advanced machine learning approaches to model and predict groundwater quality dynamics in a complex urban environment, while explicitly examining the effects of data scarcity, missing values, and temporal irregularity. Specifically, it aims to: (i) develop and compare machine learning models (TabNet, SVM, and Gradient Boosting Machines) for forecasting groundwater quality based on WQI, (ii) evaluate the effectiveness of advanced imputation techniques (MICE-PMM) against conventional methods, (iii) identify the key environmental drivers influencing groundwater quality variability, and (iv) to evaluate the potential of open-access data to support scalable and transferable urban hydrological modeling frameworks.
2. Materials and Methods
2.1. Study Area
This study focuses on the river basin district of Attica, located in central Greece and encompassing the metropolitan area of Athens. The Attica river basin district covers an area of approximately 3187 km
2 and hosts a population exceeding 3.8 million inhabitants, representing nearly 35% of Greece’s total population [
25]. The study area constitutes a highly urbanized system characterized by intense anthropogenic pressures, rapid population growth, and continuous land-use transformations, particularly during the second half of the twentieth century (
Figure 1) [
26].
The climatic conditions of the Attica region are characteristic of a Mediterranean (Csa) climate, defined by warm, dry summers and mild, wet winters [
27]. The mean annual temperature is around 18.5 °C, while mean annual precipitation varies spatially between 350 and 750 mm, increasing from the coastal plains towards the surrounding mountainous zones such as Parnitha (1413 m a.s.l.), Penteli (1109 m a.s.l.), and Hymettus (1026 m a.s.l.). Precipitation is typically characterized by high-intensity, short-duration storm events, which frequently trigger flash flooding in urbanized lowland areas [
28]. Wildfires represent an additional environmental pressure in the Attica region, significantly altering land cover, enhancing soil erosion, and modifying surface runoff patterns, groundwater recharge processes, and the overall hydrogeological balance [
29]. From a geomorphological perspective, the Attica Basin is a semi-closed hydrological system bounded by mountainous terrain to the north and east and opening towards the Saronic Gulf to the southwest.
The hydrographic network of the Attica region is primarily composed of ephemeral and intermittent streams, with the Kifissos and Ilissos rivers representing the main drainage axes. However, extensive urban development, channel modifications, and the widespread culverting of natural watercourses by transport and drainage infrastructure have significantly altered the natural hydrological regime, reduced infiltration capacity and modifying flow dynamics. In addition to these structural changes, multiple anthropogenic pollution sources further degrade urban water quality, including industrial discharges, effluents from wastewater treatment systems, agricultural runoff, and diffuse inputs associated with recreational and illegal activities [
26,
30].
The geological structure of the area is dominated by Mesozoic limestones and schists, interbedded with Neogene marls and Quaternary alluvial deposits, which exert a strong control on subsurface flow patterns and aquifer recharge processes. Urban imperviousness, together with soil compaction and the reduction in permeable surfaces, has intensified hydrological extremes, particularly in western Attica, where flood-prone zones such as Mandra and Nea Peramos have been repeatedly affected by catastrophic events in recent decades [
31,
32]. Furthermore, the Attica region is characterized by significant seismic activity associated with a dense network of active faults, which may locally influence groundwater flow paths, fracture permeability, and hydrogeological connectivity [
33,
34].
Figure 1.
Land use map of Attica River Basin District based on Corine dataset.
Figure 1.
Land use map of Attica River Basin District based on Corine dataset.
2.2. Dataset Origin and Overall Methodology
The dataset used in this study was obtained from the European Environment Agency (EEA) via its official data platform, Discodata (
https://discodata.eea.europa.eu/) for the study area. This open-access dataset includes 958 groundwater quality measurements, covering 15 parameters from 80 monitoring stations. However, due to the presence of substantial missing values (>25%) in several parameters including major cations in numerous monitoring points, the analysis was restricted to six key variables, namely electrical conductivity (EC), chloride (Cl
−), sulphate (SO
42−), nitrate (NO
3−), nitrite (NO
2−), and ammonium (NH
4+), which were subsequently used for Water Quality Index (WQI) calculation and prediction. The dataset represents the complete set of groundwater quality observations available through the EEA monitoring database for the Attica River Basin during the study period. Consequently, the sample size was determined by the availability of monitoring records within the official groundwater monitoring network rather than by a predefined sampling design. The final dataset consisted of 958 groundwater quality observations from 80 monitoring stations and represents the complete set of publicly available records for the study area within the selected monitoring period. Therefore, the sample size was governed by data availability rather than a predefined experimental sampling strategy. Thus, procedures such as randomization and blinding were not applicable. Potential sources of bias were minimized through standardized data preprocessing, objective missing-data treatment, train-test separation, and the use of multiple performance metrics for model evaluation.
The methodological framework adopted in this study (
Figure 2) follows a structured workflow for groundwater quality prediction using machine learning techniques. Initially, groundwater quality data for the Attica River Basin, including key physicochemical parameters (NH
4+, Cl
−, EC, NO
3−, NO
2−, SO
42−) and WQI, were collected and assessed for missing values. Data gaps (ranging between 8 and 13% in the retained variables) were addressed using Multiple Imputation by Chained Equations (MICE) with Predictive Mean Matching (PMM), while imputation performance was assessed using Kolmogorov–Smirnov (K–S) tests and Standardized Mean Differences (SMD < 0.1), ensuring statistical consistency between observed and imputed distributions. Subsequently, exploratory data analysis (EDA) was performed, including correlation matrix evaluation, descriptive statistics, and assessment of temporal variability (trend and seasonality). Preprocessing steps, including winsorization for outlier mitigation and Min–Max feature scaling, were applied prior to model development. The processed dataset was then partitioned into training (80%) and testing (20%) subsets to enable model evaluation under unseen conditions. No observations were removed from the dataset based on outlier detection criteria. Instead, winsorization was applied to reduce the influence of extreme values while preserving all available groundwater quality observations. This approach was considered appropriate because extreme values may reflect genuine hydrochemical conditions or localized contamination events rather than measurement errors.
Three machine learning models—TabNet, Gradient Boosting Machine (GBM), and Support Vector Machine (SVM)—were implemented to predict groundwater quality based on WQI. Model performance was evaluated using multiple statistical metrics, including the coefficient of determination (R2), root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and Nash–Sutcliffe efficiency (NSE), providing a comprehensive assessment of predictive accuracy and model robustness. Residual diagnostics, including Q–Q plots and distribution analysis, were further employed to evaluate model behavior. Finally, post-prediction analysis was conducted using global feature importance (derived from TabNet) and sensitivity analysis (tornado plots), enabling the identification of key variables influencing WQI.
Figure 2.
Methodological approach adopted in the research.
Figure 2.
Methodological approach adopted in the research.
2.3. Missing Data Handling
To address missing values in the groundwater quality time series, this study employed Multiple Imputation by Chained Equations (MICE), a widely used multivariate imputation framework that accounts for statistical relationships among variables during the imputation process. The imputation was performed using the Predictive Mean Matching (PMM) algorithm, which is particularly suitable for environmental datasets characterized by non-normal distributions and skewed values.
MICE operates through an iterative procedure, where each variable with missing values is sequentially imputed using regression models based on the remaining variables. In this context, PMM selects a set of candidate donor values from observed cases whose predicted values are closest to those of the missing observations [
35]. Formally, for a missing value
, a predicted value
is first estimated, and donor values are identified from observed cases
with similar predicted values
, as expressed in Equation (1):
where
and
denote the number of observed and missing values, respectively. One of the observed donor values is then randomly selected and used to impute the missing entry, thereby preserving the original data distribution.
The accuracy of the MICE missing data calculation was validated using Standardized Mean Differences (SMD) (
Figure 3A). All variables displayed an SMD below the 0.1 threshold, with an average accuracy score of 95.02%, indicating that the distributional integrity of the water quality parameters was preserved (
Figure 3B). The MICE algorithm successfully replicated the central tendency and variance of the water quality samples. Also, density distribution (
Figure 3C) comparison ensures that the statistical “fingerprint” of the chemical parameters remained stable after the imputation process.
Figure 3.
Validation of MICE Imputation Accuracy. (A) Standardized Mean Difference (SMD) across all numeric variables; (B) Consistency of WQI categorical descriptions between original (colored in blue) and modeled datasets (colored in red). (C) Density distribution comparison ensuring that the statistical “fingerprint” of the chemical parameters remained stable after the imputation process.
Figure 3.
Validation of MICE Imputation Accuracy. (A) Standardized Mean Difference (SMD) across all numeric variables; (B) Consistency of WQI categorical descriptions between original (colored in blue) and modeled datasets (colored in red). (C) Density distribution comparison ensuring that the statistical “fingerprint” of the chemical parameters remained stable after the imputation process.
2.4. Dataset Statistical Analysis
The correlation matrix (
Figure 4) provides insight into the relationships among the selected physicochemical parameters and the WQI. A strong positive correlation is observed between electrical conductivity and chloride (r = 0.96), suggesting a common origin associated with salinity and dissolved ionic content. Sulphate also exhibits moderate to strong correlations with electrical conductivity (r = 0.68) and chloride (r = 0.61), indicating the influence of mineralization processes and possible geogenic contributions.
Since WQI is a composite index derived from the weighted contribution of EC, Cl
−, SO
42−, NO
3−, NO
2− and NH
4+, the correlations shown in
Figure 4 reflect the relative influence of each physicochemical parameter on overall groundwater quality and drinking-water suitability. Higher correlation coefficients indicate a stronger contribution of the corresponding parameter to the variability of the calculated WQI values.
Among the nutrient-related parameters, nitrite shows the strongest correlation with WQI (r = 0.85), followed by nitrate (r = 0.52), highlighting the importance of nitrogen species in influencing groundwater quality conditions. In contrast, ammonium displays weak correlations with both WQI (r = 0.20) and other variables, suggesting a more localized or variable behavior, potentially linked to specific sources or redox conditions. Overall, the results indicate that both nutrient loading and ionic composition contribute to groundwater quality variability, with nitrogen compounds emerging as key controlling factors in the dataset.
Figure 4.
Correlation matrix of the physicochemical parameters.
Figure 4.
Correlation matrix of the physicochemical parameters.
Principal Component Analysis (PCA) was further applied to explore the underlying structure of the dataset and to identify the dominant factors controlling water quality variability (
Figure 4). The scree plot (top panel) shows the proportion of variance explained by each principal component. The first principal component (PC1) accounts for 44.1% of the total variance, followed by PC2 explaining 21.4%. Together, the first two components explain approximately 65.5% of the total variance, indicating that they capture the majority of the dataset variability. Subsequent components contribute progressively less, with PC3 (14.9%) and PC4 (12.9%) showing moderate contributions, while the remaining components explain less than 7% each.
The biplot (bottom panel) (
Figure 5) illustrates the loadings of the variables on the first two principal components. Variables such as WQI, nitrate, and nitrite exhibit strong positive loadings on both PC1 and PC2, indicating their significant contribution to overall water quality variability. In contrast, electrical conductivity, sulphate, and chloride are positively associated with PC1 but negatively correlated with PC2, suggesting the presence of a secondary hydrochemical gradient, likely related to ionic composition and mineral dissolution processes. Ammonium shows a comparatively weaker contribution, as indicated by its proximity to the origin.
Overall, the PCA results indicate that nutrient-related parameters (particularly nitrogen species) constitute the primary source of variability in the dataset, while chloride and sulphate represent a secondary hydrochemical gradient associated with groundwater mineralization. Electrical conductivity, which reflects the overall dissolved ionic content of groundwater, is strongly associated with this gradient and serves as an integrated indicator of ionic concentration.
Figure 5.
Principal component analysis of water quality parameters.
Figure 5.
Principal component analysis of water quality parameters.
2.5. Machine Learning Models Methodology
2.5.1. Tabular Network (TabNet) Model
TabNet is a deep learning architecture designed for tabular data, enabling the effective learning of high-dimensional feature representations [
36,
37]. The model employs a sparse, instance-wise feature selection mechanism, which enhances learning efficiency during the training process, as described in Equation (2).
where
(.) denotes a trainable transformation composed of fully connected (FC) and batch normalization (BN) layers, while P
i−1 represents the prior scale information from the previous decision step. Feature selection is achieved through the application of a sparsemax activation function, which performs coefficient normalization and promotes sparsity in the selected features.
The learnable mask M[i]∈R^(B×D) is used for sparse selection of the most salient features. Since the masking process is multiplicative (i.e., M[i]×f), an attentive transformer is used to obtain the masks using the processed features from the previous step (i.e., a[i − 1]). The resulting representation is then transformed according to Equation (3).
2.5.2. Generalized Boosted Regression Modeling (GBM)
GBM is employed due to its strong predictive performance and flexibility in handling structured and tabular data [
38]. The boosting approach builds an ensemble of weak learners in a sequential manner, where each subsequent model aims to correct the errors of the previous one, typically using decision trees as base learners [
39]. This iterative process reduces overall prediction error and enhances model accuracy and generalization performance. Given its ability to effectively capture nonlinear relationships and handle heterogeneous tabular datasets, GBM is particularly suitable for the present application, as described in Equation (4).
where F
0(x) denotes the initial model (e.g., mean of the response of the variable), f
m(x) represents the base learner, typically a decision tree, fitted at iteration m, and λ∈(0, 1] is the learning rate (shrinkage parameter), which controls the contribution of each learner to the final model. The parameter M denotes the total number of boosting iterations.
At each iteration m, the model computes the negative gradient (pseudo-residuals) of the loss function with respect to the current prediction Fm − 1 = (x). A base learner f
m(x) is then fitted to these residuals, and the model is updated accordingly, as shown in Equation (5). This iterative procedure continues until either convergence is achieved (i.e., no significant improvement in the loss function) or a predefined maximum number of iterations M is reached.
2.5.3. Support Vector Machine Model (SVM)
Support Vector Machines (SVM) are supervised learning models that construct an optimal hyperplane to separate data in the feature space. The decision boundary is defined such that it maximizes the margin between support vectors belonging to different classes. Although originally developed for classification tasks, SVMs can be effectively extended to regression problems through Support Vector Regression (SVR) [
40]. In this study, SVM is applied for groundwater quality prediction based on WQI. The method is particularly effective in capturing nonlinear relationships by employing kernel functions, which map the input data into a higher-dimensional feature space where linear separation becomes feasible. The model formulation is described in Equation (6).
where α
i Lagrange multipliers (non-zero only for support vectors), y
i: target labels, K(x
i, x) kernel function (e.g., linear), SV is set of support vectors.
2.6. Water Quality Index (WQI) Calculation
A weighted arithmetic WQI method was applied to evaluate groundwater quality. Six physicochemical parameters were considered, including electrical conductivity (EC), chloride (Cl
−), sulphate (SO
42−), nitrate (NO
3−), nitrite (NO
2−), ammonium (NH
4+). Each parameter was assigned a weight (wi) based on its relative importance to groundwater quality and human health. To assess the suitability of groundwater for drinking water purposes the standards for drinking purposes as recommended by the World Health Organization [
41] and drinking water directive (98/83/ΕC) have been used for the calculation of WQI, which involves three steps. In the first step, each parameter was assigned a relative weight (R
w) on a scale of 1–5, reflecting its relative importance in water quality assessment and its potential impact on human health. In the second step, relative weights (Wi) are calculated through Equation (7).
where Wi is the relative weight, ‘‘wi’’ is the weight of each parameter and ‘‘n’’ is the number of parameters. In the third step, quality rating scale calculation (Qi) for each individual parameter is computed by dividing its concentration for each groundwater sample with drinking water quality standards and then multiplied by 100 using the following Equation (8).
where Qi is the quality rating, Ci is the concentration of each chemical parameter in each water sample in milligrams per liter (mg/L) and Si was taken from the European Drinking Water Directive (98/83/EC) guidelines for each chemical parameter. Eventually, the water quality sub-index (SIi) for each chemical parameter is computed by Equation (8), and the whole WQI is determined by Equation (9).
where SIi is the sub-index of the ith parameter, Qi is the rating based on the concentration of ith parameter and
n is the total number of parameters.
The overall WQI for each groundwater sample was calculated as the sum of the weighted quality ratings of all selected parameters. Subsequently, mean WQI values were computed for each monitoring site and year to facilitate spatial analysis and mapping. Based on the calculated WQI values, groundwater quality was classified into five categories: excellent (WQI < 50), good (50–100), poor (100–200), very poor (200–300), and unsuitable for drinking purposes (WQI > 300).
4. Conclusions
This study presented an integrated machine learning framework for groundwater quality prediction in the Attica basin, combining open-access environmental data, MICE imputation, and the TabNet model. The results indicate that this approach provides reliable WQI predictions in data-scarce and heterogeneous urban environments, with TabNet achieving improved predictive performance and generalization compared to conventional models.
Feature importance and PCA analyses consistently identified nitrate as the dominant driver of groundwater quality variability, while nitrogen-based compounds overall accounted for most of the model importance, reflecting the strong influence of anthropogenic pressures such as agricultural runoff and wastewater discharge. Sensitivity analysis further confirmed a strong and near-linear relationship between nitrate concentration and WQI.
In contrast, nitrite exhibited a distinct behavior, contributing to prediction uncertainty under high-concentration conditions, highlighting the importance of accounting for nonlinear and episodic contamination effects. Temporal analysis also indicated increased variability and potential deterioration of groundwater quality in recent years.
Although the study focuses on the Attica River Basin, the proposed framework may be applicable to other groundwater systems characterized by incomplete monitoring records, missing observations, and heterogeneous environmental datasets. Overall, the proposed framework supports cost-effective monitoring strategies by focusing on key parameters and provides a scalable and transferable tool for data-driven groundwater management in urban environments.