Groundwater Salinity Prediction in Deep Desert-Stressed Aquifers Using a Novel Multi-Stage Modeling Framework Integrating Enhanced Ensemble Learning and Hybrid AI Techniques

Msaddek, Mohamed Haythem; Abdelkarim, Bilel; Zouhri, Lahcen; Moumni, Yahya

doi:10.3390/w17162452

Open AccessArticle

Groundwater Salinity Prediction in Deep Desert-Stressed Aquifers Using a Novel Multi-Stage Modeling Framework Integrating Enhanced Ensemble Learning and Hybrid AI Techniques

¹

Faculty of Sciences of Tunis, University of Tunis El Manar, Tunis 2092, Tunisia

²

Institute of Earth Sciences, Pole of University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal

³

Fiber Materials and Environmental Technologies (FibEnTech-UBI), Universidade da Beira Interior, R. Marquês de D’Ávila e Bolama, 6201-001 Covilhã, Portugal

⁴

AGHYLE, Institut Polytechnique UniLaSalle Beauvais, 19 Rue Pierre Waguet, 60026 Beauvais, France

⁵

Department of Earth Sciences, Faculty of Sciences of Bizerte, University of Carthage, Bizerte 7120, Tunisia

^*

Authors to whom correspondence should be addressed.

Water 2025, 17(16), 2452; https://doi.org/10.3390/w17162452

Submission received: 30 June 2025 / Revised: 15 August 2025 / Accepted: 16 August 2025 / Published: 19 August 2025

(This article belongs to the Section Water Quality and Contamination)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Salinization of deep groundwater is a significant environmental and economic challenge in arid and desert zones, driven by both natural processes and human activities. Understanding the causes and dynamics of groundwater salinity is essential for protecting water quality and ensuring sustainable resource use. This study presents a novel approach, using hybrid artificial intelligence methods built upon enhanced ensemble decision tree models (EdTE-ML), including CatBoost (CatBR-m), ExtraTrees (ExTR-m), and custom Bootstrapping Regressor (BsTR-m), within a two-stage predictive framework. This study focuses on a deep, stressed aquifer in the oasis zone of Kebili, in southwestern Tunisia’s desert region. In the first stage, CatBR-m and ExTR-m served as base models, generating predictive features for the BsTR-m model in the second stage. Despite relying on limited hydrochemical data from a small number of wells, both base models produced satisfactory results. The BsTR-m model in the second stage outperformed individual models in terms of accuracy, generalization to unseen data, and spatial identification of salinity-affected zones. The proposed methodology accurately predicts groundwater salinity levels, providing an effective tool for early detection of water quality degradation. This predictive capability supports more proactive and sustainable groundwater management strategies in vulnerable desert aquifer systems.

Keywords:

groundwater salinization; ensemble machine learning; hybrid artificial intelligence models; salinity prediction; desert aquifer management

1. Introduction

Groundwater constitutes a vital resource for both irrigation and household needs. Assessing its chemical and physical characteristics is essential to ensure sustainable utilization and effective resource planning. With the growing pressure on water supplies, proactive evaluation of groundwater quality is crucial to secure its suitability for long-term human use [1,2]. Numerous investigations have employed computational, probabilistic, and analytical approaches to assess indicators of water quality [3,4]. The salinity of groundwater is typically assessed through total dissolved solids (TDS), a key metric reflecting water quality. In hyper-arid desert environments, salinization stands out as a critical challenge, posing a significant threat to the long-term viability of deep, stress-impacted aquifer systems [5]. Desert aquifers under extreme stress face heightened vulnerability due to limited natural recharge and persistent exposure to intense evaporation and salinity accumulation.

In deep desert-stressed aquifers, salinization is driven primarily by natural geochemical processes and prolonged hydrological isolation. The dissolution of soluble minerals such as halite, gypsum, and other evaporites into the groundwater matrix increases ionic concentrations over time. Upward leakage or migration of deeper, highly mineralized water layers through faults or fractures can further elevate salinity levels in overlying aquifers. In hyper-arid climates, minimal recharge combined with intense evapoconcentration accelerates the accumulation of salts, while the long residence time of groundwater prevents dilution and promotes progressive mineralization.

Anthropogenic pressures can exacerbate these natural mechanisms. Excessive abstraction for irrigation and domestic supply can alter hydraulic gradients, facilitating the upward movement of saline water from deeper strata. Agricultural return flows, especially when using marginal-quality water, can leach salts from the soil back into the aquifer system. The over-application of fertilizers introduces additional dissolved ions, while land-use changes may reduce natural infiltration areas, disrupting the balance between recharge and discharge. These combined pressures lead to a gradual but persistent rise in groundwater salinity, which diminishes its suitability for human consumption, irrigation, and industrial use, and may degrade soil productivity in dependent agricultural zones.

In such environments, deep freshwater reserves are particularly at risk from upward migration of saline water from underlying layers. Moreover, rapid demographic growth and intensified agricultural and industrial demands in arid zones further strain these already fragile groundwater systems [6]. Reliable forecasting of salinity trends is vital for preserving the integrity of freshwater aquifers. An increase in salinity levels can severely limit the usability of groundwater for both drinking purposes and agricultural irrigation. As a result, simulating salinity variations is fundamental to effective water resource management, strategic hydrological planning, and the promotion of sustainable groundwater utilization [7].

Groundwater quality assessment often relies on numerical and deterministic modeling techniques [4,8]. Nonetheless, the inherent complexity of aquifer systems, characterized by spatial heterogeneity, dynamic hydrochemical interactions, and variability over time and space, introduces major obstacles to achieving accurate predictions through conventional, model-driven approaches [9]. These complexities have been addressed by employing a range of strategies to investigate groundwater quality, including on-site measurements, laboratory-based analyses, and simulation-based modeling of aquifer behavior [10,11].

Recently, cutting-edge artificial intelligence (AI) and machine learning (ML) methods have increasingly gained global recognition for their ability to predict groundwater quality [10,12,13,14]. These approaches offer superior accuracy, user accessibility, cost-effectiveness, and faster processing times compared to traditional numerical modeling [15]. They utilize a variety of input data, including both chemical and physical variables, to build robust predictive models [16]. Extensive studies have shown that nitrate concentrations are among the most accurately monitored and forecasted parameters, followed by electrical conductivity (ElC), water quality index (WaQIx), and salinity [12,14]. Vulnerability assessments for these indicators have been carried out using diverse ML-based approaches [17,18,19,20], although most work has focused on single-model applications rather than integrated or hybrid approaches [21,22,23,24,25].

Machine learning (ML) has become an essential tool for predicting groundwater salinity variations across spatial and temporal scales. Early studies employed conventional models such as linear regression (LR) [20,26,27], naïve Bayes (NB) [28,29,30], k-means clustering [31,32], and the perceptron algorithm [20,33], which, while foundational, often struggled with overfitting, convergence to local minima, and limited ability to model complex non-linear interactions [18,25]. To overcome these shortcomings, recent research has shifted toward hybrid and ensemble machine learning (H-EML) techniques, which capture complex patterns more effectively and show improved predictive resilience. Intelligent optimization strategies, including evolutionary computation and swarm intelligence, have also been applied to fine-tune model parameters [34] in advanced frameworks such as deep belief networks (DBNs) [28,35,36], probabilistic neural networks (PNNs) [25,28,33], fuzzy systems (FSs) [10,18,25,37,38], and relevance vector machines (RVMs) [28,39].

Several recent studies have demonstrated that H-EML can enhance the performance of ensemble decision tree models (EdTE-ML) [28,30,40,41,42]. However, these improvements have sometimes been accompanied by overfitting and excessive overprediction of nitrate concentrations [40,41]. One key limitation is the incompatibility of certain optimization strategies with EdTE-ML frameworks that use discrete hyperparameter spaces [28,42]. Alternative approaches such as particle swarm optimization (PSO) and simulated annealing (SA) may provide more effective tuning for EdTE-ML hyperparameters [40,41]. Despite promising results, the application of EdTE-ML models for groundwater salinity prediction remains underexplored. Notable algorithms in this category include CatBoost Regressor (CatBR-m), ExtraTrees Regressor (ExTR-m), and Bootstrapping Regressor (BsTR-m) [40,41,42,43,44], which have shown strong performance in other engineering domains but limited testing in salinity modeling. These models often outperform standalone algorithms such as gradient boosting machines (GBM) and extreme gradient boosting (XGBoost) [25,28,43,44], especially in arid-zone aquifers where data scarcity is a challenge.

Feature selection and model optimization are also critical. The random decision forest (RDF) algorithm is widely recognized for identifying relevant features [10,28,44], while CatBoost excels in handling high-dimensional and categorical data [40,41,43,44]. ExtraTrees Regressor is valued for its simplicity and consistent performance [40,41,44], and Bootstrapping Regressor is effective at minimizing overfitting and boosting prediction precision [43]. However, detailed sensitivity analyses of hyperparameters in groundwater salinity prediction are still rare [40,41,42]. GridSearchCV (GSCV) remains a popular and effective method for tuning EdTE-ML models in small datasets [25,38,43,44], due to its ease of implementation and ability to systematically explore parameter spaces.

This study advances current methods by using optimized ensemble decision tree-based machine learning (EdTE-ML) algorithms to predict groundwater salinity in deep desert aquifers. In harsh arid environments with limited and sparse groundwater quality data, groundwater often serves as the primary, and sometimes sole, freshwater source, but is heavily threatened by salinization driven by natural mineral dissolution, limited recharge, high evaporation, and over-extraction. These stresses are acute in desert aquifers, where very low recharge means even small salinity increases can cause irreversible water quality degradation, affecting drinking water, agriculture, and long-term socio-economic stability. A key innovation here is a dual-stage modeling strategy that improves prediction accuracy and robustness, especially under data-scarce conditions. The first tier employs CatBoost (CatBR-m) and ExtraTrees Regressor (ExTR-m) models to generate initial salinity predictions, which are then combined by the Ensemble Bootstrapping Regressor (BsTR-m) in the second tier to produce a more precise and robust final output. This study contributes by (i) demonstrating the effectiveness of optimized EdTE-ML models in delivering high-precision predictions despite small and weak datasets, and (ii) proposing a novel, structured modeling strategy specifically designed for groundwater salinity prediction in deeply stressed desert aquifers.

The primary objectives of this research are fivefold: (i) using the random decision forest (RDF) algorithm for feature selection to identify the most relevant groundwater quality parameters for salinity prediction; (ii) conducting a thorough sensitivity analysis with GridSearchCV (GSCV) to find optimal hyperparameters for EdTE-ML models; (iii) applying a two-tier modeling strategy with optimized EdTE-ML algorithms to improve predictive accuracy; (iv) assessing and comparing the performance of individual EdTE-ML models against their combined forms within the dual-stage framework; and (v) generating a detailed spatial distribution map to visualize and predict groundwater salinity across the study area, addressing challenges posed by scarce and limited-quality data.

2. Study Area and Data Collection

2.1. Study Area

The study area is located in the East Kebili region of southwestern Tunisia (oasis area), a deep desert environment where groundwater resources play a critical role in sustaining life and agricultural activity. This arid region is geographically bounded by the saline depressions of Chott El Djerid and Chott El Fejej (Figure 1). These chotts, acting as terminal discharge zones for regional groundwater flow, not only influence the hydraulic regime but also intensify the risk of groundwater salinization through upward leakage of mineralized water and surface evaporation-driven salt accumulation [6,45,46].

The East Kebili region is marked by extreme climatic conditions, with high interannual temperature variability ranging from 13 °C in winter to over 32 °C during summer months. Annual precipitation is sparse and erratic, generally below 100 mm. These factors reinforce a hyper-arid hydrological context where natural groundwater recharge is severely constrained [45,46].

Despite these challenges, the region supports the most extensive and productive oases in Tunisia. Traditional date palm cultivation and the recent expansion of greenhouse agriculture are highly dependent on the extraction of deep groundwater, mainly from the Complex Terminal (CT) aquifer. This aquifer, consisting of Upper Cretaceous carbonates and Tertiary continental sediments, reaches depths of over 300 m and thicknesses up to 200 m. Water temperatures in these deep reservoirs range from 27 °C to over 45 °C and often require cooling prior to irrigation use [45].

Geologically, the East Kebili region lies on the northern edge of the Saharan Platform and is composed of a thick sedimentary sequence ranging from the Jurassic to the Quaternary. Deep boreholes reveal Jurassic carbonates, mainly limestones and dolomites with marl layers, unconformably overlain by Lower Cretaceous fluvio-deltaic deposits of clays, sandstones, and evaporites. These are succeeded by Upper Cretaceous marine carbonates and shales, marking a major transgressive event, followed by a regional unconformity caused by tectonic uplift during the Paleocene-Eocene. Overlying this are Neogene and Quaternary continental deposits, unconsolidated sands and gravels, forming the Plio-Quaternary aquifer with limited recharge capacity. Structurally, the region is affected by NE–SW and E–W trending fault systems related to Mesozoic and Alpine tectonic activity [45,46]. These faults, along with horst–graben structures, influence aquifer geometry and facilitate upward migration of saline water, particularly near the Chott El Djerid and Chott El Fejaj depressions. The interplay between lithological variation and structural complexity governs groundwater flow, salinization risk, and the connectivity of deep desert aquifer systems in the region [6,45,46].

The hydrogeological framework of the region is characterized by a complex multilayer aquifer system composed of three vertically interconnected units: the Plio-Quaternary, the Complex Terminal (CT), and the Continental Intercalary (CI) aquifers [45,46]. These aquifers collectively form one of the most important groundwater reserves in the region, supporting extensive agricultural activities, particularly oasis cultivation. However, despite their substantial storage capacity, these aquifers exhibit limited natural recharge rates due to the arid climate, low precipitation, and high evaporation typical of the region. Consequently, the balance between recharge and discharge is fragile, making these groundwater resources highly vulnerable to overexploitation and salinization. The increasing groundwater extraction to meet agricultural and domestic demands, coupled with natural salinity inputs, poses a significant risk to the long-term sustainability of local oasis agriculture and the livelihoods it supports.

In the study area, the Plio-Quaternary aquifer and the Complex Terminal (CT) aquifer are hydraulically connected, forming a single multilayered aquifer. In the Chott Djerid region, the Mio-Plio-Quaternary formations are generally not differentiated in hydrogeological investigations and are, in most cases, considered part of a multilayered aquifer system hydraulically attached to the CT aquifer. Groundwater exchange between these units occurs through semi-permeable layers, enabling vertical mixing of waters of different ages and salinities. All water samples analyzed in this study were collected from wells tapping into this multilayered Mio-Plio-Quaternary–CT aquifer; therefore, the TDS values presented in Figure 2 reflect the integrated water quality of this combined system rather than that of a single stratigraphic unit [47].

Hydrodynamically, groundwater flow within the Complex Terminal aquifer is driven predominantly by recharge occurring in the elevated southern ranges of the Algerian Atlas, where infiltration is facilitated by more favorable climatic and geological conditions. From these recharge zones, groundwater migrates northward through permeable sedimentary formations towards discharge areas located in the chott depressions—large endorheic salt basins characterized by high evaporation rates [45,46]. This natural flow regime, essential for replenishing aquifers and maintaining water quality, is increasingly disrupted by intensive groundwater pumping for irrigation. The excessive abstraction exceeds the natural recharge capacity, resulting in declining piezometric levels across many monitoring wells. This decline enhances the risk of vertical flow reversals, whereby deeper, more saline waters migrate upwards into shallower aquifer layers, further degrading water quality. The combined effect is a progressive increase in salinity levels, reflected by rising total dissolved solids (TDS) concentrations, which compromises the usability of groundwater for both irrigation and potable use.

Unlike coastal aquifers, where seawater intrusion commonly drives salinization, the main factors influencing groundwater quality deterioration in the East Kebili oasis area are related to anthropogenic and natural hydrogeochemical processes. Intensive groundwater abstraction concentrates dissolved salts through evaporation in shallow aquifers, while saline water from deeper or adjacent formations intrudes upward along preferential pathways, particularly near the margins of the Chott El Djerid and Chott El Fejej basins.

The presence of naturally brackish or saline zones within the sedimentary sequence, coupled with long groundwater flow paths through mineral-rich strata, further contributes to the salinity problem. This situation is exacerbated by the region’s arid climate, which promotes evaporative concentration of salts near the surface. As a result, older oasis areas and regions adjacent to the chotts exhibit elevated TDS levels, often exceeding thresholds suitable for irrigation and human consumption (Figure 2) [45,46]. Without the implementation of integrated groundwater management strategies that balance abstraction with sustainable recharge, alongside the adoption of water-saving irrigation technologies, the region faces an increasing risk of irreversible degradation of its critical groundwater resources.

2.2. Data Collection

A comprehensive groundwater sampling campaign was conducted between 2022 and 2024 across the East Kebili region to investigate the hydrochemical characteristics of the deep desert aquifer system and to assess the progression of groundwater salinization. Given the highly arid and desert nature of the study area, groundwater resources are extremely scarce and localized, predominantly occurring near oasis zones where natural conditions allow for sustainable water extraction and human settlement. In these regions, boreholes are strictly limited to existing agricultural and inhabited areas, with no permission granted for new drilling outside these zones, due to both environmental constraints and regulatory restrictions designed to protect the fragile desert ecosystem.

To support this study, groundwater samples were collected from a network of 41 wells distributed across the study area, representing different aquifer levels and hydrogeological settings. The wells were selected to capture spatial variability and cover both recharge and discharge zones. Sampling campaigns were conducted between 2022 and 2024, following standardized protocols to ensure data quality. Hydrogeochemical analyses included measurement of major ions (Na⁺, K⁺, Ca²⁺, Mg²⁺, Cl⁻, SO₄²⁻, HCO₃⁻, CO₃²⁻), total dissolved solids (TDS), pH, and sodium adsorption ratio (SAR), performed using ICP-MS, ion chromatography, and titration protocols. Quality control procedures and replicates were used to verify the reliability of the results. These data provide a robust foundation for characterizing groundwater chemistry and understanding the processes controlling salinization.

The well depths range from 71 m to 210 m, targeting different hydrostratigraphic units within the Plio-Quaternary and Complex Terminal aquifers. Well locations and screened intervals were documented using GPS coordinates and drilling logs, ensuring precise spatial referencing. The majority of wells are production wells actively used for irrigation, while a subset includes dedicated monitoring boreholes installed to track groundwater quality over time.

As a result, the number and spatial distribution of sampling points are inherently constrained, leading to a relatively small and clustered dataset centered around these oasis areas, comprising a total of 41 groundwater samples.

Field data collection focused on measuring key in situ parameters, including static water level, total well depth, groundwater temperature, pH, salinity, and electrical conductivity (EC), to establish the physical and chemical status of the resource. Groundwater samples were retrieved from a network of deep private wells currently exploited for irrigation purposes, as well as from selected observation boreholes installed specifically for monitoring. Extraction methods were adapted to the well type: operational production wells were sampled directly through electric pumping systems, while monitoring wells were sampled using a stainless-steel bailer after purging to ensure representative water quality data. The active use of these wells during the campaign period confirms that the sampled points are reflective of the water actively supplying the oases and agricultural zones, thereby providing critical insights despite the limited geographic spread.

At each sampling site, two separate groundwater aliquots were carefully collected in sterile, contamination-free plastic containers to ensure sample integrity and avoid cross-contamination. The first aliquot was immediately preserved using appropriate acidification techniques to stabilize dissolved cations and trace metals for subsequent laboratory analysis. This preservation step is crucial to prevent precipitation, adsorption, or transformation of sensitive metal species during storage and transport. The second aliquot was left untreated to allow accurate assessment of anions and general chemical parameters, which could be influenced by chemical preservation agents.

Immediately after collection, all samples were clearly labeled with unique identifiers and metadata, including date, time, and well characteristics, to maintain traceability. They were then stored in cooled containers, typically refrigerated at 4 °C, to inhibit microbial activity and chemical alteration before analysis. These precautions ensured the chemical composition remained as representative as possible of in situ groundwater conditions.

The comprehensive hydrochemical characterization encompassed a wide range of major ions essential for understanding water quality and salinization processes. Cations analyzed included sodium (Na⁺), magnesium (Mg²⁺), potassium (K⁺), and calcium (Ca²⁺), while key anions comprised chloride (Cl⁻), sulfate (SO₄²⁻), carbonate (CO₃²⁻), bicarbonate (HCO₃⁻), and nitrate (NO₃⁻). In addition to these, critical indicators of salinization such as total dissolved solids (TDS), electrical conductivity (EC), and sodium adsorption ratio (SAR) were measured to evaluate the degree of mineralization and potential impacts on soil and crop health (Table 1).

Alkalinity was also quantified to better understand the buffering capacity of the groundwater system and the carbonate equilibrium, which are important factors influencing the geochemical evolution and stability of the aquifer. To ensure the reliability of the analytical data, an ion balance was systematically calculated by comparing the sum of measured cations and anions. Most samples exhibited acceptable charge balance within ±5%, confirming the accuracy and consistency of the laboratory results and validating the data for subsequent interpretation and modeling.

While it is true that the northern part of the study area remains largely unsampled due to the absence of accessible wells, reflecting the natural scarcity of exploitable groundwater and the prohibition on drilling in these desert expanses, this limitation is inherent to the environmental and regulatory context of deep desert aquifers. The sampled points thus represent the only feasible and sustainable groundwater sources currently utilized, making them the most relevant for assessing the hydrochemical status and salinization trends within the system. This focused sampling approach ensures that the collected dataset, although limited in spatial extent, is both representative and valuable for understanding water quality variations where human and agricultural activity depend on groundwater availability.

The main goal of this research is to address the challenge of groundwater salinization prediction in such harsh desert environments where data are scarce and the database is limited. To this end, this study integrates advanced artificial intelligence (AI) and machine learning (ML) techniques to leverage the limited available data effectively, improving prediction accuracy and providing a valuable tool for resource management despite data constraints.

This integrated chemical dataset forms a critical component of salinity monitoring in the East Kebili region, offering insights into water–rock interactions, geochemical evolution, and the mobilization of salts under intensive groundwater abstraction. The results help trace salinization trends both laterally and vertically within the aquifer system, revealing zones where water quality degradation is accelerating. Ultimately, this approach supports the development of diagnostic tools for the sustainable management of deep groundwater reserves that underpin oasis agriculture in arid environments.

3. Machine Learning (ML) Models

3.1. Stage 0: Random Decision Forest (RDF)

The random decision forest (RDF) is an ensemble-based machine learning technique that aggregates the outputs of numerous decision trees to generate predictive outcomes. Within this framework, each individual tree is independently trained using randomly selected subsets of both input features and training data. Once trained, every tree contributes a prediction, either a numerical value in regression tasks or a class label in classification tasks. The model’s final output is then determined through either majority voting (for classification) or averaging the predictions (for regression). This method operates through an ensemble of decision-based classifiers, as mathematically represented in Equation (1):

f_{a} (x) f o r a = 1, \dots, n

(1)

where n is influenced by all model variables, with each individual decision tree contributing a single vote toward the prediction associated with the input instance x.

At this stage, the input data consist of the full set of candidate hydro-physical and geochemical parameters measured from groundwater samples. These include major ions, salinity indicators, well depth, and other physical features collected in the field.

The output of RDF at this stage is the ranked importance of each input feature in relation to groundwater salinity. This feature importance ranking enables the identification of the most influential predictors, thus serving as an essential data preprocessing step.

By aggregating the outputs of multiple trees, this approach enhances model generalization and mitigates the risk of overfitting, which is a common limitation of standalone decision trees. The random decision forest (RDF) algorithm is extensively applied in tasks such as regression analysis, variable importance assessment, and dimensionality reduction. Renowned for its robustness, adaptability, and user-friendly implementation, RDF offers a favorable trade-off between predictive accuracy and model stability, establishing it as a widely preferred technique for solving regression-based problems.

3.2. Stage 1: Ensemble Learning and Hybrid AI Techniques

3.2.1. CatBoost (CatBR-m)

The CatBoost regression model (CatBR-m) is an advanced gradient boosting technique developed as an open-source solution for tackling both classification and regression tasks. Engineered to efficiently process high-dimensional datasets and categorical variables without extensive preprocessing, CatBR-m is well-suited for complex, real-world predictive scenarios. Its learning mechanism involves the sequential integration of multiple weak learners, each trained to correct the residual errors of its predecessor. Through this iterative process, the algorithm progressively refines model accuracy by reweighting the training instances based on prior prediction errors. The final predictive function emerges as a weighted sum of these base learners, with each weight reflecting the relative contribution of the corresponding model’s performance. The mathematical formulation of the CatBR-m model is expressed in Equation (2):

\hat{c} = \sum_{t = 1}^{T} f_{t} (y)

(2)

where

\hat{c}

denotes the estimated response variable, T indicates the total number of decision trees incorporated within the CatBR-m ensemble, and f_t(y) refers to the prediction generated, which processes the input feature vector y to produce an individual output estimate.

The refined subset of input variables selected from Stage 0 (RDF) is fed into CatBR-m, along with the corresponding groundwater salinity target values for supervised learning.

CatBR-m outputs a predictive regression model that estimates groundwater salinity with improved accuracy by sequentially learning from residuals. Intermediate results include feature effects and the correction of previous prediction errors.

3.2.2. ExtraTrees (ExTR-m)

The ExtraTrees regression model (ExTR-m) represents an enhanced variant of the random decision forest (RDF) algorithm, specifically engineered for high-dimensional datasets and efficient regression analysis. As an ensemble-based learning approach, ExTR-m integrates multiple independently constructed decision trees to generate a robust aggregated prediction. Unlike RDF, the ExTR-m model employs an extreme randomization strategy during tree construction, which significantly mitigates overfitting. This strategy involves injecting randomness into both the selection of training subsets and the determination of split thresholds across feature spaces. The structural distinction of ExTR-m lies in its use of fully randomized splits at each node, thereby increasing diversity among trees and improving generalization performance. The formal representation of the ExTR-m model is provided in Equation (3):

\hat{s} (y) = \frac{1}{T} \sum_{t = 1}^{T} f_{t} (z)

(3)

where ŝ(y) denotes the estimated salinity value corresponding to the input feature vector z, T represents the number of decision trees comprising the ExtraTrees ensemble, and f_t(z) signifies the output generated by the tree, which is constructed using randomized split thresholds and trained on a randomly sampled subset of the input space.

Introducing randomness during tree construction enhances model variability and promotes structural diversity across the ensemble, thereby effectively minimizing the likelihood of overfitting.

Similar to CatBR-m inputs, ExTR-m uses the feature subset determined in Stage 0 and the training data to build a predictive model robust against overfitting.

The output is an independent salinity prediction model that benefits from enhanced structural randomness, improving overall model variance and bias trade-offs.

3.3. Stage 2: Custom Bootstrapping Regressor (BsTR-m)

The Bootstrapping Regressor (BsTR-m) is a powerful ensemble ML algorithm that aims to improve the performance and robustness of regression models. Its foundation lies in the concept of bootstrap sampling, which selects data points randomly from the original dataset with replacement. For each bootstrap sample, a base regression model (usually decision trees) is trained independently. This process creates diversity in the training data for each base model. A model can be represented in Equation (4):

\hat{u} (g) = \frac{1}{T} \sum_{t = 1}^{T} f_{t} (g)

(4)

where û(g) represents the estimated salinity value corresponding to the input vector g, T denotes the number of base estimators within the ensemble framework, and f_t(g) refers to the output of the learner, which is constructed using a bootstrap-resampled subset of the original training data.

The base models from Stage 1 (CatBR-m and ExTR-m predictions) serve as inputs here. Bootstrapped datasets are generated from the original training samples to train multiple base estimators.

The BsTR-m produces an aggregated final predictive model that combines base estimators’ outputs, reducing variance and enhancing robustness against overfitting and data sparsity.

3.4. GridSearchCV (GSCV)

GridSearchCV (GSCV) is a popular optimization technique used in ML models to find the optimal hyperparameters for a model. GSCV involves defining a grid of hyperparameters. For each combination of hyperparameters within this grid, the model is trained and evaluated using a validation set. The combination of hyperparameters that yields the best performance on this validation set is then chosen as the optimal set of hyperparameters for the model. The fundamental concept behind GSCV is to systematically explore the search space of hyperparameters and to find the combination that provides the best performance. The technique is simple to implement and is widely used in a variety of ML models, as shown in Equation (5):

θ^{*} = \underset{θ \in Θ}{arg min} \frac{1}{K} \sum_{k = 1}^{K} L (f_{θ}^{(k)} (x_{v a l}^{(k)}), y_{v a l}^{(k)})

(5)

where θ^* is the optimal set of hyperparameters selected by GridSearchCV, Θ is the defined hyperparameter search space, K is the number of cross-validation folds, L is the loss function, f_θ^(k) is the model trained with parameters θ on the training fold, x_val^(k) and y_val^(k) are the validation inputs and true outputs for fold k.

The training data (input), along with a defined search space for model hyperparameters (such as tree depth, learning rate, number of estimators), are supplied to GSCV.

The output is the optimal hyperparameter configuration θ^* that maximizes model performance, which is then used to train the final predictive model. This step ensures models are neither underfit nor overfit and are well-generalized.

4. Data Processing Framework for Predicting Salinity

The methodological framework for groundwater salinity prediction is structured into four comprehensive stages, each playing a critical role in ensuring robust and reliable model outcomes.

At the outset, the key objective of this framework is to progressively transform raw hydrogeochemical data into actionable spatial predictions of groundwater salinity, particularly in an area characterized by a limited database. This proposed AI and machine learning approach is specifically designed to predict salinization in data-poor zones and serves as a more advanced and accurate tool for interpolation and spatial prediction compared to standard methods.

First, dataset conceptualization involves the careful selection of fundamental hydro-physical and geochemical parameters to be used as input features for modeling. These parameters are chosen based on their known influence on groundwater quality and salinity processes, such as concentrations of major ions, salinity indicators, and relevant physical characteristics like well depth and water table levels.

The initial dataset (input) comprised 41 groundwater samples, each characterized by concentrations of major ions (Na⁺, Cl⁻, SO₄²⁻, Ca²⁺, Mg²⁺, K⁺, HCO₃⁻, CO₃²⁻, NO₃⁻), along with physical parameters including sampling depth and location coordinates.

To systematically evaluate and prioritize these input variables, an initial screening is conducted using the random decision forest (RDF) algorithm in a preliminary stage (Stage 0). This step ranks features according to their relative importance, helping to reduce dimensionality, avoid overfitting, and focus the modeling effort on the most influential predictors.

The outcome of this stage (output) was a ranked list of hydrochemical and physical variables, allowing the model to prioritize inputs that significantly control salinity variations.

Second, enhanced ensemble decision tree models (EdTE-ML) are employed to model the complex and often nonlinear relationships between input parameters and groundwater salinity. In Stage 1, two powerful algorithms—CatBoost (CatBR-m) and ExtraTrees Regressor (ExTR-m)—are independently trained on the selected dataset. Both models are designed to handle nonlinear interactions and feature dependencies effectively, each bringing complementary strengths to the analysis. CatBoost excels in dealing with categorical variables and reducing prediction bias, while ExtraTrees emphasizes variance reduction through randomization in tree construction.

Following this, Stage 2 implements the Bootstrapping Regressor (BsTR-m), an ensemble combiner that aggregates predictions from the base models to reduce variance, mitigate overfitting, and improve overall model stability.

The models were trained using 30 samples and tested on 11 samples, demonstrating robust predictive capacity and improved accuracy through ensemble learning.

This dual-tier modeling approach enables a more nuanced and resilient prediction framework capable of capturing the intricate patterns influencing salinity distribution.

Results at this stage included reduced prediction errors and enhanced model generalization compared to single-model approaches.

Third, a rigorous hyperparameter optimization process is performed using the GridSearchCV (GSCV) technique. This automated tuning systematically explores a predefined range of hyperparameter values for each model to identify the optimal configuration that maximizes predictive accuracy while minimizing error.

This optimization ensures that the models are neither underfit nor overfit and that they generalize well to unseen data, which is particularly important given the limited and sparse nature of groundwater datasets in arid desert environments.

Through this process, optimal hyperparameters were selected that balanced model complexity and performance, as validated by improved metrics during cross-validation.

Fourth, the predictive performance of the models is thoroughly evaluated using multiple statistical metrics. These include mean absolute error (MAE), which measures average prediction errors; adjusted R², indicating the proportion of variance explained while accounting for model complexity; Kling–Gupta efficiency (KGE), which assesses model skill by integrating correlation, bias, and variability; and normalized root mean square error (nRMSE), which provides a scale-independent error measure.

This multi-metric evaluation offers a comprehensive view of model reliability and accuracy.

Evaluation results confirmed the strong performance of the ensemble approach, with high explanatory power (adjusted R²), low prediction errors (MAE, nRMSE), and balanced model skill (KGE).

Finally, the optimized and validated ensemble model is used to produce a high-resolution groundwater salinity prediction map of the study area, as illustrated in Figure 3. This spatial output is a critical tool for visualizing salinization patterns, guiding resource management, and supporting decision-making processes aimed at sustainable groundwater use in deep desert aquifers with scarce data availability.

The final and most important finding of this study is the successful prediction of groundwater salinization patterns in a data-scarce environment, demonstrating the effectiveness of the proposed machine learning framework for salinity assessment under limited dataset conditions.

4.1. Initial Structuring of Feature Variables

To investigate the statistical distribution of the groundwater dataset and assess the relationships between salinity and geochemical indicators, kernel density estimation (KDE) was applied, as shown in Figure 4. Prior to model training, input variables were standardized using the Z-score normalization method, which centers the data around a mean of zero and scales it to unit variance.

This standardization approach is widely recognized for improving algorithmic convergence, minimizing overfitting tendencies, and enhancing model robustness.

These preprocessing steps ensured that all input features were on comparable scales and that the machine learning algorithms could effectively learn the underlying patterns without bias toward variables with larger numeric ranges.

The machine learning algorithms, CatBR-m, ExTR-m, and BsTR-m (employed in Stage 2), were subsequently developed and evaluated using a total of 41 groundwater samples, with 30 samples allocated for training and 11 for testing.

This data split allowed for robust model training while preserving a subset for unbiased evaluation of predictive performance.

The selection of hydrochemical parameters for this study was based on their established relevance to groundwater salinity processes and their diagnostic value in arid environments. Major ions such as sodium (Na⁺), chloride (Cl⁻), sulfate (SO₄²⁻), calcium (Ca²⁺), magnesium (Mg²⁺), potassium (K⁺), bicarbonate (HCO₃⁻), carbonate (CO₃²⁻), and nitrate (NO₃⁻) are critical in characterizing the geochemical signature of groundwater. These ions influence salinity levels through natural processes such as mineral dissolution, ion exchange, and evaporation concentration.

For instance, Na⁺ and Cl⁻ are primary contributors to salinity, often elevated due to rock–water interactions and evaporative concentration in desert aquifers. Sulfate and bicarbonate concentrations provide insight into redox conditions and carbonate equilibria, which affect water chemistry stability. Trace metals and cations also serve as indicators of anthropogenic impact and mineralogical sources, which may contribute to salinity variations.

Moreover, the selection was informed by previous hydrogeological studies in arid and semi-arid regions, where these parameters have proven effective in detecting and monitoring salinization trends.

By incorporating a comprehensive suite of ions and chemical indicators, the model can capture both direct salinity drivers and indirect factors influencing groundwater quality. This holistic approach improves the predictive power and interpretability of the machine learning models, allowing for better discrimination of spatial and temporal salinity patterns within the limited dataset.

4.2. Identification and Matching

To identify the most effective combination of input parameters for predictive modeling, a random decision forest (RDF) algorithm was employed as a feature selection mechanism [10,28,44]. The accuracy and efficiency of any machine learning-based predictive model heavily depend on the quality and relevance of the selected input variables. Incorporating unnecessary or redundant attributes can significantly complicate the model without enhancing its predictive power [44]. Given the absence of a universal protocol for input variable selection in machine learning applications for groundwater salinity forecasting, especially under data-scarce conditions, RDF was chosen for its robustness. This approach is particularly well-suited to limited datasets, as it efficiently handles nonlinear relationships and interdependencies among the variables with minimal sensitivity to data volume.

At this stage, the task was to systematically reduce the dimensionality of the input dataset by ranking variables according to their predictive importance.

Using 41 groundwater samples, RDF analysis revealed the most influential hydrochemical and physical parameters contributing to salinity variations, allowing the model to focus on these key predictors.

This selection process improved model interpretability and reduced the risk of overfitting due to redundant or irrelevant features.

4.3. Selection of Optimal Hyperparameters

The GridSearchCV (GSCV) approach was employed to fine-tune the hyperparameters of ensemble decision tree-based machine learning (EdTE-ML) models, applied across two tiers of the modeling framework [25,38,43,44]. Selecting an appropriate optimization algorithm is critical to prevent entrapment in local minima and to enhance the convergence rate. The effectiveness of an optimization strategy is influenced by factors such as dataset size, problem complexity, and the dimensionality of the hyperparameter space. GSCV proved to be a robust method for EdTE-ML optimization, particularly when dealing with a limited set of hyperparameters and seeking the most effective parameter combinations [44].

The task at this stage was to systematically explore the hyperparameter space to identify optimal model configurations that improve prediction accuracy and generalization.

During the initial stage of the optimization procedure, all potential parameters were integrated into the GridSearchCV (GSCV) space for comprehensive evaluation. Subsequently, the two most impactful hyperparameters from each model, those that demonstrated variability in outcomes, were identified and retained. Parameters that showed negligible influence or remained constant within the search space were excluded from further analysis [43,44]. The selected top two hyperparameters for each modeling tier were then subjected to an in-depth sensitivity analysis across their respective value ranges. Investigating multiple hyperparameters within a broad search space can significantly increase both computational time and data collection costs.

For the CatBR-m model, critical parameters such as tree depth and learning rate were optimized through iterative grid-based trial-and-error exploration.

Adjusting the learning rate proved vital for balancing convergence speed and prediction stability, while tree depth controlled model complexity and overfitting risk.

Similarly, for ExTR-m and BsTR-m algorithms, the total number of estimators and the maximum number of features considered per split were the dominant tuning parameters affecting model performance.

Tree depth again played a key role in shaping the predictive behavior and generalization capacity of these models.

Furthermore, the structural depth of the trees was found to play a pivotal role in shaping the model’s predictive behavior and generalization capacity. The dominant tuning parameters for each EdTE-ML model were identified through the GridSearchCV framework, which employed a repeated data-splitting validation method with fold numbers varying from 2 to 6 in two-step intervals. This resampling strategy offers a consistent and reliable way to assess how well the ensemble models generalize to unseen data. Such iterative partitioning techniques are widely used in hyperparameter optimization to improve performance estimation, reduce overfitting risk, and ensure robust model validation on external datasets.

The results of this hyperparameter tuning and validation process were optimized model versions with enhanced predictive accuracy and robustness, crucial for reliable salinity forecasting under limited data conditions.

4.4. Assessment of Predictive Performance

Model performance was assessed using a range of statistical indicators to identify the most accurate predictive model. These included the mean absolute error (MAE), adjusted coefficient of determination (adjusted R²), Kling–Gupta efficiency (KGE), and normalized root mean square error (nRMSE) [25,38,43,44]. The MAE, adjusted R², and KGE metrics are particularly suited for evaluating machine learning algorithms applied to relatively small datasets, while nRMSE offers the advantage of normalizing prediction errors relative to the observed data variability. The selection of these evaluation criteria was guided by considerations such as minimizing both over- and underestimation, effectively representing extreme values, and achieving an optimal balance between accuracy and model interpretability. The mathematical formulations of these metrics are provided in Equations (6)–(9).

The primary task in this stage was to quantitatively evaluate and compare model predictions against observed groundwater salinity data using multiple complementary metrics.

This multi-criteria approach ensures a robust understanding of model strengths and weaknesses, particularly in relation to prediction accuracy, bias, and variability representation.

M A E = \frac{n}{1 i} \sum_{i = 1}^{n} |y_{i} - {\hat{y}}_{i}|

(6)

where n denotes the total count of observations in the dataset, y_i represents the true observed value corresponding to the ith data point, and ŷ_i signifies the predicted value produced by the model for the same ith instance.

R_{a d j}^{2} = 1 - (\frac{(1 - R^{2}) (n - 1)}{n - p - 1})

(7)

where n represents the total number of observations, p indicates the number of predictor variables, and R² denotes the conventional coefficient of determination.

K G E = 1 - \sqrt{{(r - 1)}^{2} + {(β - 1)}^{2} + {(γ - 1)}^{2}}

(8)

where r denotes the linear correlation coefficient between observed and predicted values, β represents the bias ratio, and γ signifies the variability ratio.

n R M S E = \frac{\sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}}{\bar{y}} \times 100

(9)

Applying these metrics to the testing subset of groundwater samples allowed for an objective assessment of the predictive performance under limited data availability.

Model outputs were evaluated to determine whether they met commonly accepted thresholds in hydrogeological modeling, providing confidence in their practical applicability.

In hydrogeological and groundwater research, models are generally considered satisfactory when the normalized root mean square error (nRMSE) is below 10% and the Kling–Gupta efficiency (KGE) is equal to or exceeds 0.7.

The results showed that the developed ensemble models achieved nRMSE values below this 10% threshold and KGE values greater than 0.7, confirming their reliability in predicting groundwater salinity despite the challenges posed by the small dataset.

This outcome validates the proposed modeling framework as an effective tool for salinization assessment in data-scarce desert aquifers.

5. Results

A comprehensive overview of hydro-physical and geochemical parameters from groundwater samples in the desert aquifer of Kebili oases highlights both concentration ranges and statistical variability (Table 2 and Figure 5). Major ions such as sodium (Na), calcium (Ca), chloride (Cl), sulfate (SO₄), and bicarbonate (HCO₃) exhibit relatively stable concentrations, evidenced by low coefficients of variation around 0.2, indicating minor fluctuations across the study area. Magnesium (Mg) displays moderate variability (CV ~0.3), while potassium (K) and nitrate (NO₃) show high variability (CVs of 0.8 and 1.0, respectively), suggesting localized or episodic influences. Chloride’s large variance and standard deviation reveal significant spatial differences in salinity, typical of arid environments. Carbonate (CO₃) has the highest variability (CV = 2.5) due to many low or zero values and a few elevated measurements, reflecting uneven distribution. Total dissolved solids (TDS) and sodium adsorption ratio (SAR) demonstrate moderate variation (CV ~0.4), consistent with fluctuating salinity and sodium hazard levels. The pH remains quite stable, with low variance and CV (0.1), indicating consistently slightly alkaline groundwater. Overall, these data reflect groundwater chemistry shaped by evaporation-driven concentration, mineral dissolution, and ion exchange processes typical of desert aquifers, with some parameters showing homogeneity and others reflecting geological and hydrological heterogeneity.

The spatial distribution of sodium (Na) concentrations in the study area reveals clear variability, with the highest levels (>60 mg/L) predominantly located in the central-western region, forming an irregular zone of elevated sodium (Figure 6a). Surrounding this core, moderate concentrations (40–60 mg/L) extend toward the north and southeast, while the eastern and western fringes exhibit much lower Na concentrations (<40 mg/L). This pattern suggests localized sources or accumulation processes influencing sodium levels centrally, with a gradual to sometimes steep gradient toward the periphery.

Magnesium (Mg) concentrations show a similar spatial pattern to sodium, with the highest values (>40 mg/L) concentrated in the central-western part and overlapping with Na-rich zones (Figure 6b). Moderate Mg levels (20–40 mg/L) spread into central and northern parts, whereas the eastern and southwestern boundaries have consistently lower concentrations (<20 mg/L). The strong spatial correlation between Mg and Na implies shared hydrogeochemical processes or sources affecting both ions.

Potassium (K) distribution differs markedly, with most of the study area, especially the central, eastern, and southern parts, showing low to moderate concentrations (<2.5 mg/L) (Figure 6c). However, several isolated pockets in the central-northern and southeastern areas display elevated K levels (2–5.5 mg/L), including a very localized, intense hotspot (>5.5 mg/L) in the central-northern sector. This patchy distribution suggests potassium enrichment is limited to specific geological or anthropogenic factors, distinct from the broader patterns seen for Na and Mg.

Calcium (Ca) concentrations resemble those of sodium and magnesium, with high values (>40 mg/L) clustered mainly in the central-western area and extending northward (Figure 6d). Lower Ca concentrations (<30 mg/L) dominate the eastern and southwestern edges. Gradual transitions with intermediate zones (30–40 mg/L) imply that the geological or hydrogeological settings promoting elevated Na and Mg similarly favor increased Ca levels.

Chloride (Cl) also follows the spatial trends of Na, Mg, and Ca, with the highest concentrations (>200 mg/L) forming a large contiguous zone in the central-western region (Figure 6e). Surrounding this core, moderately high Cl levels (140–200 mg/L) cover much of the central and northern portions, while the peripheries to the east and southwest maintain lower chloride (<140 mg/L). The prominent chloride concentrations reinforce the significance of salinity influences in the central-western sector.

Sulfate (SO₄) displays a distribution pattern consistent with other major ions, showing the highest concentrations (>90 mg/L) in the central-western zone, with orange and yellow concentration ranges extending broadly into central and northern parts (Figure 6f). Lower SO₄ values (<60 mg/L) are found near the eastern and southwestern boundaries. This extensive high-sulfate area supports the interpretation of mineral dissolution or evaporite influence shaping groundwater chemistry centrally.

In contrast, carbonate (CO₃) concentrations exhibit a markedly different pattern, with very low levels (<0.4 mg/L) dominating most of the study area, especially the central, eastern, and southern parts (Figure 6g). Small, isolated pockets of slightly higher concentrations (0.4–1.2 mg/L) appear in central-northern and southeastern zones, including a localized intense spot (>1.2 mg/L) mirroring the pattern seen in potassium. This patchy distribution indicates that carbonate enrichment is localized, likely linked to specific geological formations or processes not widespread across the region.

Bicarbonate (HCO₃) presents an inverse spatial pattern relative to major ions like sodium, chloride, and sulfate. The highest bicarbonate concentrations (>16 mg/L) are primarily located in the central-eastern part of the study area, extending northeast, surrounded by moderate levels (10–16 mg/L) (Figure 6h). Lower bicarbonate values (<10 mg/L) are found in the central-western and southwestern parts, which coincide with areas of elevated salinity. This inverse relationship suggests differing dominant geochemical processes or hydrogeological regimes, with bicarbonate-rich zones corresponding to less saline, more alkaline conditions.

Nitrate (NO₃) concentrations are generally low throughout the area (<4 mg/L), with scattered, isolated pockets of higher values (4–8 mg/L) primarily in the central-northern and southeastern parts, including a small intense hotspot (>8 mg/L) (Figure 6i). These localized nitrate enrichments likely reflect specific contamination sources such as agriculture or septic inputs rather than natural background levels.

The sodium adsorption ratio (SAR) spatial pattern closely follows that of sodium, magnesium, calcium, chloride, and sulfate. Highest SAR values (>16) are concentrated in the central-western region, indicating elevated sodium hazard for soils and agriculture, with moderate to high values extending into the central and northern zones (Figure 6j). Conversely, lower SAR values (<8) dominate the eastern and southwestern peripheries, consistent with their lower major ion concentrations and reduced sodium hazard.

Finally, the pH distribution varies across the study area, with lower values (<7) observed mainly in the central-western and southwestern parts, overlapping with zones of high salinity and SAR (Figure 6k). Higher pH levels (7.5–8 and above) are found in the central-eastern and northern sectors, corresponding to regions with higher bicarbonate and lower major ion concentrations. This pattern indicates different geochemical environments, where bicarbonate buffering leads to alkaline conditions, while areas of high salinity tend to have slightly lower pH values.

Overall, the spatial distributions reveal a central-western zone characterized by elevated concentrations of major ions (Na, Mg, Ca, Cl, SO₄), high sodium hazard, and lower pH, likely driven by mineral dissolution, evaporation, and salinity influences typical of arid aquifers. In contrast, peripheral areas, particularly to the east and southwest, show lower salinity, higher bicarbonate, and more alkaline conditions, reflecting distinct hydrogeochemical regimes and geological heterogeneity within the study area.

The correlation analysis of the groundwater dataset (Figure 7) reveals distinct geochemical relationships that reflect both natural mineralization processes and specific sources of chemical constituents. Total dissolved solids (TDS) appears as the central integrator of water chemistry, showing very strong positive correlations with chloride (Cl), sodium (Na), calcium (Ca), magnesium (Mg), and sulfate (SO₄), indicating that these ions are the principal contributors to overall salinity. Such a pattern is characteristic of mineralization dominated by rock–water interactions, particularly the dissolution of evaporitic minerals such as halite, which supplies Na and Cl, and gypsum or anhydrite, which contribute Ca and SO₄, as well as the dissolution of carbonate rocks supplying Ca and Mg.

The strong association between Na and Cl, and the positive link between Cl and SO₄, suggest a common origin from evaporitic strata or possible mixing with saline groundwater of marine influence. The sodium adsorption ratio (SAR) is also strongly correlated with Na and, to a slightly lesser degree, with TDS and Cl, confirming that sodium enrichment in these waters is a dominant driver of SAR values and that high-sodium waters tend to be more mineralized.

In contrast, bicarbonate (HCO₃) exhibits weaker correlations with most major ions, implying that its variability is more closely related to carbonate equilibria and CO₂-driven weathering processes than to the same salinity sources influencing Cl and Na. The pH values also display generally low correlations with major ions, suggesting that alkalinity–acidity balance is primarily controlled by buffering mechanisms rather than ionic strength. Nitrate (NO₃) shows little to no correlation with the major ions and TDS, indicating an origin largely independent from geogenic mineralization, most likely linked to localized anthropogenic inputs such as agricultural fertilizers or wastewater infiltration.

Overall, the correlation structure points to the presence of two main hydrogeochemical signatures: a salinity-driven group formed by TDS, Cl, Na, SO₄, Ca, Mg, and SAR, which reflects mineralization from rock dissolution and potential saline mixing, and a second group comprising HCO₃, NO₃, and pH, which is influenced by carbonate buffering and anthropogenic contamination, largely independent from the processes controlling overall salinity.

The results from the random decision forest (RDF) model reveal that among various input combinations tested for predicting groundwater salinity (Table 3 and Table 4), the configuration containing potassium (K) and chloride (Cl) consistently yielded the best performance with the lowest mean absolute error (MAE) of 0.0138. Adding other key ions such as sodium (Na), calcium (Ca), nitrate (NO₃), and sulfate (SO₄), along with parameters like pH and sodium adsorption ratio (SAR), improved prediction accuracy to some extent, as seen in configurations including multiple variables (e.g., C5 and C9), but the gains diminished with larger input configurations. Conversely, relying on sodium alone resulted in the poorest performance, highlighting its limited predictive power when used in isolation. The rankings demonstrate that while incorporating a balanced combination of major cations, anions, and hydrochemical parameters enhances model reliability, including too many variables may introduce noise without significantly reducing error. These findings emphasize the critical role of K and Cl in salinity prediction and provide guidance for selecting the most informative and efficient input parameters for groundwater quality modeling in arid environments.

The optimization process at the initial modeling stage, conducted through the GridSearchCV (GSCV) algorithm, revealed distinct patterns in parameter behavior across both models. In the CatBoost regressor (CatBR-m), variations in learning rate had a pronounced effect on performance: data-splitting validation (DSV) scores rose rapidly to a maximum at moderate learning rates, followed by a slow decline as the rate increased further. Regarding tree depth, the validation accuracy exhibited a steady decrease with deeper models, while training accuracy remained unaffected, indicating potential overfitting at greater depths. In the ExtraTrees regressor (ExTR-m), increasing the number of estimators initially led to a sharp improvement in DSV performance during testing, peaking at an optimal point before gradually declining with further additions. In contrast, the training performance curve remained largely stable regardless of estimator count. These trends underscore the importance of precise hyperparameter tuning to enhance model reliability and avoid performance degradation.

In the second-stage Bootstrapping Regressor model (BsTR-m), both the training and testing phases displayed closely aligned performance patterns. Initially, the data-splitting validation (DSV) scores rose significantly with an increasing number of estimators, eventually reaching a plateau where further additions no longer enhanced performance. The consistently high DSV scores across key parameters suggest that the BsTR-m model in Stage 2 was effectively calibrated to ensure robust generalization during testing. Overall, the findings confirm that the predictive performance of the CatBoost (CatBR-m), ExtraTrees (ExTR-m), and Stage-2 BsTR-m models exhibited notable sensitivity to hyperparameter configurations, highlighting the importance of careful tuning in the modeling process.

In the training phase (Table 5), the performance of the three enhanced ensemble decision tree machine learning models (EdTE-ML)—CatBoost Regressor (CatBR-m) and ExtraTrees Regressor (ExTR-m) in Stage 1, and Bootstrapping Regressor (BsTR-m) in Stage 2—was evaluated using four metrics: mean absolute error (MAE), adjusted R² (R²adj), Kling–Gupta efficiency (KGE), and normalized root mean square error (nRMSE). The CatBR-m model demonstrated excellent performance, with an MAE of 0.0034, R²adj of 0.9979, a perfect KGE of 1.0, and a very low nRMSE of 0.0042, indicating highly accurate learning of the training data. The ExTR-m model showed even more extreme values, with near-zero MAE and nRMSE, and perfect scores for R²adj and KGE—suggesting a complete fit to the training data. The BsTR-m model in Stage 2 also achieved ideal metrics across all indicators, with all error measures reduced to zero and perfect fit scores. This progression from CatBR-m to BsTR-m reflects a refinement in predictive learning across stages. However, the nearly flawless performance of ExTR-m and BsTR-m raises concerns about overfitting, highlighting the need for rigorous validation to ensure model generalization beyond the training dataset.

The validation phase (Table 6) provides a comprehensive assessment of the predictive performance of the EdTE-ML models, clearly illustrating the benefits of the two-stage modeling approach. In Stage 1, the CatBoost Regressor (CatBR-m) demonstrates reasonable predictive capacity, with an MAE of 0.04295, an adjusted R² of 0.9457, a KGE of 0.9965, and an nRMSE of 0.05385. These values suggest that while the model captures the general trend of the data, it still exhibits notable deviations from the observed salinity values. The ExtraTrees Regressor (ExTR-m), also in Stage 1, outperforms CatBR-m, yielding lower error values (MAE = 0.01953, nRMSE = 0.02449) and a higher adjusted R² of 0.9671, indicating improved alignment between predicted and actual data and better robustness. However, it is in Stage 2 that the Bootstrapping Regressor (BsTR-m) achieves the most accurate results, with the lowest MAE (0.01382), the highest adjusted R² (0.9937), and the lowest nRMSE (0.01732), coupled with a near-perfect KGE of 0.9998. These performance gains highlight the added value of integrating predictions from Stage 1 models into a refined secondary modeling process, allowing BsTR-m to leverage prior information for more accurate generalization. Overall, the progressive enhancement across the modeling stages confirms the strength of the two-tier EdTE-ML strategy in capturing complex salinity patterns within desert aquifer systems.

A comparative assessment of machine learning models for groundwater salinity prediction was carried out using both training and testing datasets. During the model training phase, predictions generated by CatBR-m (Stage 1), BsTR-m (Stage 1), and BsTR-m (Stage 2) showed nearly identical values to the actual salinity measurements, indicating high accuracy and effective model calibration. However, it is important to note that training data alone are insufficient to fully assess the predictive capabilities of these models. In the independent testing phase, which included 20 validation samples (samples 29 to 31), slight overestimations of salinity concentrations were observed in only three instances, specifically with CatBR-m and BsTR-m from Stage 1. On the other hand, BsTR-m (Stage 2) demonstrated a remarkable ability to replicate observed salinity levels with minimal deviation, highlighting its superior generalization performance compared to the Stage 1 models.

A detailed analysis of relative errors in salinity prediction revealed that, during the model training stage, all tested algorithms produced minimal deviations, with errors closely approaching zero. However, performance distinctions became more apparent in the testing phase. Among all models, the BsTR-m (Stage 2) exhibited the lowest relative error, clearly surpassing the others in predictive precision. Evaluation of the models using statistical performance indicators, as presented in Table 4, confirmed that BsTR-m (Stage 2) achieved the highest ranking, followed by CatBR-m and BsTR-m (Stage 1), respectively. These findings underscore the enhanced reliability and accuracy of the BsTR-m (Stage 2) model, particularly in generalizing to unseen data within the adopted dual-phase modeling strategy.

The predicted salinity distribution maps generated from the different modeling approaches reveal distinct variations in spatial accuracy. The first-stage models, CatBR-m and BsTR-m, exhibited noticeable tendencies toward overprediction and underprediction of groundwater salinity across the study area. In contrast, the BsTR-m model developed in Stage 2 demonstrated a marked improvement by effectively integrating the strengths of both preceding models. Its spatial output showed a high level of agreement with known patterns of saline and non-saline groundwater zones. This spatial consistency aligned with the performance rankings derived from the statistical evaluation summarized in Table 4. The implementation of ensemble-based hybrid modeling in this work clearly outperformed single-model approaches, particularly in terms of mapping precision. As such, the BsTR-m (Stage 2) model proves to be a valuable tool for generating reliable salinity maps, especially under data-scarce conditions where predictive robustness is essential.

6. Discussion

Although the CatBoost (CatBR-m) and ExtraTrees (ExTR-m) models exhibit comparable statistical metrics for normalized root mean square error (nRMSE) and Kling–Gupta efficiency (KGE), as presented in Table 4 and Table 5 and illustrated in Figure 8a,b, distinct discrepancies between the two models are evident upon visual inspection. The two-stage machine learning modeling strategy proved essential for capturing and analyzing the patterns and discrepancies illustrated in Figure 8a,b. In the second stage, the customized Bootstrapping Regressor model (BsTR-m) utilized the salinity predictions generated by the first-stage models—CatBoost (CatBR-m) and ExtraTrees (ExTR-m)—as input features, while the normalized salinity measurements served as the predictive target, enhancing the model’s ability to refine and improve the accuracy of salinity estimation. The outputs generated by the BsTR-m model in Stage 2 represent learned values that encapsulate information derived from both the input features and target salinity data. As such, the BsTR-m framework effectively leverages or conditions the salinity predictions obtained from Stage 1 models to enhance overall predictive accuracy. Considering the variability in observed contaminant levels, this modeling approach demonstrates adaptability by incorporating and learning from all available contaminant parameters included in the training process.

The findings of this research validate the concept of applying machine learning to extract deeper insights by integrating two advanced variants of ensemble decision tree models (EdTE-ML), namely CatBoost (CatBR-m) and ExtraTrees (ExTR-m), alongside normalized salinity data. While CatBR-m and ExTR-m, commonly employed as decision-support tools across various engineering disciplines [40,41,42,43,44], yielded results in Stage 1 that may be considered insufficiently robust or conclusive by some, the implementation of the Bootstrapping Regressor model (BsTR-m) in Stage 2 offers a more justifiable and reliable alternative. This defensibility is grounded in two key aspects: (i) the convergence of predictions from Stage 1 models contributed to improved statistical metrics, particularly high nRMSE and adjusted R² values; and (ii) the clear differentiation between model outputs, as visualized in Figure 8, underscores the added value of the two-stage learning approach.

In comparison with other hybrid models previously applied for groundwater salinity prediction worldwide, the BsTR-m model in Stage 2 achieved superior performance, with an adjusted R² value of 0.9937, surpassing recent hybrid approaches such as deep belief networks (DBNs), probabilistic neural networks (PNNs), fuzzy systems (FSs), and relevance vector machines (RVMs) [10,18,25,28,33,35,36,37]. This marked improvement is primarily attributed to the BsTR-m model’s ability to extract more informative signal patterns from the outputs of CatBoost (CatBR-m) and ExtraTrees (ExTR-m), as opposed to the aforementioned studies, where metaheuristic optimization algorithms were employed solely to tune hyperparameters of DBNs, PNNs, FSs, and RVMs in order to prevent convergence to local optima. In contrast, the present study employed GridSearchCV (GSCV) for effective hyperparameter tuning, while the Stage 2 BsTR-m model further enhanced the predictive accuracy by learning from and refining Stage 1 outputs. Moreover, the resampling nature of the BsTR-m framework contributes to minimizing both variance and bias, an essential advantage when dealing with limited datasets in environmental modeling contexts [25,38,43,44].

Implementing the EdTE-ML algorithms through a two-stage modeling strategy demonstrated outstanding predictive accuracy, rapid convergence, and strong performance with limited datasets. This innovative framework thus offers a valuable and practical tool for researchers and policymakers aiming to safeguard groundwater from salinization in severely arid, desert-stressed aquifer systems worldwide. Nonetheless, the success of such models depends heavily on the availability of extensive and high-quality data. Expanding the collection of large-scale datasets remains essential to strengthening model robustness. Continuous monitoring of target contaminants within the watershed is also crucial, as their concentrations can vary significantly over time due to processes such as hydrodynamic dispersion and the inflow of water carrying dissolved substances [48].

The spatial distribution characteristics of groundwater salinization observed in the Kebili oasis region can be attributed to several hydrogeological and anthropogenic factors. Firstly, natural processes such as mineral dissolution from the aquifer matrix, limited recharge in arid desert conditions, and high evaporation rates at the surface contribute to the progressive increase in salinity, especially in low-lying or stagnant groundwater zones. The proximity of boreholes to the oasis zones, where groundwater extraction is concentrated, further intensifies salinization by inducing saltwater intrusion and altering the natural groundwater flow regime. Additionally, excessive pumping for irrigation exacerbates the depletion of fresher water layers, causing the upward migration of deeper saline water. The limited recharge and scarce rainfall typical of desert environments reduce the natural dilution capacity, causing salts to accumulate over time.

Furthermore, geological heterogeneities, such as fault zones or variations in aquifer permeability, influence the spatial variability of salinity by creating preferential pathways or barriers to flow, which affect solute transport. Anthropogenic factors like land-use changes, improper irrigation practices, and lack of effective drainage systems contribute to local salinity hotspots. These combined natural and human-driven mechanisms explain why salinity patterns are not uniform across the region but exhibit clear spatial heterogeneity, with certain areas more vulnerable to quality degradation. Understanding these causes is critical for designing targeted mitigation strategies and guiding sustainable groundwater management.

A key achievement of the proposed methodology is its successful application in predicting aquifer salinization despite the constraints posed by a limited and spatially clustered database. The proposed AI and machine learning approach is specifically designed to predict salinization in poor database zones, serving successfully as a more advanced and accurate tool for interpolation and spatial prediction compared to standard models. This capability demonstrates the strength of advanced artificial intelligence and machine learning techniques in effectively handling weak datasets typical of arid and data-scarce environments. By leveraging available hydrogeochemical data from a restricted well network, the methodology provides reliable salinization forecasts that can inform sustainable groundwater management across the entire study area, including unsampled regions. This success highlights the potential of data-driven approaches to overcome traditional limitations in groundwater quality assessment, offering a valuable tool for monitoring and mitigating salinity risks in similar desert aquifer systems worldwide.

Based on the spatial predictions of groundwater salinity provided by the proposed EdTE-ML framework, several practical recommendations can be considered to mitigate salinization risks in the deep aquifer system of the Kebili oasis region. These include optimizing groundwater extraction patterns to reduce stress on vulnerable zones, promoting the use of controlled irrigation techniques to limit saline water intrusion, and encouraging crop selection based on salt tolerance. In parallel, managed aquifer recharge and improved drainage infrastructure could help reduce salt accumulation over time. The identification of high-risk areas also supports better land-use planning and the prioritization of monitoring efforts. Looking ahead, future research could focus on integrating remote sensing indicators (e.g., vegetation indices or land surface temperature) and time-series climatic variables to improve the temporal resolution of predictions. Furthermore, incorporating additional hydrogeological and geophysical data, such as aquifer permeability or fault mapping, may enhance model performance. Testing the transferability of the two-stage EdTE-ML framework to other arid aquifer systems and developing GIS-based decision support tools would also contribute to broader regional applications and more informed groundwater governance in salinity-prone desert environments.

7. Conclusions

This research presents a novel two-phase modeling framework using advanced ensemble decision tree-based machine learning (EdTE-ML) to predict groundwater salinity in the hyper-arid East Kebili oasis. It highlights the critical need for precise groundwater quality monitoring and management in arid regions to ensure sustainable water use. The proposed AI and machine learning approach is particularly effective in predicting salinization in zones with poor and sparse databases, serving as a developed and more accurate tool for interpolation and spatial prediction compared to conventional models.

In the first phase, a random decision forest (RDF) identified potassium (K) and chloride (Cl) as key predictors, explaining over 85% of salinity variability. Hyperparameter tuning via GridSearchCV improved model performance, reducing normalized root mean square error (nRMSE) by about 15%. The Stage 2 Bootstrapping Regressor (BsTR-m) outperformed the Stage 1 CatBoost and ExtraTrees models, achieving an adjusted R² of 0.9937 and a mean absolute error (MAE) below 0.05 g/L, demonstrating high predictive accuracy.

Spatial predictions from BsTR-m corrected earlier inaccuracies and identified localized zones with salinity exceeding 2.5 g/L, posing risks to irrigation and ecosystem health.

Importantly, this study demonstrates the robustness and scalability of the EdTE-ML framework in handling small, sparse datasets common in arid, data-scarce regions. By leveraging ensemble learning and a two-stage approach, the method reduces variance and bias, enhancing prediction reliability despite limited data—critical for groundwater management where data collection is challenging.

Overall, the hybrid EdTE-ML approach provides a reliable tool for forecasting salinity changes, optimizing resource use, and guiding interventions. Future work should integrate hydrogeological models and expand monitoring to further improve prediction accuracy.

Author Contributions

Conceptualization, M.H.M.; Methodology, M.H.M.; Software, M.H.M. and B.A.; Formal analysis, B.A.; Resources, Y.M.; Data curation, Y.M.; Writing—original draft, M.H.M.; Writing—review & editing, B.A.; Supervision, L.Z.; Project administration, L.Z.; Funding acquisition, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Data Availability Statement

The data used as a reference framework in this study are derived from the agricultural map designed by the Regional Commissary for Agricultural Development Kebili (CRDA Kebili) and the Directorate General of Water Resources (DGRE). The hydrogeochemical data presented in this work were independently collected by the authors through field sampling and laboratory analysis during 2022–2024.

Acknowledgments

The authors appreciate the collaboration between Tunis El Manar University and AGHYLE, Institut Polytechnique UniLaSalle Beauvais, SFR Condorcet FR CNRS 3417 19 Rue Pierre Waguet, 60026 Beauvais, France.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Asadi, E.; Isazadeh, M.; Samadianfard, S.; Ramli, M.F.; Mosavi, A.; Nabipour, N.; Shamshirband, S.; Hajnal, E.; Chau, K.-W. Groundwater Quality Assessment for Sustainable Drinking and Irrigation. Sustainability 2020, 12, 177. [Google Scholar] [CrossRef]
Ibrahim, H.; Yaseen, Z.M.; Scholz, M.; Ali, M.; Gad, M.; Elsayed, S.; Khadr, M.; Hussein, H.; Ibrahim, H.H.; Eid, M.H.; et al. Evaluation and Prediction of Groundwater Quality for Irrigation Using an Integrated Water Quality Indices, Machine Learning Models and GIS Approaches: A Representative Case Study. Water 2023, 15, 694. [Google Scholar] [CrossRef]
Norouzi Khatiri, K.; Nematollahi, B.; Hafeziyeh, S.; Niksokhan, M.H.; Nikoo, M.R.; Al-Rawas, G. Groundwater Management and Allocation Models: A Review. Water 2023, 15, 253. [Google Scholar] [CrossRef]
Yang, W.; Zhang, Z.; Song, D.; Zhang, B.; Zhou, Y.; Zhang, N.; Zhao, M.; Song, D.; Yuan, H.; Pang, Q. Pollution risk evaluation of groundwater wells based on stochastic and deterministic simulation of aquifer lithology. Ecotoxicol. Environ. Saf. 2024, 285, 117027. [Google Scholar] [CrossRef]
Shin, S.; Aziz, D.; El-sayed, M.E.A.; Hazman, M.; Almas, L.; McFarland, M.; El Din, A.S.; Burian, S.J. Systems Thinking for Planning Sustainable Desert Agriculture Systems with Saline Groundwater Irrigation: A Review. Water 2022, 14, 3343. [Google Scholar] [CrossRef]
Msaddek, M.H.; Souissi, D.; Moumni, Y.; Chenini, I.; Bouaziz, N.; Dlala, M. Groundwater potentiality assessment in an arid zone using a statistical approach and multi-criteria evaluation, southwestern Tunisia. Geol. Q. 2019, 63, 10–15. [Google Scholar] [CrossRef]
Pulido-Bosch, A.; Rigol-Sanchez, J.P.; Vallejos, A.; Andreu, J.M.; Cerón, J.C.; Molina-Sanchez, L.; Sola, F. Impacts of agricultural irrigation on groundwater salinity. Environ. Earth Sci. 2018, 77, 97. [Google Scholar] [CrossRef]
Moghaddam, A.; Moteallemi, A.; Joulaei, F.; Peirovi, R. A spatial variation study of groundwater quality parameters in the Gonabad Plain using deterministic and geostatistical models. Desalination Water Treat. 2018, 103, 261–269. [Google Scholar] [CrossRef]
Aksoy, A.; Culver, T.B. Impacts of physical and chemical heterogeneities on aquifer remediation design. J. Water Resour. Plan. Manag. 2004, 130, 311–320. [Google Scholar] [CrossRef]
Msaddek, M.H.; Moumni, Y.; Zouhri, L.; Chenini, I.; Zghibi, A. Groundwater Quality Evaluation of Fractured Aquifers Using Machine Learning Models and Hydrogeochemical Approaches to Sustainable Water-Irrigation Security in Arid Climate (Central Tunisia). Water 2023, 15, 3332. [Google Scholar] [CrossRef]
Zhang, Z.; Yang, J.; Gong, C.; Wang, W.; Ran, B.; Wang, G.; Zhang, Q.; Wang, Y.L. Enhancing predictions of remedial reagent transport via a vertical groundwater circulation well with high-resolution aquifer characterization. Sci. Total Environ. 2024, 921, 171041. [Google Scholar] [CrossRef]
Hussein, E.A.; Thron, C.; Ghaziasgar, M.; Bagula, A.; Vaccari, M. Groundwater Prediction Using Machine-Learning Tools. Algorithms 2020, 13, 300. [Google Scholar] [CrossRef]
Singha, S.; Pasupuleti, S.; Singha, S.S.; Singh, R.; Kumar, S. Prediction of groundwater quality using efficient machine learning technique. Chemosphere 2021, 276, 130265. [Google Scholar] [CrossRef] [PubMed]
Osman, A.I.A.; Ahmed, A.N.; Huang, Y.F.; Kumar, P.; Birima, A.H.; Sherif, M.; Sefelnasr, A.; Ebraheemand, A.A.; El-Shafie, A. Past, present and perspective methodology for groundwater modeling-based machine learning approaches. Arch. Comput. Methods Eng. 2022, 29, 3843–3859. [Google Scholar] [CrossRef]
Msaddek, M.H.; Ben Alaya, M.; Moumni, Y.; Ayari, A.; Chenini, I. Enhanced machine learning model to estimate groundwater spring potential based on digital elevation model parameters. Geocarto Int. 2022, 37, 8815–8841. [Google Scholar] [CrossRef]
Msaddek, M.H.; Moumni, Y.; Ayari, A.; El May, M.; Chenini, I. Artificial intelligence modelling framework for mapping groundwater vulnerability of fractured aquifer. Geocarto Int. 2022, 37, 10480–10510. [Google Scholar] [CrossRef]
Menció, A.; Mas-Pla, J.; Otero, N.; Regàs, O.; Boy-Roura, M.; Puig, R.; Bach, J.; Domènech, C.; Zamorano, M.; Brusi, D.; et al. Nitrate pollution of groundwater; all right…, but nothing else? Sci. Total Environ. 2016, 539, 241–251. [Google Scholar] [CrossRef]
Xin, J.; Wang, Y.; Shen, Z.; Liu, Y.; Wang, H.; Zheng, X. Critical review of measures and decision support tools for groundwater nitrate management: A surface-to-groundwater profile perspective. J. Hydrol. 2021, 598, 126386. [Google Scholar] [CrossRef]
Chen, K.; Liu, Q.; Yang, T.; Ju, Q.; Zhu, M. Risk assessment of nitrate groundwater contamination using GIS-based machine learning methods: A case study in the northern Anhui plain, China. J. Contam. Hydrol. 2024, 261, 104300. [Google Scholar] [CrossRef]
Poursaeid, M.; Mastouri, R.; Shabanlou, S.; Najarchi, M. Estimation of total dissolved solids, electrical conductivity, salinity and groundwater levels using novel learning machines. Environ. Earth Sci. 2020, 79, 453. [Google Scholar] [CrossRef]
Khadra, F.W.; El Sibai, R.; Khadra, W.M. Deriving groundwater major ions from electrical conductivity using artificial neural networks supported by analytical hydrochemical solutions. Groundw. Sustain. Dev. 2024, 24, 101056. [Google Scholar] [CrossRef]
Sakizadeh, M. Artificial intelligence for the prediction of water quality index in groundwater systems. Model. Earth Syst. Environ. 2016, 2, 8. [Google Scholar] [CrossRef]
Kulisz, M.; Kujawska, J.; Przysucha, B.; Cel, W. Forecasting Water Quality Index in Groundwater Using Artificial Neural Network. Energies 2021, 14, 5875. [Google Scholar] [CrossRef]
Taşan, S. Estimation of groundwater quality using an integration of water quality index, artificial intelligence methods and GIS: Case study, Central Mediterranean Region of Turkey. Appl. Water Sci. 2023, 13, 15. [Google Scholar] [CrossRef]
Roy, D.K.; Sarkar, T.K.; Munmun, T.H.; Paul, C.R.; Datta, B. A review on the applications of machine learning and deep learning to groundwater salinity modeling: Present status, challenges, and future directions. Discov. Water 2025, 5, 16. [Google Scholar] [CrossRef]
Mondal, N.C.; Singh, V.P.; Singh, V.S.; Saxena, V.K. Determining the interaction between groundwater and saline water through groundwater major ions chemistry. J. Hydrol. 2010, 388, 100–111. [Google Scholar] [CrossRef]
Yazdanpanah, N. Spatiotemporal mapping of groundwater quality for irrigation using geostatistical analysis combined with a linear regression method. Model. Earth Syst. Environ. 2016, 2, 18. [Google Scholar] [CrossRef]
Mosavi, A.; Hosseini, F.S.; Choubin, B.; Goodarzi, M.; Dineva, A.A. Groundwater salinity susceptibility mapping using classifier ensemble and Bayesian machine learning models. IEEE Access 2020, 8, 145564–145576. [Google Scholar] [CrossRef]
Karimi-Rizvandi, S.; Goodarzi, H.V.; Afkoueieh, J.H.; Chung, I.-M.; Kisi, O.; Kim, S.; Linh, N.T.T. Groundwater-Potential Mapping Using a Self-Learning Bayesian Network Model: A Comparison among Metaheuristic Algorithms. Water 2021, 13, 658. [Google Scholar] [CrossRef]
Pham, B.T.; Jaafari, A.; Van Phong, T.; Mafi-Gholami, D.; Amiri, M.; Van Tao, N.; Duong, V.H.; Prakash, I. Naïve Bayes ensemble models for groundwater potential mapping. Ecol. Inform. 2021, 64, 101389. [Google Scholar] [CrossRef]
Marín Celestino, A.E.; Martínez Cruz, D.A.; Otazo Sánchez, E.M.; Gavi Reyes, F.; Vásquez Soto, D. Groundwater Quality Assessment: An Improved Approach to K-Means Clustering, Principal Component Analysis and Spatial Analysis: A Case Study. Water 2018, 10, 437. [Google Scholar] [CrossRef]
Eid, M.H.; Eissa, M.; Mohamed, E.A.; Ramadan, H.S.; Czuppon, G.; Kovács, A.; Szűcs, P. Application of stable isotopes, mixing models, and K-means cluster analysis to detect recharge and salinity origins in Siwa Oasis, Egypt. Groundw. Sustain. Dev. 2024, 25, 101124. [Google Scholar] [CrossRef]
Barzegar, R.; Asghari Moghaddam, A. Combining the advantages of neural networks using the concept of committee machine in the groundwater salinity prediction. Model. Earth Syst. Environ. 2016, 2, 26. [Google Scholar] [CrossRef]
Wu, Z.; Moayedi, H.; Salari, M.; Le, B.N.; Ahmadi Dehrashid, A. Assessment of sodium adsorption ratio (SAR) in groundwater: Integrating experimental data with cutting-edge swarm intelligence approaches. Stoch. Environ. Res. Risk Assess. 2024, 1–18. [Google Scholar] [CrossRef]
Chen, Y.; Liu, G.; Huang, X.; Meng, Y. Groundwater remediation design underpinned by coupling evolution algorithm with deep belief network surrogate. Water Resour. Manag. 2022, 36, 2223–2239. [Google Scholar] [CrossRef]
Wang, B.; Tan, Z.; Sheng, W.; Liu, Z.; Wu, X.; Ma, L.; Li, Z. Identification of Groundwater Contamination Sources Based on a Deep Belief Neural Network. Water 2024, 16, 2449. [Google Scholar] [CrossRef]
Khader, A.I.; McKee, M. Use of a relevance vector machine for groundwater quality monitoring network design under uncertainty. Environ. Model. Softw. 2014, 57, 115–126. [Google Scholar] [CrossRef]
Ashouri, R.; Emamgholizadeh, S.; Haji Kandy, H.; Mehdizadeh, S.S.; Jamali, S. Estimation of land subsidence using coupled particle swarm optimization and genetic algorithm: The case of Damghan aquifer. Water Supply 2024, 24, 416–435. [Google Scholar] [CrossRef]
Abba, S.I.; Benaafi, M.; Usman, A.G.; Ozsahin, D.U.; Tawabini, B.; Aljundi, I.H. Mapping of groundwater salinization and modelling using meta-heuristic algorithms for the coastal aquifer of eastern Saudi Arabia. Sci. Total Environ. 2023, 858, 159697. [Google Scholar] [CrossRef]
Barzegar, R.; Moghaddam, A.A.; Deo, R.; Fijani, E.; Tziritis, E. Mapping groundwater contamination risk of multiple aquifers using multi-model ensemble of machine learning algorithms. Sci. Total Environ. 2018, 621, 697–712. [Google Scholar] [CrossRef]
Tosan, M.; Nourani, V.; Kisi, O.; Dastourani, M. Evolution of ensemble machine learning approaches in water resources management: A review. Earth Sci. Inform. 2025, 18, 416. [Google Scholar] [CrossRef]
Kaur, H.; Bansod, B.S.; Khungar, P.; Dhawan, C. Combining clustering and ensemble learning for groundwater quality monitoring: A data-driven framework for sustainable water management. Environ. Sci. Pollut. Res. 2025, 32, 13862–13903. [Google Scholar] [CrossRef]
Sahour, H.; Gholami, V.; Vazifedan, M. A comparative analysis of statistical and machine learning techniques for mapping the spatial distribution of groundwater salinity in a coastal aquifer. J. Hydrol. 2020, 591, 125321. [Google Scholar] [CrossRef]
Tran, D.A.; Tsujimura, M.; Ha, N.T.; Nguyen, V.T.; Van Binh, D.; Dang, T.D.; Doan, Q.V.; Bui, D.T.; Ngoc, T.A.; Phu, L.V.; et al. Evaluating the predictive power of different machine learning algorithms for groundwater salinity prediction of multi-layer coastal aquifers in the Mekong Delta, Vietnam. Ecol. Indic. 2021, 127, 107790. [Google Scholar] [CrossRef]
Ben Brahim, F.; Boughariou, E.; Hajji, S.; Bouri, S. Assessment of groundwater quality with analytic hierarchy process, Boolean logic and clustering analysis using GIS platform in the Kebili’s complex terminal groundwater, SW Tunisia. Environ. Earth Sci. 2022, 81, 419. [Google Scholar] [CrossRef]
Trigui, M.R.; Trabelsi, R.; Zouari, K.; Agoun, A. Implication of hydrogeological and hydrodynamic setting on water quality of the Complex Terminal Aquifer in Kebili (southern Tunisia): The use of geochemical indicators and modelling. J. Afr. Earth Sci. 2021, 176, 104121. [Google Scholar] [CrossRef]
Kamel, S.; Dassi, L.; Zouari, K. Approche hydrogéologique et hydrochimique des échanges hydrodynamiques entre aquifères profond et superficiel du bassin du Djérid, Tunisie. Hydrol. Sci. J. 2006, 51, 713–730. [Google Scholar] [CrossRef]
Ntona, M.M.; Busico, G.; Mastrocicco, M.; Kazakis, N. Modeling groundwater and surface water interaction: An overview of current status and future challenges. Sci. Total Environ. 2022, 846, 157355. [Google Scholar] [CrossRef]

Figure 1. Study area.

Figure 2. Spatial distribution of TDS (g/L) in the study area.

Figure 3. Flowchart of the proposed methodology.

Figure 4. Kernel density (KDE) plots illustrating the variability of predictor and target variables: (a) Na; (b) Mg; (c) K; (d) Ca; (e) Cl; (f) SO₄; (g) CO₃; (h) HCO₃; (i) NO₃; (j) SAR; (k) pH; and (l) TDS.

Figure 5. Comparative boxplot visualization of groundwater physico-chemical parameters: (a) Na; (b) Mg; (c) K; (d) Ca; (e) Cl; (f) SO₄; (g) CO₃; (h) HCO₃; (i) NO₃; (j) SAR; (k) pH; and (l) TDS.

Figure 6. Spatial distribution of different hydrochemical parameters in the study area: (a) Na; (b) Mg; (c) K; (d) Ca; (e) Cl; (f) SO₄; (g) CO₃; (h) HCO₃; (i) NO₃; (j) SAR; and (k) pH.

Figure 7. Correlation matrix heatmap of the hydrochemical groundwater dataset.

Figure 8. Groundwater salinity distribution maps through EdTE-ML modeling approaches: (a) CatBR-m (Stage 1); (b) ExTR-m (Stage 2); and (c) BsTR-m (Stage 2).

Table 1. Hydrogeological characteristics and chemical analyses of groundwater samples.

Well	X (UTM 32N)	Y (UTM 32N)	Depth (m)	pH	Cl (mg/L)	SO₄ (mg/L)	CO₃ (mg/L)	HCO₃ (mg/L)	NO₃ (mg/L)	Na (mg/L)	Mg (mg/L)	K (mg/L)	Ca (mg/L)	TDS (g/L)	SAR
1	474,133	3,753,067	172	7.50	129.40	86.40	0.00	16.40	2.10	36.20	28.40	0.80	38.40	6.90	6.30
2	470,466	3,752,376	179	6.40	129.40	85.70	0.00	16.10	1.10	38.60	20.30	1.20	34.30	6.80	7.40
3	476,109	3,748,170	154	6.40	135.40	80.80	0.00	13.50	2.00	35.30	27.00	1.80	35.10	6.80	6.30
4	481,310	3,750,429	181	7.80	134.00	89.30	0.00	11.80	9.10	59.20	25.80	5.90	33.70	7.90	11.90
5	470,502	3,747,311	148	6.20	127.90	87.00	0.00	16.70	1.10	35.50	30.50	2.00	35.50	6.70	6.20
6	474,127	3,745,684	150	7.20	147.40	57.00	0.00	13.40	1.40	28.90	30.50	0.20	42.70	6.60	4.80
7	481,385	3,746,698	127	6.30	131.60	85.50	0.00	16.00	0.50	40.30	23.30	1.40	37.50	6.30	7.30
8	487,458	3,746,532	125	6.30	150.60	75.80	0.00	16.80	1.10	51.50	25.70	1.40	37.50	7.60	9.10
9	490,293	3,745,567	131	7.70	147.50	89.10	0.20	14.80	1.10	50.00	29.90	1.30	37.90	7.50	8.60
10	485,798	3,752,378	192	6.40	155.30	97.70	0.00	16.00	2.10	42.10	40.60	2.10	34.00	9.30	10.00
11	477,544	3,756,665	210	6.30	171.50	101.00	0.00	16.00	1.10	61.40	37.90	2.30	41.80	11.00	12.90
12	497,247	3,739,994	103	6.30	132.60	72.10	0.00	15.00	1.30	40.80	21.10	1.60	28.80	6.10	8.10
13	500,696	3,740,209	111	6.80	129.60	76.80	0.00	16.20	0.60	35.40	19.50	0.10	40.00	6.00	6.50
14	504,176	3,738,469	107	7.40	119.80	74.10	0.40	13.90	0.00	52.90	12.00	1.80	36.20	5.00	4.70
15	506,379	3,736,772	95	7.90	120.10	87.30	0.80	14.20	1.10	46.70	14.00	1.50	38.70	5.00	4.20
16	510,867	3,738,644	106	7.70	142.40	87.70	0.00	16.40	1.10	34.10	38.80	0.50	37.90	8.10	7.10
17	506,369	3,741,815	123	7.80	188.20	72.90	0.00	12.10	3.10	61.50	25.20	1.80	33.60	8.20	13.10
18	514,697	3,734,255	87	7.90	110.00	70.30	0.00	11.40	1.80	46.50	11.20	0.70	28.20	3.70	3.70
19	520,594	3,741,533	104	6.40	164.40	81.00	0.00	15.00	1.10	48.10	23.60	3.20	31.90	8.20	12.90
20	523,537	3,738,674	98	7.80	140.60	84.90	1.40	13.30	0.80	44.00	29.60	1.40	29.90	7.10	8.00
21	523,919	3,732,726	71	8.10	102.70	37.60	0.00	11.80	1.50	34.50	13.00	0.20	26.10	0.90	2.10
22	513,425	3,743,369	130	6.40	154.50	94.70	0.00	18.20	2.10	59.70	36.70	2.10	34.20	9.10	10.00
23	487,829	3,743,632	95	7.70	124.00	82.40	0.00	11.90	0.00	34.00	20.50	4.30	32.80	6.20	6.60
24	494,105	3,742,472	100	6.30	129.30	82.20	0.00	15.40	1.10	35.40	22.80	2.10	35.70	6.10	6.50
25	521,053	3,738,448	132	6.30	137.60	80.60	0.00	16.10	0.70	45.30	22.00	0.80	36.10	7.10	8.40
26	527,875	3,734,355	86	8.00	108.50	40.30	0.60	12.10	1.40	38.80	14.50	0.20	20.10	1.60	3.30
27	482,985	3,748,066	125	6.40	147.20	84.80	0.00	18.60	1.10	56.00	27.40	1.40	39.80	7.80	9.60
28	492,235	3,747,561	150	7.20	165.60	77.00	0.00	16.60	1.90	48.70	30.10	1.70	43.10	9.10	9.70
29	488,253	3,749,886	160	7.60	167.60	102.40	0.00	12.80	2.10	38.30	40.20	5.80	44.50	9.30	10.50
30	496,856	3,743,260	139	6.30	137.30	83.50	0.00	14.70	0.90	43.10	31.90	1.30	37.60	7.30	7.30
31	489,445	3,741,690	90	7.60	137.40	81.70	0.00	15.00	0.30	40.00	23.50	1.00	37.20	6.20	7.20
32	500,896	3,737,476	85	7.00	130.50	81.20	0.80	14.70	0.70	44.30	20.00	2.00	34.60	6.00	6.60
33	508,395	3,736,314	97	7.50	117.60	69.40	0.00	11.20	2.70	36.00	19.70	1.90	33.40	4.10	3.10
34	521,997	3,732,594	75	7.70	102.90	50.00	0.00	12.00	0.00	39.80	12.30	0.20	22.10	0.50	1.80
35	526,153	3,736,140	100	7.70	116.90	45.80	0.00	12.10	1.50	52.10	15.60	0.00	20.90	2.30	3.30
36	528,420	3,735,878	92	7.70	108.10	42.80	0.00	12.00	6.90	39.60	17.10	0.30	20.50	1.90	3.20
37	513,610	3,739,959	130	6.30	148.50	89.30	0.00	17.70	3.20	45.00	34.20	1.70	36.10	8.10	9.20
38	479,362	3,755,086	200	7.20	218.60	93.00	0.00	15.80	1.20	59.00	38.40	1.70	53.90	10.70	14.50
39	482,214	3,753,886	195	7.70	194.70	82.50	0.00	12.60	0.80	61.70	22.90	2.90	42.90	10.00	15.90
40	509,696	3,740,400	113	7.30	161.40	98.80	1.00	15.90	2.40	36.50	42.80	1.80	45.00	8.80	8.50
41	527,255	3,739,990	110	6.30	135.80	83.30	0.00	5.30	1.10	45.00	23.00	1.90	38.10	7.00	8.10

Table 2. Descriptive overview of hydro-physical and geochemical parameters derived from groundwater samples.

Parameters	Min	Max	Mean	Median	Variance	Standard Deviation	Coefficient of Variation
Na (mg/L)	28.9	61.7	44.4	43.1	80.4	9.0	0.2
Mg (mg/L)	11.2	42.8	25.5	23.6	74.0	8.6	0.3
K (mg/L)	0.0	5.9	1.7	1.6	1.7	1.3	0.8
Ca (mg/L)	20.1	53.9	35.3	36.1	47.9	6.9	0.2
Cl (mg/L)	102.7	218.6	140.4	135.8	610.6	24.7	0.2
SO₄ (mg/L)	37.6	102.4	78.7	82.4	257.1	16.0	0.2
CO₃ (mg/L)	0.0	1.4	0.1	0.0	0.1	0.3	2.5
HCO₃ (mg/L)	5.3	18.6	14.3	14.8	6.2	2.5	0.2
NO₃ (mg/L)	0.0	9.1	1.6	1.1	2.8	1.7	1.0
TDS (g/L)	0.5	11.0	6.6	6.9	6.3	2.5	0.4
SAR	1.8	15.9	7.7	7.3	11.3	3.4	0.4
pH	6.2	8.1	7.1	7.2	0.4	0.7	0.1

Table 3. RDF-based ranking of input configurations using test data.

Configuration	Input Configuration	MAE
C1	Na	0.1175
C2	K, Cl	0.0138
C3	Na, Mg, K	0.0607
C4	Na, Mg, Ca, SAR, pH	0.0399
C5	Na, K, Ca, Cl, NO₃	0.0178
C6	Na, Mg, Ca, Cl, SO₄, HCO₃, pH	0.0309
C7	Na, Mg, K, Ca, Cl, SO₄, SAR	0.0239
C8	Na, Mg, Ca, Cl, SO₄, CO₃, NO₃, SAR	0.0239
C9	Na, K, Ca, Cl, SO₄, CO₃, NO₃, SAR, pH	0.0211
C10	Na, Mg, K, Ca, Cl, SO₄, CO₃, HCO₃, NO₃, pH	0.0349
C11	Na, Mg, K, Ca, Cl, SO₄, CO₃, HCO₃, NO₃, SAR, pH	0.0276

Table 4. RDF-based ranking of optimal input configurations using test data (MA-sorted criteria).

Rank	Configurations	Input Configuration	MAE
1	C2	K, Cl	0.0138
2	C5	Na, K, Ca, Cl, NO₃	0.0178
3	C9	Na, K, Ca, Cl, SO₄, CO₃, NO₃, SAR, pH	0.0211
4	C7/C8	Na, Mg, K, Ca, Cl, SO₄, SAR/Na, Mg, Ca, Cl, SO₄, CO₃, NO₃, SAR	0.0239
5	C11	Na, Mg, K, Ca, Cl, SO₄, CO₃, HCO₃, NO₃, SAR, pH	0.0276
6	C6	Na, Mg, Ca, Cl, SO₄, HCO₃, pH	0.0309
7	C10	Na, Mg, K, Ca, Cl, SO₄, CO₃, HCO₃, NO₃, pH	0.0349
8	C4	Na, Mg, Ca, SAR, pH	0.0399
9	C3	Na, Mg, K	0.0607
10	C1	Na	0.1175

Table 5. Evaluation of EdTE-ML model precision across dual modeling stages using performance metrics during the training phase.

	EdTE-ML Training Phase
Assessment Criterion	Stage 1: CatBR-m	Stage 1: ExTR-m	Stage 2: BsTR-m
MAE	0.0034	4.96 × 10⁻¹⁶	0.0
Adjusted R²	0.9979	1.0	1.0
KGE	1.0	1.0	1.0
nRMSE	0.0042	6.23 × 10⁻¹⁶	0.0

Table 6. Evaluation of EdTE-ML model precision across dual modeling stages using performance metrics during the validation phase.

	EdTE-ML Validation Phase
Assessment Criterion	Stage 1: CatBR-m	Stage 1: ExTR-m	Stage 2: BsTR-m
MAE	0.04295	0.01953	0.01382
Adjusted R²	0.9457	0.9671	0.9937
KGE	0.9965	0.9993	0.9998
nRMSE	0.05385	0.02449	0.01732

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Msaddek, M.H.; Abdelkarim, B.; Zouhri, L.; Moumni, Y. Groundwater Salinity Prediction in Deep Desert-Stressed Aquifers Using a Novel Multi-Stage Modeling Framework Integrating Enhanced Ensemble Learning and Hybrid AI Techniques. Water 2025, 17, 2452. https://doi.org/10.3390/w17162452

AMA Style

Msaddek MH, Abdelkarim B, Zouhri L, Moumni Y. Groundwater Salinity Prediction in Deep Desert-Stressed Aquifers Using a Novel Multi-Stage Modeling Framework Integrating Enhanced Ensemble Learning and Hybrid AI Techniques. Water. 2025; 17(16):2452. https://doi.org/10.3390/w17162452

Chicago/Turabian Style

Msaddek, Mohamed Haythem, Bilel Abdelkarim, Lahcen Zouhri, and Yahya Moumni. 2025. "Groundwater Salinity Prediction in Deep Desert-Stressed Aquifers Using a Novel Multi-Stage Modeling Framework Integrating Enhanced Ensemble Learning and Hybrid AI Techniques" Water 17, no. 16: 2452. https://doi.org/10.3390/w17162452

APA Style

Msaddek, M. H., Abdelkarim, B., Zouhri, L., & Moumni, Y. (2025). Groundwater Salinity Prediction in Deep Desert-Stressed Aquifers Using a Novel Multi-Stage Modeling Framework Integrating Enhanced Ensemble Learning and Hybrid AI Techniques. Water, 17(16), 2452. https://doi.org/10.3390/w17162452

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Groundwater Salinity Prediction in Deep Desert-Stressed Aquifers Using a Novel Multi-Stage Modeling Framework Integrating Enhanced Ensemble Learning and Hybrid AI Techniques

Abstract

1. Introduction

2. Study Area and Data Collection

2.1. Study Area

2.2. Data Collection

3. Machine Learning (ML) Models

3.1. Stage 0: Random Decision Forest (RDF)

3.2. Stage 1: Ensemble Learning and Hybrid AI Techniques

3.2.1. CatBoost (CatBR-m)

3.2.2. ExtraTrees (ExTR-m)

3.3. Stage 2: Custom Bootstrapping Regressor (BsTR-m)

3.4. GridSearchCV (GSCV)

4. Data Processing Framework for Predicting Salinity

4.1. Initial Structuring of Feature Variables

4.2. Identification and Matching

4.3. Selection of Optimal Hyperparameters

4.4. Assessment of Predictive Performance

5. Results

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI