Assessment of Marine Water Quality Using Integrated Indices and Machine Learning Framework in the Arabian Gulf Region

Gad, Mohamed; Ata, Ahmed Ali El-Sayed M.; Fattah, Mohamed K.; El-Fadaly, Ezzat A.; El-baki, Mohamed S. Abd; Gaagai, Aissam; Eid, Mohamed Hamdy; Elsherbiny, Osama; Taha, Mohamed Farag; Elsayed, Salah

doi:10.3390/su18126140

Open AccessArticle

Assessment of Marine Water Quality Using Integrated Indices and Machine Learning Framework in the Arabian Gulf Region

by

Mohamed Gad

^1,*

,

Ahmed Ali El-Sayed M. Ata

²,

Mohamed K. Fattah

¹,

Ezzat A. El-Fadaly

³

,

Mohamed S. Abd El-baki

⁴,

Aissam Gaagai

⁵

,

Mohamed Hamdy Eid

^6,7

,

Osama Elsherbiny

⁸

,

Mohamed Farag Taha

^9,10,*

and

Salah Elsayed

¹¹

¹

Hydrogeology, Evaluation of Natural Resources Department, Environmental Studies and Research Institute, University of Sadat City, Sadat City 32897, Egypt

²

Chemistry, Evaluation of Natural Resources Department, Environmental Studies and Research Institute, University of Sadat City, Sadat City 32897, Egypt

³

Inorganic Chemistry, Evaluation of Natural Resources Department, Environmental Studies and Research Institute, University of Sadat City, Sadat City 32897, Egypt

⁴

Agricultural Engineering Department, Faculty of Agriculture, Mansoura University, Mansoura 35516, Egypt

⁵

Scientific and Technical Research Center on Arid Regions (CRSTRA), Biskra 07000, Algeria

⁶

Institute of Environmental Management, Faculty of Earth Science, University of Miskolc, 3515 Miskolc, Hungary

⁷

Geology Department, Faculty of Science, Beni-Suef University, Beni-Suef 65211, Egypt

⁸

Interdisciplinary Research Center for Membranes and Water Security, King Fahd University of Petroleum and Minerals, Dhahran 31261, Saudi Arabia

⁹

Department of Soil and Water Sciences, Faculty of Environmental Agricultural Sciences, Arish University, Arish 45516, Egypt

¹⁰

College of Biosystems Engineering and Food Science, Zhejiang University, Hangzhou 310058, China

¹¹

Agricultural Engineering, Evaluation of Natural Resources Department, Environmental Studies and Research Institute, University of Sadat City, Sadat City 32897, Egypt

Show full affiliation list

Hide full affiliation list

^*

Authors to whom correspondence should be addressed.

Sustainability 2026, 18(12), 6140; https://doi.org/10.3390/su18126140 (registering DOI)

Submission received: 5 April 2026 / Revised: 29 May 2026 / Accepted: 6 June 2026 / Published: 15 June 2026

(This article belongs to the Special Issue Sustainable Environmental Science and Water/Wastewater Treatment: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

This study presents an integrated computational framework for quantifying industrial impacts on marine ecosystems through the combined assessment of multiple environmental quality indices. The Aquatic Water Quality Index (AWQI) and four diagnostic pollution indices, namely the Heavy Metal Pollution Index (HPI), Metal Index (MI), Degree of Contamination (C_d), and Pollution Index (PI), were applied across 23 offshore sites in Mesaieed Industrial City, Qatar, to establish a high-resolution baseline for evaluating the effects of industrial effluents and brine discharge. Multivariate statistical analyses, including Principal Component Analysis (PCA) and Cluster Analysis (CA), identified Cr, Pb, Mn, Ni, and Zn as the principal drivers of water quality variability, effectively distinguishing anthropogenic influences from natural background conditions. To enable rapid and automated marine environmental assessment, three machine learning models—Artificial Neural Networks (ANN), Random Forest (RF), and Decision Trees (DT)—were developed and evaluated for predicting the investigated indices. Model performance was assessed through rigorous training–testing validation and the Diebold–Mariano test. The results demonstrated that model selection significantly influences predictive accuracy. Among the evaluated algorithms, RF achieved the highest predictive performance for AWQI (R² = 0.88) and C_d (R² = 0.92), whereas ANN performed best for HPI (R² = 0.89), and DT yielded the most accurate predictions for MI (R² = 0.82). Despite the index-specific strengths of individual models, RF emerged as the most robust and generalizable approach, consistently providing superior performance across heterogeneous environmental datasets. The proposed framework advances marine water quality assessment from conventional descriptive monitoring toward a proactive, data-driven paradigm, offering a scalable and cost-effective decision support tool for environmental management, pollution mitigation, and evidence-based coastal governance in industrialized coastal regions.

Keywords:

machine learning (ML); multivariate methods; pollution indices (PIs); Mesaieed Industrial City (MIC); Arabian Gulf Region

1. Introduction

The natural environment has undergone major changes because of industrialization and uncontrolled urban development. Seawater contains the most diverse and dynamic and productive ecosystems which exist on Earth [1]. The aquatic ecosystem consists of physiochemical components together with their biological community and their mutual interactions. The aquatic environment consists of complex biological and physical processes which operate within an empty space [2].

However, an ecosystem has frequently changed over time, with species adapting to their environment [2]. The potential for toxic effects, persistence, and bioaccumulation problems that might damage aquatic ecosystems have drawn a lot of attention to seawater quality indicators in water environment research in recent years [3]. Activities related to the petroleum industry, other industrial processes, and urbanization can damage the environment and contaminate water ecosystems, putting humans and aquatic biota at risk [4,5]. Since seawater management depends heavily on water quality, assessing seawater quality for aquatic ecosystems in underdeveloped countries has been a major concern in recent years [4].

The petroleum industry is renowned for its intricate operations, which encompasses the extraction, refinement, and distribution of hydrocarbon resources. In the process, substantial quantities of industrial wastewater are generated, necessitating treatment before the release into the surrounding environment [6]. The Arabian Gulf, with its strategic location and vast reserves of oil and gas, has become a hotspot for petroleum-related activities [7]. Consequently, the Gulf’s marine ecosystem is subjected to continuous exposure to treated industrial wastewater effluents, raising concerns about the long-term sustainability of this fragile environment.

The Mesaieed Industry City (MIC) serves as Qatar’s primary industrial hub and the central location for petrochemical and oil refining operations. Originally established in 1949 as a modest port facility, Mesaieed has undergone substantial expansion to accommodate a diverse array of major industrial enterprises. This rapid industrial and urban development within MIC has generated significant environmental pressures, particularly impacting marine water quality and the sensitive habitats dependent on these waters through the continuous discharge of industrial wastewater effluents [8].

Water quality assessment represents a fundamental component of effective seawater management strategies. Consequently, the evaluation of seawater quality in aquatic environments within developing nations has emerged as a pressing contemporary concern [9]. The region’s most accessible seawater resources face increasing vulnerability due to potential contamination from intensive industrial activities [10]. This situation necessitates the implementation of comprehensive periodic and seasonal monitoring programs for seawater quality assessment, enabling the evaluation of water suitability for various applications while addressing degradation challenges to maintain sustainable seawater conditions throughout the Arabian Gulf Region. However, conventional approaches to seawater quality assessment encompassing sample collection, preservation, and laboratory analysis have become increasingly challenging and economically burdensome. These limitations highlight the urgent need for innovative assessment methodologies [11]. Industrial development and uncontrolled urbanization have significantly altered the natural environment. Marine ecosystems rank among the world’s most productive, biodiverse, and interconnected systems. This marine environment supports diverse activities including commercial fishing, tourism, and serves as a crucial habitat for migratory bird populations during both summer and winter seasons [12].

While previous works have applied similar frameworks to groundwater and surface water systems, this study is distinguished by its application to marine seawater in a hypersaline industrial environment (mean salinity 45.60 psu), its use of marine ecosystem indices with CCME aquatic life standards, its introduction of Diebold–Mariano validation for machine learning (ML) model selection, and its regional benchmarking with Gulf Cooperation Council policy integration (TMDL, ROPME, EBM). These adaptations confirm that the methodological framework has been substantively recalibrated for marine environmental assessment and regional regulatory contexts. The main objective of this research is to develop an integrated assessment framework for marine water quality evaluation at MIC that combines index-based quantification, multivariate source identification, and ML prediction to support evidence-based management decisions. This is addressed through three specific objectives: (i) evaluating marine water quality using AWQI and complementary pollution indices (HPI, MI, C_d, PI) with CCME standards; (ii) identifying pollution sources and spatial patterns through PCA and CA; and (iii) comparing ANN, RF, and DT models for automated index prediction, validated by the Diebold–Mariano test. Each component serves a distinct analytical purpose, collectively enabling cost-effective real-time monitoring without laboratory recalculations. This research brings essential progress toward environmental sustainability in the Arabian Gulf region.

2. Materials and Methods

2.1. Study Area

Mesaieed is an industrial city in Al Wakrah Municipality in the State of Qatar, approximately 36 km (22 mi) south of Doha with coordinates 23.9820° N, 51.5526° E. It was one of the most important cities in Qatar during the 20th century, having gained in recognition as a prime industrial zone and tanking center for petroleum received from Dukhan. Both Mesaieed and its industrial area were administered by a subdivision of “Qatar Energy” called “Mesaieed Industry City (MIC) Management”, which was established in 1996.

Mesaieed was established in 1949 as a simple port facility and since then has grown to support a wide range of major industries. The accelerated industrial and urban expansion within MIC which has constituted stressors for the natural environment, particularly in terms of marine water quality and associated sensitive habitats, through the discharge of industrial wastewater streams. The case study at MIC marine area (Figure 1) assessed the impact of treated industrial wastewater (TIW) and brine discharge to sea via sampling and dispersion modeling. The model ran to be identified the potential impact area of the TIW and brine streams in the receiving water of the Arabian Gulf and identify mitigation measures.

The MIC marine area is a key site for assessing industrial impacts on seawater quality. Unlike prior applications of this integrated framework to groundwater or lacustrine surface water, the marine environment presents unique methodological challenges. Salinities exceeding 45 psu, brine discharge from desalination plants, and petroleum industry effluents fundamentally alter trace metal speciation, index weighting validity, and ML feature importance. The CCME marine water quality standards applied in this study differ substantively from FAO irrigation guidelines employed in previous studies, requiring complete recalibration of index formulations and interpretive thresholds. The location and ongoing expansion of MIC make it crucial for studying pollution dispersion and ecological stressors. With multiple discharge sources, including treated industrial wastewater and brine, the area provides a representative setting for evaluating industrial effects and developing management strategies.

2.2. Sampling and Analysis

Seawater samples were collected from 23 strategically selected locations surrounding MIC during both summer and winter seasons over the study period of 2022 and 2023. At each sampling location, both surface (top) and bottom water samples were collected to assess vertical water column variability and to provide a comprehensive understanding of water quality conditions throughout the water column. The sampling plan for the field survey and detailed work schedule are presented in Figure 1, which illustrates the spatial distribution of sampling points and the systematic approach employed for data collection.

Sampling and measurements for all locations were strategically scheduled to occur on the same day whenever possible, with timing dependent on favorable weather conditions and optimal tidal states to ensure consistency and comparability of results. Due to the large number of sampling locations and the need to capture tidal variations, the sampling campaign was divided into two consecutive days. The first day focused on low tide sampling to capture conditions when industrial effluents might be more concentrated, followed by high tide sampling on the second day to assess dilution effects and mixing patterns in the marine environment.

The samples were collected following standardized protocols as outlined in the American Public Health Association 9 to ensure consistency and reliability of the sampling procedures. The precise location of each collected sample was determined using UTM coordinates obtained with a handheld MAGELLAN GPS 315 unit (Magellan Navigation, Inc., San Dimas, CA, USA)., providing accurate spatial referencing for all sampling points as shown in Figure 1 of field sampling locations and measuring points. This precise positioning enabled accurate mapping of water quality variations and facilitated correlation with potential pollution sources in the industrial area.

Physical characteristics of the water samples, including salinity, pH, and temperature (°C), were determined in situ using a calibrated YSI Professional Plus portable multi-parameter analyzer (YSI Incorporated, Yellow Springs, OH, USA) to capture real-time conditions and minimize potential changes that could occur during sample transport and storage. These immediate measurements provided essential baseline data for understanding the physicochemical environment at each sampling location and served as quality control indicators for subsequent laboratory analyses.

Seawater samples were collected in pre-labeled 500 mL high-density polyethylene bottles that had been pre-cleaned according to standard protocols for trace metal analysis. The samples were immediately acidified with concentrated nitric acid to a pH below 2.0 for preventing metal precipitation or adsorption losses. The acidified bottles were then sealed using appropriate caps before storage in a refrigerator set at 4 °C until laboratory analysis was possible. This preservation method protected the samples from degradation during storage and transportation steps of the analytical process.

The analysis was conducted using the standard methods for the examination of water and wastewater [13], with appropriate analytical techniques selected for each parameter group. The Hach DR6000 spectrophotometer (Hach Company, Loveland, CO, USA) provided reliable measurements of these important biological and nutrient indicators through quantitative analytical methods for chlorophyll ‘a’ and ammonia (NH₃) concentration determination. The analyses of trace elements and heavy metals along with nitrate (NO₃), nitrite (NO₂), total phosphorus (TP), hexavalent chromium (Cr-VI), aluminum (Al), barium (Ba), cadmium (Cd), total chromium (Cr), copper (Cu), iron (Fe), lead (Pb), manganese (Mn), mercury (Hg), nickel (Ni), and zinc (Zn) were conducted by an inductively coupled plasma mass spectrometer (ICAP TQ ICP-MS, Thermo Fisher Scientific Inc., Waltham, MA, USA). This analytical technique provided the necessary sensitivity and precision to measure trace elements at environmentally relevant concentrations accurately.

The strict chain of custody documentation was used for transferring all samples to an EXOVA L.L.C. approved and accredited laboratory in Doha, Qatar for complete analysis. The chain of custody procedures ensured complete traceability and integrity of samples from collection through final analysis, maintaining the legal and scientific validity of the analytical results. The laboratory’s accreditation and quality management systems provided additional assurance of analytical reliability and adherence to international standards for environmental testing.

A comprehensive quality assurance and quality control (QA/QC) protocol for all seawater samples required duplicate analyses to improve data confidence and ensure analytical process reliability throughout the analytical program. The duplicate analyses generated statistical precision measures and helped detect systematic analytical procedure errors. The analytical techniques were tested for accuracy by using certified reference materials (ERM-CA713, JRC-IRMM, Geel, Belgium) in each analytical batch which confirmed that measured values were within certified limits and analytical methods were operating correctly. The extensive QA/QC program confirmed that all analytical results fulfilled established data quality objectives and enabled the reliability of the dataset generated for statistical analysis and environmental assessment.

2.3. Multivariate Statistics

2.3.1. Cluster Analysis (CA)

The reliable data mining technique of Hierarchical CA detects patterns in homogeneous variable groups and reveals complex environmental dataset structures according to Ghodbane et al. [14]. This technique builds a binary data tree which unites similar data points through successive statistical proximity and characteristic matching. The point clusters show high intra-cluster homogeneity and significant inter-cluster heterogeneity to ensure that each cluster contains similar samples which differ from samples in other clusters [15].

Multiple analytical methods were used to develop and unite water sample groups into meaningful clusters which revealed spatial similarity patterns and clustering relationships between sampling stations throughout the study area. The clustering analysis used Ward’s linkage criterion to form optimal clusters by minimizing the within-cluster sum of squares. The analysis results appear as a dendrogram which displays hierarchical sample group relationships through a two-dimensional tree diagram to show the clustering structure of the dataset.

2.3.2. Principal Component Analysis (PCA)

The widely used exploratory data analysis technique using PCA reduces high-dimensional datasets by maximizing the original data variance. The technique reduces dimensionality by converting multiple potentially correlated variables into uncorrelated principal components (PCs) which represent the fundamental patterns and relationships of the original dataset. The eigenvectors create linear combinations of original variables to form principal components which represent independent mathematical constructs that show the weightings of each original variable and maintain statistical independence between components.

The first principal component explains the largest amount of total variance in the dataset while showing the most important pattern of variation and subsequent principal components explain progressively smaller amounts of remaining variance. The principal components are arranged in sequential order based on their contribution to the overall variability, with each successive component contributing less to the total variance than the previous one, thereby providing a hierarchical understanding of the data structure [16]. This analytical approach enables researchers to identify the most important variables driving the observed patterns in water quality and to reduce the complexity of multivariate datasets while retaining the essential information needed for environmental assessment.

2.4. Indexing Approaches

2.4.1. Arithmetic Water Quality Index (AWQI)

The AWQI offers a holistic evaluation framework that assesses water purity levels through the integration of commonly monitored physicochemical parameters in routine environmental surveillance programs. The index functions as the best measure to assess surface water conditions for aquatic ecosystem support through its mathematical approach based on Brown et al. [17]. The AWQI calculation employs a weighted arithmetic method that follows the basic mathematical relationship (Equation (1)).

AWQI = \sum_{i = 1}^{n} Q_{i} W_{i}

(1)

The equation uses Q_i to represent sub-quality indexes of variables and W_i to represent weight units of specified variables and n to represent the total number of physicochemical characteristics analyzed. The analysis included seventeen physicochemical characteristics (n = 17) which were expressed in milligrams per liter (mg/L) to maintain consistent and comparable results.

The Canadian Council of Ministers of the Environment established standards to calculate Q_i values through observed surface water concentrations (C_i) and their respective environmental benchmarks (S_i) for aquatic ecosystem protection parameters as specified by CCME [18]. The mathematical relationship appears in the following equation (Equation (2)):

Q_{i} = \frac{C_{i}}{S_{i}} \times 100

(2)

The weight (W_i) for each variable or parameter is considered using the normalization equation presented below (Equation (3)), which ensures that the sum of all weights equals unity:

W_{i} = \frac{w_{i}}{\sum w_{i}}

(3)

The individual weight (w_i) for each parameter is determined using the recommended standards according to the following proportional relationship:

The weight (w_i) for each parameter is determined using the recommended standards (Equation (4)).

w_{i} = \frac{K}{S_{i}}

(4)

In this equation, K represents the constant proportionality that ensures appropriate scaling of the weight values relative to the environmental standards.

For effective AWQI computation, individual weights (w_i) must be allocated to each surface water parameter, requiring systematic calculation of both relative weights (W_i) and quality rating values (Q_i) for all analytical parameters incorporated in the assessment. The W_i values for selected physicochemical variables are detailed in Table 1, with corresponding individual weights (w_i) determined through the proportional methodology outlined in Equation (4). A weighted arithmetic methodology was utilized to establish weight assignments that accurately represent each parameter’s significance in overall water quality determination. The derived weights (w_i) and relative weights (W_i) for all water quality variables are thoroughly documented in Table 1, accompanied by evaluation standards for seawater quality metrics and their influence on trace element levels, adhering to the analytical framework developed by Brown et al. [17].

2.4.2. Pollution Indices (PIs)

This research employed four separate contamination evaluation methodologies to deliver a thorough assessment of aquatic environmental quality. The analytical approaches encompass the HPI initially introduced by Prasad and Bose [19], the MI formulated by Tamasi and Cini [20], the C_d developed by Backman et al. [21], and the PI established by Caeiro et al. [22]. These contamination metrics, specifically HPI, MI, C_d, and PI, were methodically evaluated based on the concentrations of ten designated trace elements detailed in Table 1, employing the mathematical formulations and analytical protocols outlined in the subsequent equations and methodological approaches.

Heavy Metal Pollution Index (HPI)

The HPI represents a comprehensive assessment tool where each selected parameter is assigned a specific rating or weight (W_i) to construct the overall HPI value following the methodology established by Backman et al. [21]. This toxicity index utilizes mathematical weights of trace elements to reflect overall water quality conditions relative to recommended standard guidelines (S_i) for each metal in aquatic environments as specified by the Canadian Council of Ministers of the Environment [18]. To compute the HPI values, the concentration limits for the standard (S_i) and maximum desired (i) values for each parameter were sourced from CCME [18] standards. The HPI values were estimated according to the following fundamental equation:

HPI = \frac{\sum_{i = 1}^{n} W_{i} Q_{i}}{\sum_{i = 1}^{n} W_{i}}

(5)

In this equation, W_i and Q_i represent the unit weights and sub-indices for selected trace elements, respectively, while n represents the total number of trace elements being monitored, which equals 10 in this study. The sub-index values (W_i) and (Q_i) are calculated using the following mathematical relationships:

W_{i} = \frac{K}{S_{i}} = \frac{1}{S_{i}}

(6)

Q_{i} = \sum_{i = 1}^{n} \frac{(M_{i} - I_{i})}{(S_{i} - I_{i})}

(7)

Within these mathematical formulations, K signifies the proportional coefficient while Si indicates the acceptable threshold concentration for each parameter based on regulatory standards. The parameters M, I, and S correspond to the observed heavy metal concentration, optimal concentration, and reference concentration for the ith variable, respectively. The minus symbol (−) represents the absolute numerical variance between values, with the mathematical sign disregarded during computation. HPI values are classified into three separate contamination categories: minimal trace element contamination (HPI < 100), trace element contamination at critical levels (HPI = 100), and severe heavy metal contamination (HPI > 100), establishing a comprehensive system for environmental evaluation and regulatory decision-making processes.

Metal Index (MI)

The MI functions as an integrated methodology for evaluating the comprehensive water quality with particular emphasis on metallic pollutant concentrations. The assessment tool evaluates existing environmental conditions to provide substantial knowledge about aquatic system quality under metal-related contamination pressures [23]. The MI represents water quality conditions under metal stress and is calculated according to the following equation (Equation (8)):

MI = \sum_{i = 1}^{n} \frac{H_{c}}{H_{\max}}

(8)

The mathematical relationship uses H_c to represent measured trace element concentrations in water samples while H_max represents the maximum allowed metal concentrations from environmental standards and i stands for the ith sample in the dataset. The MI value provides an exact measurement of metallic contamination pressure on aquatic systems while offering a numerical assessment of multiple metal impacts on marine and freshwater environments.

Degree of Contamination (C_d)

The C_d is determined through methodical evaluation of pollution factors for specific trace elements that surpass the allowable limit concentrations defined by environmental regulatory standards. The calculation is performed using a two-step mathematical approach as described by Edet and Offiong [23], with the degree of contamination being classified according to a three-level scale where C_d < 1 indicates low contamination, C_d between 1–3 represents medium contamination, and C_d > 3 signifies high contamination levels. The contamination degree is calculated using the following primary equation (Equations (9) and (10)):

C_{d} = \sum_{i = 1}^{n} C_{fi}

(9)

The pollution factor for each specific trace element is calculated through the subsequent mathematical relationship:

C_{fi} = \frac{C_{Ai}}{C_{Ni}} - 1

(10)

The mathematical framework uses C_fi to represent pollution factors for individual trace elements while C_Ai shows laboratory-measured metal concentrations in water samples and C_Ni represents the environmental regulations’ allowed metal concentration limits. The evaluation method provides complete pollution assessment by analyzing both individual metal effects and their cumulative impact on water quality standards.

When measured metal concentrations (C_Ai) are consistently and substantially below the normative values (C_Ni), each contamination factor (C_fi) yields a negative value. Consequently, the sum (C_d) can become strongly negative. In the context of marine water quality assessment using CCME guidelines as the normative reference, strongly negative values indicate that the water quality is well within the acceptable range for all analyzed metals. This is mathematically consistent with the original formulation by Backman et al. [21], which was designed to detect contamination by identifying parameters that exceed their permissible limits; when none do, the index is inherently negative.

Pollution Index (PI)

The PI functions as a tool to evaluate the contamination effects of trace elements on surface water quality by providing a standardized measurement of pollution impacts. The calculation of PI values occurs separately for each metal before they receive classification into five distinct categories based on their pollution intensity and ecological significance. The metric shows how particular trace elements affect surface water quality and provides vital information for environmental protection and restoration planning. The PI calculation process uses the following mathematical formula (Equation (11)):

PI = \frac{\sqrt{[{(\frac{C_{i}}{s_{i}})}_{\max}^{2} + {(\frac{C_{i}}{s_{i}})}_{\min}^{2}]}}{2}

(11)

The formula includes C_i which represents the observed metallic concentration in aquatic specimens and s_i which represents the reference metal threshold that corresponds to the permissible metallic concentration in water according to regulatory standards. The calculation method evaluates both maximum and minimum concentration ratios to provide a balanced assessment of contamination levels.

The PI is calculated individually for each metal, and the resulting values are classified into five contamination categories: PI < 1 indicates no significant adverse influence; 1 ≤ PI < 2 indicates slight adverse influence; 2 ≤ PI < 3 indicates moderate adverse influence; 3 ≤ PI < 5 indicates strong adverse influence; and PI ≥ 5 indicates very strong adverse influence on the aquatic ecosystem.

2.5. Machine Learning Methods

Current water quality evaluation methodologies for indices such as AWQI, HPI, MI, and C_d demonstrate a significant drawback as they demand thorough knowledge of weighting parameters, potentially resulting in unclear and biased outcomes in environmental evaluation processes [24]. These metrics are computed through the integration of diverse physicochemical parameter values into a unified numerical score that indicates the general appropriateness of water quality for particular applications. Scientists have investigated multiple approaches to reduce the natural bias of this methodology by integrating essential ionic weights derived from entropy-based computations, thus improving the precision and dependability of the assessment framework.

Nevertheless, conventional approaches utilizing mathematical formulations for calculating these indices require extensive data gathering processes, thorough laboratory examinations, intricate data management, and stringent validation procedures, making them exceptionally time-intensive and resource-demanding for standard environmental surveillance programs. Conversely, ML algorithms offer a more streamlined and effective alternative methodology for aquatic quality evaluation [25].

To forecast WQIs effectively, three distinct ML models were developed and implemented: DT, RF, and ANN. These ML models were created using the Spyder software version 6.1 environment and the Python scikit-learn library version 1.8.0, which provides comprehensive tools for ML applications [26].

The research method used in this study has been applied in other studies by Gad et al. [27], Khoi et al. [28], and Hassan et al. [29], to demonstrate the effectiveness of this feature selection strategy.

2.5.1. Feature Selection

Based on correlation analysis above a predetermined threshold value, the following features were selected as input variables for the ML models: Chlorophyll ‘a’ (C₅₅H₇₂MgN₄O₅), ammonia (NH₃), nitrate (NO₃), nitrite (NO₂), total phosphorus (TP), hexavalent chromium (Cr-VI), aluminum (Al), barium (Ba), cadmium (Cd), total chromium (Cr), copper (Cu), iron (Fe), lead (Pb), manganese (Mn), mercury (Hg), nickel (Ni), and zinc (Zn).

2.5.2. Data Preprocessing and Splitting

The preprocessing of the dataset is performed to deal with null values and outliers in the dataset. The data rows with null values are removed from the dataset to avoid any adverse impact on the training process of the ML model. The outliers in the dataset are handled using the imputation method with the most typical values. The dataset preprocessing is performed using the Scikit-Learn preprocessing module of the Python programming language. The dataset was randomly partitioned into a training set (70%, 65 samples) used for model development with five-fold cross-validation, and a hold-out test set (30%, 27 samples) used solely to assess generalization performance.

All input features were subsequently normalized using the StandardScaler method, which standardizes features by removing the mean and scaling to unit variance.

To facilitate the ML models training process, the dataset underwent normalization using Equation (12). This normalization technique was employed to standardize the input features, ensuring its mean values were centered around zero and its standard deviation was set to one. Here, z denotes the transformed dataset value; x represents the actual value; µ denotes the mean value; and σ signifies the standard deviation.

z = \frac{x - µ}{σ}

(12)

2.5.3. Cross Validation Procedure

The models are evaluated using 5-fold cross-validation to evaluate model stability. The dataset is divided into five equal folds to train and evaluate the model five times. During each iteration, four folds are used for training, and the remaining fold is used for testing purposes. The testing fold is rotated across all five folds, ensuring that each fold acts as the testing set, at least once. Model performance metrics are recorded for each iteration. After five iterations, the mean of the metrics is employed to evaluate ML models. The cross-validation procedure was performed independently within the training set only; the hold-out test set was not used for any model selection, hyperparameter tuning, or validation during this process, thereby preserving its integrity as an unbiased final evaluation set. This approach guarantees that test set performance accurately represents how the model will generalize to unseen data.

2.5.4. Model Architecture and Training

The training dataset was used to evaluate multiple hyperparameter combinations and optimize the performance and generalization capabilities of the three ML models using the scikit-learn library’s grid search approach combined with a 5-fold cross-validation technique. The best-performing model with the optimal set of hyperparameters was identified as the configuration that achieved the highest coefficient of determination (R²) value and the lowest root mean squared error (RMSE) during validation testing. Hyperparameters represent configuration settings that are predetermined prior to the model’s training phase and are not learned directly from the data during the training process. The appropriate selection and optimization of these hyperparameters is essential for determining the model’s overall performance and its ability to generalize new, unseen data [30].

Artificial Neural Network (ANN)

The ANN models underwent thorough development to build their complete structure which includes input neural layers, hidden neural layers and output neural layers for understanding complex water quality parameter–environmental index relationships. Each perceptron unit in these networks behaves like a multiple linear regression model yet they surpass linear capability through their interconnected structure and activation functions. The study adopted the quasi-Newton optimization method as its main optimization technique for environmental modeling applications (Equation (13)) because of its superior convergence and computational efficiency characteristics.

The quasi-Newton method applies its iterative process to modify neuron connection weights while using gradient-based optimization to reduce prediction errors as explained by Yang et al. [31]. This optimization strategy delivers reliable weight optimization while preventing the local minima problems that plague other optimization methods. The mathematical structure of the quasi-Newton update rule delivers consistent convergence and dependable model performance throughout various datasets and environmental settings.

Systematic optimization and hyperparameter tuning need to focus on several important architectural and training elements which impact both model performance and generalization ability. The researchers studied how different numbers of hidden layers from 1 to 5 influenced water quality relationship complexity extraction without excessive overfitting. The systematic testing of hidden layer neurons showed results from 2 to 10 units to strike an appropriate balance between model complexity and computation time and prevent overfitting and maintain sufficient representation capabilities.

Systematic evaluation of different activation functions including Hyperbolic Tangent (Tanh), Logistic (Sigmoid), Rectified Linear Unit (ReLU), and Linear (Identify) allowed researchers to test various non-linear transformation capabilities in network architecture [32]. The training process required a learning rate of 0.0001 for stable weight updates which prevented oscillations and ensured smooth convergence to optimal solutions. Systematic analysis of maximum iterations from 500 to 1000 steps determined the optimal training period which produced convergence without excessive computations or training data dependency.

The overall structure and configuration of an ANN is typically determined through systematic experimentation, empirical testing, and validation procedures, as emphasized by Mijwil [33], requiring careful balance between model complexity and generalization performance. This iterative approach ensures that the final model architecture is optimally suited for the specific characteristics of the water quality dataset and the environmental prediction tasks at hand.

The quasi-Newton optimization process follows the mathematical relationship presented in Equation (13), which governs the iterative weight adjustment mechanism throughout the training process:

ω_{j + 1} ≔ ω_{j} - α \cdot L \cdot {(\frac{\partial L}{\partial ω_{j}})}^{- 1}

(13)

where

ω_{j + 1}

is “Weights of next iteration”,

ω_{j}

: Weights of current iteration and

α

: Learning rate,

\frac{\partial L}{\partial ω_{j}}

: The first partial derivative of the loss function (

L

).

Decision Tree Model (DT)

The DT models are particularly appropriate for exploratory knowledge discovery applications as they do not require extensive parameter tuning or specialized domain expertise for effective implementation. The decision tree algorithm comprises several fundamental components including a root node, branches, decision nodes, and leaf nodes, all organized in a hierarchical tree-like structure that facilitates intuitive interpretation of the decision-making process. The decision nodes serve as critical decision points that terminate at the leaf nodes, which contain the final predictions or classifications. Each decision node makes specific decisions that dictate the path connecting one node to another, creating a logical flow of decision-making that mirrors human reasoning processes [34].

Parameter optimization was methodically conducted throughout the model development stage, with the resulting optimal configurations subsequently employed to create the highest-performing model setup according to established tuning protocols [35]. Based on the analytical frameworks developed by Ahmed et al. [36], two essential hyperparameters were examined in this research during decision tree model development. The tree’s maximum depth was systematically adjusted across a range of 1 to 20 levels to identify the ideal compromise between computational complexity and model generalizability. Furthermore, the mean squared error (MSE) and mean absolute error (MAE) methodologies, defined in Equations (14) and (15), respectively, were employed to assess the effectiveness of individual split determinations within the tree architecture, guaranteeing that each branching choice maximally enhances overall predictive performance.

Random Forest Model (RF)

The RF constitutes a widely adopted and exceptionally efficient ML algorithm utilized for regression and classification purposes in environmental analysis applications. This ensemble technique employs an integration of numerous decision trees to substantially improve the accuracy and dependability of forecasts when compared to single tree algorithms. At its fundamental core, a random forest comprises an ensemble of decision trees generated from random subsets of the available training data, which helps to reduce overfitting and improve generalization to new datasets.

This study systematically considered three important hyperparameters for optimal RF configuration. The quantity of trees within the ensemble was adjusted across a range of 1 to 20 to identify the ideal ensemble configuration that optimizes both computational performance and predictive precision. The maximum depth of each individual tree was similarly modified systematically from 1 to 20 levels to regulate the intricacy of every component tree within the collective model. The criterion functions utilized for split evaluation included both the MSE and MAE metrics as described in the previously mentioned equations. Given that this collective methodology substantially boosted overall effectiveness and markedly enhanced the algorithm’s ability to adapt to novel data inputs, Breiman (2001) recommended computing the mean of outcomes from all component trees to evaluate the ultimate random forest prediction, delivering stable and dependable forecasts for aquatic quality evaluation purposes [37].

2.6. Models Evaluation

The evaluation of three distinct ML models including ANN, RF, and DT was systematically estimated using established performance metrics such as the coefficient of determination (R²), root mean squared error (RMSE), and mean absolute error (MAE) values. The metrics were calculated according to Equations (14)–(16) to quantify the variance between actual observed values and predicted values generated by each model, providing objective measures of model performance and reliability.

The mathematical relationships for these evaluation metrics are expressed as follows (Equations (14)–(16)):

RMSE = \sqrt{\frac{\sum_{i = 1}^{N} {(Y_{a} - Y_{p})}^{2}}{N}}

(14)

R^{2} = 1 - \frac{\sum {(Y_{a} - Y_{p})}^{2}}{\sum {(Y_{a} - \bar{Y})}^{2}}

(15)

MAE = \frac{1}{N} \sum_{i = 1}^{N} |Y_{a} - Y_{p}|

(16)

The equations use Y_a to represent actual observed values from the dataset and Y_p to represent predicted values from the model and N to represent the total number of data points in the original dataset. These metrics provide comprehensive assessment of model accuracy and precision across different prediction scenarios.

2.7. Data Analysis

The physicochemical variables and WQIs underwent methodical examination through extensive statistical procedures to calculate fundamental statistical measures such as minimum, maximum, and average values for all assessed parameters. The Pearson correlation coefficient was methodically applied to determine the associations between WQI and physicochemical properties of water specimens, with statistical significance assessed at 0.05 and 0.001 probability levels to guarantee reliable statistical conclusions.

For thorough aquatic quality assessments, CA and PCA were methodically implemented to improve the detection of significant pollutant elements in marine water specimens. This methodology relies on converting intricate analytical data into identifiable patterns that support environmental understanding, adhering to analytical frameworks developed by Matiatos et al. and Rakotondrabe et al. [38,39]. The CA and PCA methods were particularly employed to identify the origins or influences responsible for documented water quality variations by transforming the initial variables into a novel collection of independent components that represent the fundamental variance within the data.

PAST software (version 2.25) was methodically utilized to conduct statistical examination of physicochemical variables and WQIs, encompassing Pearson correlation coefficient calculations and thorough evaluation of analytical chemical results for both CA and PCA implementations. Spatial distribution charts were developed using Geographic Information System (GIS) approach version 10, utilizing inverse distance weighted interpolation (IDW) methods. IDW constitutes one of the most basic and commonly applied interpolation techniques for charting diverse environmental features across geographical areas, as evidenced in research by Hfaiedh et al., Burrough, and Watson [4,40,41]. Through ArcGIS’s IDW functionality (ArcGIS Pro 2.8.8), the statistical connections between established sampling sites were methodically determined, and the spatial distributions of trace elements across the study region were computed and displayed, delivering thorough spatial comprehension of pollution patterns and environmental circumstances.

3. Results and Discussion

3.1. Physicochemical Data

The assessment of seawater quality and industrial activity impacts depends on understanding its physicochemical properties. The parameters show both natural seawater composition variations and help identify pollution sources together with their effects on marine ecosystems. The research analyzed physicochemical data to assess temperature and pH and salinity changes in Mesaieed discharge areas throughout two years. The study results deliver essential information about seawater quality factors which stem from climatic elements and industrial waste and oceanographic natural processes.

The assessment of seawater quality heavily depends on physicochemical parameters which serve as essential tools for learning about water chemistry and quality. The statistical description of trace elements and heavy metals in seawater samples from Mesaieed Seawater near discharge locations for two years appears in Table 2. One of the factors influencing seawater quality, which regulates the biological, physical, and chemical activity in seawater, is temperature. It is also a crucial component of aquatic life.

Temperature in natural water bodies is subject to significant variation due to climatic factors and geographical positioning, including air temperature, latitude, solar altitude, seasonal variations, wind patterns, water depth, and heat exchange processes, particularly in shallow areas near land masses. Seawater temperature measurements during this study exhibited considerable seasonal variation, ranging from 26.15 °C to 32.40 °C during summer with an annual average of 29.421 °C, and from 16.13 °C to 19.5 °C during winter with an annual average of 18.742 °C across the two-year study period as shown in Table 2. While recorded temperatures typically fall within the favorable range for the majority of aquatic species, sharp thermal variations may cause direct adverse impacts on fish communities based on CCME [18] aquatic ecosystem protection standards.

Hydrogen ion concentration (pH) represents one of the most critical parameters affecting aquatic biota due to its fundamental influence on biochemical processes and ecosystem health. Living organisms demonstrate high dependency and sensitivity to pH variations, making it a crucial water quality indicator. Seawater pH values demonstrated stable alkaline conditions, varying from 8.48 to 8.72 with a mean of 8.60 during summer, and from 8.43 to 8.62 with a mean of 8.61 during winter across the study period as detailed in Table 2. The pH values consistently met the CCME [18] guidelines for aquatic life, confirming that the marine environment provides chemically suitable conditions for diverse ecosystems.

Water salinity of the Gulf ranges from 37 psu at the Strait of Hormuz to about 43 psu in the central part of the Arabian Gulf [42]. Higher salinity values are observed in the shallow intertidal lagoons and at Salwa Bay where it frequently reaches a value of 70 psu or above the high evaporation rate in the Arabian Gulf and its circulation pattern are the most important factors controlling salinity of the Qatari coast. Seawater salinity measurements obtained during the present study are summarized in Table 2. The mean salinity value was 45.60 psu across the summer and winter seasons over the two-year study period. Salinity values of the collected samples ranged from 43.21 to 45.81 psu. The salinity values in the collected samples showed that the saltwater in Mesaieed will have high salinity values due to the effects of evaporation linked to extremely high solute dissolution and ongoing recharge from industrial wastewater discharge.

The research focused on trace elements and nutrients—including Chlorophyll ‘a’, NH₃, NO₃, TP, Cr-VI, Al, Ba, Cd, Cr, Cu, Fe, Pb, Mn, Hg, Ni, and Zn. Table 2 summarizes the statistical description of these seawater quality parameters in MIC over the two years 2022–2023, while the raw data are provided in Tables S1–S4 in the Supplementary Materials.

In Mesaieed Industrial City, the two-year mean seawater quality values—0.012, 0.021, 0.223, 0.017, 0.011, 0.00015, 0.0039, 0.011, 0.0011, 0.0101, 0.0005, 0.007, 0.0001, 0.0011, 0.0001, 0.0001, and 0.011 mg/L, respectively—follow the trend NO₃ > NH₃ > NO₂ > Chlorophyll ‘a’ > TP > Ba > Zn > Cr > Fe > Al > Cd > Mn > Cu > Cr-VI > Pb > Hg > Ni. Based on our current understanding, trace elements or heavy metals in marine waters originate from two primary sources: natural processes (geological weathering and soil erosion) and human activities (treated industrial wastewater discharge effluents). The concentrations of trace elements in the analyzed water specimens varied considerably among samples, suggesting that the seawater experienced moderate contamination from these trace elements at concentrations approaching the threshold of recommended acceptable limits for aquatic ecosystem protection as established by CCME [18]. The outcomes of this investigation and assessment in our study, along with the findings from previous research, emphasize the urgent need for effective strategies for managing and reducing metal pollution in the Seawater in the long term as a control measures, as it is may be posed a significant threat to the aquatic ecosystem and the overall environmental health of the gulf region. Strengthening regulatory frameworks, improving wastewater treatment processes, and enhancing monitoring programs are crucial steps toward safeguarding marine biodiversity and ensuring sustainable water quality in MIC.

3.2. Aquatic Water Quality Indices (AWQI)

Standardized indices for water quality assessment enable scientists to evaluate the complete health status of aquatic ecosystems while detecting pollution source impacts on the environment. The AWQI evaluates seawater conditions through standardized assessment of multiple physicochemical parameters which produces a single numerical value for environmental assessment and management purposes [43]. The index functions as a vital instrument to determine if seawater conditions support aquatic life or need immediate environmental restoration and cleanup actions. The research evaluated AWQI values through systematic analysis during two years of monitoring to study seasonal patterns and spatial distribution patterns in MIC seawater environments [44].

The statistical data of AWQI, which calculated over a two-year duration during both summer (top and bottom water samples) and winter (top and bottom water samples) monitoring periods in MIC seawater are comprehensively obtainable in Table 3. The AWQI values ranged from 82.00 to 108.32 across all sampling locations and temporal periods. The results indicate that all samples collected from the 23 sampling points were considered as unfit for supporting healthy aquatic environments and are not suggested for aquatic life according to the established assessment criteria for AWQI values as detailed in Table 3.

The geographic distribution examination, as demonstrated through detailed cartographic findings, shows that AWQI score within the MIC research zone display a steady and notable escalation from the northeastern sector toward the eastern and southern portions of the investigation area [43]. This geographic trend suggests that marine water quality deterioration is most severe in locations situated adjacent to the downstream drainage system, particularly in zones where industrial effluent channels merge and at outfall locations where combined wastewater from MIC industrial facilities is discharged into the marine environment, as depicted in Figure 2a, Figure 3a, Figure 4a and Figure 5a. The continuous discharge of both treated and untreated industrial wastewater into the seawater system contributes significantly to the observed environmental degradation and declining water quality conditions.

These comprehensive findings emphasize the urgent need for implementing stricter environmental regulations and developing improved wastewater management strategies within the MIC industrial complex. Advanced treatment technologies, more stringent discharge controls, and regular comprehensive monitoring programs are crucial for reducing contamination levels and protecting marine ecosystems. Future research should concentrate on identifying particular pollution sources and developing new technologies to improve treatment efficiency and studying ecosystem-based restoration methods for marine environment rehabilitation. The Arabian Gulf region needs proactive environmental measures to protect marine life diversity while supporting sustainable industrial development that unites economic growth with environmental conservation.

3.3. Water Quality Indices (WQIs)

The WQIs function as fundamental analytical instruments which enable complete evaluation of heavy metal contamination in water bodies while providing standardized environmental health assessments. The HPI, MI and C_d serve as essential assessment tools which deliver crucial information about trace elements found in seawater ecosystems including their distribution and environmental effects. The indices enable the assessment of metal concentration risks to aquatic life while pinpointing locations that need immediate environmental management and remediation actions. This extensive research evaluates WQIs in MIC seawater throughout two years by analyzing seasonal patterns and spatial distribution to understand contamination dynamics thoroughly.

The HPI statistical analysis showed that HPI interval ranged between 82.587 and 108.47 with an average of 83.70 during summer (top and bottom samples) and winter (top and bottom samples) monitoring periods in MIC seawater. The research shows that most seawater samples stayed below the critical HPI threshold which indicates minimal metal pollution according to Figure 2b, Figure 3b, Figure 4b and Figure 5b. The seawater samples exhibited MI values ranging from 1.705 to 5.82, indicating variable levels of metal contamination across different sampling locations and temporal periods.

Based on the findings of MI results, the metals demonstrated a significant impact on certain seawater samples, with spatial distribution patterns revealing significant environmental implications. The MI results from summer (top and bottom samples) and winter (top and bottom samples) monitoring periods in MIC seawater, as illustrated through spatial distribution mapping, demonstrate that metals exerted larger impact on areas extending from the northeastern region towards the eastern and southern areas of the MIC seawater environment, as shown in Figure 2c, Figure 3c, Figure 4c and Figure 5c. These affected areas are located in closer proximity to industrial discharge points, establishing a clear correlation between industrial activities and metal contamination levels.

The study region demonstrates a gradual escalation in metallic pollutant contamination extending from the northeastern zone toward the eastern and southern sectors, as thoroughly illustrated in Figure 2b,c, Figure 3b,c, Figure 4b,c and Figure 5b,c.

The computed C_d values ranged from −10.29 to −6.17 across all sampling seasons and depths. These strongly negative values arise because the measured concentrations of all analyzed metals were substantially below their respective CCME normative limits, causing each individual contamination factor (C_fi) to be negative. Mathematically, when C_Ai < C_Ni, the term (C_Ai/C_Ni − 1) is negative, and summing 10 such negative factors yield a strongly negative aggregate C_d. This result confirms that the marine waters at MIC are currently uncontaminated with respect to the analyzed heavy metals when evaluated against CCME marine water quality guidelines. The classification threshold of C_d < 1 for low contamination encompasses all negative values, as they represent conditions where no metal exceeds its permissible limit (Figure 2d, Figure 3d, Figure 4d and Figure 5d).

Furthermore, all samples (100%) displayed minimal contamination levels with negative C_d outcomes (C_d < 1), which additionally indicates that water quality concerning trace elements has stayed within permissible boundaries for sustaining aquatic ecosystems. The C_d calculation results offer important insights regarding the scope of metallic contamination trends throughout the two-year investigation period. This pollution distribution is linked to the persistent and continual discharge of effluents, both processed and unprocessed, from industrial drainage channels and outfall locations, particularly concentrated within the MIC marine coastal zones, as evidenced in Figure 2d, Figure 3d, Figure 4d and Figure 5d.

The geographic distribution charts of the AWQI (Figure 2, Figure 3, Figure 4 and Figure 5) and PIs outcomes (Figure 6) throughout summer (surface and bottom specimens) and winter (surface and bottom specimens) sampling periods demonstrate a troubling deterioration in marine water quality for aquatic ecosystem support. The geographical pattern of WQIs for MIC marine waters stayed comparatively stable across the two-year investigation period, exhibiting a persistent and adverse escalation in physicochemical variables during this duration. The degradation of water quality in MIC marine waters, as demonstrated through HPI and MI evaluations, reveals heightened contamination levels and considerable impact of heavy metals on the aquatic environment.

However, there are noticeable variations in evaluation approaches for analyzed parameters and metallic concentrations, specifically for hexavalent chromium (Cr-VI), aluminum (Al), barium (Ba), cadmium (Cd), total chromium (Cr), copper (Cu), iron (Fe), lead (Pb), manganese (Mn), mercury (Hg), nickel (Ni), and zinc (Zn), when compared to the Canadian Council of Ministers of the Environment (CCME) standards [18], which indicate acceptable levels for most parameters.

According to the classification of Pollution Index levels, the analytical data for seawater collected during summer (top and bottom samples) and winter (top and bottom samples) monitoring periods reveals the extent of trace element effects, as detailed in Table 4 and Table 5. The PI values demonstrate that no significant adverse influence was detected in the analyzed samples (Figure 6) during both seasonal monitoring periods, with PI values for measured parameters and trace elements remaining within acceptable ranges as presented in Table 4, Table 5 and Table 6.

The contamination index findings indicate that heavy metal concentrations in marine water specimens can be linked to processed and unprocessed industrial effluents from diverse manufacturing operations, insufficient wastewater treatment systems, and the combination of industrial discharges with uncontrolled anthropogenic activities. These results align with earlier investigations, including research by Goher et al. [45], which employed contamination indices to assess aquatic quality conditions in El Manzala Lake and determined that selected surface water specimens faced significant risks from metallic pollution.

The thorough results indicate that MIC marine water contamination levels tend to increase with time because industrial effluents from different sources continue to enter the environment without control. The AWQI and contamination indices provide a useful and practical method to evaluate marine water quality in aquatic ecosystems by analyzing both physicochemical properties and trace element levels.

The findings demonstrate the need for ongoing monitoring programs and effective pollution control measures to manage heavy metal contamination in MIC seawater. The protection of marine ecosystems for sustainable environmental management requires three essential actions: improving industrial wastewater treatment systems and enforcing stricter discharge limits and enhancing monitoring frameworks.

3.4. Multivariate Analysis

3.4.1. Cluster Analysis (CA)

Multivariate statistical techniques are widely used in environmental studies to analyze complex datasets and identify patterns in water quality. These methods help in distinguishing pollution sources, assessing spatial and temporal variations, and understanding the interactions between different water quality parameters. The CA is a key technique that groups similar sampling locations based on shared characteristics, providing valuable insights into the contamination trends and underlying factors affecting seawater quality in MIC.

The CA represents the most fundamental quantitative approach for evaluating similarities. Following the execution of hierarchical cluster analysis, the procedure was visualized through a diagram called a dendrogram [17]. These diagrams demonstrate which clusters were combined at each analytical stage and the distance between clusters during the merging process. The CA organized the investigated sampling locations into groups based on similarities within each group and differences between distinct groups (Figure 7a). The R-mode methodology was applied to conduct and generate CA. These techniques were utilized for developing and combining coherent sets of marine water specimens into significant clusters, and for evaluating spatial similarities and location grouping among the sampling sites [46]. Ward’s linkage criterion was employed for the clustering procedure, with results displayed as a dendrogram and a two-dimensional diagram. The R-mode cluster analysis performed on chemical elements in groundwater samples generated three clusters (Figure 7a).

A total of seventeen variables created dendrograms with two groups dominated by NO₂. Cluster 1 primarily comprises nutrient elements including NO₂, NH₃, TP, and several heavy metals such as Fe, Al, Zn, Ba, and Chl. The second cluster was characterized by Mn, Ni, Cu, Cr, Pb, Cd, Hg, and Cr-VI. Despite the presence of hazardous substances and variations in their concentrations being similar, the dendrogram reveals minimal Euclidean distances between these groupings [47]. The specimens were collected from different depths in the marine environment, indicating that sediments can absorb and retain substantial quantities of toxic pollutants such as heavy metals from the water column variably throughout the aquatic ecosystem. The adsorption capacity is influenced by numerous factors within the sediment–water system, including pH, temperature, cation exchange capacity, ionic strength, surface area, particle size, mineralogical characteristics, and benthic organism activity.

These findings reinforce the importance of sediment monitoring in assessing metal contamination in marine environments. Understanding the interactions between seawater and sediments can help develop targeted remediation strategies to minimize heavy metal pollution. Future research should explore sediment dynamics further and assess their long-term impact on water quality and marine ecosystems in MIC.

3.4.2. Principal Component Analysis (PCA)

Multivariate statistical techniques such as PCA are widely applied in environmental studies to analyze complex datasets and identify patterns of contamination. PCA helps reduce dimensionality by transforming large datasets into a smaller number of components that explain most of the variance. This approach is particularly useful in assessing the influence of multiple factors on seawater quality and distinguishing between natural and anthropogenic sources of pollution. By applying PCA, this study aims to determine the most significant variables contributing to heavy metal contamination in MIC seawater.

PCA can analyze multivariate relationships and explain data variation by limiting the number of variables to many groupings of persons based on principle component scores [48]. Introduced by Rencher, this methodology may convert a dataset with several variables into a set of comprehensive principal components and is quite comparable to the correlation or regression analysis methods. Researchers have used PCA in several fields because it enables a significant decrease in the number of variables and the identification of structure in the interactions between various variables [49]. The first step in using PCA to assess the levels of heavy metal contamination is to identify the principal components of the dataset. Since the principal components make up the bulk of the data in the assessed indexes, they are able to properly represent the amounts of heavy metal contamination in the water. By using PCA techniques, we want to maximize the variance of a linear combination of the variables in the dataset. The weight total of the different principal component values may be used to calculate the values of primary components, and the concentrations of heavy metals in the sea can be used to calculate the levels of heavy metal pollution in the sea.

The PCA of the metals demonstrated many PCA explaining in total 90.33% of the variance (Table 7) and (Figure 7a–h). In total, eight factors (F1, F2, F3, F4, F5, F6, F7, F8) explain 27.02, 18.09, 12.03, 9.90, 7.14, 5.79, 5.26, and 5.10%, respectively, where F1 is represented by Cr (−0.573), Cu (−0.858), Mn (−0.961), Ni (−0.937), and Zn (−0.843). F2 consists of TP (−0.935), Ba (−0.927), and Cd (−0.864). F3, F4, F5, F6, F7, and F8 are represented by NO₂ (0.612), Hg (−0.603), Chl (0.611), NH₃ (0.572), NO₃ (0.675), and Al (−0.587), respectively. However, due to their low concentration in the seawater at both the top and bottom levels, many metals, such as iron and lead, are not important by any means. Individual metal contamination of marine networks may also be caused by human activities, the natural dispersion of clay minerals in sediment, and the interaction between soil and water.

The strong negative loadings of Cr (−0.573), Cu (−0.858), Mn (−0.961), Ni (−0.937), and Zn (−0.843) in Factor 1 indicate a strong association of these metals with the same pollution source or geochemical process. In PCA, the sign of the loading reflects the direction of the relationship relative to the component axis rather than the strength or environmental significance of contamination. Therefore, the negative loadings observed in Factor 1 should be interpreted as an inverse orientation along the PCA axis and not as a reduction in pollution impact. The axis orientation in PCA is mathematically arbitrary and may be reversed without affecting the interpretation of the results.

The grouping of Cr, Cu, Mn, Ni, and Zn within the same factor suggests a common origin, likely related to anthropogenic activities surrounding the study area, including industrial discharge, shipping operations, antifouling paints, fuel combustion, urban runoff, and metal-processing activities. Similar associations among these metals have been widely reported in marine environments influenced by industrial and harbor activities. The high absolute loading values indicate that these metals are major contributors to the variability explained by Factor 1 and therefore play an important role in controlling seawater quality in the investigated area.

The PI analysis revealed that cadmium and copper exerted small but measurable effects on the aquatic ecosystem, while chromium, lead, manganese, nickel, and zinc, despite being the primary drivers of variance in the PCA (Factor 1), demonstrated only low-to-moderate PI values, indicating that their current concentrations, though elevated relative to background levels, have not yet reached the threshold for severe contamination as defined by CCME guidelines. Iron demonstrated negligible contaminating influence. This distinction between statistical prominence in multivariate analysis and absolute contamination severity is important for accurate environmental risk communication.

It is important to note that the high PCA loadings for Mn, Ni, Cu, Zn, and Cr in Factor 1 reflect their strong co-variation across sampling sites, likely driven by a common industrial source, rather than necessarily indicating severe absolute contamination. This interpretation is consistent with the PI results, which classify these metals within the low-to-moderate pollution range.

These findings underscore the importance of identifying pollutant sources to improve water quality management. Future research should focus on long-term monitoring, source apportionment, and ecological risk assessments to develop effective mitigation strategies. Addressing contamination at its source will play a crucial role in ensuring sustainable marine ecosystem health in the MIC region.

3.4.3. The Performance of Machine Learning (ML) Models to Predict the WQI

The growing complexity of environmental data needs sophisticated analytical methods to boost prediction precision and decision-making capabilities. ML models have proven strong analytical tools for water quality assessment. The models process extensive datasets to reveal concealed patterns which enables them to make dependable WQI predictions through multiple environmental parameter analysis [50]. The research combines ML approaches to boost water quality monitoring efficiency while developing proactive environmental management strategies.

To forecast WQIs effectively, three distinct ML models were developed and implemented: DT, RF, and ANN. These ML models were created using the Spyder software environment and the Python scikit-learn module, which provides comprehensive tools for ML applications.

Based on correlation analysis (Figure 8) above a predetermined threshold value, the following features were selected as input variables for the ML models: Chlorophyll ‘a’ (C₅₅H₇₂MgN₄O₅), ammonia (NH₃), nitrate (NO₃), nitrite (NO₂), total phosphorus (TP), hexavalent chromium (Cr-VI), aluminum (Al), barium (Ba), cadmium (Cd), total chromium (Cr), copper (Cu), iron (Fe), lead (Pb), manganese (Mn), mercury (Hg), nickel (Ni), and zinc (Zn).

The optimal R² threshold for feature selection was determined separately for each WQIs. For AWQI prediction, features were selected using R² thresholds of 0.01, 0.05, and 0.05 for ANN, RF, and DT, respectively. For HPI, the corresponding selection thresholds were 0.05, 0.20, and 0.20. For MI, the corresponding selection thresholds were 0.25, 0.10, and 0.10. For C_d, the thresholds used were 0.25, 0.25, and 0.25 for the same variables.

The ML models showed strong predictive capabilities for WQI forecasting according to the results presented in Table 8. The model successfully detected intricate relationships between physicochemical parameters and WQI variations which allowed it to predict water quality trends precisely. The testing phase results from Figure 9a–d demonstrate the strong performance of ML models in assessing seawater contamination levels through observed versus predicted data.

The research demonstrates why ML applications remain essential for environmental monitoring and management. Future research needs to improve model performance through dataset expansion and input variable optimization and the investigation of support vector machines (SVM) and random forest regression as alternative ML techniques. The integration of real-time monitoring systems with AI-driven predictive models will play a crucial role in safeguarding aquatic ecosystems and optimizing sustainable water resource management in industrial regions like MIC.

The artificial neural network (ANN-AWQI) model achieved R² values of 0.95 and 0.90, RMSE values of 0.86 and 1.28, MAE values of 0.85 and 1.27 during training and validation periods for AWQI prediction. The ANN-AWQI model uses two hidden layers with 5 neurons each and ReLU activation across 700 iterations as shown in Figure 10a. The ANN-HPI model achieved R² values of 0.95 and 0.89, RMSE values of 0.93 and 1.34, and MAE values of 0.41 and 0.99, respectively, for HPI prediction. The ANN-HPI architecture implements the ReLU activation function throughout 500 iterations and consists of two hidden layers with 4 neurons each which is illustrated in Figure 10b. The ANN-MI model achieved remarkable R² values of 0.99 and 0.98, RMSE values of 0.08 and 0.10, and MAE values of 0.05 and 0.06 for MI prediction. The ANN-MI architecture presents the Identity activation function during 500 iterations with one hidden layer containing 8 neurons (Figure 10c). The ANN-C_d model reached outstanding R² values of 0.99 and 0.94 along with RMSE values of 0.08 and 0.17 and MAE values of 0.05 and 0.16 for C_d prediction. The ANN-C_d architecture consists of a single hidden layer with 7 neurons and Identity activation function which runs through 500 iterations as depicted in Figure 10d.

The research demonstrates that ANNs achieve high accuracy when predicting WQIs. ANN-based approaches demonstrate robustness and reliability in complex environmental data modeling through their high R² values and low RMSE and MAE scores in all models. The different network configurations and activation methods used for each WQI component show that optimizing network parameters according to WQIs components leads to better prediction outcomes. Real-time data integration combined with model architecture optimization along with ensemble learning techniques will enhance prediction accuracy and extend the environmental monitoring capabilities of ANN models. The advancements will enhance water resource management approaches that sustain marine ecosystems throughout industrial areas like MIC during extended periods.

The study confirmed the predictive strength of RF models for WQIs forecasting. Table 9 with Figure 11a–d presents the findings obtained from the testing process. The random forest (RF-AWQI) model which was used for AWQI prediction achieved an R² of 0.93 and 0.88, RMSE of 1.05 and 1.37, and MAE of 0.91 and 1.31 during training and testing phases. It used 13 trees with a maximum depth of 5. The RF-HPI model in predicting HPI achieved, respectively, an RMSE of 0.89 and 1.17, R² of 0.95 and 0.91, and MAE of 0.37 and 0.86, respectively. The model consisted of 15 trees that reached a maximum depth of 6. The RF-MI model in predicting MI achieved, respectively, an RMSE of 0.12 and 0.11, an R² of 0.98 and 0.97, and MAE of 0.05 and 0.06, respectively. The model utilized 11 trees with a maximum depth of 7. The RF-C_d model in predicting C_d attained an R² of 0.99 and 0.92, an RMSE of 0.07 and 0.19, and MAE of 0.04 and 0.18, respectively, with 11 trees and a maximum depth of 5. These models used MSE as the criterion function.

The good results of the RF models in predicting WQIs show that ensemble learning techniques are useful for modeling non-linear water quality parameter relationships and complex interactions between them. The high R² values and low RMSE and MSE values across all models demonstrate RF as an effective alternative to ANN for water quality prediction. The benefits of RF include its ability to handle missing values and prevent overfitting through bootstrapping as well as providing feature importance scores which make it a useful tool for environmental monitoring. Optimizing hyper parameters and incorporating real-time datasets and integrating RF with other ML techniques would enhance prediction accuracy and expand its applicability. Data-driven sustainable water resource management decisions will be supported by these advancements to reduce industrial environmental impact on marine ecosystems like MIC.

The research also tested the usage of DT as a possible method to forecast WQIs for the Seawater in MIC. Decision trees are widely adopted because they are simple to understand, easy to interpret and can handle both categorical and continuous data. The method is particularly effective in modeling complex non-linear relationships between predictors and response variables, making it highly suitable for environmental data analysis. DT models provided strong predictive results which helped identify water quality parameters while offering a useful tool for environmental management decision-making. The research results showed that decision tree algorithms can predict WQIs accurately as presented in Table 10 and Figure 12a–d. The results were recorded during the evaluation phase. Throughout the training and validation stages, the decision tree (DT-AWQI) model for AWQI prediction exhibited performance with R² values of 0.95 and 0.90, RMSE values of 0.86 and 1.28, and MSE values of 0.85 and 1.27, respectively. The DT-AWQI employed a maximum depth of 5 levels. The DT-HPI model for HPI prediction displayed performance with RMSE values of 0.86 and 1.03, R² values of 0.95 and 0.93, and MSE values of 0.24 and 0.75, respectively. The DT-HPI used a maximum depth of 5 levels. The DT-MI model for MI prediction showed performance with RMSE values of 0.29, R² values of 0.82 across both seasonal periods, and MSE values of 0.08 and 0.09 for training and testing, respectively. The DT-MI implemented a maximum depth of 7 levels. The DT-C_d model for C_d prediction displayed performance with R² values of 0.99 and 0.93, RMSE values of 0.06 and 0.18, MSE values of 0.03 and 0.17, respectively. The DT-C_d operated at a maximum depth of 2 levels. These algorithms used MSE as the evaluation metric.

The DT models forecasted WQIs robustly because they are able to handle complex environmental data with precision. The consistent high R² values and low RMSE and MSE scores in all models demonstrate that decision trees can be a useful tool for water quality monitoring. The results confirm that decision tree models are versatile in environmental assessments and can be used as a reliable alternative to other ML methods like ANN and RF. Future research could investigate ensemble decision tree approaches or hybrid models to boost prediction accuracy especially in real-time monitoring systems for sustainable water management in marine areas like MIC.

The study of environmental science has witnessed increasing interest in ML models because they show promise for WQIs prediction. These models, ANN, RF, and DT, are recognized for their ability to handle complex, multidimensional data and their flexibility in modeling non-linear relationships. ML models learn from historical data to generate precise and trustworthy predictions of WQIs which helps optimize water quality monitoring operations and enables better decision-making. Our research findings contribute to this expanding body of knowledge by showing how ML models excel in environmental tasks.

To prevent the model from overfitting to the training data, cross-validation was applied for hyperparameter optimization. The dataset was randomly partitioned into five folds. For each validation round, one-fold served as the validation set while the remaining four were used for training; this process was repeated until each fold had been used exactly once as the validation set. The final model was selected based on the average performance across folds, prioritizing high R² and low RMSE. GridSearchCV was employed to take the ML model, hyperparameter grid, and number of k-folds as inputs, and returns the best estimator with its optimal hyperparameters.

Thirty percent of the data was held out as a separate test set (70/30 train/test split). The training portion (70%) was used exclusively for cross-validation. Results of the final models on both training and test sets are reported in Table 8, Table 9 and Table 10. The small discrepancy between training and test performance (R², RMSE, and MAE) suggests that overfitting is unlikely, justifying our choice of k = 5 folds. In addition, Figure 10, Figure 11 and Figure 12 present scatter plots of predicted versus actual values, showing a clear positive linear trend (indicating low variance). While a few points deviate from the main cluster (suggesting minor bias), the overall relationship remains approximately linear.

Across the four indices (AWQI, HPI, MI, and C_d), the Diebold–Mariano test results show that no single model consistently outperforms the others; instead, the best model varies by index, as shown in Table 11. For AWQI, the DM statistics (ANN vs. RF: −1.1745, p = 0.2433; ANN vs. DT: 0.2319, p = 0.8171; RF vs. DT: 1.2370, p = 0.2193) indicate that RF is directionally preferred over both ANN and DT, though none of the differences are statistically significant. For HPI, ANN outperforms RF (DM = 0.6134, p = 0.5412) and significantly outperforms DT (DM = 2.5119, p = 0.0138), while RF also beats DT directionally (DM = 0.4961, p = 0.6210), making ANN the clear winner for that index. For MI, the ANN vs. DT (−1.4646) and RF vs. DT (−1.3541) comparisons indicate DT outperforms both, while the ANN vs. RF comparison (−1.0665) indicates RF outperforms ANN. Thus, DT is the best-performing model for MI. For C_d, RF shows a highly significant advantage over ANN (DM = −3.3748, p = 0.0011), a near-significant advantage over DT via the ANN vs. DT comparison (DM = −1.9668, p = 0.0523, favoring DT over ANN), and a directional advantage over DT directly (RF vs. DT: DM = 1.6586, p = 0.1006). The strongest evidence is for RF over ANN in C_d and for ANN over DT in HPI. In practice, the optimal model should be chosen per index: RF for AWQI and C_d, ANN for HPI, and DT for MI. When a single model must be selected across all indices, RF is the most defensible choice: it leads on two indices and remains competitive on the rest.

The ML algorithms demonstrate consistent performance during training and validation phases which indicates their reliable and stable WQI forecasting ability. Current research shows advanced multivariate regression techniques including ANN, RF, and DT can precisely predict WQIs. Hassan et al. [29] used RF algorithm to predict WQI by analyzing diverse trace elements which resulted in 98.99% precision. Khoi et al. [28] studied WQI prediction through ANN, RF, and DT algorithms while R² values showed strong accuracy ranging between 0.68 and 0.99. Bui et al. [51] employed RF and DT algorithms to create forecasts that reached R² values of 0.93 and 0.87 using physicochemical properties in their analysis. El Bilali et al. [52] achieved R² values of 0.92 for both ANN and RF algorithms when predicting WQIs using physical variables. The work demonstrates ML models’ ability to determine important input parameters for precise predictions by showing how they simplify WQIs modeling [53]. ML models serve as an efficient instrument for WQI evaluation and automated WQI calculations that leads to substantial reductions in time and work requirements. This approach serves as a robust alternative to conventional WQI calculation approaches which demand intricate calculations together with multiple sub-index equations. The research recommends ML models to resource managers and water quality monitoring organizations due to their reliable and thorough results.

The ML models demonstrate robust WQI prediction ability because they maintain high performance during training and testing phases. ML models deliver a substantial advantage through their improved prediction accuracy and reduced computational needs which surpass traditional WQI calculation methods that demand complex calculations and multiple sub-index formulas. Our research along with previous studies demonstrates how these models assist resource managers and water quality monitoring agencies to optimize their monitoring programs while making better decisions. The future implementation of ML models for water quality assessments will play a crucial role in developing more effective and precise environmental management strategies.

3.4.4. Regional Benchmarking, Policy Implications, and Management Relevance

The heavy metal concentrations and pollution index values recorded at MIC acquire their full significance when placed in the context of comparable industrial coastal systems across the Arabian Gulf. Along the Al-Khobar coast, Alharbi et al. [54] found that average Zn, Fe, Mn, Cu, As, and Cr concentrations exceeded those of several worldwide seas and gulfs, with the highest levels concentrated in sheltered embayments near desalination plants and industrial facilities, a spatial pattern that mirrors the elevated HPI and MI values observed in the northeastern and eastern sectors of MIC in the present study. A broader assessment of 22 western Arabian Gulf coastal sites by Amin and Almahasheer [55] found that 82% of locations were non-polluted to slightly polluted using the Pollution Index, with C_d emerging as the most frequently polluting metal, consistent with the present study’s PI results. In Kuwait Bay, Nour et al. [56] documented moderate heavy metal contamination in summer and low contamination in winter, the same seasonal intensification pattern observed at MIC attributing it to oil refining, fertilizer manufacturing, and shipping activities directly analogous to those operating at MIC. Within Qatar itself, Ghanimeh et al. [57] reported Cu and Ni contamination factor (CF) values of 12 and 60, respectively, in Doha Bay waters, both exceeding the high-risk threshold of CF = 6. The Gulf-wide meta-analysis of Swetha et al. [58] further confirmed that the Arabian Gulf is characterized by low-to-moderate contamination overall, with localized industrial hotspots, and recommended continuous monitoring and scientifically informed waste management strategies. Taken together, these comparisons position MIC as a boundary-state industrial coastal system: less severely impacted than Dammam or Doha Bay, but more impacted than the majority of western Gulf sites, and on a trajectory of worsening contamination given the ongoing industrial expansion. The MIC Environmental Guidelines specify individual discharge limits for heavy metals; however, the cumulative effect of simultaneous discharges from numerous industrial facilities creates a combined pollution load that individual permits are not designed to manage. We therefore recommend that the Qatar Ministry of Environment and Climate Change implement a Total Maximum Daily Load (TMDL) framework for MIC marine receiving waters, setting aggregate daily limits for Mn, Ni, Cr, and key nutrients. Saudi Arabia’s Executive Regulations for the Protection of Aqueous Media already employ a screening model for mixing zone determination in the Arabian Gulf, with dilution factors of 16 required for industrial areas. This regulatory approach could be adapted by Qatar. Following the approach demonstrated by Painting et al. [59], who used 15 years of monitoring data from 27 sites in Bahrain’s coastal waters to derive locally calibrated baseline thresholds, MIC should establish near-pristine offshore reference stations. These stations would enable detection of deterioration trends before absolute thresholds are breached. Under Qatar’s obligations as a signatory to the Kuwait Regional Convention and its Protocol for the Protection of the Marine Environment against Pollution from Land-Based Sources, MIC monitoring data should be formally reported to the Regional Organization for the Protection of the Marine Environment (ROPME) regional database. These findings carry direct policy implications for environmental governance in Qatar. Ghanimeh et al. [57] called for the establishment of site-specific GCC background values and the enforcement of national pollution risk indicators through environmental licensing and industrial discharge permitting. These recommendations apply with equal force to MIC, given the absence of locally adapted marine water quality standards across the GCC. This reporting would enable Gulf-wide trend analysis consistent with the framework of Swetha et al. [58]. The ML models developed in this study provide ready-made, cost-effective tools for near-real-time enforcement of these targets: the RF model for AWQI and C_d prediction (R² = 0.882 and 0.920, respectively), and the ANN model for HPI prediction (R² = 0.887). ROPME is currently implementing a Regional Action Plan on Ecosystem-Based Management (EBM). We recommend that MIC adopt an EBM approach that considers not only physicochemical thresholds but also biological indicators (e.g., bivalve bioaccumulation, algal community shifts) to capture sub-lethal ecological effects. Given that PCA Factor 1 and Factor 2 likely reflect both water column and sediment-phase contamination, coupled water–sediment monitoring programs are essential, as marine sediments act as the ultimate sink for heavy metals in the Arabian Gulf. Future research should extend this integrated framework to other Qatari industrial zones (e.g., Ras Laffan) and across the GCC to develop a harmonized regional assessment protocol.

3.4.5. Limitations of the Study

Although the sampling campaign was carefully designed to minimize temporal variability, seawater samples were collected over two consecutive days to accommodate the large number of sampling stations and to capture both low- and high-tide conditions. Consequently, short-term environmental fluctuations, including minor variations in meteorological and hydrodynamic conditions, may have introduced limited temporal variability between sampling periods. However, sampling was conducted under relatively stable weather conditions and within a short time interval to reduce these effects. Therefore, the observed spatial and tidal variations are considered representative of the study area during the investigation period.

4. Conclusions

This study established an integrated assessment framework for evaluating marine water quality in the industrial coastal zone of Mesaieed Industrial City (MIC), Qatar, through the combined use of water quality indices, multivariate statistical techniques, and machine learning models. The results demonstrated that industrial effluents, desalination brine discharge, and associated nutrient enrichment are the primary factors shaping the spatial distribution of water quality deterioration across the study area. Areas located near major discharge outlets exhibited the highest levels of environmental stress, highlighting the influence of long-term industrial activities on the surrounding marine ecosystem. The application of multiple pollution indices provided a comprehensive understanding of contamination status and source attribution. While overall contamination levels remained within low-to-moderate categories relative to established regulatory guidelines, elevated contributions from Mn, Ni, Cr, Pb, and Zn revealed a distinct industrial signature. Furthermore, PCA explained 90.33% of the total dataset variance and successfully differentiated between metal-related and nutrient-related pollution sources, demonstrating the effectiveness of multivariate approaches for interpreting complex environmental datasets. The machine learning analysis confirmed the strong potential of data-driven approaches for rapid and reliable prediction of marine water quality conditions. Among the evaluated models, RF achieved the highest predictive accuracy for AWQI (R² = 0.88) and C_d (R² = 0.92), whereas ANN performed best for HPI (R² = 0.89), and DT yielded the most accurate predictions for MI (R² = 0.82). Although each algorithm exhibited strengths for specific indices, RF consistently provided the most stable and transferable performance across the entire assessment framework, supporting its suitability as a generalized predictive tool for marine environmental monitoring.

The practical significance of this work lies in demonstrating that machine learning models can effectively transform routinely measured physicochemical parameters into accurate estimates of environmental quality indices, substantially reducing the time, effort, and cost associated with conventional monitoring approaches. The proposed framework offers a scalable decision support system for environmental authorities and coastal managers seeking to improve pollution surveillance, risk assessment, and regulatory oversight in industrialized marine environments. Future studies should extend the monitoring period to capture long-term temporal dynamics, incorporate sediment quality and biological indicators, and explore advanced artificial intelligence techniques, including ensemble and deep learning models, to enhance predictive performance and support ecosystem-based management of coastal resources.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/su18126140/s1.

Author Contributions

Conceptualization, M.G., A.A.E.-S.M.A., M.K.F., E.A.E.-F. and S.E.; fieldwork, A.A.E.-S.M.A.; methodology, M.G., S.E., M.S.A.E.-b.; A.G., M.H.E., A.A.E.-S.M.A., M.F.T., and O.E.; software, M.G., S.E., A.A.E.-S.M.A., M.S.A.E.-b.; A.G.; validation, M.K.F., O.E., M.F.T. and E.A.E.-F.; formal analysis, A.A.E.-S.M.A.; investigation, A.A.E.-S.M.A.; resources, A.A.E.-S.M.A., O.E. and M.F.T.; data curation, A.A.E.-S.M.A.; writing—original draft preparation, A.A.E.-S.M.A., M.K.F., E.A.E.-F., S.E., M.S.A.E.-b. and M.G.; writing—review and editing, A.A.E.-S.M.A., M.K.F., E.A.E.-F., S.E., M.G., M.F.T. and O.E.; supervision, M.K.F., E.A.E.-F., S.E., M.G.; project administration, A.A.E.-S.M.A., M.K.F., E.A.E.-F., S.E. and M.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data are provided as tables and figures in the manuscript and Supplementary Materials (Tables S1–S4).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Haseeba, K.P.; Aboobacker, V.M.; Vethamony, P.; Al-Khayat, J.A. Water and Sediment Characteristics in the Avicennia Marina Environment of the Arabian Gulf: A Review. Mar. Pollut. Bull. 2025, 216, 117963. [Google Scholar] [CrossRef]
Shaheen, M.E.; Gagnon, J.E.; Barrette, J.C.; Keshta, A.E. Evaluation of Pollution Levels in Sediments from Lake Edku, Egypt Using Laser Ablation Inductively Coupled Plasma Mass Spectrometry. Mar. Pollut. Bull. 2024, 202, 116387. [Google Scholar] [CrossRef]
Rose, J.B.; Örmeci, B.; Aw, T.G. Water Quality and Health: An Ecological Perspective. Water Ecol. 2025, 1, 100007. [Google Scholar] [CrossRef]
Hfaiedh, E.; Gaagai, A.; Petitta, M.; Ben Moussa, A.; Mlayah, A.; Eid, M.H.; Szűcs, P.; Elsayed, S.; El-baki, M.S.A.; Elbeltagi, A.; et al. Hydrogeochemical Characterization and Water Quality Evaluation Associated with Toxic Elements Using Indexing Approaches, Multivariate Analysis, and Artificial Neural Networks in Morang, Tunisia. Environ. Earth Sci. 2025, 84, 361. [Google Scholar] [CrossRef]
Eid, M.H.; Saeed, O.; Szűcs, P.; Kovács, A.; Székács, A.; Mörtl, M.; Alrakhami, M.S.; Al-Mashreki, M.H.; Elsherbiny, O.; Elsayed, S.; et al. Impacts and Sources of Potential Toxic Elements on Water Quality and Optimizing Machine Learning Models for Sustainable Management. Model. Earth Syst. Environ. 2025, 11, 375. [Google Scholar] [CrossRef]
Al-Khayat, J.A.; Jones, D.A. A Comparison of the Macrofauna of Natural and Replanted Mangroves in Qatar. Estuar. Coast. Shelf Sci. 1999, 49, 55–63. [Google Scholar] [CrossRef]
Alharbi, T.; Al-Kahtany, K.; Nour, H.E.; Giacobbe, S.; El-Sorogy, A.S. Contamination and Health Risk Assessment of Arsenic and Chromium in Coastal Sediments of Al-Khobar Area, Arabian Gulf, Saudi Arabia. Mar. Pollut. Bull. 2022, 185, 114255. [Google Scholar] [CrossRef]
Almahasheer, H. Spatial Coverage of Mangrove Communities in the Arabian Gulf. Environ. Monit. Assess. 2018, 190, 85. [Google Scholar] [CrossRef]
Gad, M.; El Hamed, R.A.; El Fadaly, E.A.; Mousa, I.E.; Gaagai, A.; Aouissi, H.A.; Eid, M.H.; Abukhadra, M.R.; Alqhtani, H.A.; Allam, A.A.; et al. New Approach to Predict Wastewater Quality for Irrigation Utilizing Integrated Indexical Approaches and Hyperspectral Reflectance Measurements Supported with Multivariate Analysis. Sci. Rep. 2025, 15, 16395. [Google Scholar] [CrossRef]
Bărbulescu, A.; Barbeș, L. Water Quality Assessment in the Northern Part of the Romanian Black Sea Coastal Area Using an Integrated Index. Appl. Sci. 2026, 16, 4042. [Google Scholar] [CrossRef]
Al-Khayat, J.A.; Giraldes, B.W. Burrowing Crabs in Arid Mangrove Forests on the Southwestern Arabian Gulf: Ecological and Biogeographical Considerations. Reg. Stud. Mar. Sci. 2020, 39, 101416. [Google Scholar] [CrossRef]
Namukonde, N.; Simukonda, C.; Ganzhorn, J.U. Different Effects of Fire Age and Fire Recurrence on Grass and Woody Plant Chemistry in Kafue National Park, Zambia. Biotropica 2023, 55, 1165–1173. [Google Scholar] [CrossRef]
Baird, R.; Eaton, A.D.; Rice, E.W.; Bridgewater, L. Standard Methods for the Examination of Water and Wastewater; American Public Health Association: Washington, DC, USA, 2017; ISBN 9780875532875. [Google Scholar]
Ghodbane, M.; Benaabidate, L.; Boudoukha, A.; Gaagai, A.; Adjissi, O.; Chaib, W.; Aouissi, H.A. Analysis of Groundwater Quality in the Lower Soummam Valley, North-East of Algeria. J. Water Land Dev. 2022, 54, 1–12. [Google Scholar] [CrossRef]
Bhagat, A.; Kshirsagar, N.; Khodke, P.; Dongre, K.; Ali, S. Penalty Parameter Selection for Hierarchical Data Stream Clustering. Procedia Comput. Sci. 2016, 79, 24–31. [Google Scholar] [CrossRef]
Çakir, U.; Buck, T. MEGS: Morphological Evaluation of Galactic Structure. Astron. Astrophys. 2024, 691, A320. [Google Scholar] [CrossRef]
Brown, R.M.; McClelland, N.I.; Deininger, R.A.; O’Connor, M.F. A Water Quality Index—Crashing the Psychological Barrier. In Indicators of Environmental Quality; Springer: Boston, MA, USA, 1972; pp. 173–182. [Google Scholar]
Canadian Environmental Quality Guidelines; CCME: Winnipeg, MB, Canada, 1999; ISBN 9781896997346.
Prasad, B.; Bose, J. Evaluation of the Heavy Metal Pollution Index for Surface and Spring Water near a Limestone Mining Area of the Lower Himalayas. Environ. Geol. 2001, 41, 183–188. [Google Scholar] [CrossRef]
Tamasi, G.; Cini, R. Heavy Metals in Drinking Waters from Mount Amiata (Tuscany, Italy). Possible Risks from Arsenic for Public Health in the Province of Siena. Sci. Total Environ. 2004, 327, 41–51. [Google Scholar] [CrossRef]
Backman, B.; Bodiš, D.; Lahermo, P.; Rapant, S.; Tarvainen, T. Application of a Groundwater Contamination Index in Finland and Slovakia. Environ. Geol. 1998, 36, 55–64. [Google Scholar] [CrossRef]
Caeiro, S.; Costa, M.H.; Ramos, T.B.; Fernandes, F.; Silveira, N.; Coimbra, A.; Medeiros, G.; Painho, M. Assessing Heavy Metal Contamination in Sado Estuary Sediment: An Index Analysis Approach. Ecol. Indic. 2005, 5, 151–169. [Google Scholar] [CrossRef]
Edet, A.E.; Offiong, O.E. Evaluation of Water Quality Pollution Indices for Heavy Metal Contamination Monitoring. A Study Case from Akpabuyo-Odukpani Area, Lower Cross River Basin (Southeastern Nigeria). GeoJournal 2002, 57, 295–304. [Google Scholar] [CrossRef]
Tiyasha; Tung, T.M.; Yaseen, Z.M. A Survey on River Water Quality Modelling Using Artificial Intelligence Models: 2000–2020. J. Hydrol. (Amst.) 2020, 585, 124670. [Google Scholar] [CrossRef]
Azamathulla, H.M.; Haghiabi, A.H.; Parsaie, A. Prediction of Side Weir Discharge Coefficient by Support Vector Machine Technique. Water Supply 2016, 16, 1002–1016. [Google Scholar] [CrossRef]
Zhou, L.; Wang, X.; Zhang, C.; Zhao, N.; Taha, M.F.; He, Y.; Qiu, Z. Powdery Food Identification Using NIR Spectroscopy and Extensible Deep Learning Model. Food Bioprocess Technol. 2022, 15, 2354–2362. [Google Scholar] [CrossRef]
Gad, M.; Abou El-Safa, M.M.; Farouk, M.; Hussein, H.; Alnemari, A.M.; Elsayed, S.; Khalifa, M.M.; Moghanm, F.S.; Eid, E.M.; Saleh, A.H. Integration of Water Quality Indices and Multivariate Modeling for Assessing Surface Water Quality in Qaroun Lake, Egypt. Water 2021, 13, 2258. [Google Scholar] [CrossRef]
Khoi, D.N.; Quan, N.T.; Linh, D.Q.; Nhi, P.T.T.; Thuy, N.T.D. Using Machine Learning Models for Predicting the Water Quality Index in the La Buong River, Vietnam. Water 2022, 14, 1552. [Google Scholar] [CrossRef]
Hassan, M.M.; Hassan, M.M.; Akter, L.; Rahman, M.M.; Zaman, S.; Hasib, K.M.; Jahan, N.; Smrity, R.N.; Farhana, J.; Raihan, M.; et al. Efficient Prediction of Water Quality Index (WQI) Using Machine Learning Algorithms. Hum. Centric Intell. Syst. 2021, 1, 86. [Google Scholar] [CrossRef]
Elgeldawi, E.; Sayed, A.; Galal, A.R.; Zaki, A.M. Hyperparameter Tuning for Machine Learning Algorithms Used for Arabic Sentiment Analysis. Informatics 2021, 8, 79. [Google Scholar] [CrossRef]
Yang, K.; Hao, J.; Wang, Y. Switching Angles Generation for Selective Harmonic Elimination by Using Artificial Neural Networks and Quasi-Newton Algorithm. In Proceedings of the 2016 IEEE Energy Conversion Congress and Exposition (ECCE); IEEE: New York, NY, USA, 2016; pp. 1–5. [Google Scholar]
Sharma, S.; Sharma, S.; Athaiya, A. Activation functions in neural networks. Int. J. Eng. Appl. Sci. Technol. 2020, 4, 310–316. [Google Scholar] [CrossRef]
Mijwil, M.M. Artificial Neural Networks Advantages and Disadvantages. Mesopotamian J. Big Data 2021, 2021, 29–31. [Google Scholar] [CrossRef] [PubMed]
Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques; Elsevier: Amsterdam, The Netherlands; Morgan Kaufmann: San Francisco, CA, USA, 2012; ISBN 9780123814791. [Google Scholar]
Xia, Y.; Liu, C.; Li, Y.; Liu, N. A Boosted Decision Tree Approach Using Bayesian Hyper-Parameter Optimization for Credit Scoring. Expert Syst. Appl. 2017, 78, 225–241. [Google Scholar] [CrossRef]
Ahmed, M.; Mumtaz, R.; Hassan Zaidi, S.M. Analysis of Water Quality Indices and Machine Learning Techniques for Rating Water Pollution: A Case Study of Rawal Dam, Pakistan. Water Supply 2021, 21, 3225–3250. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Rakotondrabe, F.; Ndam Ngoupayou, J.R.; Mfonka, Z.; Rasolomanana, E.H.; Nyangono Abolo, A.J.; Ako Ako, A. Water Quality Assessment in the Bétaré-Oya Gold Mining Area (East-Cameroon): Multivariate Statistical Analysis Approach. Sci. Total Environ. 2018, 610–611, 831–844. [Google Scholar] [CrossRef] [PubMed]
Matiatos, I.; Alexopoulos, A.; Godelitsas, A. Multivariate Statistical Analysis of the Hydrogeochemical and Isotopic Composition of the Groundwater Resources in Northeastern Peloponnesus (Greece). Sci. Total Environ. 2014, 476–477, 577–590. [Google Scholar] [CrossRef] [PubMed]
Burrough, P.A. Principles of Geographical Information Systems for Land Resources Assessment; Clarendon Press: Oxford, UK, 1986; ISBN 0198545924. [Google Scholar]
Watson, D.F. Contouring: A Guide to the Analysis and Display of Spatial Data: With Programs on Diskette; Pergamon Press: Oxford, UK, 2019; ISBN 9780080402864. [Google Scholar]
Nesterov, O. An Assessment of Seawater Desalination Impact on Salinities in the Arabian/Persian Gulf Using a 3D Circulation Model. Ocean Model. 2025, 194, 102503. [Google Scholar] [CrossRef]
Banda, T.D.; Kumarasamy, M.V. Development of Water Quality Indices (WQIs): A Review. Pol. J. Environ. Stud. 2020, 29, 2011–2021. [Google Scholar] [CrossRef]
Bora, M.; Goswami, D.C. Water Quality Assessment in Terms of Water Quality Index (WQI): Case Study of the Kolong River, Assam, India. Appl. Water Sci. 2017, 7, 3125–3135. [Google Scholar] [CrossRef]
Goher, M.E.; Farhat, H.I.; Abdo, M.H.; Salem, S.G. Metal Pollution Assessment in the Surface Sediment of Lake Nasser, Egypt. Egypt. J. Aquat. Res. 2014, 40, 213–224. [Google Scholar] [CrossRef]
Werdell, P.J.; McKinna, L.I.W.; Boss, E.; Ackleson, S.G.; Craig, S.E.; Gregg, W.W.; Lee, Z.; Maritorena, S.; Roesler, C.S.; Rousseaux, C.S.; et al. An Overview of Approaches and Challenges for Retrieving Marine Inherent Optical Properties from Ocean Color Remote Sensing. Prog. Oceanogr. 2018, 160, 186–212. [Google Scholar] [CrossRef]
Okbah, M.A.; Nasr, S.M.; Soliman, N.F.; Khairy, M.A. Distribution and Contamination Status of Trace Metals in the Mediterranean Coastal Sediments, Egypt. Soil Sediment Contam. An. Int. J. 2014, 23, 656–676. [Google Scholar] [CrossRef]
Everitt, B.; Dunn, G. Applied Multivariate Data Analysis; Wiley & Sons: Hoboken, NJ, USA, 2001; ISBN 9780470711170. [Google Scholar]
Rencher, A.C. Methods of Multivariate Analysis; John Wiley: Hoboken, NJ, USA, 2002; ISBN 9780471418894. [Google Scholar]
Luo, H.; Nong, X.; Xia, H.; Liu, H.; Zhong, L.; Feng, Y.; Zhou, W.; Lu, Y. Integrating Water Quality Index (WQI) and Multivariate Statistics for Regional Surface Water Quality Evaluation: Key Parameter Identification and Human Health Risk Assessment. Water 2024, 16, 3412. [Google Scholar] [CrossRef]
Bui, D.T.; Khosravi, K.; Tiefenbacher, J.; Nguyen, H.; Kazakis, N. Improving Prediction of Water Quality Indices Using Novel Hybrid Machine-Learning Algorithms. Sci. Total Environ. 2020, 721, 137612. [Google Scholar] [CrossRef]
El Bilali, A.; Taleb, A.; Brouziyne, Y. Groundwater Quality Forecasting Using Machine Learning Algorithms for Irrigation Purposes. Agric. Water Manag. 2021, 245, 106625. [Google Scholar] [CrossRef]
Gazzaz, N.M.; Yusoff, M.K.; Aris, A.Z.; Juahir, H.; Ramli, M.F. Artificial Neural Network Modeling of the Water Quality Index for Kinta River (Malaysia) Using Water Quality Variables as Predictors. Mar. Pollut. Bull. 2012, 64, 2409–2420. [Google Scholar] [CrossRef] [PubMed]
Alharbi, T.; Alfaifi, H.; El-Sorogy, A. Metal Pollution in Al-Khobar Seawater, Arabian Gulf, Saudi Arabia. Mar. Pollut. Bull. 2017, 119, 407–415. [Google Scholar] [CrossRef] [PubMed]
Amin, S.A.; Almahasheer, H. Pollution Indices of Heavy Metals in the Western Arabian Gulf Coastal Area. Egypt. J. Aquat. Res. 2022, 48, 21–27. [Google Scholar] [CrossRef]
Nour, H.E.; Ramadan, F.; Alsubaie, K.; Tawfik, M. Seasonal Variation and Assessment of Heavy Metals in Coastal Seawater of Kuwait Bay, Northeast Coast of Kuwait. EnvironmentAsia 2022, 15, 108–119. [Google Scholar] [CrossRef]
Ghanimeh, S.; Dalloul, M.; Al-Naimi, M.; Almomani, F.; Hassan, H.; Semerjian, L.; Tariq, A. Seawater Pollution in the Arabian Gulf: Unveiling Risks and the Urgent Need for Local Standards. Earth Syst. Environ. 2026, 10, 1649–1663. [Google Scholar] [CrossRef]
Swetha, S.; Veerasingam, S.; Rajendran, S.; Hassan, H.; Hashmi, M.Z.U.R.R.; Alsaadi, H.; Rangel-Buitrago, N.; Sadooni, F.N. Long-Term Trends in Heavy Metal Contamination of Marine Sediments in the Arabian Gulf: A Meta-Analysis. Environ. Monit. Assess. 2025, 197, 873. [Google Scholar] [CrossRef]
Painting, S.J.; Smith, A.J.; Khamis, A.S.; Abdulla, K.H.; Le Quesne, W.J.F.; Lyons, B.P.; Devlin, M.J.; Garcia, L. Development of Standards for Assessing Water Quality in Marine Coastal Waters of Bahrain. Mar. Pollut. Bull. 2023, 196, 115560. [Google Scholar] [CrossRef]

Figure 1. Location map of the Study Area.

Figure 2. (a–d) Geographic Distribution of Marine Water Quality Indices for Summer Surface Samples.

Figure 3. (a–d) Geographic Distribution of Marine Water Quality Indices for Summer Bottom Samples.

Figure 4. (a–d) Geographic Distribution of Marine Water Quality Indices for Winter Surface Samples.

Figure 5. (a–d) Geographic Distribution of Marine Water Quality Indices for Winter Bottom Samples.

Figure 6. The Relative Pollution Index in Seawater Sampling (a) Summer Top Results, (b) Summer Bottom Results, (c) Winter Top Results, (d) Winter Bottom Results.

Figure 7. Cluster Dendrogram for Variables (a), PCA (b–h).

Figure 8. Heatmaps of the coefficients of determination between WQIs and trace elements, where color intensity indicates the correlation magnitude.

Figure 9. The performance of different ANN models during testing phases for predicting (a) AWQI, (b) HPI, (c) MI, and (d) C_d.

Figure 10. ANN architecture, incorporating a combination of the optimal physiochemical parameters, for predicting (a) AWQI, (b) HPI, (c) MI, and (d) C_d.

Figure 11. The performance of different RF models during testing phases for predicting (a) AWQI, (b) HPI, (c) MI, and (d) C_d.

Figure 12. The effectiveness of various DT algorithms during validation phases for forecasting (a) AWQI, (b) HPI, (c) MI, and (d) C_d.

Table 1. Arithmetic rating method for computation of HPI, MI, C_d and PI.

Trace Element (mg/L)	S_i (mg/L) CCME [18]	MAC_i	Unit Weight (W_i)	Sub Index (Q_i)	W_i × Q_i
Hexavalent chromium (Cr-VI)	0.0015	0.1	0.05432	10	0.54320
Aluminum (Al)	0.1	0.039	0.00081	2.9	0.00318
Barium (Ba)	0.05	0.22	0.00163	22	0.03585
Cadmium (Cd)	0.001	1.1	0.08148	110	8.96287
Chromium (Cr)	0.01	1.01	0.00815	101	0.82295
Copper (Cu)	0.004	0.125	0.02037	12.5	0.25463
Iron (Fe)	0.3	0.023	0.00027	2.33	0.00063
Lead (Pb)	0.007	0.014	0.01164	1.42	0.01663
Manganese (Mn)	0.05	0.022	0.00163	2.2	0.00359
Mercury (Hg)	0.0001	1	0.81481	100	81.48062
Nickel (Ni)	0.025	0.004	0.00326	0.4	0.00130
			∑ (W_i) = 1		∑ (W_i × Q_i)

Table 2. Statistical description of Seawater quality parameters in MIC (2022–2023).

Seawater Quality Parameters (2022–2023)
	T °C	pH	Salinity	(C₅₅H₇₂MgN₄O₅)	NH₃	NO₃	NO₂	TP	Cr (VI)	Al	Ba	Cd	Cr	Cu	Fe	Pb	Mn	Hg	Ni	Zn
1st Year—Summer Top 2022 (n = 23)
Min	27.81	8.50	43.21	0.01	0.02	0.04	0.016	0.01	0.00001	0.0029	0.01	0.0001	0.0001	0.0005	0.005	0.0001	0.0001	0.0001	0.0001	0.01
Max	32.40	8.65	43.63	0.01	0.02	0.09	0.02	0.01	0.00001	0.0099	0.01	0.0003	0.0015	0.0032	0.00768	0.0001	0.0059	0.0001	0.0038	0.023
Mean	29.421	8.60	43.55	0.01	0.02	0.09	0.02	0.01	0.00001	0.0054	0.0100	0.0001	0.0005	0.0015	0.0053	0.0001	0.0019	0.0001	0.0015	0.0119
1st Year—Summer Bottom 2022 (n = 23)
Min	26.15	8.48	43.23	0.01	0.02	0.04	0.016	0.01	0.00001	0.0031	0.01	0.0001	0.0001	0.0003	0.005	0.0001	0.0001	0.0001	0.0003	0.01
Max	29.20	8.72	43.34	0.01	0.02	0.04	0.02	0.01	0.00001	0.125	0.23	0.0035	0.0116	0.0286	0.12098	0.0023	0.0381	0.0023	0.0335	0.266
Mean	27.152	8.55	43.29	0.01	0.02	0.04	0.01652	0.01	0.00001	0.0054	0.0100	0.0002	0.0005	0.0012	0.0053	0.0001	0.0017	0.0001	0.0015	0.0116
2nd Year—Winter Top 2023 (n = 23)
Min	18.25	8.43	45.15	0.011	0.021	0.042	0.015	0.011	0.00001	0.0034	0.011	0.0001	0.0001	0.0005	0.005	0.0001	0.0011	0.0001	0.0001	0.011
Max	19.50	8.45	45.71	0.014	0.023	0.223	0.025	0.012	0.00015	0.0118	0.014	0.0031	0.0101	0.002	0.061	0.0031	0.0031	0.0001	0.0021	0.011
Mean	18.742	8.52	45.60	0.0118	0.0213	0.0944	0.0192	0.0111	0.0001	0.006	0.011	0.001	0.001	0.001	0.012	0.001	0.001	0.000	0.000	0.011
2nd Year—Winter Bottom 2023 (n = 23)
Min	16.13	8.55	45.25	0.01000	0.02000	0.09130	0.01774	0.01000	0.00001	0.0034	0.01	0.0001	0.0001	0.0005	0.005	0.0001	0.0001	0.0001	0.0001	0.01
Max	17.75	8.62	45.81	0.01000	0.02000	0.09130	0.01774	0.01000	0.00001	0.0117	0.01	0.0002	0.0014	0.0011	0.061	0.0001	0.0001	0.0001	0.0006	0.01
Mean	17.235	8.63	45.33	0.01000	0.02000	0.09130	0.01774	0.01000	0.00001	0.0060	0.0100	0.0001	0.0003	0.0007	0.0116	0.0001	0.0001	0.0001	0.0001	0.0100

All parameters are reported in mg/L.

Table 3. AWQI Statistical Description in Seawater (2022–2023).

AWQI	Summer (Top)	Summer (Bottom)	Winter (Top)	Winter (Bottom)
Min	82.64	82.54	82.00	82.64
Max	83.66	85.39	108.32	82.75
Mean	82.45	82.45	88.33	82.89

Table 4. Assessment of WQIs for Seawater Quality Based on the Effects of Trace Elements.

Index	Range	Water Class
HPI	<100	Low polluted
	>100	High polluted
MI	<0.3	Very pure
	0.3–1.0	Pure
	1.0–2.0	Slightly affected
	2.0–2.0	Moderately affected
	2.0–6.0	Strongly affected
	>6.0	Seriously affected
C_d	<1	Low
	1–3	Medium
	>3	High

Table 5. Assessment of Seawater Relative Pollution Index (PI) Based on the Effects of Trace Elements in Summer.

Trace Element	PI
Trace Element	Summer (Top)	Summer (Bottom)
Chromium (Cr-VI)	0.00	0.00
Aluminum (Al)	0.03	0.08
Barium (Ba)	0.14	0.14
Cadmium (Cd)	0.16	0.11
Chromium (Cr)	0.08	0.08
Copper (Cu)	0.41	0.32
Iron (Fe)	0.02	0.02
Lead (Pb)	0.01	0.01
Manganese (Mn)	0.05	0.04
Mercury (Hg)	0.71	0.71
Nickel (Ni)	0.08	0.08
Zinc (Zn)	0.25	0.22

Table 6. Assessment of Seawater Relative Pollution Index (PI) Based on the Effects of Trace Elements in Winter.

Trace Element	PI
Trace Element	Winter (Top)	Winter (Bottom)
Chromium (Cr-VI)	0.05	0.01
Aluminum (Al)	0.06	0.06
Barium (Ba)	0.16	0.14
Cadmium (Cd)	0.55	0.07
Chromium (Cr)	0.51	0.05
Copper (Cu)	0.14	0.14
Iron (Fe)	0.10	0.10
Lead (Pb)	0.15	0.01
Manganese (Mn)	0.02	0.00
Mercury (Hg)	0.71	0.71
Nickel (Ni)	0.00	0.00
Zinc (Zn)	0.16	0.14

Table 7. Correlation Between the Metal Parameters and Factors.

Parameters	Factor 1	Factor 2	Factor 3	Factor 4	Factor 5	Factor 6	Factor 7	Factor 8
Chl	−0.426	0.023	−0.254	0.279	0.611	−0.371	−0.143	−0.230
NH₃	0.188	−0.120	0.349	0.264	0.589	0.572	−0.099	−0.134
NO₃	−0.445	0.154	0.435	0.045	0.124	0.073	0.675	−0.004
NO₂	0.337	−0.098	0.612	0.007	−0.427	0.162	−0.238	−0.280
TP	0.119	−0.935	−0.082	0.030	−0.009	−0.035	0.093	0.105
Cr-VI	−0.476	−0.360	0.380	−0.347	−0.005	−0.359	−0.276	−0.337
Al	−0.199	−0.064	−0.256	0.525	−0.306	−0.097	0.323	−0.587
Ba	0.160	−0.927	0.031	0.110	0.134	−0.037	−0.041	0.149
Cd	0.086	−0.864	0.099	−0.148	0.009	0.214	0.156	−0.189
Cr	−0.573	−0.195	0.356	−0.500	0.168	−0.224	0.243	0.109
Cu	−0.858	−0.283	−0.064	0.025	−0.117	0.133	−0.214	−0.030
Fe	0.271	0.121	−0.576	−0.469	0.276	0.210	0.012	−0.342
Pb	0.298	−0.481	−0.447	0.440	−0.091	−0.123	0.051	0.183
Mn	−0.961	0.016	−0.107	0.108	0.000	0.069	−0.096	0.125
Hg	−0.105	−0.179	−0.603	−0.631	−0.155	0.182	0.109	−0.057
Ni	−0.937	−0.101	−0.094	0.138	−0.049	0.171	−0.055	0.006
Zn	−0.843	0.069	−0.148	0.100	−0.178	0.327	−0.070	0.098

Table 8. ANN models Performance for WQIs prediction after training and testing.

ANN Models	Optimal Features	Hyperparameter (Z, L, N, I) *	Training			Testing
ANN Models	Optimal Features	Hyperparameter (Z, L, N, I) *	R²	RMSE	MAE	R²	RMSE	MAE
ANN-AWQI	C₅₅H₇₂MgN₄O₅, NO₃, NO₂, TP, Ba, Cd, Cr, Fe, Pb, Mn, Zn	(ReLU, 2, 5, 700)	0.953	0.864	0.847	0.897	1.282	1.270
ANN-HPI	Cr-VI, Ba, Cd, Cr, Pb, Mn	(ReLU, 2, 4, 500)	0.946	0.930	0.407	0.887	1.343	0.986
ANN-MI	Cr-VI, Ba, Cd, Cr, Pb, Mn	(Identity, 1, 8, 500)	0.986	0.084	0.047	0.980	0.098	0.064
ANN-C_d	Cr-VI, Ba, Cd, Cr, Pb, Mn	(Identity, 1, 7, 500)	0.986	0.084	0.047	0.938	0.170	0.160

* Z, L, N, and I represent the activation function employed, the number of layers, the number of neurons in each layer, and the number of iterations, respectively.

Table 9. RF models Performance for WQIs prediction after training and testing.

Model	Optimal Features	Hyperparameter (C, D, T) *	Training			Testing
Model	Optimal Features	Hyperparameter (C, D, T) *	R²	RMSE	MAE	R²	RMSE	MAE
RF-AWQI	NO₂, Cr-VI, Ba, Cd, Cr, Pb, Mn	(MSE, 5, 13)	0.9307	1.051	0.905	0.8822	1.370	1.306
RF-HPI	Cr-VI, Ba, Cd, Cr, Pb	(MSE, 6, 15)	0.9507	0.888	0.371	0.9145	1.170	0.855
RF-MI	Cr-VI, Ba, Cd, Cr, Cu, Pb, Mn, Ni, Zn	(MSE, 7, 11)	0.9752	0.120	0.050	0.9719	0.114	0.063
RF-C_d	Cr-VI, Ba, Cd, Cr, Pb, Mn	(MSE, 5, 11)	0.9885	0.073	0.036	0.9203	0.193	0.184

* C, D, and T represent the criterion function, the maximum depth of individual trees, the number of trees in the forest.

Table 10. DT models Performance of WQIs.

Models	Optimal Features	Hyperparameter (C, D, T) *	Training			Testing
Models	Optimal Features	Hyperparameter (C, D, T) *	R²	RMSE	MAE	R²	RMSE	MAE
DT-AWQI	NO₂, Ba, Cd, Cr, Pb, Mn	(MSE, 5)	0.9537	0.859	0.847	0.8973	1.279	1.270
DT-HPI	Cr-VI, Ba, Cd, Cr, Pb	(MSE, 5)	0.9534	0.864	0.241	0.9334	1.033	0.749
DT-MI	Cr-VI, Ba, Cd, Cr, Cu, Pb, Ni, Zn	(MSE, 5)	0.8254	0.285	0.081	0.8248	0.286	0.093
DT-C_d	Cr-VI, Ba, Cd, Cr, Pb, Mn	(MSE, 7)	0.9927	0.058	0.031	0.9311	0.179	0.170

* C, D, and T represent the criterion function, the maximum depth of individual trees, the number of trees in the forest.

Table 11. Diebold–Mariano test results for pairwise model comparisons (ANN, RF, DT) across four indices (AWQI, HPI, MI, C_d).

WQIs	Comparison	DM Statistic	p-Value
AWQI	ANN vs. RF	−1.1745	0.2433
	ANN vs. DT	0.2319	0.8171
	RF vs. DT	1.2370	0.2193
HPI	ANN vs. RF	0.6134	0.5412
	ANN vs. DT	2.5119	0.0138
	RF vs. DT	0.4961	0.6210
MI	ANN vs. RF	−1.0665	0.2890
	ANN vs. DT	−1.4646	0.1465
	RF vs. DT	−1.3541	0.1790
C_d	ANN vs. RF	−3.3748	0.0011
	ANN vs. DT	−1.9668	0.0523
	RF vs. DT	1.6586	0.1006

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gad, M.; Ata, A.A.E.-S.M.; Fattah, M.K.; El-Fadaly, E.A.; El-baki, M.S.A.; Gaagai, A.; Eid, M.H.; Elsherbiny, O.; Taha, M.F.; Elsayed, S. Assessment of Marine Water Quality Using Integrated Indices and Machine Learning Framework in the Arabian Gulf Region. Sustainability 2026, 18, 6140. https://doi.org/10.3390/su18126140

AMA Style

Gad M, Ata AAE-SM, Fattah MK, El-Fadaly EA, El-baki MSA, Gaagai A, Eid MH, Elsherbiny O, Taha MF, Elsayed S. Assessment of Marine Water Quality Using Integrated Indices and Machine Learning Framework in the Arabian Gulf Region. Sustainability. 2026; 18(12):6140. https://doi.org/10.3390/su18126140

Chicago/Turabian Style

Gad, Mohamed, Ahmed Ali El-Sayed M. Ata, Mohamed K. Fattah, Ezzat A. El-Fadaly, Mohamed S. Abd El-baki, Aissam Gaagai, Mohamed Hamdy Eid, Osama Elsherbiny, Mohamed Farag Taha, and Salah Elsayed. 2026. "Assessment of Marine Water Quality Using Integrated Indices and Machine Learning Framework in the Arabian Gulf Region" Sustainability 18, no. 12: 6140. https://doi.org/10.3390/su18126140

APA Style

Gad, M., Ata, A. A. E.-S. M., Fattah, M. K., El-Fadaly, E. A., El-baki, M. S. A., Gaagai, A., Eid, M. H., Elsherbiny, O., Taha, M. F., & Elsayed, S. (2026). Assessment of Marine Water Quality Using Integrated Indices and Machine Learning Framework in the Arabian Gulf Region. Sustainability, 18(12), 6140. https://doi.org/10.3390/su18126140

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Assessment of Marine Water Quality Using Integrated Indices and Machine Learning Framework in the Arabian Gulf Region

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Sampling and Analysis

2.3. Multivariate Statistics

2.3.1. Cluster Analysis (CA)

2.3.2. Principal Component Analysis (PCA)

2.4. Indexing Approaches

2.4.1. Arithmetic Water Quality Index (AWQI)

2.4.2. Pollution Indices (PIs)

Heavy Metal Pollution Index (HPI)

Metal Index (MI)

Degree of Contamination (Cd)

Pollution Index (PI)

2.5. Machine Learning Methods

2.5.1. Feature Selection

2.5.2. Data Preprocessing and Splitting

2.5.3. Cross Validation Procedure

2.5.4. Model Architecture and Training

Artificial Neural Network (ANN)

Decision Tree Model (DT)

Random Forest Model (RF)

2.6. Models Evaluation

2.7. Data Analysis

3. Results and Discussion

3.1. Physicochemical Data

3.2. Aquatic Water Quality Indices (AWQI)

3.3. Water Quality Indices (WQIs)

3.4. Multivariate Analysis

3.4.1. Cluster Analysis (CA)

3.4.2. Principal Component Analysis (PCA)

3.4.3. The Performance of Machine Learning (ML) Models to Predict the WQI

3.4.4. Regional Benchmarking, Policy Implications, and Management Relevance

3.4.5. Limitations of the Study

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Degree of Contamination (C_d)