3.4.2. Principal Component Analysis (PCA)
Multivariate statistical techniques such as PCA are widely applied in environmental studies to analyze complex datasets and identify patterns of contamination. PCA helps reduce dimensionality by transforming large datasets into a smaller number of components that explain most of the variance. This approach is particularly useful in assessing the influence of multiple factors on seawater quality and distinguishing between natural and anthropogenic sources of pollution. By applying PCA, this study aims to determine the most significant variables contributing to heavy metal contamination in MIC seawater.
PCA can analyze multivariate relationships and explain data variation by limiting the number of variables to many groupings of persons based on principle component scores [
48]. Introduced by Rencher, this methodology may convert a dataset with several variables into a set of comprehensive principal components and is quite comparable to the correlation or regression analysis methods. Researchers have used PCA in several fields because it enables a significant decrease in the number of variables and the identification of structure in the interactions between various variables [
49]. The first step in using PCA to assess the levels of heavy metal contamination is to identify the principal components of the dataset. Since the principal components make up the bulk of the data in the assessed indexes, they are able to properly represent the amounts of heavy metal contamination in the water. By using PCA techniques, we want to maximize the variance of a linear combination of the variables in the dataset. The weight total of the different principal component values may be used to calculate the values of primary components, and the concentrations of heavy metals in the sea can be used to calculate the levels of heavy metal pollution in the sea.
The PCA of the metals demonstrated many PCA explaining in total 90.33% of the variance (
Table 7) and (
Figure 7a–h). In total, eight factors (F1, F2, F3, F4, F5, F6, F7, F8) explain 27.02, 18.09, 12.03, 9.90, 7.14, 5.79, 5.26, and 5.10%, respectively, where F1 is represented by Cr (−0.573), Cu (−0.858), Mn (−0.961), Ni (−0.937), and Zn (−0.843). F2 consists of TP (−0.935), Ba (−0.927), and Cd (−0.864). F3, F4, F5, F6, F7, and F8 are represented by NO
2 (0.612), Hg (−0.603), Chl (0.611), NH
3 (0.572), NO
3 (0.675), and Al (−0.587), respectively. However, due to their low concentration in the seawater at both the top and bottom levels, many metals, such as iron and lead, are not important by any means. Individual metal contamination of marine networks may also be caused by human activities, the natural dispersion of clay minerals in sediment, and the interaction between soil and water.
The strong negative loadings of Cr (−0.573), Cu (−0.858), Mn (−0.961), Ni (−0.937), and Zn (−0.843) in Factor 1 indicate a strong association of these metals with the same pollution source or geochemical process. In PCA, the sign of the loading reflects the direction of the relationship relative to the component axis rather than the strength or environmental significance of contamination. Therefore, the negative loadings observed in Factor 1 should be interpreted as an inverse orientation along the PCA axis and not as a reduction in pollution impact. The axis orientation in PCA is mathematically arbitrary and may be reversed without affecting the interpretation of the results.
The grouping of Cr, Cu, Mn, Ni, and Zn within the same factor suggests a common origin, likely related to anthropogenic activities surrounding the study area, including industrial discharge, shipping operations, antifouling paints, fuel combustion, urban runoff, and metal-processing activities. Similar associations among these metals have been widely reported in marine environments influenced by industrial and harbor activities. The high absolute loading values indicate that these metals are major contributors to the variability explained by Factor 1 and therefore play an important role in controlling seawater quality in the investigated area.
The PI analysis revealed that cadmium and copper exerted small but measurable effects on the aquatic ecosystem, while chromium, lead, manganese, nickel, and zinc, despite being the primary drivers of variance in the PCA (Factor 1), demonstrated only low-to-moderate PI values, indicating that their current concentrations, though elevated relative to background levels, have not yet reached the threshold for severe contamination as defined by CCME guidelines. Iron demonstrated negligible contaminating influence. This distinction between statistical prominence in multivariate analysis and absolute contamination severity is important for accurate environmental risk communication.
It is important to note that the high PCA loadings for Mn, Ni, Cu, Zn, and Cr in Factor 1 reflect their strong co-variation across sampling sites, likely driven by a common industrial source, rather than necessarily indicating severe absolute contamination. This interpretation is consistent with the PI results, which classify these metals within the low-to-moderate pollution range.
These findings underscore the importance of identifying pollutant sources to improve water quality management. Future research should focus on long-term monitoring, source apportionment, and ecological risk assessments to develop effective mitigation strategies. Addressing contamination at its source will play a crucial role in ensuring sustainable marine ecosystem health in the MIC region.
3.4.3. The Performance of Machine Learning (ML) Models to Predict the WQI
The growing complexity of environmental data needs sophisticated analytical methods to boost prediction precision and decision-making capabilities. ML models have proven strong analytical tools for water quality assessment. The models process extensive datasets to reveal concealed patterns which enables them to make dependable WQI predictions through multiple environmental parameter analysis [
50]. The research combines ML approaches to boost water quality monitoring efficiency while developing proactive environmental management strategies.
To forecast WQIs effectively, three distinct ML models were developed and implemented: DT, RF, and ANN. These ML models were created using the Spyder software environment and the Python scikit-learn module, which provides comprehensive tools for ML applications.
Based on correlation analysis (
Figure 8) above a predetermined threshold value, the following features were selected as input variables for the ML models: Chlorophyll ‘a’ (C
55H
72MgN
4O
5), ammonia (NH
3), nitrate (NO
3), nitrite (NO
2), total phosphorus (TP), hexavalent chromium (Cr-VI), aluminum (Al), barium (Ba), cadmium (Cd), total chromium (Cr), copper (Cu), iron (Fe), lead (Pb), manganese (Mn), mercury (Hg), nickel (Ni), and zinc (Zn).
The optimal R2 threshold for feature selection was determined separately for each WQIs. For AWQI prediction, features were selected using R2 thresholds of 0.01, 0.05, and 0.05 for ANN, RF, and DT, respectively. For HPI, the corresponding selection thresholds were 0.05, 0.20, and 0.20. For MI, the corresponding selection thresholds were 0.25, 0.10, and 0.10. For Cd, the thresholds used were 0.25, 0.25, and 0.25 for the same variables.
The ML models showed strong predictive capabilities for WQI forecasting according to the results presented in
Table 8. The model successfully detected intricate relationships between physicochemical parameters and WQI variations which allowed it to predict water quality trends precisely. The testing phase results from
Figure 9a–d demonstrate the strong performance of ML models in assessing seawater contamination levels through observed versus predicted data.
The research demonstrates why ML applications remain essential for environmental monitoring and management. Future research needs to improve model performance through dataset expansion and input variable optimization and the investigation of support vector machines (SVM) and random forest regression as alternative ML techniques. The integration of real-time monitoring systems with AI-driven predictive models will play a crucial role in safeguarding aquatic ecosystems and optimizing sustainable water resource management in industrial regions like MIC.
The artificial neural network (ANN-AWQI) model achieved R
2 values of 0.95 and 0.90, RMSE values of 0.86 and 1.28, MAE values of 0.85 and 1.27 during training and validation periods for AWQI prediction. The ANN-AWQI model uses two hidden layers with 5 neurons each and ReLU activation across 700 iterations as shown in
Figure 10a. The ANN-HPI model achieved R
2 values of 0.95 and 0.89, RMSE values of 0.93 and 1.34, and MAE values of 0.41 and 0.99, respectively, for HPI prediction. The ANN-HPI architecture implements the ReLU activation function throughout 500 iterations and consists of two hidden layers with 4 neurons each which is illustrated in
Figure 10b. The ANN-MI model achieved remarkable R
2 values of 0.99 and 0.98, RMSE values of 0.08 and 0.10, and MAE values of 0.05 and 0.06 for MI prediction. The ANN-MI architecture presents the Identity activation function during 500 iterations with one hidden layer containing 8 neurons (
Figure 10c). The ANN-C
d model reached outstanding R
2 values of 0.99 and 0.94 along with RMSE values of 0.08 and 0.17 and MAE values of 0.05 and 0.16 for C
d prediction. The ANN-C
d architecture consists of a single hidden layer with 7 neurons and Identity activation function which runs through 500 iterations as depicted in
Figure 10d.
The research demonstrates that ANNs achieve high accuracy when predicting WQIs. ANN-based approaches demonstrate robustness and reliability in complex environmental data modeling through their high R2 values and low RMSE and MAE scores in all models. The different network configurations and activation methods used for each WQI component show that optimizing network parameters according to WQIs components leads to better prediction outcomes. Real-time data integration combined with model architecture optimization along with ensemble learning techniques will enhance prediction accuracy and extend the environmental monitoring capabilities of ANN models. The advancements will enhance water resource management approaches that sustain marine ecosystems throughout industrial areas like MIC during extended periods.
The study confirmed the predictive strength of RF models for WQIs forecasting.
Table 9 with
Figure 11a–d presents the findings obtained from the testing process. The random forest (RF-AWQI) model which was used for AWQI prediction achieved an R
2 of 0.93 and 0.88, RMSE of 1.05 and 1.37, and MAE of 0.91 and 1.31 during training and testing phases. It used 13 trees with a maximum depth of 5. The RF-HPI model in predicting HPI achieved, respectively, an RMSE of 0.89 and 1.17, R
2 of 0.95 and 0.91, and MAE of 0.37 and 0.86, respectively. The model consisted of 15 trees that reached a maximum depth of 6. The RF-MI model in predicting MI achieved, respectively, an RMSE of 0.12 and 0.11, an R
2 of 0.98 and 0.97, and MAE of 0.05 and 0.06, respectively. The model utilized 11 trees with a maximum depth of 7. The RF-C
d model in predicting C
d attained an R
2 of 0.99 and 0.92, an RMSE of 0.07 and 0.19, and MAE of 0.04 and 0.18, respectively, with 11 trees and a maximum depth of 5. These models used MSE as the criterion function.
The good results of the RF models in predicting WQIs show that ensemble learning techniques are useful for modeling non-linear water quality parameter relationships and complex interactions between them. The high R2 values and low RMSE and MSE values across all models demonstrate RF as an effective alternative to ANN for water quality prediction. The benefits of RF include its ability to handle missing values and prevent overfitting through bootstrapping as well as providing feature importance scores which make it a useful tool for environmental monitoring. Optimizing hyper parameters and incorporating real-time datasets and integrating RF with other ML techniques would enhance prediction accuracy and expand its applicability. Data-driven sustainable water resource management decisions will be supported by these advancements to reduce industrial environmental impact on marine ecosystems like MIC.
The research also tested the usage of DT as a possible method to forecast WQIs for the Seawater in MIC. Decision trees are widely adopted because they are simple to understand, easy to interpret and can handle both categorical and continuous data. The method is particularly effective in modeling complex non-linear relationships between predictors and response variables, making it highly suitable for environmental data analysis. DT models provided strong predictive results which helped identify water quality parameters while offering a useful tool for environmental management decision-making. The research results showed that decision tree algorithms can predict WQIs accurately as presented in
Table 10 and
Figure 12a–d. The results were recorded during the evaluation phase. Throughout the training and validation stages, the decision tree (DT-AWQI) model for AWQI prediction exhibited performance with R
2 values of 0.95 and 0.90, RMSE values of 0.86 and 1.28, and MSE values of 0.85 and 1.27, respectively. The DT-AWQI employed a maximum depth of 5 levels. The DT-HPI model for HPI prediction displayed performance with RMSE values of 0.86 and 1.03, R
2 values of 0.95 and 0.93, and MSE values of 0.24 and 0.75, respectively. The DT-HPI used a maximum depth of 5 levels. The DT-MI model for MI prediction showed performance with RMSE values of 0.29, R
2 values of 0.82 across both seasonal periods, and MSE values of 0.08 and 0.09 for training and testing, respectively. The DT-MI implemented a maximum depth of 7 levels. The DT-C
d model for C
d prediction displayed performance with R
2 values of 0.99 and 0.93, RMSE values of 0.06 and 0.18, MSE values of 0.03 and 0.17, respectively. The DT-C
d operated at a maximum depth of 2 levels. These algorithms used MSE as the evaluation metric.
The DT models forecasted WQIs robustly because they are able to handle complex environmental data with precision. The consistent high R2 values and low RMSE and MSE scores in all models demonstrate that decision trees can be a useful tool for water quality monitoring. The results confirm that decision tree models are versatile in environmental assessments and can be used as a reliable alternative to other ML methods like ANN and RF. Future research could investigate ensemble decision tree approaches or hybrid models to boost prediction accuracy especially in real-time monitoring systems for sustainable water management in marine areas like MIC.
The study of environmental science has witnessed increasing interest in ML models because they show promise for WQIs prediction. These models, ANN, RF, and DT, are recognized for their ability to handle complex, multidimensional data and their flexibility in modeling non-linear relationships. ML models learn from historical data to generate precise and trustworthy predictions of WQIs which helps optimize water quality monitoring operations and enables better decision-making. Our research findings contribute to this expanding body of knowledge by showing how ML models excel in environmental tasks.
To prevent the model from overfitting to the training data, cross-validation was applied for hyperparameter optimization. The dataset was randomly partitioned into five folds. For each validation round, one-fold served as the validation set while the remaining four were used for training; this process was repeated until each fold had been used exactly once as the validation set. The final model was selected based on the average performance across folds, prioritizing high R2 and low RMSE. GridSearchCV was employed to take the ML model, hyperparameter grid, and number of k-folds as inputs, and returns the best estimator with its optimal hyperparameters.
Thirty percent of the data was held out as a separate test set (70/30 train/test split). The training portion (70%) was used exclusively for cross-validation. Results of the final models on both training and test sets are reported in
Table 8,
Table 9 and
Table 10. The small discrepancy between training and test performance (R
2, RMSE, and MAE) suggests that overfitting is unlikely, justifying our choice of k = 5 folds. In addition,
Figure 10,
Figure 11 and
Figure 12 present scatter plots of predicted versus actual values, showing a clear positive linear trend (indicating low variance). While a few points deviate from the main cluster (suggesting minor bias), the overall relationship remains approximately linear.
Across the four indices (AWQI, HPI, MI, and C
d), the Diebold–Mariano test results show that no single model consistently outperforms the others; instead, the best model varies by index, as shown in
Table 11. For AWQI, the DM statistics (ANN vs. RF: −1.1745,
p = 0.2433; ANN vs. DT: 0.2319,
p = 0.8171; RF vs. DT: 1.2370,
p = 0.2193) indicate that RF is directionally preferred over both ANN and DT, though none of the differences are statistically significant. For HPI, ANN outperforms RF (DM = 0.6134,
p = 0.5412) and significantly outperforms DT (DM = 2.5119,
p = 0.0138), while RF also beats DT directionally (DM = 0.4961,
p = 0.6210), making ANN the clear winner for that index. For MI, the ANN vs. DT (−1.4646) and RF vs. DT (−1.3541) comparisons indicate DT outperforms both, while the ANN vs. RF comparison (−1.0665) indicates RF outperforms ANN. Thus, DT is the best-performing model for MI. For C
d, RF shows a highly significant advantage over ANN (DM = −3.3748,
p = 0.0011), a near-significant advantage over DT via the ANN vs. DT comparison (DM = −1.9668,
p = 0.0523, favoring DT over ANN), and a directional advantage over DT directly (RF vs. DT: DM = 1.6586,
p = 0.1006). The strongest evidence is for RF over ANN in C
d and for ANN over DT in HPI. In practice, the optimal model should be chosen per index: RF for AWQI and C
d, ANN for HPI, and DT for MI. When a single model must be selected across all indices, RF is the most defensible choice: it leads on two indices and remains competitive on the rest.
The ML algorithms demonstrate consistent performance during training and validation phases which indicates their reliable and stable WQI forecasting ability. Current research shows advanced multivariate regression techniques including ANN, RF, and DT can precisely predict WQIs. Hassan et al. [
29] used RF algorithm to predict WQI by analyzing diverse trace elements which resulted in 98.99% precision. Khoi et al. [
28] studied WQI prediction through ANN, RF, and DT algorithms while R
2 values showed strong accuracy ranging between 0.68 and 0.99. Bui et al. [
51] employed RF and DT algorithms to create forecasts that reached R
2 values of 0.93 and 0.87 using physicochemical properties in their analysis. El Bilali et al. [
52] achieved R
2 values of 0.92 for both ANN and RF algorithms when predicting WQIs using physical variables. The work demonstrates ML models’ ability to determine important input parameters for precise predictions by showing how they simplify WQIs modeling [
53]. ML models serve as an efficient instrument for WQI evaluation and automated WQI calculations that leads to substantial reductions in time and work requirements. This approach serves as a robust alternative to conventional WQI calculation approaches which demand intricate calculations together with multiple sub-index equations. The research recommends ML models to resource managers and water quality monitoring organizations due to their reliable and thorough results.
The ML models demonstrate robust WQI prediction ability because they maintain high performance during training and testing phases. ML models deliver a substantial advantage through their improved prediction accuracy and reduced computational needs which surpass traditional WQI calculation methods that demand complex calculations and multiple sub-index formulas. Our research along with previous studies demonstrates how these models assist resource managers and water quality monitoring agencies to optimize their monitoring programs while making better decisions. The future implementation of ML models for water quality assessments will play a crucial role in developing more effective and precise environmental management strategies.
3.4.4. Regional Benchmarking, Policy Implications, and Management Relevance
The heavy metal concentrations and pollution index values recorded at MIC acquire their full significance when placed in the context of comparable industrial coastal systems across the Arabian Gulf. Along the Al-Khobar coast, Alharbi et al. [
54] found that average Zn, Fe, Mn, Cu, As, and Cr concentrations exceeded those of several worldwide seas and gulfs, with the highest levels concentrated in sheltered embayments near desalination plants and industrial facilities, a spatial pattern that mirrors the elevated HPI and MI values observed in the northeastern and eastern sectors of MIC in the present study. A broader assessment of 22 western Arabian Gulf coastal sites by Amin and Almahasheer [
55] found that 82% of locations were non-polluted to slightly polluted using the Pollution Index, with C
d emerging as the most frequently polluting metal, consistent with the present study’s PI results. In Kuwait Bay, Nour et al. [
56] documented moderate heavy metal contamination in summer and low contamination in winter, the same seasonal intensification pattern observed at MIC attributing it to oil refining, fertilizer manufacturing, and shipping activities directly analogous to those operating at MIC. Within Qatar itself, Ghanimeh et al. [
57] reported Cu and Ni contamination factor (CF) values of 12 and 60, respectively, in Doha Bay waters, both exceeding the high-risk threshold of CF = 6. The Gulf-wide meta-analysis of Swetha et al. [
58] further confirmed that the Arabian Gulf is characterized by low-to-moderate contamination overall, with localized industrial hotspots, and recommended continuous monitoring and scientifically informed waste management strategies. Taken together, these comparisons position MIC as a boundary-state industrial coastal system: less severely impacted than Dammam or Doha Bay, but more impacted than the majority of western Gulf sites, and on a trajectory of worsening contamination given the ongoing industrial expansion. The MIC Environmental Guidelines specify individual discharge limits for heavy metals; however, the cumulative effect of simultaneous discharges from numerous industrial facilities creates a combined pollution load that individual permits are not designed to manage. We therefore recommend that the Qatar Ministry of Environment and Climate Change implement a Total Maximum Daily Load (TMDL) framework for MIC marine receiving waters, setting aggregate daily limits for Mn, Ni, Cr, and key nutrients. Saudi Arabia’s Executive Regulations for the Protection of Aqueous Media already employ a screening model for mixing zone determination in the Arabian Gulf, with dilution factors of 16 required for industrial areas. This regulatory approach could be adapted by Qatar. Following the approach demonstrated by Painting et al. [
59], who used 15 years of monitoring data from 27 sites in Bahrain’s coastal waters to derive locally calibrated baseline thresholds, MIC should establish near-pristine offshore reference stations. These stations would enable detection of deterioration trends before absolute thresholds are breached. Under Qatar’s obligations as a signatory to the Kuwait Regional Convention and its Protocol for the Protection of the Marine Environment against Pollution from Land-Based Sources, MIC monitoring data should be formally reported to the Regional Organization for the Protection of the Marine Environment (ROPME) regional database. These findings carry direct policy implications for environmental governance in Qatar. Ghanimeh et al. [
57] called for the establishment of site-specific GCC background values and the enforcement of national pollution risk indicators through environmental licensing and industrial discharge permitting. These recommendations apply with equal force to MIC, given the absence of locally adapted marine water quality standards across the GCC. This reporting would enable Gulf-wide trend analysis consistent with the framework of Swetha et al. [
58]. The ML models developed in this study provide ready-made, cost-effective tools for near-real-time enforcement of these targets: the RF model for AWQI and C
d prediction (R
2 = 0.882 and 0.920, respectively), and the ANN model for HPI prediction (R
2 = 0.887). ROPME is currently implementing a Regional Action Plan on Ecosystem-Based Management (EBM). We recommend that MIC adopt an EBM approach that considers not only physicochemical thresholds but also biological indicators (e.g., bivalve bioaccumulation, algal community shifts) to capture sub-lethal ecological effects. Given that PCA Factor 1 and Factor 2 likely reflect both water column and sediment-phase contamination, coupled water–sediment monitoring programs are essential, as marine sediments act as the ultimate sink for heavy metals in the Arabian Gulf. Future research should extend this integrated framework to other Qatari industrial zones (e.g., Ras Laffan) and across the GCC to develop a harmonized regional assessment protocol.