A Comprehensive Review of Machine Learning for Water Quality Prediction over the Past Five Years

: Water quality prediction, a well-established ﬁ eld with broad implications across various sectors, is thoroughly examined in this comprehensive review. Through an exhaustive analysis of over 170 studies conducted in the last ﬁ ve years, we focus on the application of machine learning for predicting water quality. The review begins by presenting the latest methodologies for acquiring water quality data. Categorizing machine learning-based predictions for water quality into two primary segments—indicator prediction and water quality index prediction—further distinguishes be-tween single-indicator and multi-indicator predictions. A meticulous examination of each method’s technical details follows. This article explores current cu tt ing-edge research trends in machine learning algorithms, providing a technical perspective on their application in water quality prediction. It investigates the utilization of algorithms in predicting water quality and concludes by highlighting signi ﬁ cant challenges and future research directions. Emphasis is placed on key areas such as hydrodynamic water quality coupling, e ﬀ ective data processing and acquisition, and mitigating model uncertainty. The paper provides a detailed perspective on the present state of application and the principal characteristics of emerging technologies in water quality prediction.


Introduction
In recent years, the industrialization and urbanization of coastal areas have experienced increasing population pressures [1][2][3][4].A significant volume of wastewater generated by local residents is often discharged into the sea after undergoing rudimentary water treatment [5][6][7].The discharge of sewage into the receiving water body will significantly increase turbidity and organic and inorganic substances, thereby changing the living environment of marine organisms [8].Discrepancies between sewage treatment standards and marine water quality standards, particularly concerning specific key pollution values in certain countries, have led to the degradation of seawater quality due to the discharge of untreated sewage [9].Furthermore, the growing industrial sector and population necessitate additional land for infrastructure development [10][11][12], making coastal reclamation [13][14][15] a prominent topic.This activity, in turn, has instigated alterations in marine hydrodynamic factors in coastal regions [16,17].Reports illustrate the adverse effects of coastal erosion, hurricanes, typhoons [18,19], and coastal flooding [20].These occurrences not only have environmental consequences but also impact aquaculture development [21,22] and bathing areas [23,24].To address these challenges, considerable efforts have been directed towards predicting water quality and hydrodynamic movements in coastal regions.
Numerical models, which derive results from rules and data [25], have been a frequently employed method for predicting changes in water quality [8,[26][27][28] and hydrodynamic movement [29], with a variety of models available.Despite their capability to generate accurate simulations, their usage is significantly constrained by limitations.Firstly, accuracy heavily depends on the selection of model parameters [30,31], posing a challenge for beginners lacking a solid grasp of the underlying theories.Secondly, in an effort to minimize complexity, numerous assumptions are integrated into numerical models, often tailored to specific situations, rendering them less adaptable for direct application in different environments.Furthermore, long-term and large-scale predictions have proven to be exceedingly time consuming [32,33] and demand substantial computational space, making it challenging to achieve real-time and prompt results during unexpected situations.In response to these challenges, scientists have been actively seeking solutions to address these inherent shortcomings.
In recent years, advancements in computer science have enabled integrating machine learning into numerical simulations, offering high-speed and precise predictions [34,35].Initially, the application of machine learning in coastal simulation faced limitations due to the challenge of obtaining oceanic data.However, recent progress in satellite remote sensing and unmanned aerial vehicle observation has alleviated these data constraints [36][37][38].For instance, Nagur Cherukuru et al. [39] developed a semi-analytical remote sensing model, facilitating the retrieval of suspended sediment and dissolved organic carbon in coastal waters.This breakthrough enables the exploration of potential correlations between water quality metrics and satellite imagery.The substantial increase in available data, a fundamental aspect for training machine learning models [40,41], has played a pivotal role.Machine learning, as a field in computer science, seeks implicit relationships between input and output values, facilitating the rapid discovery of connections and the establishment of criteria for prediction without restrictive assumptions.The expanded dataset has significantly enhanced the effectiveness of machine learning applications.
This review primarily focuses on presenting the latest applications of machine learning technologies in the identification and prediction of water quality.Additionally, we provide an overview of the most recent methods for acquiring water quality data.Various cases of water quality prediction, including chlorophyll-a, salinity, dissolved oxygen, and water quality index prediction, are examined.Furthermore, the paper explores water quality predictions through the coupling of hydrodynamics.
The subsequent sections of the paper are organized as follows: Section 2 delves into data acquisition techniques, providing insights into the methods employed.Section 3 comprehensively discusses the most recent advancements in water quality prediction through machine learning.Finally, the conclusions drawn and directions for future work are explored in Section 4.

Acquiring Water Quality Data
The identification and collection of water pollution data constitute pivotal steps in understanding the status of water quality [42,43].Seawater quality is predominantly characterized by its chemical, physical, and biological properties.Various water parameters [5], including physical parameters (such as water temperature, total suspended solids, turbidity, and total dissolved solids), chemical parameters (such as chemical and biochemical oxygen demand, and dissolved oxygen) [44][45][46], biological parameters (such as Escherichia Coli and enterococci levels [47]), can be collected by water quality inspectors.
Remote sensing, particularly via satellites, offers broader spatial coverage [48,49] and requires less time compared to traditional field measurements.Leveraging the specific reflection wavelength characteristics of objects, remote sensing enables the extraction of data from images.For instance, water with a higher algae content exhibits the reflectance at wavelengths of 550 nm [50].Researchers have explored the potential correlation between water quality and satellite images, deriving water quality parameters from these images through empirical formulas [51,52].Remote sensing algorithms can convert spectral reflectance into chlorophyll concentration, and widely used sensors such as SeaWiFS [53][54][55][56][57], MODIS [53,57,58], CZCS, MERIS [57,59], and OLCI [60] facilitate this process.Figure 1 illustrates the steps in obtaining water quality data through remote sensing.Steps for inverting water quality data from remote sensing data.The establishment of spectral reference libraries is a prerequisite for extracting data from remote sensing monitoring [61].
In recent years, substantial strides have been conducted on leveraging optical features to obtain water quality information [62][63][64].Advancing water quality research necessitates more precise data [65,66].Consequently, methods that integrate optical features with water quality parameters have been developed.Demonstrating feasibility, the use of remote sensing to measure physical and chemical parameters in marine water has been validated, such as turbidity, suspended sediment, dissolved organic carbon, chemical oxygen demand (COD), ammonia nitrogen (NH3-N), dissolved oxygen (DO), Secchi disc depth, and total suspended solids.Vaibhav Garg et al. [67] have shown that it is possible to detect turbidity using Sentinel-2 multispectral remote sensing data through red and near-infrared wavelengths.Cherukuru et al. [39] have successfully developed a new semianalytical remote sensing inversion model for retrieving suspended sediment and dissolved organic carbon in coastal waters.High-performance inversion results were achieved for four water quality parameters: chemical oxygen demand (COD), turbidity, ammonia nitrogen (NH3-N), and dissolved oxygen (DO), indicating the potential application value of near-surface remote sensing in inland, coastal, and various water bodies [68].Yuan Fong Su et al. [69] established univariate and multivariate water quality evaluation models for retrieving sea surface reflectance using SPOT remote sensing images, applying them to SPOT multispectral images to generate distribution maps for three water quality variables: Secchi disc depth, turbidity, and total suspended solids.The study demonstrated the feasibility of utilizing satellite remote sensing images for coastal water quality mapping.
Machine learning has excelled in normalizing the difference in chlorophyll, turbidity, and the salinity index, successfully classifying water quality in Sentinel-2 images.The Classification and Regression Tree method has accurately identified macroscopic bloom locations with over 98% accuracy [70].Figure 2 shows the steps in building a machine learning model.It is crucial to emphasize that water quality indicators lacking optical mechanisms, which rely on mathematical models or intermittent methods, cannot be directly measured through remote sensing.Based on a high correlation between non-optical active parameters and optical active parameters, in a study by Hanyu Li et al. [71], Landsat 5/8 remote sensing images and measured total nitrogen (TN) and total phosphorus (TP) were utilized to investigate the modeling effects of machine learning methods.The results confirmed that machine learning algorithms are well suited for inverting non-optical activity parameters in coastal water bodies.

Utilization of Machine Learning in Water Quality Prediction
Changes in water quality in coastal areas are influenced by both natural and anthropogenic factors [72][73][74][75][76]. Pollutants in the marine environment are not only from sewage discharge but also from natural activities, whereby water bodies transport pollutants through inherent circulation processes [77,78], including rainfall, runoff [79], seawater intrusion, and tidal intrusion, ultimately merging into the ocean.Water quality pollution encompasses a broad spectrum of sources and complex causes, rendering accurate results challenging through mechanistic analysis.Nonetheless, given the extensive impact and threats posed to industrial production, human lives, and the ecological environment, realtime identification and prediction of water quality pollution are imperative.

Single Water Quality Prediction Using Machine Learning
Machine learning is well suited for predicting water quality as it can identify the factors causing changes and reveal potential complex relationships between variables [80,81] and their predicted outcomes.Machine learning models have found extensive applications across various fields [82][83][84][85].For instance, a neural network-based algorithm has been utilized to monitor turbidity in the marine environment [86].Jun Ma et al. [87] employed a combination of Deep Matrix Factorization and Deep Neural Network to accurately predict BOD values.Furthermore, Decision Forest, Decision Jungle, and Boosted Decision Tree achieved accuracy scores exceeding 99% for predicting Escherichia Coli and enterococci levels [47].

Prediction of Chlorophyll-a
Chlorophyll is a crucial water quality parameter commonly employed to assess biomass [88,89].Elevated chlorophyll values indicate eutrophication within a water body [90,91], often associated with increased pollutant input, diminished dissolved oxygen, and the emergence of toxic cyanobacteria blooms.Two primary sources provide chlorophyll data: on-site measurements, which possess inherent limitations discussed earlier, and numerical simulation technology.Despite numerous attempts to predict chlorophyll content through numerical simulations, challenges arise due to the complex interplay of physical and mixing processes involving other physicochemical parameters related to water quality, as well as external factors such as light and temperature.
Given the uncertainties associated with marine biochemical parameters, numerical simulations have not yielded entirely satisfactory results in simulating phytoplankton biomass.Growing efforts and promising results are emerging in the realm of data-driven modeling for water quality [92][93][94].For instance, Xin Yu et al. [95] employed Visible Infrared Imaging Radiometer Suite satellite data from 2011 to 2018 and to train a machine learning-based model and utilized a data interpolation empirical orthogonal function.Driven by external forcing including river discharge, nutrient loadings, solar radiation, wind, and air temperature, the data-driven model achieved an average root mean square error of 1.85 µg/L for the entire bay with overall satisfactory performance.
In a study by Yong Sung Kwon et al. [96], machine learning techniques utilizing bands from 1 to 4 obtained from Landsat-8 Operational Land Imager satellite images demonstrated satisfactory performance, highlighting the effectiveness of combining remote sensing and machine learning for estimating chlorophyll-a concentration.Huanmei Yao et al. [97] utilized the Gradient Boosting Decision Tree model to estimate Chl-a concentrations, combining Landsat 8 OLI satellite data with a nominal 30 m spatial resolution from the United States Geological Survey with field measurements.The Gradient Boosting Decision Tree model exhibited a higher accuracy (MAE = 0.998 µg/L, MAPE = 19.413%,and RMSE = 1.626 µg/L) compared to different physics models.
Acknowledging the efficacy of machine learning as a prediction method, scientists continually explore optimal machine learning-based models for prediction.Hae-Ran Kim et al. [98] determined that the integrated learning method Extreme Gradient Boosting, combined with the single model Support Vector Regression, achieved superior results compared to six other machine learning algorithms (Regression Tree, Support Vector Regression, Bagging, Random Forest, Gradient Boosting Machine, and Extreme Gradient Boosting) in predicting Chl-a concentration.In another study, Diego Gómez et al. [99] evaluated the performance of Random Forest, Support Vector Machine, Artificial Neural Network, and Deep Neural Network algorithms.ANN demonstrated better performance under specific conditions, but when considering more factors, the other three methods were preferable.The most successful outcome was achieved by the Random Forest algorithm without using any feature selection techniques, yielding R 2 = 0.92 and RMSE = 0.82 mg/m 3 .Mohebzadeh and Lee [100] employed three machine learning techniques (Support Vector Regression, Random Forest Regression, and Long Short-Term Memory) as a downscaling approach, concluding that second degree multiple polynomial regression and Support Vector Regression-Radial Basis Function could produce high-resolution Chla maps.Junan Lin et al. [101] used four machine learning models (Support Vector Regression, Random Forest Regression, Wavelet Analysis-Back Propagation Neural Network, and Wavelet Analysis-Long Short-Term Memory) to predict chlorophyll-a in coastal waters, successfully forecasting algal blooms, with Wavelet Analysis-Back Propagation Neural Network and Wavelet Analysis-Long Short-Term Memory outperforming the others.Jie Niu et al. [102] incorporated particulate organic carbon and particulate inorganic carbon as predictive factors in machine learning and deep learning models to estimate Chl-a concentration along the coast.The results showed that the Gaussian process regression model outperformed the deep learning model in terms of stability and robustness.Due to differences and varying data characteristics, a universally superior method cannot be determined; the appropriate approach should be selected based on data characteristics and local conditions.
Although the mentioned machine learning algorithms can yield highly satisfactory results, certain studies have noted limitations in algorithms like Support Vector Machines [103], attributing this to their low-computational efficiency due to the nonlinear relationship between variables and outputs.
In addition to commonly used traditional machine learning methods, increasingly specialized machine learning algorithms are being developed.Hua Su et al. [104] introduced LightGBM, surpassing traditional methods and OLCI Chl-a products.Nima Pahlevan et al. [105] presented the mixed density network (MDN) simulation applicable to MSI and OLCI data.

Prediction of Salinity
Salinity is a pivotal factor influencing the physical, chemical, and biological processes in the ocean [106][107][108].The salt content in seawater directly impacts its density [109,110], consequently influencing circulation and stratification.Salinity distribution plays a crucial role in the growth and reproduction of marine organisms.To gain a deeper understanding of salinity distribution, a method has been devised, relying on historical data to train machine learning models and extract developmental trends.Guillou et al. [111] delved into machine learning algorithms to simulate the nonlinear and intricate relationship between salinity and input parameters (such as tide-induced free-surface elevation, river discharges, and wind velocity).Priyanka Chawla et al. [112] developed an effective regression and machine learning model for predicting water quality salinity and forecasting future salinity levels based on historical records.Lal and Datta [113] created independent models (Artificial Neural Network, Gaussian Process, and Support Vector Regression) to construct both homogeneous and heterogeneous models capable of predicting salinity concentrations.The results highlight the superiority of the heterogeneous model over all independent models and numerical salt transport models.

Prediction of Dissolved Oxygen
Dissolved oxygen, representing molecular oxygen in the water environment [114,115], is influenced by factors such as atmospheric pressure, water temperature, and salt content.A decrease in atmospheric pressure, an increase in water temperature, or higher salt content can lead to a decrease in dissolved oxygen.The dissolved oxygen level in water results from the comprehensive effects of water quality and environmental conditions.Typically ranging between 5-10 mg/L in source water, it approaches saturation in natural water surfaces.Excessive reproduction of algae can cause dissolved oxygen to become supersaturated.In cases of pollution by organic and inorganic reducing substances, dissolved oxygen may decrease or approach zero, activating anaerobic bacteria and deteriorating water quality.Water hypoxia poses a severe environmental challenge in coastal areas worldwide [116], leading to significant economic losses when dissolved oxygen falls below a critical threshold, causing mass deaths of aquatic organisms [116,117].
The successful application of past cases has proven the possibility of machine learning for predicting dissolved oxygen.Eric Ariel L. Salas et al. [118] trained two machine learning algorithms, Random Forest and Support Vector Machine, to predict spatiotemporal variations in dissolved oxygen concentrations using spectral predictors derived from Sentinel-2 images, yielding accurate results.Manuel Valera et al. [119] compared the performance of Random Forest and Support Vector Machine, with Random Forest consistently performing slightly better, given its ease of tuning and training.
Despite the feasibility of machine learning-based dissolved oxygen prediction, acquiring water quality data in challenging environments remains a hurdle.Seongsik Park et al. [120] introduced redox potential as a preferred input variable, using machine learning to predict dissolved oxygen-a cost-effective method.

Prediction of Multiple Water Quality Parameters
Compared with the prediction of specific water quality parameters, methods for predicting multiple water quality parameters can achieve better results.For instance, Yuan-Fong Su et al. [69] discovered that a multivariate model considering the wavelength-dependent comprehensive impact of various seawater components on sea surface reflectance outperformed the univariate model.Identifying the model with multiple parameters which predicts the best results has been widely studied.Yong Hoon Kim et al. [121] used three machine learning methods (Random Forest, Cubist, Support Vector Regression) to predict chlorophyll-a and suspended particulate matter indicators in coastal environments, with Support Vector Regression showing superiority.Shang Tian et al. [122] compared four machine learning algorithms (Extreme Gradient Boost, Support Vector Regression, Random Forest, and Artificial Neural Network) in retrieving chlorophyll-a, dissolved oxygen, and ammonia nitrogen from inland reservoirs, with Extreme Gradient Boost showing superior performance.Patricia Jimeno-Sáez et al. [123] utilized machine learning (Multi-layer Neural Networks and Support Vector Regression) to predict chlorophyll-a levels based on target dataset information from nine different water quality parameters, demonstrating satisfactory results with the Support Vector Regression model outperforming them all.Xiaotong Zhu et al. [124] estimated chlorophyll-a, turbidity, and dissolved oxygen in the Shenzhen Bay area using an ensemble machine learning model based on Sentinel-2 satellite remote sensing images, yielding satisfactory performance.Nguyen et al. [125] applied three machine learning methods (Decision Tree, Random Forest, Gradient Augmented Regression, and Ada Augmented Regression) based on Sentinel-2 images to establish a seawater quality parameter model, with Random Forest producing the best results.Shengyue Chen et al. [126] trained Random Forest, Support Vector Machine, and Backpropagation Neural Network models using water temperature, hydrogen ion concentration, conductivity, dissolved oxygen, and turbidity as water quality datasets to estimate total phosphorus, total nitrogen, and ammonia nitrogen in small-scale coastal basins, with the Random Forest model outperforming the Support Vector Machine and Backpropagation Neural Network models in estimation performance.Table 1 lists the simulation parameters mentioned above and the best model in the simulation.Researchers select models based on the simulation area and simulation parameters.

Prediction of Coastal Water Quality Index Using Machine Learning
The water quality index, providing an overall assessment of a water body at a specific location and moment [127][128][129], is a method developed for analyzing the water quality of marine systems [130].Its primary advantage lies in the use of simple mathematical functions that can convert multifaceted information into a straightforward numerical expression, conveying the environmental status to the public.This method is relatively easy for non-professionals to comprehend.Typically, this technique comprises four crucial elements [131,132]: (i) indicator selection process [133]; (ii) sub-index process; (iii) indicator weighting process; and (iv) aggregation function.The calculation is as follows: where Qi is the water quality of a single parameter, Wi is the weight of the corresponding water quality parameter.Sangeeta Pati et al. [134] developed a water quality index employing cluster analysis to categorize data into three water quality characteristics.Discriminant analysis was then applied to generate discriminant functions, effectively measuring multiple parameters of Indian coastal waters.The results demonstrated the utility of the WQI method in handling complex nutrient data and identifying pollution sources.Values of water quality parameters corresponding to the water quality index value are shown in Figure 3. Currently, the uncertainty of the WQI model has been revealed by several studies [136][137][138][139][140]. The uncertainty of the water quality index model mainly arises from the indicator selection process and the indicator weighting process.Problems occur due to inappropriate sub-indexes, parameter weightings, inappropriate aggregation functions, or an overestimation of the WQI index that does not reflect the real information of water quality.To address these issues, some research has attempted to solve the uncertainty and provide accurate predictions of the water quality index [138,[141][142][143][144][145].Several studies have applied various machine learning algorithms such as Extreme Gradient Boosting, Support Vector Machine, Random Forest, and Decision Tree algorithms [146] to compare the algorithm performance in predicting WQIs correctly.
Michelle C. Tanega et al. [147] applied machine learning classification algorithms such as Random Forest, Decision Tree, and Support Vector to calculate the water quality index and water quality classification of Lake Taal in the Philippines.The results showed that the accuracy of Random Forest was the highest at 95.0%, followed by decision trees with the same accuracy.The accuracy of support vector machines was 93.33%.Md Galal Uddin et al. [137] proposed an improved Water Quality Index (WQI) model for predicting coastal water quality, which is more objective and data driven, and is less susceptible to masking and fuzzy errors.This model uses the machine learning algorithm Extreme Gradient Boosting to rank water quality indicators based on their impact and importance.Zohreh Sheikh Khozani et al. [148] attempted to use intelligent models with different functions to predict the water quality index.Using three different machine learning techniques, including Multi-layer Perceptron, Convolutional Neural Network, and Short-term Memory, to perform WQI prediction.All three models have shown good performance in predicting WQI, effectively shortening the calculation time and reducing errors in the derivation process of sub-indicators.Guize Liu et al. [149] proposed a prediction system based on the Support Vector Machine and Particle Swarm Optimization algorithm.The results show that the maximum error of the water pollution index prediction model for sample prediction is 2.41%, the average error is 1.24%, and the root mean square error is 5.36 × 10 −4 , with a correlation coefficient of 0.91 squared.The SVM-PSO algorithm has good sewage prediction ability.Md Galal Uddin et al. [134] conducted in-depth research on indicator selection techniques to reduce significant uncertainty in evaluation.Analyzing the effects of 18 different FS technologies, constructing 15 water quality indicator combinations, and testing the performance of the model using nine machine learning algorithms.The results indicate that the Random Forest algorithm can effectively select key water quality indicators.The Deep Neural Network algorithm predicts a subset more accurately.
These methods have achieved relatively effective results but have not fundamentally improved WQI.Md Galal Uddin et al. [136] used machine learning techniques to improve the newly developed Weighted Quadratic Mean-WQI model architecture to reduce model uncertainty.They used eight widely used machine learning algorithms, Decision Tree, Extra Tree, Extreme Gradient Boosting, Random Forest, Support Vector Machine, K-Nearest Neighbors, Linear Regression, and Gaussian Naïve Bayes, to reduce the uncertainty of the WQI model and improve the model architecture in modeling coastal WQIs.

Prediction of Water Quality through Coupling Hydrodynamics and Water Quality
The primary distinction between nearshore waters and natural rivers is that the upstream of natural river reservoirs is nearly unaffected by downstream water bodies, while nearshore water bodies are not only influenced by shore discharge but also by ocean water bodies due to tides and storms [150,151].Pollutants may be mixed and washed by seawater, making the changes in water quality more complex.Therefore, it is necessary to combine models with hydrodynamic factors to predict changes in ocean water quality.
Currently, hydrodynamic factor prediction has made outstanding progress [152][153][154], and simulation of hydrodynamic processes can yield relatively accurate results.Kai Fei et al. [155] established a numerical simulation model for storm surges coupled with hydrology and hydrodynamics, successfully calculating the relative contributions of each driving factor to water level using Extreme Gradient Boosting.Shamshirband et al. [156] proposed a nested grid numerical model that utilizes water depth and surface wind field data for wave height modeling.The combination of hydrodynamic models and machine learning can improve analysis reliability and computational efficiency [157,158].For example, historical data or simulated data from hydrodynamic models can be used to train machine learning models to predict wave heights [159,160], hurricane storm surge hazards [161,162], floods [163], erosion [164], and water level characteristics of storm surges [155].Notably, a nearshore wave and hydrodynamic prediction model by Wei and Davison [165] based on Convolutional Neural Networks can accurately predict the propagation and fragmentation of waves on nearshore slopes, including detailed wave peak bending and separation.Riaz et al. [166] achieved the prediction of near-bottom hydrodynamic conditions in rapidly changing water flows.This detailed simulation provides a reference for future analyses of water quality changes.
Based on the results of hydrodynamic prediction, it is possible to predict the water quality status of coastal areas through hydrodynamic coupling hydrology.Hung Vuong Pham et al. [167] has designed a coastal risk assessment model based on Bayesian networks, predicting the water quality parameters of seawater.This method proposes a multi-model chain approach, integrating regional and global climate models with machine learning and satellite images.It combines ocean fluid dynamics, wave fields, and coastline extraction models to predict suspended particulate matter.Importantly, this method has the potential to integrate climate and risk data, enabling more accurate predictions in the future.Bayesian [168] is a probability-based method that combines prior information about unknown parameters with sample information, using the Bayesian formula to obtain posterior information.Based on the posterior information, unknown parameters can be inferred, yielding good results in small samples.Therefore, it can be used in situations where coastal data are relatively insufficient.Cebe and Balas [169] studied the prediction of nitrite, nitrate, and dissolved oxygen concentrations using water quality coupled three-dimensional hydrodynamic methods.The hydrodynamic model simulates wind-driven nearshore water flow, cycling of nitrogen, phosphorus, and oxygen in ecological sub-models, as well as dominant aquatic organisms such as phytoplankton, zooplankton, and planktonic bacteria.

Discussion
This paper presents a comprehensive exploration of the challenges and advancements in predicting water quality in coastal areas, with an emphasis on the integration of machine learning into water quality parameter simulations.The identified challenges, including the impact of urbanization, industrialization, coastal reclamation, and environmental events such as erosion, hurricanes, and flooding, underscore the importance of accurate predictive models for maintaining and safeguarding coastal ecosystems.
This review rightly emphasizes the significant role of machine learning in overcoming the limitations of numerical models.Traditional models face challenges related to parameter selection, adaptability, and computational efficiency.The integration of machine learning, enabled by recent progress in satellite remote sensing and unmanned aerial vehicle observation, offers a promising solution.The ability to process large datasets and extract meaningful correlations between water quality metrics and satellite imagery represents a substantial leap forward.The review of water quality data acquisition methods highlights the pivotal role of remote sensing, particularly satellite technology.The exploration of optical features and the establishment of spectral reference libraries demonstrate innovative approaches to derive water quality information.Machine learning models trained on optical features prove effective in predicting water quality parameters, overcoming challenges associated with non-optical indicators.The presented steps for inverting water quality data from remote sensing maps provide a clear framework for researchers and practitioners.
This paper also provides a comprehensive overview of machine learning applications in predicting specific water quality parameters such as chlorophyll-a, salinity, and dissolved oxygen.We provide a thorough examination of various machine learning algorithms' performance in predicting these parameters.Noteworthy is the recognition that the choice of the appropriate algorithm depends on data characteristics and local conditions.The inclusion of studies comparing different algorithms for specific parameters adds valuable insights for researchers seeking optimal prediction models.
Different machine learning algorithms can be selected based on the simulated water quality parameters.The Classification and Regression Tree method can accurately identify macroscopic bloom locations [71].Decision Forest, Decision Jungle, and Boosted Decision Tree can be used to predict Escherichia Coli and enterococci levels [47].Extreme Gradient Boosting, combined with the single model Support Vector Regression and Long Short-Term Memory Long Short-Term Memory is better when estimating Chl-a concentrations.Artificial Neural Network, Gaussian Process, and Support Vector Regression can be used in predicting salinity concentrations.Random Forest performing slightly better than Support Vector Machine in the prediction of dissolved oxygen.Support Vector Regression, Extreme Gradient Boost, and Random Forest can be use in predicting multiple water quality parameters, and ensemble machine learning model is also a good choice.Random Forests, Extreme Gradient Boosting, Multi-layer Perceptron, Convolutional Neural Network, and Short-term Memory are all models which have shown good performance in predicting WQI, while the Random Forest algorithm can effectively select key water quality indicators.The Deep Neural Network algorithm predicts a subset more accurately.
The water quality index (WQI) is a crucial tool for assessing overall water quality, and this paper addresses the uncertainties associated with traditional WQI models.Machine learning algorithms, including Random Forest, Support Vector Machine, and Decision Tree, have been successfully applied to enhance WQI prediction.However, we point out that while these methods achieve effective results, they fall short of fundamentally improving WQI.Our discussion on ongoing efforts to reduce model uncertainty and improve architecture adds depth to the exploration of WQI prediction.
The integration of hydrodynamics with water quality prediction is a key focus of the current paper.The challenges posed by nearshore waters, influenced by both shore discharge and oceanic forces, necessitate a holistic approach.The success of hydrodynamic prediction models, coupled with machine learning, is evident in accurately simulating storm surges, wave heights, and other dynamic factors.Our discussion on the potential of Bayesian networks in coastal risk assessment, integrating climate and risk data, presents a forward-looking perspective on predicting water quality in coastal regions.
The presented research underscores the transformative impact of machine learning on water quality prediction in coastal areas.Our review on the limitations of current models, the need for diverse datasets, and the consideration of evolving environmental conditions points to avenues for future research.

Conclusions
This review provides a comprehensive overview of the recent advancements in machine learning applied to water quality prediction.Despite an extensive survey and comparison of existing literature, establishing a singular best-performing machine learning approach proves challenging.The efficacy of machine learning models tends to vary significantly across different parameters and regions.A promising avenue for further exploration involves a deeper analysis of water quality parameter characteristics, aiming to propose more universally applicable methodologies.Generalizing results for the intricate prediction of coastal water quality remains challenging when solely relying on machine learning models trained on data devoid of consideration for physical and chemical processes.Models with regional characteristics may lack the capacity to predict processes involving key factors beyond the training dataset.To enhance the prediction accuracy of machine learning, the following aspects can be addressed: (a) diversifying data sources and increasing data volume, where remote sensing satellite maps can serve as a reference for inverting optical characteristic parameters, necessitating further research for rapid and effective data acquisition on non-optical water quality parameters in coastal areas; (b) addressing missing data through interpolation methods such as univariate input, k-nearest neighbors' input, and multiple-input denoising techniques.This paper offers an exploration of the complexities and advancements in predicting water quality in coastal regions, providing a valuable resource for researchers, practitioners, and policymakers involved in environmental management and conservation.The integration of machine learning into numerical simulations emerges as a promising paradigm shift, offering more accurate and adaptable predictive models for safeguarding the delicate balance of coastal ecosystems.

Figure 1 .
Figure 1.Steps for inverting water quality data from remote sensing data.The establishment of spectral reference libraries is a prerequisite for extracting data from remote sensing monitoring[61].

Figure 2 .
Figure 2. Steps for building a model through machine learning.(1) Data collection: Identify the specific water quality parameters you want to predict.Gather a comprehensive dataset.(2) Data mugging: Clean the dataset by handling missing values, outliers, and irrelevant features.(3) Feature extraction: Identify relevant features that contribute to the prediction task.(4) Feature engineering: Transforming raw data into better representations of the essential features of a problem.(5) Model selection: Choose appropriate machine learning algorithms.(6) Model training: Train the selected model using the training dataset.(7) Model evaluation: Assess the model's performance on the testing dataset using appropriate evaluation metrics.

Figure 3 .
Figure 3. Values of water quality parameters corresponding to water quality index value (data from [135]).

Table 1 .
Comparison table of existing research performance.