Machine Learning Forecasts of Coastal Chlorophyll-a Based on Satellite and Model Data: A Case Assessment in the Northern Taiwan Strait

Wu, Yangcong; Jiang, Long; Lin, Heshan; Chen, Chun; Jiang, Degang

doi:10.3390/rs18121904

Open AccessArticle

Machine Learning Forecasts of Coastal Chlorophyll-a Based on Satellite and Model Data: A Case Assessment in the Northern Taiwan Strait

by

Yangcong Wu

¹

,

Long Jiang

^1,2,*

,

Heshan Lin

^3,4,5,

Chun Chen

^3,4,5 and

Degang Jiang

^3,4,5

¹

Key Laboratory of Marine Hazards Forecasting, Ministry of Natural Resources, Hohai University, Nanjing 210098, China

²

Key Laboratory of Coastal Salt Marsh Ecosystems and Resources, Ministry of Natural Resources, Nanjing 210098, China

³

Island Research Center, Ministry of Natural Resources, Pingtan 350400, China

⁴

Fujian Provincial Key Laboratory of Island Conservation and Development, Island Research Center, Ministry of Natural Resources, Pingtan 350400, China

⁵

Observation and Research Station of Island and Coastal Ecosystem in the Western Taiwan Strait, Ministry of Natural Resources, Xiamen 361005, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(12), 1904; https://doi.org/10.3390/rs18121904 (registering DOI)

Submission received: 9 April 2026 / Revised: 31 May 2026 / Accepted: 5 June 2026 / Published: 9 June 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A systematic comparison of six machine learning models for chlorophyll-a (chl-a) forecast in the northern Taiwan Strait based on MODIS data showed that the models successfully captured the relatively stable seasonal chl-a variability in offshore regions, but performed poorly in forecasting the complex nearshore chl-a variability, particularly during algal blooms.
A hydrodynamic–biogeochemical model was developed for the Taiwan Strait, which successfully reproduced the variability of chl-a during blooms. Using the model outputs as inputs to the ML models further demonstrated the critical importance of data quality for improving ML-based chl-a forecasts.

What are the implications of the main findings?

This study provides a promising framework for ML emulation of mechanistic model outputs to improve computational efficiency of operational chl-a forecasts while maintaining the accuracy of mechanistic models.
This study builds up models that could potentially be applied in the early warning of harmful algal blooms in the northern Taiwan Strait.

Abstract

The chlorophyll-a (chl-a) concentration is a major indicator of marine ecosystem status, harmful algal blooms, and marine primary productivity. In coastal waters, however, complex hydrodynamic and ecological conditions lead to highly variable chl-a dynamics, driven by diverse and interacting mechanisms, posing substantial challenges for chl-a forecasts. To assess the applicability of machine learning approaches in predicting chl-a under complex coastal environments, we present a case study in the Taiwan Strait, where harmful algal blooms occur a few times every year. Based on satellite remote sensing data, a spatiotemporal imputation and prediction framework (STIMP), temporal models (Transformer, CrossFormer, Tsmixer), and spatiotemporal models (MTGNN and PredRNN) were applied to simulate chl-a spatiotemporal variability. A hydrodynamic–biogeochemical model was compared with these machine learning approaches to assess the model skills in coastal chl-a simulations. Results indicate that machine learning models trained with satellite data exhibit reasonable predictive skill offshore with pronounced seasonal variability and low data missing ratio, while their performance weakens in regions where seasonal signals are masked by short-term chl-a fluctuations with more missing data. In contrast, the hydrodynamic–biogeochemical model represents short-term variations in chl-a in nearshore regions with higher temporal resolution and accounts for the underlying mechanisms of phytoplankton biomass accumulation and die-off. When trained with model output, the machine learning approach shows improved performance in coastal chl-a forecasts, with much higher computational efficiency compared to the hydrodynamic–biogeochemical model. This study highlights the advantage of mechanistic and machine learning models in deciphering the spatiotemporal scales and governing mechanisms of chl-a variability in coastal regions and extracting spatiotemporal variability with computational efficiency, respectively. With input data of sufficient temporal resolution (e.g., daily to 3 days) and duration (5–10 years), a combination of the machine learning and mechanistic modeling approaches is recommended for operational coastal phytoplankton bloom forecasting.

Keywords:

MODIS; chlorophyll-a; machine learning; physical-biogeochemical models; the Taiwan Strait

1. Introduction

Chlorophyll-a (chl-a), as the primary pigment in phytoplankton, is an indicator of marine phytoplankton biomass and commonly measured and studied to assess marine primary production, fish abundance, water quality, and ecosystems status [1,2,3,4]. Algal blooms, including harmful algal blooms, are generally associated with high chl-a concentrations. Coastal chl-a, highly variable spatially and temporally, is vastly sampled, surveyed, and modeled to quantify the risk of harmful algal blooms [5,6,7,8]. Therefore, forecasts of coastal chl-a are highly desirable for scientists and ecosystem managers.

In coastal waters, the complexity of environmental conditions causes substantial spatiotemporal variability in chl-a in estuarine–coastal systems [9,10,11]. Phytoplankton growth is a function of nutrients, light, and temperature [12], which are driven by physical processes such as background circulation, river discharge, tides, upwelling, etc. Both physical and biogeochemical processes exert significant influences on the chl-a variability [13,14]. In the Strait of Georgia and Juan de Fuca Strait, the depth of winter halocline and summer mixing significantly affect phytoplankton abundance, exhibiting contrasting effects in the northern and central regions [15]. Across the South China Sea, light inhibition plays a dominant role in regional chl-a variability rather than carbon availability [16]. In the Pearl River Estuary, the dominant factors of chl-a variability exhibit significant depth-dependent differences, with total suspended solids dominating in nearshore (<10 m) and offshore (>30 m) regions, while sea surface temperature is the primary controlling factor in the intermediate zone [17]. Based on long-term observations from fixed monitoring stations in the North Sea, wind- and tide-driven horizontal and vertical transport are the dominant drivers of chl-a variability on hourly to daily timescales, whereas on weekly to monthly timescales, effective photosynthetically active radiation, nutrient availability, and thermal stratification play dominant roles [18]. Due to the diverse environmental drivers and complicated interactions, chl-a forecasts are especially challenging.

Currently, chl-a forecasts include both mechanistic and statistical approaches. Mechanistic forecasting relies primarily on hydrodynamic–biogeochemical models, in which physical processes are coupled with ecological models to simulate the spatiotemporal variations in temperature, velocity, nutrients, and phytoplankton biomass, and has been applied across a wide range of environments, including lakes, reservoirs, coastal waters, and the open ocean [19]. At regional scales, hydrodynamic–biogeochemical models are developed based on specific hydrological, meteorological, and ecological characteristics, allowing targeted optimization of simulation performance, and applied in marine primary productivity analysis, marine pollution assessment, and harmful algal bloom prevention [20,21,22,23,24]. Recently, statistical models were equipped with machine learning (ML) methods, which extract the complicated nonlinear relationship between phytoplankton biomass and environmental conditions and are common in operational forecasting applications. By learning nonlinear relationships directly from large volumes of satellite observations, in situ measurements, and environmental variables (e.g., sea surface temperature, photosynthetically active radiation, and dissolved oxygen), ML models capture complex spatiotemporal patterns in chl-a variability without explicitly prescribing underlying physical or biological processes, and demonstrate strong performance in short- to medium-term forecasting across diverse marine environments [25,26,27]. However, the comparison and combination of statistical and mechanistic models have not been widely discussed in the same region.

In both approaches, satellite chl-a data provide ideal model input as the broad spatial coverage and temporal frequency. To assess the applicability and performance of both model approaches in chl-a forecasts, this study develops both mechanistic and ML models in the northern Taiwan Strait (Figure 1) controlled by coastal current, where harmful algal blooms occur several times each year. We incorporated remote sensing chl-a and multiple environmental variables as mechanistic and ML model inputs and utilized several ML methods to assess the applicability in forecasting of short-term spring blooms and seasonal variations. In addition, hydrodynamic–biogeochemical model outputs were used to train an ML model, to explore the potential of improving prediction skill and computational efficiency. Pros and cons of ML models based on satellite data are discussed based on the model performance and recommendations are given for applications in this and other similar estuarine–coastal regions.

2. Materials and Methods

2.1. The Study Area

The Taiwan Strait connects the East and South China Seas and is bound by Taiwan Island to the east and mainland China to the west. The strait circulation is largely shaped by the seasonal variations in the southward Zhe-min Coastal Current (ZMCC) and the Taiwan Strait Current (TSC) (Figure 1). Driven by winter monsoon, ZMCC carries cold, buoyant, and nutrient-rich waters from the Yangtze River along the western coast. ZMCC weakens in spring and disappears in summer. In contrast, TSC intensifies in summer and transports a mixture of warm and saline South China Sea and Kuroshio waters toward the East China Sea [28,29].

Driven by the seasonal alternation of water masses, chl-a in the Taiwan Strait displays marked spatial and seasonal variability [30,31]. At the basin scale, chl-a are generally higher in winter and lower in summer. In the northern Taiwan Strait, the highest surface chl-a occurs in nearshore regions during winter, while in the southern Strait, particularly in the upwelling zone close to the western coast, chl-a peaks in summer.

2.2. Field and Remote Sensing Data

The daily surface local chl-a were measured by a fluorescence sensor (EXO Total Algae PC Smart Sensor, YSI, Yellow Springs, OH, USA) mounted on the buoy from January to June during 2023 to 2024 and further validated through calibration with lab-analyzed samples (Figure 1b). The hourly buoy chl-a data were averaged on a daily basis to exclude the outliers and abrupt peaks in the time series. Chl-a used for ML models are extracted from the Moderate Resolution Imaging Spectroradiometer (MODIS) Level-3 daily mapped chl-a product (4 km spatial resolution) generated by the OCI algorithm, covering the period from 2003 to 2024. Daily observations were subsequently aggregated into 8-day composites to reduce cloud cover and improve spatial coverage in the study region. No additional coastal-water correction was applied beyond the standard quality control procedures provided in MODIS product.

Several environmental variables were collected for ML model training. Daily sea surface temperature (SST) with the resolution of 0.25° was obtained from the Optimum Interpolation Sea Surface Temperature (OISST) product. Daily sea surface salinity (SSS) of the 40 km resolution was derived from the Soil Moisture Active Passive (SMAP) satellite product. Atmospheric forcing variables, including local precipitation over the Min River basin, winds at 10 m (u10 and v10), were extracted from the fifth-generation atmospheric reanalysis of the global climate (ERA5) by ECMWF (European Centre for Medium-Range Weather Forecasts), which offers hourly data at a spatial resolution of 0.25°. The daily mean photosynthetically active radiation (PAR) was derived from the MODIS level-3 product. Monthly mean Yangtze River discharge data were obtained from the Changjiang Water Resources Commission of the Ministry of Water Resources. Above environmental variables were interpolated to the same spatial and temporal resolution as chl-a data. Given the coarse spatial resolution and limited temporal range (2015–2024) of SSS data, spatially averaged SSS over the ML model domain was employed, and climatological means were used in missing years. All environmental variables, with their respective spatial and temporal resolution and data length in Table 1, were subsequently regridded and temporally aggregated to match the spatiotemporal dimension of the satellite chl-a dataset before ML model training. Precipitation and river discharge data are one-dimensional, i.e., without spatial dimension.

2.3. Machine Learning Models

2.3.1. Imputation of Satellite Chlorophyll-a Data

Satellite chl-a was divided into training (2003–2015) and validation (2016–2024). Due to the cloud cover, the mean chl-a missing rate is 64% during this period (Figure 2a). The spatiotemporal imputation and prediction (STIMP) model was employed to reconstruct missing gaps from incomplete satellite data based on the spatial and temporal dependencies in chl-a variability. STIMP is a Transformer-based deep-learning framework for spatiotemporal imputation and prediction, which demonstrates superior imputation performance across different missing rates compared with the traditional data interpolating empirical orthogonal function (DINEOF) method [27]. STIMP was trained by randomly masking 10–90% of the original valid MODIS data and the imputation performance was assessed. The STIMP-imputed chl-a were used as input predictors, while MODIS data without imputation worked as ground truth labels for both training and validation (Table 2).

2.3.2. Time-Series Forecasts Based on Satellite Data

Six statistical models including persistence, climatology, seasonal climatology, linear regression, random forest, and autoregressive models, were implemented to forecast chl-a based on the time series of MODIS product. All models were trained by the 2003–2015 dataset and designed to produce 1-year lead forecasts (46-time steps) throughout the period 2016–2024. The persistence model used the time series from the previous year as the forecast of a certain year. Output of the climatology model was a constant, the average chl-a of the last ten years, while the seasonal climatology model calculated each time step as the mean chl-a at this time step over the last ten years. After training with imputed MODIS dataset from 2003 to 2015, linear regression, random forest, and autoregressive models forecast one-year chl-a based on input of the previous year (Table 2). Specifically, linear regression models the linear relationship between input and output time series, random forest captures nonlinear relationships by an ensemble of decision trees, and the autoregressive model predicts based on temporal dependence in historical chl-a sequences.

2.3.3. Spatiotemporal Forecasts Based on Satellite Data

With STIMP included, five other ML models were applied for forecasts of the spatiotemporal chl-a variations based on the imputed data, including Transformer (a neural network architecture that excels at processing sequential data), Crossformer (a Transformer-based model that explicitly utilizes cross-dimension dependency for multivariate time-series forecasting), Tsmixer (Time-Series Mixer), PredRNN (recurrent neural networks for predictive learning), and MTGNN (Multivariate Time-Series Forecasting with Graph Neural Networks). Transformer and Crossformer rely on attention mechanisms to capture long-range and multiscale temporal relationships, with Crossformer providing an improved ability to capture interactions among different features [32,33]. Tsmixer is an MLP (Multi-layer Perceptron)-based multivariate time-series forecasting model [34]. PredRNN extends recurrent neural networks with spatiotemporal memory to model the evolution of spatial fields, while MTGNN integrates temporal convolution with graph neural networks to capture inter-variable and spatial dependencies [35,36].

The performance of ML models was evaluated by comparing model output with non-imputed MODIS chl-a data both inshore and offshore (Figure 1b), with distinct seasonal chl-a characteristics (Figure 2b,c).

All ML forecasting experiments were conducted under identical training conditions. Prior to training, all predictors were independently standardized using z-score normalization based on the statistics of the training dataset, and the outputs were transformed back to the original units through inverse normalization. The learning rate was set to 10⁻⁴, the batch and hidden size were 8. The Adaptive Moment Estimation (Adam) optimizer and mean squared error (MSE) loss function were used for model training. Each model was trained for 120 epochs, and the model with the lowest validation mean absolute error (MAE) was retained. To preliminarily estimate the stability of forecasts, each experiment was repeated using five fixed seeds under the same training conditions. Models were configured with an input sequence of 1 year. The output sequence was set to 1 year for long-term forecasting experiments. The detailed information of each ML model is shown in Table 2 and Table 3.

2.3.4. Including Environmental Variables in ML Forecasts

In the first forecasting experiments, only the satellite chl-a was input in ML models. To potentially improve forecasting skills, environmental datasets including SST, PAR, winds (u10, v10), precipitation, and the Yangtze River discharge were incorporated to consider their influences. SST and PAR were included because of their controls on phytoplankton growth and photosynthesis. Winds may affect the nutrient transport and water-column stratification, and were incorporated to represent the potential influences on chl-a variability. For unavailability of long-term discharge data, precipitation over the Min River basin was used as a proxy for the Min River discharge as local terrestrial nutrient input. The Yangtze River discharge was included as a remote source of nutrients because the western coast of the Taiwan Strait is strongly influenced by coastal currents transporting nutrient-rich water originating from the Yangtze River plume. SSS data were incorporated to indicate the overall influence of freshwater and terrestrial input. Environmental variables were used as lagged predictors with the same input sequence as chl-a, and no future environmental observations from the forecasting period were used as model inputs.

In the STIMP model run incorporating all these environmental variables, the SHAP (SHapley Additive exPlanations) analysis was conducted to quantify the contribution of each input variable to chl-a prediction.

2.4. The Hydrodynamic–Biogeochemical Model

A three-dimensional hydrodynamic–biogeochemical model was implemented for the Taiwan Strait region (Figure 1a), downscaled from a model of the entire China Seas [37,38] and run from January to June during 2022–2024. The hydrodynamic processes were simulated using the General Estuarine Transport Model (GETM), which was online coupled with an ecological module through the Framework for Aquatic Biogeochemical Models (FABM) [39]. The model employed a horizontal resolution of 1 km with 10 vertical sigma layers, and bathymetry was derived from the General Bathymetric Chart of the Oceans (GEBCO_2019) and locally measured bathymetry around the Pingtan Island. Major river discharges were prescribed (Figure 1b), and open-boundary conditions for sea level, currents, temperature, and salinity were provided by the parent model. The meteorological forcing was obtained from the ERA5 reanalysis product.

Open-boundary conditions of water elevation, currents, temperature, and salinity were provided by the China Seas model [37,38], while the initial hydrodynamic conditions were derived from Simple Ocean Data Assimilation (SODA) reanalysis data. The open-boundary and initial of nutrient conditions were provided by World Ocean Atlas (WOA) climatological data. The monthly discharge data for the Yangtze River and rivers in Fujian Province were sourced from the Yangtze Water Resources Commission and the Fujian Water Resources Department.

The ecological component of the coupled model was based on a nitrogen-driven NPZD (Nutrient–Phytoplankton–Zooplankton–Detritus) framework implemented in FABM [40]. The model resolves key pelagic nitrogen pools, including dissolved inorganic nitrogen (DIN), phytoplankton, zooplankton, red Noctiluca scintillans (RNS), and detritus, together with a benthic detritus compartment (Figure 3). Biological source–sink processes, including DIN uptake, grazing of RNS on phytoplankton, and excretion, egestion, mortality of RNS, among other biological processes, followed established formulations (Table 4). All pelagic state variables were advected and diffused by the hydrodynamic fields simulated in GETM. Chl-a was converted from the modeled phytoplankton concentration using the ratio chl-a:N, which varies under different light and nutrient conditions in coastal waters [41,42]. For lack of observational data in the study area, we adopted a constant ratio in the ecological model 1.58 g chl-a mol N⁻¹ that has been applied in the surrounding East China Sea model [43,44].

To investigate the potential of ML emulation of mechanistic chl-a modeling, outputs from the GETM–FABM model, including SST, DIN, and chl-a during 2022–2024, were interpolated to serve as inputs for Transformer to emulate the spatiotemporal evolution of the mechanistic model outputs, which provides a balance between model complexity and computational efficiency given the high-resolution outputs of the mechanistic model. In the three-year period, the first two years (2022–2023) and last year (2024) were used for training and validation of the Transformer, a model which was configured with an input sequence of 14 days, and output sequence of 1, 3, and 7 days, respectively (Table 2). The selection of the 2022–2024 period was constrained by the availability of buoy data and environmental observations required for validating the mechanistic model. Extending the simulations to earlier periods without sufficient validation by observational data may result in larger uncertainties in the mechanistic model.

3. Results

3.1. Imputation of the Missing Satellite Chlorophyll-a Data

To assess the effect of data imputation, the valid MODIS chl-a was randomly masked to generate missing data ranging from 0.1 to 0.9. Each masking experiment was repeated five times, and the median root mean square errors (RMSEs) were shown in Figure 4. When the missing ratio increases, correlation coefficients (r) between the imputed and originally excluded chl-a decreases from 0.98 to 0.89, while RMSE increases from 0.27 mg m⁻³ to 0.63 mg m⁻³ (Figure 4). Under the average missing conditions of the study region (64%, Figure 2a), it is expected that the STIMP model fills the data gaps with r of 0.95–0.96 and RMSE of 0.37–0.42 mg m⁻³. In the nearshore regions where the missing data exceeds 80%, the imputed data render large deviations from the observations and tend to underestimate the remote sensing data (Figure 2b). It should be noted that missing satellite data usually aggregate at certain locations in a certain period, making it more challenging to impute missing data than the training dataset with random masking. Overall, the STIMP model exhibits reasonable performance in reconstructing the spatiotemporal chl-a field in the study region, although the imputation errors would increase substantially in nearshore regions with extremely high (>80%) missing ratios.

3.2. Forecasting Skills of Machine Learning Models

All ML models displayed distinct skills between nearshore and offshore regions of the study area with only chl-a and no other environmental variables as model input (Figure 5). All models show much lower deviations from the validation MODIS chl-a data in the offshore areas, with median RMSE values below 0.6 mg m⁻³. Among these ML models, the forecast deviations are close to each other in the offshore waters, and the smallest (largest) by the STIMP (MTGNN) model in the inshore waters. Taking the best-performing STIMP model as an example, most nearshore regions show no significant or even negative correlations with the validation MODIS chl-a data, and the seasonal chl-a pattern is not well resolved (Figure 6 and Figure 7). In contrast, offshore regions exhibit significant correlations and lower RMSEs between the projection and observation. In offshore waters, the STIMP model captures the seasonal pattern but does not depict the interannual variation well (Figure 7b).

Similar to the spatiotemporal forecasts, all statistical models based on time series show better performance in the offshore region than nearshore. The seasonal climatology model performs with the highest r of 0.58 and lowest RMSE of 0.47 mg m⁻³ offshore, similar to that of the STIMP model (Figure 8b). In contrast, all models do not capture the chl-a variability well nearshore, in which there are insignificant correlations and large RMSEs (Table 5).

When including the environmental variables (SST, SSS, Yangtze River discharge, PAR, local precipitation, winds) in the STIMP model forecasts, the model skills both inshore and offshore are not substantially improved except that RMSE reduces slightly by 0.1 mg m⁻³ in most cases, and the offshore chl-a is much better projected than the inshore in each of the scenarios (Table 6). According to the mean absolute SHAP values averaged spatially and temporally, historical chl-a was the most important predictor in the forecasting model, while the Yangtze River discharge exhibits the highest relative importance among all environmental variables (Figure 9). SST and SSS contribute >10% to the forecasting results, whereas precipitation shows minimal contributions.

3.3. The Hydrodynamic–Biogeochemical Model Output

In addition to ML models, a hydrodynamic–biogeochemical model was set up for the study area to understand the mechanisms of chl-a spatiotemporal variability. To evaluate the reliability of the model in terms of hydrodynamic and biogeochemical simulations, SST, SSS, elevation, tidal currents, and DIN were validated against independent observational datasets. Detailed validation results are provided in Text S1 of the Supplementary Material. Input with the realistic terrestrial nutrients and meteorological conditions, the model displayed decent skills in capturing the spring bloom recorded by the nearshore buoy in 2023 (Figure 10, r = 0.88, RMSE = 2.16 mg m⁻³), although it overestimated and underestimated the chl-a level before and after the peak bloom, respectively. In particular, during the spring bloom, the model reproduced a rapid increase in chl-a exceeding 15 mg m⁻³ and the die-off due to nutrient depletion afterwards.

Compared with the MODIS chl-a data, the hydrodynamic–biogeochemical GETM-FABM model output offers higher temporal resolution (daily) and a zero data missing rate and hence was used as Transformer training and validation datasets replacing the satellite data. It turns out that the Transformer model trained with GETM-FABM outputs exhibit much improved performance over the study region, especially in the nearshore regions. However, the model generated large RMSEs from the GETM-FABM output primarily in the Min River estuary, where chl-a varies substantially in response to the riverine outflow. With the increase in forecast lead time from 1 day to 7 days, the forecasted chl-a is always significantly correlated with the GETM-FABM output, although the forecasting skill decreases with the forecast time, as shown by the reduced r and increased RMSE overall (Figure 11). With the forecast time of 7 days, the spring bloom magnitude is largely underestimated (Figure 12). Correspondingly, the RMSE increased by 17% for the 3 days forecast and by 73% for the 7 days forecast relative to the 1 day forecast lead time. The spring bloom predicted by the Transformer model is always delayed by a few days, which indicates that the bloom forecast relies highly on the chl-a increase in the input data and that the Transformer model is incapable of learning bloom triggers solely based on the environmental conditions.

4. Discussion

4.1. The Overall Forecasting Performance of Machine Learning Models

Multiple ML models exhibit similar skills in forecasting of the remote sensing chl-a in the northern Taiwan Strait, with the STIMP model slightly better than others in the study area. All ML models perform better offshore than the inshore waters, primarily as a result of the relatively low data missing rate offshore. In nearshore regions, the models even exhibit negative but statistically insignificant correlation coefficients with MODIS chl-a observations, indicating the inability to capture the seasonal variability of chl-a. In addition to the influence of the high data missing rate and mismatch scale between MODIS data and algal blooms, the negative correlations may also be attributed to the complicated drivers of chl-a variability and interactions of environmental factors. In contrast, chl-a in the offshore waters displays a distinct winter-high and summer-low seasonal pattern (Figure 2c), while the inshore remote sensing chl-a lacks clear seasonal variability. When comparing with the buoy data, the temporal resolution of remote sensing chl-a (8 days) is too low to delineate the initiation, development, and die-off of the nearly 2-week-long phytoplankton bloom (Figure 13). Thus, the satellite chl-a data show high standard deviation during April and May, the season of the spring bloom, while the chl-a magnitude is hardly higher than other seasons. The forecast based on training of the MODIS chl-a does not resolve the spring blooms well, similar to the satellite data (Figure 13). Hence, due to the low temporal resolution to depict phytoplankton blooms and relatively high percentage of missing data in inshore waters, training with the MODIS chl-a data limits the predictive skills of ML models, despite the data imputation efforts.

ML models in our study do not resolve the interannual variations well in both inshore and offshore waters, which results likely from the lack of mechanistic interpretation of environmental influence on phytoplankton growth. The complex hydrodynamic conditions of the Taiwan Strait give rise to multiple, tightly coupled drivers governing chl-a variability. During harmful algal bloom periods (April–June), factors such as increasing SST, nutrient input from river discharge, and upwelling induced by southerly winds have been identified as major contributors to high chl-a along the western coast of the strait [30,45]. Outside bloom periods, depleted nutrients during summer leads to the reduced chl-a, while in autumn and winter, despite the accumulation of nutrients along the western coast under the influence of the northeast monsoon, increased suspended matter content may limit phytoplankton growth [46,47]. In northern Taiwan Island, seasonal chl-a variability has been linked to changes in ocean currents and tidal dynamics [48].

When adding the environmental variables (nutrients, light, and temperature, etc.) that play a combined role in phytoplankton proliferation, ML model skills do not improve evidently. It implies that the present ML configuration does not appear to capture the dominant mechanisms in regulating the timing and magnitude of the bloom but only the variation patterns of chl-a. The quality of the environmental dataset may prevent the ML models from well accounting for environmental influences on the chl-a variations. Temporally, changing high-frequency environmental variables, such as the hourly winds and precipitation and daily SST and SSS data (Table 1), into the 8-day resolution, to agree with that of chl-a may potentially cause the loss of the short-term (e.g., 3–5 days) environmental changes driving fast phytoplankton responses. The Yangtze River discharge, which is quantified as the most influential variable in the forecast (Figure 9), has much lower temporal resolution (monthly) than that of the chl-a data, which likely limits the forecasting skills. SSS, the second most influential environmental factor, is available after 2015 and mismatches with chl-a data temporally. Spatially, the resolution of all environmental datasets except PAR are much coarser than that of chl-a, which may not adequately resolve fine-scale variability, particularly in nearshore regions. For issues of land contamination and land fraction masking, SSS suffers from a high missing ratio in shallow nearshore waters, limiting the predictive skills when including it in the ML models. Thus, the SHAP analysis further shows that although environmental variables contribute to the forecasting results, the ML model still relies primarily on historical chl-a information. Therefore, despite adding the environmental variables, the limitation of the chl-a training dataset (the MODIS product) still dominates the forecasting performance. The strong coupling among multiple drivers and their region-specific influences makes it difficult for ML models to decipher the underlying nonlinear processes. In addition to the environmental controls, it is difficult for ML models to capture dynamic physical processes such as ocean circulation, constraining their performance in regions influenced by complex coastal currents [49]. In contrast, the mechanistic model provides a fine-scale (1 km, daily) dataset for ML model training and testing, which likely contributes to the improved skills in and capturing of environmental controls on chl-a variability and modeling nearshore chl-a. These results further imply that the quality of the training datasets of both chl-a and environmental variables appear to prevent the ML models to achieve higher predictive skills, particularly in nearshore regions with highly dynamic chl-a variability, complicated environmental controls, but a high missing ratio.

In longer-term forecasts, ML forecasts tend to decay toward the climatological state, and even the simple seasonal climatology achieves similar predictive skills to the STIMP model. The lack of interannual variations in ML-predicted chl-a reveals that ML models in this study, due to either the data quality or algorithms, are incapable of bonding chl-a with environmental drivers or sustaining physically driven variability over extended timescales [50].

4.2. Towards Improved Forecast Skills in Coastal Chlorophyll-a by Machine Learning Models

To address the limitations of ML approaches in coastal chl-a spatiotemporal forecasting, we recommend a few aspects based on the findings in this study. Prior to conducting chl-a forecasting, the spatiotemporal scales and governing mechanisms of the target chl-a variability should be clearly analyzed before determining input features, model architecture, and appropriate spatiotemporal model resolution. Inadequate scale matching may substantially limit the forecasting applicability of ML models in coastal waters [18]. From the data perspective, long-term in situ observations and buoy measurements are desired for improving forecasting applicability at fixed locations and have demonstrated strong performance in both short- and long-term chl-a forecasting at single or multiple stations [51,52,53]. For regional scales, remote sensing products offer sufficiently two-dimensional surface datasets to support model training and validation. In this context, ML-based imputation methods can effectively reduce data sparsity and further enhance forecasting applicability by increasing data continuity. However, for applications requiring high temporal responsiveness and operational forecasting applicability, such as harmful algal blooms early warning, special attention must be paid to the temporal resolution and accuracy of imputed satellite-derived chl-a data.

Beyond data considerations, we suggest that improving forecasting applicability in coastal waters requires the integration of statistical and mechanistic modeling approaches. The hydrodynamic–biogeochemical model in our study follows the mass-balance rule and represents continuous spatiotemporal evolution of variables, favoring interpreting chl-a dynamics under changing environmental conditions. Knowledge of the phytoplankton dynamics of the target area is necessary before conducting any statistical forecasts. In this study, the hydrodynamic–biogeochemical model exhibits decent model skills with all realistic environmental variables that influence phytoplankton dynamics incorporated, which offers better training data than the MODIS data for the low data missing ratio and high spatiotemporal resolution. However, the development and operation of such mechanistic models are time consuming compared to ML models. In our configuration, simulating one year using the hydrodynamic–biogeochemical model requires approximately 6 h with 100 CPU (Central Processing Unit) cores, excluding the time for model spin-up, development or parameter calibration. In contrast, the combined training and forecast time of ML models, when deployed locally, completes the same task within 30–40 min (excluding the processes of data preprocessing, imputation, and hyperparameter calibration), demonstrating a substantial reduction in computational cost. Such computational efficiency is critical for operational forecasting systems that require rapid updates. In our study, we have combined the two techniques by training and validating the ML models with the mechanistic model, which showed improved forecast performance with advantages of both approaches and provided substantially higher computational efficiency for operational forecasting applications. Such hybrid strategies have been shown to enhance forecasting applicability across a wide range of spatial scales, from global interannual chl-a forecast to small-scale lake algal bloom forecasting [54,55]. It has to be mentioned that the Transformer model in our study only emulates the mechanistic model, and the latter sets the limit of modeling skills of the former. Hence, the ML model is not yet able to reduce biases of the mechanistic model against observations, which requires longer datasets of both observations and mechanistic models.

Despite being computationally efficient, the statistical ML model relies largely on the quality of the training data, i.e., the mechanistic model performance. For instance, the lag in bloom forecasts in our study indicates the inability of statistical models to predict abrupt changes in phytoplankton biomass and their dependence on patterns and trends of input data. In our application, the mechanistic model performs not as well in 2024 as in 2023 (Figure S7 vs. 10) and still needs a longer period of observational data for validation and improvement. Meanwhile, ML techniques may improve the forecasting skills of hydrodynamic–biogeochemical models, particularly through data assimilation and the optimization of model forcing fields [56].

Beyond chl-a forecasting, ML approaches have been widely applied to the forecast of marine ecosystem evolvement [57,58]. Integrating hydrodynamic–biogeochemical models with data-driven ML approaches offers a promising pathway toward the development of ecological forecasting systems in coastal waters. Such a forecast system combines the process consistency and physical interpretability of mechanistic models with the pattern-recognition capability and computational efficiency of ML models, thereby enhancing forecasting applicability across multiple ecological variables and temporal scales. Such forecasting systems are particularly valuable for operational applications, as they enable simultaneous forecast of multiple ecosystem indicators in a timely manner and support ecosystem management under rapidly changing coastal environments. However, the substantial decline in forecasting skill with increasing lead time suggests that the present framework is more suitable for short-term operational forecasting and early-warning applications, while weekly forecasts remain limited by the accumulation of uncertainties.

As every single model is based on certain simplification and assumptions, model results should be interpreted with caution in operational applications. If possible, management decisions should rely on an ensemble of model outputs in the forecast framework. Multi-model results should be jointly evaluated and validated by observational data. Under different application contexts, the complementary strengths of individual models should be fully exploited. ML models are generally effective at capturing dominant statistical patterns and climatological behavior, making them well suited for baseline assessment and anomaly detection. In contrast, mechanistic models offer greater interpretability and provide process-level insights, particularly during extreme events or under changing environmental regimes. Within an integrated, multi-model ecological forecasting framework, such complementarity enables managers to balance predictive performance and computational efficiency with interpretability in face of marine disasters such as harmful algal blooms and hypoxia. Our study focuses on the forecast of chl-a concentrations, which do not always translate into the occurrence of harmful algal blooms. A comprehensive investigation of local phytoplankton species, environmental patterns, and driving mechanisms in harmful algal blooms is at least required to build a knowledge-based operational system for early warnings of harmful algal blooms.

5. Conclusions

By applying multiple machine learning (ML) models to forecast the spatiotemporal variability of chl-a in the northern Taiwan Strait and comparing the results with mechanistic hydrodynamic–biogeochemical model outputs, this study assesses the performance of ML approaches across different environments and temporal scales. When trained and validated with the MODIS remote sensing chl-a data, ML models demonstrate decent skills in capturing the seasonal variations in offshore waters but show limited predictability under complex environmental conditions in inshore waters. The model skills are affected by the relatively coarse temporal resolution and irregular gap of the satellite observations, particularly in nearshore regions with highly dynamic chl-a variability. In the study area, ML models do not establish a robust relationship between chl-a and environmental variables, as the inclusion of multiple environmental variables does not lead to a substantial improvement in forecasting performance. Because of that, ML models do not resolve the interannual chl-a variations inshore or offshore and show similar skills to the simple seasonal climatological values. In contrast, a hydrodynamic–biogeochemical model incorporates continuous physical and biogeochemical processes and exhibits reproduced short-term phytoplankton dynamics such as spring blooms with higher temporal resolution. This study underlines the need for broader and longer chl-a observations in coastal waters for model validation and improvement in the study region.

These findings highlight the critical role of data quality in applying ML models in chl-a forecasts. When using remote sensing products, the spatiotemporal resolution, data accuracy, and missing-data ratios should be assessed. When replacing the satellite data with outputs from the mechanistic hydrodynamic–biogeochemical model, model performance is enhanced for improved data resolution and accuracy. The combination of mechanistic and statistical models is recommended for further chl-a forecasting application for both resolving the environmental controls of phytoplankton growth and improving computational efficiency.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/rs18121904/s1, The Supplementary Materials provide additional information on mechanistic model calibration and validation. The completed captions of the Supplementary Materials are listed: Text S1: Mechanistic model calibration and validation. Figure S1: The bathymetry and model domain of (a) the China Seas and (b) Taiwan Strait and (c) a zoom-in view of Pingtan coastal seas. Locations of major rivers are marked with green dots in (a) and (b). Two stations of current velocity measurements, Nan’ao and Changle, are marked with blue triangles in (b). Observational stations around the Pingtan Island are shown in (c). Figure S2: The modeled (red) and observational (blue) water elevation at the tide gauge south of Pingtan Island in August 2017. The location of the tide gauge is shown in Figure S1c. Figure S3: The modeled (red) and observational (blue) eastward (u) and northward (v) current velocity at (a) the B1 buoy south of Pingtan Island on 22 December 2021, (b) Nan’ao station in the southwestern Taiwan Strait on 3 October 2018, and (c) Changle station north of Pingtan Island during 4–5 September 2020. Figure S4: The modeled (red) and observed (blue) sea surface temperature and salinity at the B1 buoy south of Pingtan Island from January to June 2023 and from January to June 2024. Figure S5: Taylor diagram integrating comparisons between model results and observational data, including water level, current velocity, salinity, and temperature. Figure S6: The modeled (red) and observed (blue) dissolved inorganic nitrogen (DIN) concentrations from 1 November 2022 to 30 June 2023 in three areas of the Pingtan coastal waters. Figure S7: Chlorophyll-a (chl-a) concentrations simulated by the hydrodynamic–biogeochemical model and observed at the buoy from January to June 2024. Figure S8: Mean modeled and MODIS-derived chlorophyll-a (chl-a) concentrations from January–March and April–June 2023.

Author Contributions

Conceptualization, L.J.; methodology, L.J. and Y.W.; validation, Y.W.; formal analysis, Y.W.; investigation, Y.W., L.J. and C.C.; resources, H.L., C.C. and D.J.; data curation, Y.W.; writing—original draft preparation, Y.W.; writing—review and editing, L.J.; visualization, Y.W.; Supervision, L.J., project administration, L.J., Y.W. and C.C.; funding acquisition, L.J. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Fund of Fujian Key Laboratory of Island Monitoring and Ecological Development (Island Research Center, MNR) (2023ZD01), Fujian Provincial Natural Science Foundation Project (General Program 2023J011384), and National Natural Science Foundation of China (42106027), Natural Science Foundation of Jiangsu Province (BK20252044), Jiangsu Provincial Innovation Research Program on Carbon Peaking and Carbon Neutrality (BT2024012, BT2025034) and Key Laboratory of Coastal Salt Marsh Ecosystems and Resources, Ministry of Natural Resources (KLCSMERMNR202501).

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, C.; Liang, J.; Yang, G.; Sun, W. Spatio-temporal distribution of harmful algal blooms and their correlations with marine hydrological elements in offshore areas, China. Ocean Coast. Manag. 2023, 238, 106554. [Google Scholar] [CrossRef]
Richardson, A.J.; Schoeman, D.S. Climate impact on plankton ecosystems in the Northeast Atlantic. Science 2004, 305, 1609–1612. [Google Scholar] [CrossRef]
Zhang, K.; Zhao, X.; Xue, J.; Mo, D.; Zhang, D.; Xiao, Z.; Yang, W.; Wu, Y.; Chen, Y. The temporal and spatial variation of chlorophyll a concentration in the China Seas and its impact on marine fisheries. Front. Mar. Sci. 2023, 10, 1212992. [Google Scholar] [CrossRef]
Zohdi, E.; Abbaspour, M. Harmful algal blooms (red tide): A review of causes, impacts and approaches to monitoring and prediction. Int. J. Environ. Sci. Technol. 2019, 16, 1789–1806. [Google Scholar] [CrossRef]
Boyce, D.G.; Lewis, M.R.; Worm, B. Global phytoplankton decline over the past century. Nature 2010, 466, 591–596. [Google Scholar] [CrossRef]
Boyce, D.G.; Dowd, M.; Lewis, M.R.; Worm, B. Estimating global chlorophyll changes over the past century. Prog. Oceanogr. 2014, 122, 163–173. [Google Scholar] [CrossRef]
Dai, Y.; Yang, S.; Zhao, D.; Hu, C.; Xu, W.; Anderson, D.M.; Li, Y.; Song, X.; Boyce, D.G.; Gibson, L.; et al. Coastal phytoplankton blooms expand and intensify in the 21st century. Nature 2023, 615, 280–284. [Google Scholar] [CrossRef]
Gobler, C.J. Climate change and harmful algal blooms: Insights and perspective. Harmful Algae 2020, 91, 101731. [Google Scholar] [CrossRef] [PubMed]
Jiang, L.; Xia, M.; Ludsin, S.A.; Rutherford, E.S.; Mason, D.M.; Jarrin, J.M.; Pangle, K.L. Biophysical modeling assessment of the drivers for plankton dynamics in dreissenid-colonized western Lake Erie. Ecol. Model. 2015, 308, 18–33. [Google Scholar] [CrossRef]
Jiang, L.; Gerkema, T.; Kromkamp, J.C.; van der Wal, D.; Carrasco De La Cruz, P.M.; Soetaert, K. Drivers of the spatial phytoplankton gradient in estuarine–coastal systems: Generic implications of a case study in a Dutch tidal bay. Biogeosciences 2020, 17, 4135–4152. [Google Scholar] [CrossRef]
Jiang, L.; Blommaert, L.; Jansen, H.M.; Broch, O.J.; Timmermans, K.R.; Soetaert, K. Carrying capacity of Saccharina latissima cultivation in a Dutch coastal bay: A modelling assessment. ICES J. Mar. Sci. 2022, 79, 709–721. [Google Scholar] [CrossRef]
Paerl, H.W.; Hall, N.S.; Peierls, B.L.; Rossignol, K.L.; Joyner, A.R. Hydrologic variability and its control of phytoplankton community structure and function in two shallow, coastal, lagoonal ecosystems: The Neuse and New River Estuaries, North Carolina, USA. Estuaries Coasts 2014, 37, 31–45. [Google Scholar] [CrossRef]
Jiang, L.; Xia, M. Wind effects on the spring phytoplankton dynamics in the middle reach of the Chesapeake Bay. Ecol. Model. 2017, 363, 68–80. [Google Scholar] [CrossRef]
Jiang, L.; Xia, M. Modeling investigation of the nutrient and phytoplankton variability in the Chesapeake Bay outflow plume. Prog. Oceanogr. 2018, 162, 290–302. [Google Scholar] [CrossRef]
Jarníková, T.; Olson, E.M.; Allen, S.E.; Lanson, D.; Suchy, K.D. A clustering approach to determine biophysical provinces and physical drivers of productivity dynamics in a complex coastal sea. Ocean Sci. Discuss. 2021, 2021, 1–36. [Google Scholar] [CrossRef]
Wang, R.; Li, X.; Song, J.; Wang, Z.; Zhong, G.; Yuan, H.; Duan, L. Surface seawater Chlorophyll-a variability in the South China Sea: Influence of pCO₂ and co-varying environmental factors. Environ. Res. 2025, 279, 121808. [Google Scholar] [CrossRef]
Ma, C.; Zhao, J.; Zhang, G. Decoding the drivers of variability in chlorophyll-a concentrations in the Pearl River Estuary: Intra-annual and inter-annual analyses of environmental influences. Environ. Res. 2025, 268, 120783. [Google Scholar] [CrossRef]
Blauw, A.N.; Benincà, E.; Laane, R.W.P.M.; Greenwood, N.; Huisman, J. Predictability and environmental drivers of chlorophyll fluctuations vary across different time scales and regions of the North Sea. Prog. Oceanogr. 2018, 161, 1–18. [Google Scholar] [CrossRef]
Gao, L.; Li, D. A review of hydrological/water-quality models. Front. Agric. Sci. Eng. 2014, 1, 267–276. [Google Scholar] [CrossRef]
Chen, S.; Jiang, L.; Cheng, X.; Liao, G.; Gekema, T. A physical perspective of recurrent water quality degradation: A case study in the Jiangsu coastal waters, China. J. Geophys. Res. Ocean. 2023, 128, E2022JC019607. [Google Scholar] [CrossRef]
Fang, Z.; Feng, T.; Meng, Y.; Zhao, S.; Yang, G.; Wang, Y.; Wang, L.; Shao, S.; Sun, W. Impacts of coastal nutrient increases on the marine ecosystem in the East China Sea during 1982–2012: A coupled hydrodynamic-ecological modeling study. J. Geophys. Res. Ocean. 2025, 130, E2024JC021553. [Google Scholar] [CrossRef]
Macías, D.; Stips, A.; Garcia-Gorriz, E. The relevance of deep chlorophyll maximum in the open Mediterranean Sea evaluated through 3D hydrodynamic–biogeochemical coupled simulations. Ecol. Model. 2014, 281, 26–37. [Google Scholar] [CrossRef]
Oschlies, A.; Garçon, V. Eddy-induced enhancement of primary production in a model of the North Atlantic Ocean. Nature 1998, 394, 266–269. [Google Scholar] [CrossRef]
Van Oostende, N.; Dussin, R.; Stock, C.A.; Barton, A.D.; Curchitser, E.; Dunne, J.P.; Ward, B.B. Simulating the ocean’s chlorophyll dynamic range from coastal upwelling to oligotrophy. Prog. Oceanogr. 2018, 168, 232–247. [Google Scholar] [CrossRef]
He, X.; Shi, S.; Geng, X.; Xu, L.; Zhang, X. Spatial-temporal attention network for multistep-ahead forecasting of chlorophyll. Appl. Intell. 2021, 51, 4381–4393. [Google Scholar] [CrossRef]
Shamshirband, S.; Jafari Nodoushan, E.; Adolf, J.E.; Abdul Manaf, A.; Mosavi, A.; Chau, K. Ensemble models with uncertainty analysis for multi-day ahead forecasting of chlorophyll a concentration in coastal waters. Eng. Appl. Comput. Fluid Mech. 2019, 13, 91–101. [Google Scholar] [CrossRef]
Zhang, F.; Kung, H.; Zhang, F.; Yang, C.; Gan, J. AI-powered spatiotemporal imputation and prediction of chlorophyll-a concentration in coastal ecosystems. Nat. Commun. 2025, 16, 7656. [Google Scholar] [CrossRef]
Jan, S.; Tseng, Y.H.; Dietrich, D.E. Sources of water in the Taiwan Strait. J. Oceanogr. 2010, 66, 211–221. [Google Scholar] [CrossRef]
Pan, A.J.; Wan, X.F.; Guo, X.G.; Jing, C.S. Responses of the Zhe-Min coastal current adjacent to Pingtan Island to the wintertime monsoon relaxation in 2006 and its mechanism. Sci. China Earth Sci. 2013, 56, 386–396. [Google Scholar] [CrossRef]
Hong, H.; Chai, F.; Zhang, C.; Huang, B.; Jiang, Y.; Hu, J. An overview of physical and biogeochemical processes and ecosystem dynamics in the Taiwan Strait. Cont. Shelf Res. 2011, 31, S3–S12. [Google Scholar] [CrossRef]
Tseng, H.C.; You, W.L.; Huang, W.; Chung, C.C.; Tsai, A.Y.; Chen, T.Y.; Lan, K.W.; Gong, G.C. Seasonal variations of marine environment and primary production in the Taiwan Strait. Front. Mar. Sci. 2020, 7, 38. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Zhang, Y.; Yan, J. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Ekambaram, V.; Jati, A.; Nguyen, N.; Sinthong, P.; Kalagnanam, J. Tsmixer: Lightweight mlp-mixer model for multivariate time series forecasting. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2023; pp. 459–469. [Google Scholar]
Wang, Y.; Wu, H.; Zhang, J.; Gao, Z.; Wang, J.; Yu, P.S.; Long, M. Predrnn: A recurrent neural network for spatiotemporal predictive learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2208–2225. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Chang, X.; Zhang, C. Connecting the dots: Multivariate time series forecasting with graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Association for Computing Machinery: New York, NY, USA, 2020; pp. 753–763. [Google Scholar]
Jiang, L.; Lu, X.; Xu, W.; Yao, P.; Cheng, X. Uncertainties associated with simulating regional sea surface height and tides: A case study of the East China seas. Front. Mar. Sci. 2022, 9, 827547. [Google Scholar] [CrossRef]
Huang, R.; Jiang, L.; Cheng, X.; Burchard, H. Bifurcated upshelf extension of the Yangtze River plume. J. Geophys. Res. Ocean. 2025, 130, E2025JC022937. [Google Scholar] [CrossRef]
Burchard, H.; Bolding, K. GETM: A general Estuarine Transport Model; Scientific documentation; European Commission, Joint Research Centre, Institute for Environment and Sustainability: Ispra, Italy, 2002. [Google Scholar]
Bruggeman, J.; Bolding, K. A general framework for aquatic biogeochemical models. Environ. Model. Softw. 2014, 61, 249–265. [Google Scholar] [CrossRef]
Jakobsen, H.H.; Markager, S. Carbon-to-chlorophyll ratio for phytoplankton in temperate coastal waters: Seasonal patterns and relationship to nutrients. Limnol. Oceanogr. 2016, 61, 1853–1868. [Google Scholar] [CrossRef]
Chen, S.; Jiang, L.; Yan, Y.; Grégoire, M. The nutrient budget of a highly eutrophic coastal system with inefficient nutrient retention: The radial sand ridges, southwestern Yellow Sea. J. Geophys. Res. Ocean. 2026, 131, E2025JC023286. [Google Scholar] [CrossRef]
Cloern, J.E.; Grenz, C.; Vidergar-Lucas, L. An empirical model of the phytoplankton chlorophyll: Carbon ratio-the conversion factor between productivity and growth rate. Limnol. Oceanogr. 1995, 40, 1313–1321. [Google Scholar] [CrossRef]
Zhou, F.; Chai, F.; Huang, D.; Xue, H.; Chen, J.; Xiu, P.; Xuan, J.; Li, J.; Zheng, D.; Ni, X.; et al. Investigation of hypoxia off the Changjiang Estuary using a coupled model of ROMS-CoSiNE. Prog. Oceanogr. 2017, 159, 237–254. [Google Scholar] [CrossRef]
Tsai, S.F.; Wu, L.Y.; Chou, W.C.; Chiang, K.P. The dynamics of a dominant dinoflagellate, Noctiluca scintillans, in the subtropical coastal waters of the Matsu archipelago. Mar. Pollut. Bull. 2018, 127, 553–558. [Google Scholar] [CrossRef] [PubMed]
Gong, G.C.; Shiah, F.K.; Liu, K.K.; Wen, Y.H.; Liang, M.H. Spatial and temporal variation of chlorophyll a, primary productivity and chemical hydrography in the southern East China Sea. Cont. Shelf Res. 2000, 20, 411–436. [Google Scholar] [CrossRef]
Naik, H.; Chen, C.T.A. Biogeochemical cycling in the Taiwan Strait. Estuar. Coast. Shelf Sci. 2008, 78, 603–612. [Google Scholar] [CrossRef]
Hsu, P.C.; Lu, C.Y.; Hsu, T.W.; Ho, C.R. Diurnal to seasonal variations in ocean chlorophyll and ocean currents in the north of Taiwan observed by Geostationary Ocean Color Imager and coastal radar. Remote Sens. 2020, 12, 2853. [Google Scholar] [CrossRef]
Li, H.; Li, X.; Song, D.; Nie, J.; Liang, S. Prediction on daily spatial distribution of chlorophyll-a in coastal seas using a synthetic method of remote sensing, machine learning and numerical modeling. Sci. Total Environ. 2024, 910, 168642. [Google Scholar] [CrossRef]
Grande, D.; Buizza, R.; Storto, A. Machine learning in ocean data assimilation: Advances, gaps and the road to operations. Ocean. Model. 2026, 200, 102678. [Google Scholar] [CrossRef]
Du, Z.; Qin, M.; Zhang, F.; Liu, R. Multistep-ahead forecasting of chlorophyll a using a wavelet nonlinear autoregressive network. Knowl.-Based Syst. 2018, 160, 61–70. [Google Scholar] [CrossRef]
Rajaee, T.; Boroumand, A. Forecasting of chlorophyll-a concentrations in South San Francisco Bay using five different models. Appl. Ocean Res. 2015, 53, 208–217. [Google Scholar] [CrossRef]
Ding, W.X.; Zhang, C.Y.; Shang, S.P.; Li, X.D. Optimization of deep learning model for coastal chlorophyll a dynamic forecast. Ecol. Model. 2022, 467, 109913. [Google Scholar] [CrossRef]
Chen, C.; Chen, Q.; Yao, S.; He, M.; Zhang, J.; Li, G.; Lin, Y. Combining physical-based model and machine learning to forecast chlorophyll-a concentration in freshwater lakes. Sci. Total Environ. 2024, 907, 168097. [Google Scholar] [CrossRef]
Park, J.S.; Park, J.Y.; Ham, Y.G.; Kim, J.H.; Jeon, W.J. A Deep Learning Framework for Chlorophyll Prediction in Large Marine Ecosystems: Benchmarking with a Dynamic Model and Implications for Fish Catch Forecasts. EGUsphere 2025, 2025, 1–19. [Google Scholar] [CrossRef]
Higgs, I.; Bannister, R.; Skákala, J.; Carrassi, A.; Ciavatta, S. Hybrid machine learning data assimilation for marine biogeochemistry. Biogeosciences 2026, 23, 315–344. [Google Scholar] [CrossRef]
Zhang, P.; Liu, X.; Dai, H.; Shi, C.; Xie, R.; Song, G.; Tang, L. A multi-model ensemble approach for reservoir dissolved oxygen forecasting based on feature screening and machine learning. Ecol. Indic. 2024, 166, 112413. [Google Scholar] [CrossRef]
Cruz, R.C.; Reis Costa, P.; Vinga, S.; Krippahl, L.; Lopes, M.B. A review of recent machine learning advances for forecasting harmful algal blooms and shellfish contamination. J. Mar. Sci. Eng. 2021, 9, 283. [Google Scholar] [CrossRef]

Figure 1. The bathymetry and model domain of (a) the China Seas, the region of (b) Taiwan Strait. The domain of machine learning and hydrodynamic–biogeochemical models is shown in (b) with dashed and solid boxes, respectively. The major rivers are marked with red squares in both panels. A buoy is marked as the green diamond. Two boxes in the nearshore and offshore regions are indicated for further analysis of ML output chl-a in (b).

Figure 2. (a) The missing ratio of the MODIS (Moderate Resolution Imaging Spectroradiometer) chl-a data and climatological MODIS chl-a in (b) nearshore and (c) offshore regions of the northern Taiwan Strait. Dots and shades represent the averages and standard deviations, respectively. The locations of these regions are shown in Figure 1b.

Figure 3. A schematic representation of the NPZD (Nutrient–Phytoplankton–Zooplankton–Detritus) model, illustrating the state variables (boxes) and the major biogeochemical processes (arrows).

Figure 4. When 10% to 90% of data are randomly excluded, the imputed data are compared with the original (those excluded) remote sensing chlorophyll-a (chl-a) data measured by the Moderate Resolution Imaging Spectroradiometer (MODIS). Data imputation is conducted with the spatiotemporal imputation and prediction (STIMP)-machine learning model. The solid black line in each panel marks the 1:1 relationship between the two datasets. r and RMSE are short for correlation coefficients and root mean square errors. Data shown in each panel is the one with the median RMSE among five repeated random masking experiments.

Figure 5. The root mean square errors (RMSEs) between the machine learning model ensemble mean outputs and validation dataset of the MODIS (Moderate Resolution Imaging Spectroradiometer) chlorophyll-a (chl-a) in (a) nearshore and (b) offshore regions of the northern Taiwan Strait during 2016–2024. For each box plot, the red horizontal lines indicate the median values; the blue boxes represent the interquartile range (25th–75th percentiles); the blue dashed whiskers extend to the most extreme non-outlier values within 1.5 times the interquartile range; and the red circles denote outliers. See Figure 1b for the boxes representing nearshore and offshore regions.

Figure 6. The (a) correlation coefficients (r) and (b) root mean square errors (RMSEs) between the ensemble mean chlorophyll-a (chl-a) predicted by the STIMP (spatiotemporal imputation and prediction) model and the validation data, MODIS (Moderate Resolution Imaging Spectroradiometer) observations in the northern Taiwan Strait during 2016–2024. Red dots indicate significant correlations (p < 0.05).

Figure 7. The chlorophyll-a (chl-a) concentrations predicted by the STIMP (spatiotemporal imputation and prediction) model and the validation data, MODIS (Moderate Resolution Imaging Spectroradiometer) observations in the (a) nearshore and (b) offshore waters of the northern Taiwan Strait during 2016–2024. The boxes where the inshore and offshore data were sampled are shown in Figure 1b, and the blue shades in each panel are spatial standard deviations of the modeled data.

Figure 8. The chlorophyll-a (chl-a) concentrations derived from the time-series statistical models (persistence, climatology, seasonal climatology, linear regression, random forest, and autoregressive models), STIMP and MODIS (Moderate Resolution Imaging Spectroradiometer) observations in the (a) nearshore and (b) offshore waters of the northern Taiwan Strait during 2016–2024. The boxes where the inshore and offshore data are sampled are shown in Figure 1b, and the blue shades in each panel are spatial standard deviations of the STIMP modeled data. In panel (a), triangle indicates that the autoregressive model produced numerically unstable (divergent) predictions exceeding the plotting range.

Figure 9. The relative importance of environmental variables for chlorophyll-a (chl-a) forecasting based on STIMP (the spatiotemporal imputation and prediction) model derived from SHAP (SHapley Additive exPlanations) values averaged across all samples, forecast lead times, and spatial locations.

Figure 10. (a) The chlorophyll-a (chl-a) concentration simulated by a hydrodynamic–biogeochemical model and observed at the buoy from January to June 2023. The surface distribution of (b,c) dissolved inorganic nitrogen (DIN) and (d,e) chl-a concentration (b,d) before and (c,e) during the spring bloom. The buoy location is shown in Panel (e).

Figure 11. The (a,c,e) correlation coefficients (r) and (b,d,f) root mean square errors (RMSEs) between ensemble mean chlorophyll-a (chl-a) predicted by the Transformer model and the validation data, GETM-FABM output in the northern Taiwan Strait with forecast time of (a,b) 1, (c,d) 3, and (e,f) 7 days in 2024. Red dots indicate significant correlations (p < 0.05). The red triangle indicates the location of the Min River estuary.

Figure 12. The modeled chlorophyll-a (chl-a) by the Transformer model and the validation data, GETM-FABM output at the buoy with forecast time of (a) 1, (b) 3, (c) 7 days from January to June in 2024. The red line and shade represent the averages and standard deviations of ensemble modeled chl-a, respectively. The red asterisks indicate the times when each forecast starts. See Figure 1b for location of the buoy.

Figure 13. The chlorophyll-a (chl-a) concentration measured at the buoy station, observed by MODIS (Moderate Resolution Imaging Spectroradiometer) and forecasted by the STIMP (spatiotemporal imputation and prediction) model based on MODIS data in (a) 2023 and (b) 2024. Blue line and shades represent the averages and standard deviations of ensemble modeled chl-a, respectively. See Figure 1b for location of the buoy.

Table 1. The spatial and temporal resolutions and data length of environmental variables and satellite chlorophyll-a used for training machine learning models in this study.

Environmental Variables	Spatial Resolution	Temporal Resolution	Data Length
SST, OISST	0.25°	Daily	2003–2024
SSS, SMAP	40–70 km	Daily	2015–2024
Yangtze River Discharge, observation	N/A	Monthly	2003–2024
PAR, MODIS	4 km	8-day	2003–2024
Precipitation over Min River basin, ERA5	Basin-averaged	Hourly	2003–2024
u10, ERA5	0.25°	Hourly	2003–2024
v10, ERA5	0.25°	Hourly	2003–2024
Chl-a, MODIS	4 km	8-day	2003–2024

Table 2. Tasks and settings of machine learning models used in this study.

Tasks	Models	Input	Output	Training Dataset	Validation Dataset
Missing data imputation	STIMP	MODIS chl-a during 2003–2024	Reconstructed chl-a during 2003–2024 with all data gaps filled	MODIS chl-a during 2003–2015	MODIS chl-a during 2016–2024
Time-series forecasts	Linear Regression, Random Forest, and Autoregressive models	Reconstructed chl-a time series with all data gaps filled of the previous year (46-time steps, 8-day resolution)	Forecasted chl-a time series in the year following the input year (46-time steps, 8-day resolution)	chl-a time series during 2003–2015 *	chl-a time series during 2016–2024 *
Spatiotemporal forecasts based on satellite data	STIMP, Transformer, Crossformer, Tsmixer, PredRNN, MTGNN	Reconstructed chl-a with all data gaps filled and environmental variables (winds, SST, PAR, precipitation, SSS, river discharge) from the previous year (46-time steps, 8-day resolution)	Forecasted chl-a in the year following the input year (46-time steps, 8-day resolution)	chl-a during 2003–2015 * and environmental variables (winds, SST, PAR, precipitation, SSS, river discharge)	chl-a during 2016–2024 *
Spatiotemporal forecasts based on mechanistic model output	Transformer	GETM-FABM output chl-a, SST, and nutrients from the previous 14 days (daily resolution)	Forecasted chl-a with the lead time of 1, 3, 7 days (daily resolution))	GETM-FABM output chl-a, SST, and nutrients during 2022–2023	GETM-FABM output chl-a in 2024

* The imputed data are input predictors, while the non-imputed original data are ground truth labels.

Table 3. Unified training configurations for all machine learning (ML) models in this study.

Category	Configuration
Learning rate	10⁻⁴
Batch size	8
Hidden size	8
Training epochs	120
Loss function	MSE
Data normalization	Z-score normalization
Optimizer	Adam

Table 4. Governing equations in the NPZD biogeochemical model of this study formulated in FABMs.

Equations	Interpretation
$T_{fac_Phy} = \{\begin{matrix} e^{{{k 1}_{P h y} \cdot (T - T_{P h y})}^{2}}, T < T_{P h y} \\ e^{{{k 2}_{P h y} \cdot (T - T_{P h y})}^{2}}, T \geq T_{P h y} \end{matrix}$	$T_{fac_Phy},$ dimensionless, the temperature factor for phytoplankton (PHY) uptake; $T,$ °C, the in situ temperature.
$T_{fac_Zoo} = \{\begin{matrix} e^{{{k 1}_{Z o o} \cdot (T - T_{Z o o})}^{2}}, T < T_{Z o o} \\ e^{{{k 2}_{Z o o} \cdot (T - T_{Z o o})}^{2}}, T \geq T_{Z o o} \end{matrix}$	$T_{fac_Zoo},$ dimensionless, the temperature factor for red Noctiluca scintillans (RNS) and other zooplankton (ZOO).
${T_{f a c} = q_{10}}^{(T - T_{r e f}) / T}$	$T_{f a c},$ dimensionless, the temperature factor for other biogeochemical processes.
$D i n U p t = m a x u p t \cdot \frac{DIN}{DIN + ksDIN} \cdot T_{fac_Phy} \cdot \frac{P A R}{P A R + k s P A R} \cdot P H Y$	$D i n U p t,$ mmol m⁻³ day⁻¹, the pelagic dissolved inorganic nitrogen (DIN) uptake rate; $P A R,$ µmol photons m⁻²s⁻¹, in situ photosynthetically active radiation.
$R n s G r z = m a x G r z R n s \cdot T_{fac_Zoo} \cdot \frac{P H Y}{P H Y + k s G r z R n s} \cdot R N S$	$R n s G r z,$ mmol m⁻³ day⁻¹, the grazing rate for RNS.
$RnsGro = (1 - pFaeces) \cdot R n s Grz$	$RnsGro,$ mmol m⁻³ day⁻¹, the growth rate for RNS.
$RnsExc = R n s ExcRate \cdot T_{fac_Zoo} \cdot R N S$	$RnsExc,$ mmol m⁻³ day⁻¹, the nutrient excretion rate for RNS.
$RnsMor = R n s MorRate \cdot T_{fac_Zoo} \cdot {R N S}^{2}$	$RnsMor,$ mmol m⁻³ day⁻¹, the mortality rate for RNS.
$Z o o G r z = m a x G r z Z o o \cdot T_{fac_Zoo} \cdot \frac{P H Y}{P H Y + k s G r z Z o o} \cdot Z O O$	$Z o o G r z,$ mmol m⁻³ day⁻¹, the grazing rate for ZOO.
$ZooGro = (1 - pFaeces) \cdot Z o o Grz$	$ZooGro,$ mmol m⁻³ day⁻¹, the growth rate for ZOO.
$ZooExc = Z o o ExcRate \cdot T_{fac_Zoo} \cdot Z O O$	$ZooExc,$ mmol m⁻³ day⁻¹, the nutrient excretion rate for ZOO.
$ZooMor = Z o o MorRate \cdot T_{fac_Zoo} \cdot {Z O O}^{2}$	$ZooMor,$ mmol m⁻³ day⁻¹, the mortality rate for ZOO.
$Min = minRate {\cdot T}_{fac} \cdot DET$	$Min,$ mmol m⁻³ day⁻¹, the remineralization rate for detritus (DET).
$B o t M i n = B o t m i n R a t e {\cdot T}_{fac} \cdot B D E T$	BotMin, mmol m⁻² day⁻¹, the remineralization rate for bottom detritus (BDET).
$S i n D e t = S i n R a t e D e t \cdot D E T$	SinDet, mmol m⁻² day⁻¹, the sinking rate for DET.
$S i n P h y = S i n R a t e P h y \cdot P H Y$	$S i n P h y,$ mmol m⁻² day⁻¹, the sinking rate for PHY.
$\frac{d D I N}{d t} = Min + ZooExc - DinUpt + \frac{B o t M i n}{z}$	$d D I N / d t,$ mmol m⁻³ day⁻¹, the DIN temporal derivative; $z,$ m, vertical layer depth.
$\frac{d P H Y}{d t} = DinUpt - R n s Grz - Z o o G r z - \frac{S i n P h y}{z}$	$d P H Y / d t,$ mmol m⁻³ day⁻¹, the PHY temporal derivative.
$\frac{d R N S}{d t} = R n s Gro - R n s Exc - R n s Mor$	$d R N S / d t,$ mmol m⁻³ day⁻¹, the RNS temporal derivative.
$\frac{d Z O O}{d t} = Z o o Gro - Z o o Exc - Z o o Mor$	$d Z O O / d t,$ mmol m⁻³ day⁻¹, the ZOO temporal derivative.
$\frac{d D E T}{d t} = R n s Z o o M o r + M o r Z ooMor + pFaeces \cdot ZooGrz - Min - \frac{S i n D e t}{z}$	$d D E T / d t,$ mmol m⁻³ day⁻¹, the DET temporal derivative.
$\frac{d B D E T}{d t} = S i n D e t + S i n P h y - B o t M i n$	$d B D E T / d t,$ mmol m⁻² day⁻¹, the BDET temporal derivative.

Table 5. The spatial average correlation coefficients (r) and root mean square errors (RMSEs) between chlorophyll-a (chl-a) derived from the time-series statistical models (persistence, climatology, seasonal climatology, linear regression, random forest, and autoregressive models) and MODIS (Moderate Resolution Imaging Spectroradiometer) observations. The asterisks indicate significant correlations (p < 0.05). N/A indicates a failure to converge during training, resulting in invalid evaluation metrics.

Baseline Model	Nearshore		Offshore
Baseline Model	r	RMSE (mg m⁻³)	r	RMSE (mg m⁻³)
Persistence	−0.046	1.14	0.48 *	0.55
Climatology	−0.005	0.98	−0.08	0.57
Seasonal climatology	−0.039	0.99	0.58 *	0.47
Linear regression	−0.059	1.36	0.36 *	0.60
Random forest	−0.031	2.03	0.61 *	0.61
Autoregressive	N/A	N/A	0.33 *	0.54

Table 6. The spatial average correlation coefficients (r) and root mean square errors (RMSEs) between the ensemble mean chlorophyll-a (chl-a) derived from the STIMP (the spatiotemporal imputation and prediction) model and MODIS (Moderate Resolution Imaging Spectroradiometer) observations when chl-a and different additional environmental factors are included. The asterisks indicate significant correlations (p < 0.05).

Environmental Variables	Nearshore		Offshore
Environmental Variables	r	RMSE (mg m⁻³)	r	RMSE (mg m⁻³)
None	−0.017	1.07	0.59 *	0.47
SST	−0.062	0.95	0.58 *	0.48
SSS	−0.029	0.95	0.54 *	0.49
Yangtze River Discharge	−0.060	0.96	0.57 *	0.48
PAR	0.0040	0.95	0.58 *	0.47
Precipitation over Min River basin	−0.021	0.95	0.57 *	0.47
u10	−0.038	0.95	0.56 *	0.48
v10	−0.023	0.95	0.56 *	0.48
All above	−0.033	1.08	0.57 *	0.48

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, Y.; Jiang, L.; Lin, H.; Chen, C.; Jiang, D. Machine Learning Forecasts of Coastal Chlorophyll-a Based on Satellite and Model Data: A Case Assessment in the Northern Taiwan Strait. Remote Sens. 2026, 18, 1904. https://doi.org/10.3390/rs18121904

AMA Style

Wu Y, Jiang L, Lin H, Chen C, Jiang D. Machine Learning Forecasts of Coastal Chlorophyll-a Based on Satellite and Model Data: A Case Assessment in the Northern Taiwan Strait. Remote Sensing. 2026; 18(12):1904. https://doi.org/10.3390/rs18121904

Chicago/Turabian Style

Wu, Yangcong, Long Jiang, Heshan Lin, Chun Chen, and Degang Jiang. 2026. "Machine Learning Forecasts of Coastal Chlorophyll-a Based on Satellite and Model Data: A Case Assessment in the Northern Taiwan Strait" Remote Sensing 18, no. 12: 1904. https://doi.org/10.3390/rs18121904

APA Style

Wu, Y., Jiang, L., Lin, H., Chen, C., & Jiang, D. (2026). Machine Learning Forecasts of Coastal Chlorophyll-a Based on Satellite and Model Data: A Case Assessment in the Northern Taiwan Strait. Remote Sensing, 18(12), 1904. https://doi.org/10.3390/rs18121904

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning Forecasts of Coastal Chlorophyll-a Based on Satellite and Model Data: A Case Assessment in the Northern Taiwan Strait

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. The Study Area

2.2. Field and Remote Sensing Data

2.3. Machine Learning Models

2.3.1. Imputation of Satellite Chlorophyll-a Data

2.3.2. Time-Series Forecasts Based on Satellite Data

2.3.3. Spatiotemporal Forecasts Based on Satellite Data

2.3.4. Including Environmental Variables in ML Forecasts

2.4. The Hydrodynamic–Biogeochemical Model

3. Results

3.1. Imputation of the Missing Satellite Chlorophyll-a Data

3.2. Forecasting Skills of Machine Learning Models

3.3. The Hydrodynamic–Biogeochemical Model Output

4. Discussion

4.1. The Overall Forecasting Performance of Machine Learning Models

4.2. Towards Improved Forecast Skills in Coastal Chlorophyll-a by Machine Learning Models

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI