1. Introduction
Freshwater and coastal lake ecosystems are globally vulnerable to accelerating pressures from climate change, land use transformations, and hydrological instability, making accurate and timely water quality monitoring increasingly essential [
1,
2,
3,
4]. Traditional in situ measurements remain the foundation of ecological assessment, yet their limited spatial and temporal coverage creates significant gaps in understanding ecosystem-wide dynamics, particularly in shallow coastal lakes where conditions can change rapidly [
5].
Parallel advances in geospatial technologies, satellite remote sensing, and machine learning (ML) have created opportunities to bridge these gaps by integrating multi-source environmental information into coherent predictive frameworks [
6]. Although many recent studies have explored remote sensing-based retrieval of single physicochemical parameters [
7,
8], inconsistencies in temporal matching, sparse in situ measurement networks, and spatial heterogeneity frequently constrain predictive robustness and limit operational use.
Vrana Lake in Dalmatia, Croatia, a shallow coastal freshwater system hydrologically connected to the Adriatic Sea, represents an ecologically sensitive environment where seawater intrusion, seasonal fluctuations, and nutrient inputs interact to shape water quality [
9]. Previous research has characterized these dynamics [
10,
11,
12], yet monitoring efforts remained spatially limited and did not fully capture system-wide variability [
9]. To overcome this challenge, GIS-based multicriteria decision analysis (MCDA) has proven effective for synthesizing environmental factors into spatial models of water quality, enabling the identification of critical zones vulnerable to eutrophication and pollution pressures [
10]. Similarly, satellite missions such as Landsat 8–9, Sentinel-2, and PlanetScope offer high-frequency, multispectral imagery that can be correlated with in situ data to map key water quality parameters across the entire lake surface [
13].
These methods provide valuable complementary perspectives, but their full potential lies in their integration with advanced data-driven techniques. To ensure reliable temporal matching, water quality index (WQI) values were compared with satellite imagery acquired within a 10-day window from each field campaign. Previous studies have shown that a 1-day time window is ideal, with the possibility of extending up to 10 days if conditions do not significantly change [
14,
15], and further research has suggested that this window also depends on satellite resolution, where higher spatial, spectral, and radiometric resolution increases the reliability of extending the time window for pairing satellite and ground-based data [
16]. A 10-day temporal tolerance was selected in this study to minimize the impact of cloud cover and to maximize the availability of usable satellite scenes, while still maintaining ecological relevance in the comparison of in situ and remote sensing data, without major weather changes between the measurement day and satellite overpass. In practice, most satellite–in situ matchups occurred within 0–3 days of the field measurements.
Most remote sensing and ML studies addressing water quality rely on a direct comparison between raw in situ measurements and satellite imagery, with models trained to predict individual physicochemical parameters [
17]. While effective for local analyses, this approach is often constrained by sparse monitoring networks and high spatial variability within inland waters, resulting in limited predictive robustness. In contrast, the present study emphasizes the use of an integrated WQI, generated through GIS-MCDA by Batina and Šiljeg (2025) [
10], as the primary reference for model training (
Figure A1). This index aggregates multiple parameters into a single, spatially continuous representation of water quality, thereby providing a more robust basis for predictive modelling. As a secondary comparison, ML models were also trained on raw in situ data, allowing the study to evaluate differences between the conventional approach and the proposed WQI-driven framework.
ML offers a powerful toolset for bridging the gap between point-based field data and spatially continuous remote sensing observations [
17]. By leveraging statistical learning algorithms, ML can identify complex nonlinear relationships between spectral reflectance and water quality parameters [
18], as well as classify ecological conditions into distinct quality classes. In the context of Vrana Lake, ML enables the combination of in situ measurements, raster-based GIS models, and satellite imagery into a unified monitoring framework, enhancing monitoring accuracy and supporting long-term ecological assessment in sensitive protected areas.
This study introduces SIGMaL, a unified framework for lake water quality monitoring that integrates satellite imagery, in situ measurements, GIS-MCDA, and ML into a single analytical pipeline. SIGMaL combines four complementary components: (i) a year-long series of monthly in situ measurements of key physicochemical parameters; (ii) raster-based WQI derived from GIS-MCDA, providing spatially continuous water quality classes; (iii) multi-sensor satellite observations from Sentinel-2, Landsat 8–9, and PlanetScope (acquired and evaluated independently); and (iv) ML models, including convolutional neural networks (CNNs) for WQI classification, trained separately for each sensor to enable a fair cross-sensor comparison.
To overcome the spatial limitations of the 20 in situ monitoring stations, SIGMaL uses MCDA-derived raster densification, expanding the dataset to 318 samples that better represent lake water quality variability. Satellite reflectance data from each sensor were paired with these samples and used to train ML models under identical modelling settings, enabling systematic comparison of spectral, spatial, and temporal performance across sensors. Designed as a modular, scalable, and reproducible workflow, SIGMaL enhances spatial and temporal coverage beyond conventional point-based monitoring and provides a robust basis for evaluating the suitability of different satellite platforms for coastal shallow lake environments. The framework supports improved ecological assessment and offers a transferable methodology for data-limited freshwater and coastal shallow lake systems.
In contrast to most lake studies in which ML is trained directly on raw in situ measurements [
13,
19,
20,
21], the GIS–MCDA WQI is adopted in this study as the primary modelling target. Two parallel strategies are employed: (A) CNN-based classification of WQI classes (principal track), and (B) regression on raw in situ parameters (comparison track). We hypothesize that (1) the WQI derived from GIS–MCDA provides a more stable modelling target than individual in situ parameters, and (2) the SIGMaL framework can accurately classify spatial water quality patterns across a complex coastal lake. The main findings indicate that the SIGMaL framework integrating in situ data, GIS–MCDA WQI, satellite imagery, and machine learning provides a more stable and spatially comprehensive approach to lake water quality monitoring than traditional parameter-based models, with Sentinel-2 offering the strongest overall performance and WQI-based CNNs consistently outperforming raw-parameter regression across all sensors.
2. Materials and Methods
2.1. Study Area
Vrana Lake, located in Dalmatia near the eastern Adriatic coast, is the largest natural freshwater lake in Croatia, covering an area of about 30 km
2 [
22]. The lake extends between 43°51′–43°57′N and 15°30′–15°39′E (WGS84) (
Figure 1). It is characterized by shallow water that undergoes strong seasonal fluctuations, with higher water levels in winter and spring, and lower levels in summer and autumn [
23]. Due to its ecological importance and species richness, the lake and its surroundings are protected within the Vrana Lake Nature Park.
The lake’s hydrological regime is influenced by multiple factors, including precipitation, tributary inflows, groundwater exchange, evaporation, and its artificial connection to the Adriatic Sea through the Prosika canal [
9]. During periods of low water levels, seawater intrusion increases salinity, whereas freshwater inputs from surrounding karst fields and springs reduce salinity during wetter periods [
24]. These dynamics, combined with wind-driven mixing, strongly affect the water quality and ecosystem health of the lake.
2.2. Data Collection
Field surveys were conducted on a monthly basis from July 2023 to June 2024. Measurements were performed in the morning hours (08:00–13:00 local time) to minimize diurnal variability in water temperature (WT), dissolved oxygen (DO), and chlorophyll-a concentrations. Measurement days were chosen to coincide with stable meteorological conditions, avoiding precipitation and strong winds that could compromise data comparability [
9]. Each campaign was carried out by a team aboard a small research vessel using YSI EXO2 multiparameter probe (YSI Inc., Yellow Springs, OH, USA). The probe was calibrated before every campaign following manufacturer guidelines (explained in
Section 2.2.2). Due to adverse weather conditions, the November 2023 survey was postponed and conducted on 4 December, while the regular December survey took place on 19 December [
9]. All other campaigns were carried out as scheduled.
2.2.1. In Situ Measurements
Batina et al. (2025) [
9] established a network of 20 fixed monitoring stations across the lake to ensure sufficient spatial coverage and representation of hydrological and ecological variability (
Figure 1). Over the 12-month monitoring period, maximum of 20 stations was measured in each of the 12 campaigns, resulting in 230 valid station measurements per parameter, as not all sites could be measured every month due to weather or equipment restraints.
Although several physicochemical and biological parameters were measured [
9], this study includes DO, WT, turbidity, and electrical conductivity (EC), resulting in 230 observations for each of these four parameters over the one-year research period. The selection of these parameters is supported by a year-long multiparameter analysis and correlation study conducted in Vrana Lake [
9], where they were identified as dominant drivers of lake water quality dynamics. Their selection was further supported by expert input from the Ruđer Bošković Institute and the Public Institution Vrana Lake Nature Park. Moreover, these parameters constitute the core physicochemical inputs of the GIS–MCDA-based WQI model, whose robustness was validated through sensitivity analysis and Monte Carlo simulations by Batina and Šiljeg (2025) [
10] (
Figure A1).
2.2.2. Multiparameter Probe
The YSI EXO2 multiparameter probe is designed to measure a wide range of physicochemical indicators, including WT, DO, EC, salinity, turbidity, and chlorophyll-a, with manufacturer-specified accuracies that vary by parameter (e.g., ±0.01 °C for WT, ±0.5% or 0.001 dS/m for EC, and ±1% or 0.1 mg/L for DO) [
25]. Such precision makes the instrument highly suitable for ecological and hydrological monitoring; however, its reliability depends heavily on proper handling and routine calibration. Regular maintenance is essential to mitigate external influences such as sensor fouling, sediment deposition, or biological growth, which can compromise data quality. Calibration should be carried out using standard solutions of known conductivity, oxygen reference standards, and systematic cleaning of optical sensors to ensure that field measurements remain both accurate and reproducible [
25].
The calibration procedure of the YSI EXO2 multiparameter probe was applied in accordance with the instructions provided in the official EXO User Manual [
25] (
Figure 2). Prior to each calibration step, the EXO calibration cup and sensors were rinsed two to three times with the appropriate standard for the parameter being adjusted, with the rinse solutions discarded and replaced with fresh calibration standard. When calibration standards were not used immediately, the sensors and cup were rinsed with deionized water and dried with a lint-free cloth before refilling. The calibration cup was filled to the recommended level to ensure that all sensors were fully submerged, while precautions were taken to avoid cross-contamination. Clean, dry probes were mounted on the sonde, and a calibration-dedicated guard was installed and tightened, and a separate guard was reserved for field deployments to maintain accuracy and cleanliness. The sequence followed the prescribed order from the manual: verification of the temperature sensor against a certified reference thermometer, calibration of EC first, then pH and ORP, followed by turbidity, and finally the optical sensors such as DO and depth. This order reflects sensor interdependencies and is designed to minimize error propagation, ensuring that the EXO2 provides reliable and reproducible field measurements [
25].
2.3. Satellite Data Acquisition and Preprocessing
This study used atmospherically corrected Level-2 (surface reflectance) imagery from Sentinel-2 MultiSpectral Instrument (MSI; European Space Agency, Paris, France), Landsat 8–9 Operational Land Imager (OLI)/Thermal Infrared Sensor (TIRS; National Aeronautics and Space Administration and U.S. Geological Survey, Washington, DC, USA), and PlanetScope SuperDove satellites (Planet Labs PBC, San Francisco, CA, USA). Sentinel-2 images were obtained through the Copernicus Browser (European Space Agency, Paris, France), Landsat 8–9 through the USGS Earth Explorer (U.S. Geological Survey, Reston, VA, USA), and PlanetScope from Planet Explorer (Planet Labs PBC, San Francisco, CA, USA) [
26]. Because the goal of the study was not to compare atmospheric correction algorithms, pre-processed Level-2 data were adopted to ensure consistency across sensors and to focus computational effort on the integration of remote sensing, GIS–MCDA, and ML.
Although employing pre-processed imagery simplifies the workflow, the authors are aware of potential limitations, especially over inland waters where atmospheric conditions, adjacency effects, aerosol variability, and water surface reflections complicate correction accuracy [
27,
28]. Previous research has shown the potential of advanced remote sensing and ML approaches to enhance water quality monitoring when atmospheric corrections are adequately addressed [
29]. Furthermore, Pan et al. (2022) [
27] evaluated ten atmospheric correction algorithms over lakes and highlighted that adjacency effects near land and inconsistent aerosol modelling can reduce the fidelity of water reflectance retrievals. Similarly, Zhu and Xia (2023) [
28] discuss that while atmospheric correction is generally beneficial for remote sensing inversion tasks, in large-scale statistical inference studies small residual atmospheric errors may have limited impact on performance when models rely on strong statistical correlations rather than pixel-level physical retrievals. Because the focus of this study is on the integrated WQI rather than individual water quality parameters, this approach was considered acceptable: it maintains consistency across Sentinel-2, Landsat 8–9, and PlanetScope datasets, and ensures that computational complexity is focused on the ML and MCDA integration stages rather than on refining atmospheric correction.
Moderate-resolution sensors such as Sentinel-2 (10–60 m, 5-day revisit) and Landsat 8–9 (30–100 m, 8-day aggregate revisit) have been widely used in water quality monitoring [
13,
30,
31]. Recently, PlanetScope has emerged as a valuable alternative for small or narrow waterbodies due to its daily revisit and 3 m spatial resolution, despite its limited spectral depth relative to Sentinel-2 and Landsat (
Table 1). The Landsat 8–9 collection includes OLI optical and TIRS thermal bands, with 30 m and 100 m reflectance and 15 m panchromatic resolution. Sentinel-2 MSI provides 13 spectral bands (10–60 m) in visible and near-infra red (NIR), including three Red-Edge bands critical for aquatic applications. PlanetScope Level-3B products consist of 8 spectral bands at 3 m resolution.
Field measurement dates were aligned with predicted Sentinel-2 and Landsat 8–9 overpasses, while PlanetScope was excluded from planning due to its daily revisit capability. Because Vrana Lake is shallow and highly exposed to wind, currents and waves can mix the entire water column, especially during strong Bora and Jugo events in winter and Maestral winds in summer, ensuring relatively uniform temperature and nutrient conditions [
10]. Favourable meteorological conditions were therefore essential; satellite scenes had to be cloud-free and precipitation-free, and fieldwork had to be conducted under safe wind conditions for the vessel crew.
Table 2 summarizes the dates of in situ measurements alongside the closest available satellite acquisitions without clouds. As shown, satellite imagery did not always coincide with field measurements, and scenes from different sensors were often available on different days. Columns Max prior (days) and Max after (days) in
Table 2 indicate the maximum temporal offset between each in situ campaign and corresponding satellite scenes, illustrating, for example, that July measurements coincided with a Landsat 9 overpass and were preceded by Sentinel-2 and PlanetScope acquisitions by one day.
Quantitative evidence of rapid temporal variability in Vrana Lake is provided by Batina et al. (2025) [
9], who reported pronounced seasonal and intra-annual fluctuations in turbidity, EC, WT, and DO across monthly campaigns at 20 stations. That study further demonstrated that the lake behaves as a well-mixed shallow system with minimal vertical stratification but strong horizontal and temporal variability driven by meteorological forcing and seawater intrusion. Although maximum temporal offsets between satellite overpasses and in situ measurements reached up to 11 days, the majority of satellite–in situ matchups occurred within 0–3 days, with a mean offset of approximately 0.3 days, supporting the ecological relevance of the satellite-based analysis under typical conditions.
Although the monitoring network consisted of 20 fixed stations, the number of stations measured during each monthly campaign varied because adverse weather conditions or occasional equipment malfunction prevented safe access to all sites. The total number of valid measurements per month is visible in
Table 2.
2.4. Dataset Development
To strengthen the dataset for statistical analysis and ML applications, the original network of 20 in situ stations was densified to 318 points (
Figure 3), corresponding to 318 pixels derived from the final water quality raster presented in Batina and Šiljeg (2025) [
10]. Importantly, this procedure does not represent statistical resampling of point-based field observations, nor an attempt to create independent in situ measurements. Instead, the 318 samples serve as spatial reference points derived from a GIS–MCDA-based WQI surface. The rationale was that, instead of directly comparing maximum of 20 in situ measurements per parameter monthly with satellite imagery on a monthly basis, a larger number of measurements was required to ensure effective model training and testing.
The raster used for densification was generated by a MCDA approach, which aggregated multiple weighted criteria into a final water quality map using the Weighted Linear Combination (WLC) method [
10]. The raster cells had a resolution of 300 × 300 m, each representing spatially explicit information on water quality across the lake. This rasterization provided a consistent framework for extracting 318 evenly distributed pixel values, which served as additional measurement points. By integrating these raster-derived points, the analysis was able to capture greater spatial variability and provide larger sample size to support robust ML model development. These raster-derived samples should therefore be interpreted as spatial representations of relative water quality patterns, not as statistically independent observations. Accordingly, the ML models are trained to recognize relative spatial WQI patterns, emphasizing spatial differentiation across the lake.
A water quality raster of Vrana Lake was classified into seven discrete classes by Batina and Šiljeg (2025) [
10], representing different levels of water quality across the lake surface. These classes provided a spatially explicit framework for distinguishing areas of higher and lower water quality, reflecting the heterogeneity of environmental conditions within the lake. The underlying raster values ranged from 0.596 (Class 7) to 0.737 (Class 1), indicating overall good water quality, but with detectable spatial variation that served as the basis for ranking and differentiating classes across the system.
In this study, the seven raster-derived classes were used as reference categories for ML, serving to identify which parts of the lake belong to each water quality class based on satellite imagery. To enable model training and testing, 318 raster cells (pixels) were extracted from the classified surface and used as measurement points, ensuring sufficient spatial coverage and dataset size for robust model development. The pixels were distributed across the classes as follows: Class 1—55 pixels, Class 2—64 pixels, Class 3—49 pixels, Class 4—48 pixels, Class 5—46 pixels, Class 6—29 pixels, and Class 7—27 pixels. This quantitative distribution ensured that all water quality categories were represented, allowing the ML models to be trained on the full spectrum of observed lake conditions and to capture subtle differences in relative water quality across the system.
The seven WQI classes are not intended to represent fine-scale or instantaneous water quality variability. Instead, they reflect integrated, lake-scale ecological conditions derived from annual averages and GIS-MCDA synthesis. Accordingly, the WQI serves as a spatial reference framework for identifying persistent patterns and relative gradients rather than micro-scale heterogeneity.
2.5. ML Framework
To model lake water quality, two modelling tracks were implemented: (A) regression of individual in situ parameters and (B) CNN-based WQI classification. Models were trained and evaluated separately for each satellite sensor (Sentinel-2, Landsat 8–9, PlanetScope) to allow a sensor-specific assessment of predictive capability under a consistent experimental design.
2.5.1. Regressors for Water Quality Parameters Modelling
In the task of water quality parameters modelling, a diverse set of regression algorithms was considered to balance linear baselines and nonlinear learners:
Linear Regression [
32] is a fundamental supervised learning method that establishes a linear relationship between variables by fitting the best-fitting line to the observed data. The primary objective is to estimate model parameters that minimize the Sum of Squared Errors (SSE) between predicted and actual values. Implementing the algorithm involves key steps like data preprocessing, feature selection, model fitting using the least squares method, and subsequent evaluation and diagnosis.
Ridge Regression [
33], also known as L2-regularized regression, is an advanced form of linear regression designed for situations where the dataset has many features relative to the number of data points or when features are highly correlated (multicollinearity). Its primary function is to prevent overfitting and improve the model’s robustness. It achieves this by adding an L2 penalty term to the standard linear regression cost function.
Random Forest [
34] is a ML technique that belongs to the ensemble family of algorithms, meaning it uses multiple models to get a better overall result. Its fundamental goal is to build a “forest” of many simple decision trees and combine their individual predictions to produce an outcome that is more accurate and less prone to errors than any single tree. Random Forest achieves this stability by purposefully introducing randomness; it trains each tree on a slightly different random subset of the data and features.
Gradient Boosting [
35] is an ensemble method in ML that builds its predictive model as a series of sequential steps. It works by creating new, simple decision trees that are designed to fix the prediction errors of the trees that came before them. This process uses a gradient descent approach to gradually improve accuracy by minimizing a chosen measure of error. The technique is flexible and effective across various applications but does require careful setting of its parameters to achieve good performance.
The eXtreme Gradient Boosting (XGBoost) [
36] is a highly efficient and scalable implementation of the gradient boosting framework, often favoured for its speed and performance in structured data competitions. It introduces several enhancements, such as regularization (L1 and L2) to prevent overfitting and parallel processing of the tree construction. Due to its advanced optimization and handling of missing values, it has become a leading choice for complex regression tasks.
The Support Vector Machine (SVM) [
37] is a classification algorithm that works by finding the most distinct boundary to separate two classes of data. The main idea is to maximize the margin, which is the empty space between the separating line (hyperplane) and the closest data points from each class. These closest points are called support vectors because they are the only ones that “support” or define the final position of the boundary. By maximizing this gap, the SVM creates a robust model that generalizes well and makes more reliable predictions on new, unseen data. For complex data that cannot be separated by a straight line, SVM uses the Kernel Trick. This mathematical technique allows the algorithm to effectively transform the data into a higher dimension where a straight separation is possible, enabling it to fit non-linear patterns.
The Random Sample Consensus (RANSAC) algorithm [
38] is an iterative algorithm that estimates model parameters by fitting candidate solutions to randomly selected data subsets and retaining the solution supported by the largest consensus set. Although robust to outliers, RANSAC is known to be computationally expensive and sensitive to noise and the correct selection of the true dimension.
K-Nearest Neighbours (KNN) Regression is a nonparametric and simple method highly valued for its effectiveness with complex data structures [
39]. It works by predicting the value for a new data point based on the average (or a weighted average) of the k closest data points in the training set. While easy to implement, standard KNN regression is susceptible to overfitting and discontinuity in the fit. Methods like KNN are proposed to enhance its accuracy and robustness in big data applications by integrating techniques like kernel smoothing and bootstrap sampling.
Poisson Regression [
40] is a generalized linear model for count data, using a log link, function to ensure non-negative predictions. This allows the model to correctly predict non-negative counts based on various input factors. Poisson regression was included to provide a statistical baseline for comparison across modelling approaches.
This extended set of models allowed for a comprehensive benchmarking of both classical statistical approaches and modern ensemble learners, ensuring that the analysis captured linear, nonlinear, and instance-based perspectives on the relationship between satellite reflectance and water quality parameters.
In situ parameter modelling was carried out using raw in situ measurements as regression targets using Python 3.12.4 (computer code is available, as stated in the section Data Availability Statement). Performance was quantified using the mean absolute error (MAE, Equation (1)), root mean square error (RMSE, Equation (2)), and the coefficient of determination (R
2, Equation (3)) [
41].
where
is the estimated value,
is the observed value,
is the mean of observed values, and
is the number of samples.
The R
2 quantifies the proportion of variance in the dependent variable that is explained by a regression model. Its values range from 1 (perfect prediction) to negative infinity. While values close to 1 indicate strong predictive performance, negative R
2 values occur when a model performs worse than a baseline predictor that simply returns the mean of the observed data [
42]. Because R
2 is dependent on the variance of the underlying dataset, it is not directly comparable across datasets with different distributions. The score is undefined when the true target has zero variance; in such cases, implementations typically assign 1.0 for perfect predictions or 0.0 when predictions deviate from the constant target.
2.5.2. CNNs for WQI Assessment
To predict the WQI from satellite observations, the dataset of 318 raster-derived samples was randomly partitioned into two subsets: 80% for model training and 20% for independent testing, preserving the distribution of WQI classes using Python (computer code is available, as stated in the section Data Availability Statement).
The candidate models for WQI prediction were developed based on a one-dimensional (1D) CNN architecture. CNNs are a class of ML algorithms that combine convolutional layers with fully connected dense layers. The convolutional layers excel at extracting features from raw signals or imagery without requiring prior preprocessing, while the dense layers serve primarily for classification tasks. Given that WQI measurements were collected over a one-year period, each sampling point was represented by multiple satellite image snapshots spanning that time frame. Specifically, a single WQI value was predicted from a matrix consisting of 12 temporal snapshots for each spectral band, where the dimension is bands × 12 months, reflecting consistent temporal coverage used throughout the study.
The neural network architecture consisted of two convolutional layers, followed by normalization, pooling, and dense layers. Initially, the network was trained to learn spectral features by applying 1D convolutions in the spectral dimension, capturing spectral characteristics and their temporal variations. Subsequently, 1D convolutions were applied along the temporal dimension, enabling the model to identify significant feature dynamics over time for each spectral band separately. This two-folds convolution strategy allows us to effectively evaluate both spectral and temporal information in the satellite time series data contributions for the prediction accuracy of the WQI, aligning with common practices in deep learning for environmental and remote sensing applications.
For the WQI-based modelling, predicted classes were compared against reference labels derived from the classified MCDA raster. These accuracy metrics were used to quantitatively assess the performance of the classification models. Overall accuracy was evaluated using confusion matrices, where correct predictions correspond to the main diagonal and all misclassifications are treated equally, regardless of how close the predicted class is to the correct one [
43]. Area under receiver operating characteristics (ROC) curve (AUC) [
44] was used as the principal performance metric because, unlike accuracy, it evaluates the model based on the predicted probabilities of class membership, capturing how well the model ranks and separates water quality categories across all decision thresholds.
Generalization performance was assessed on the held-out 20% test subset. Testing was conducted independently for Sentinel-2, Landsat 8–9, and PlanetScope to ensure a fair, sensor-specific comparison under identical evaluation criteria.
2.6. Workflow Overview
The overall SIGMaL workflow (
Figure 4) integrates four main components: (i) in situ water quality monitoring and probe calibration, (ii) GIS–MCDA and raster-based derivation of the WQI, (iii) satellite image acquisition and preprocessing, and (iv) ML model training, testing, and spatial prediction.
This stepwise design ensured consistency across heterogeneous data sources (in situ measurements, GIS–MCDA models, and satellite imagery) and facilitated reproducible ML experiments. The 80/20 split of raster-derived samples into training and testing subsets provided the foundation for robust model evaluation, while comparative benchmarking across algorithms and sensors allowed systematic identification of the most effective predictive approach.
3. Results
3.1. Regression of In Situ Parameters
3.1.1. Sentinel-2 Results
Across the Sentinel-2 dataset, ensemble methods outperformed linear and kernel-based approaches, with clear advantages in modelling nonlinear spectral–water quality relationships (
Table 3). Gradient Boosting delivered the strongest overall performance for most variables, achieving the highest accuracy for WT (R
2 = 0.816, MAE = 15.218, RMSE = 25.675) and competitive fits for turbidity (R
2 = 0.765) and DO (R
2 = 0.682). For EC, Random Forest slightly outperformed Gradient Boosting, achieving the highest coefficient of determination (R
2 = 0.650) together with the lowest MAE (0.178) and a marginally lower RMSE (0.247) compared to Gradient Boosting (R
2 = 0.652, RMSE = 0.246, MAE = 0.185), indicating strong robustness to spectral heterogeneity and stable predictive performance.
The KNN Regressor showed particularly strong behaviour for turbidity, achieving the highest R2 across all models (0.806) and low MAE and RMSE values, suggesting that local neighbourhood patterns in reflectance strongly benefit turbidity estimation. Linear Regression produced moderate fits across all variables, while Ridge and Poisson regression models consistently resulted in near-zero or negative R2 values, confirming their limited suitability for modelling nonlinear satellite reflectance–water quality relationships. SVM and RANSAC exhibited unstable performance, especially for turbidity and DO, with negative or low R2, reflecting sensitivity to noise and high-variance spectral conditions.
3.1.2. Landsat 8–9 Results
For Landsat 8–9, ensemble tree-based models again provided superior performance relative to linear and kernel methods, with clearer separation among algorithms for individual water quality parameters (
Table 4). Gradient Boosting achieved the best overall performance for EC, delivering the highest R
2 (0.728) and overall, the lowest MAE and RMSE among the tested learners. For turbidity, Random Forest achieved the best overall performance (R
2 = 0.591, MAE = 6.245), outperforming Gradient Boosting and XGBoost.
WT prediction showed exceptionally strong accuracy across the board, with both Random Forest and Gradient Boosting achieving R2 = 0.996 and very low error values (<4 °C RMSE). This indicates that Landsat’s thermal bands (B10 and B11) provide highly stable temperature information for the study area. For DO, Random Forest outperformed all other models by a large margin (R2 = 0.921, RMSE = 0.368), with Gradient Boosting performing similarly but slightly weaker. Linear Regression provided moderate fits, while Ridge Regression suffered degraded performance for all variables except WT. Kernel-based SVM regression, Poisson regression, and RANSAC performed poorly, often yielding negative or near-zero R2 values, highlighting their sensitivity to nonlinear and noisy spectral–ecological relationships.
3.1.3. PlanetScope Results
PlanetScope produced more variable model performance due to its limited spectral range and sensitivity to atmospheric and adjacency effects (
Table 5). Nevertheless, several algorithms achieved strong predictive capability. The KNN Regressor was the strongest overall performer, delivering the highest R
2 values for EC (0.713), turbidity (0.661), and DO (0.613), indicating that PlanetScope’s fine spatial resolution (3 m) enables effective exploitation of local spectral neighbourhoods despite the restricted spectral configuration.
For WT, Linear Regression surprisingly outperformed all nonlinear models (R2 = 0.685), suggesting that under stable atmospheric conditions the reflectance–temperature relationship behaves more linearly than for other variables. Ensemble tree-based models such as Random Forest and Gradient Boosting produced moderate and consistent results across most parameters (R2 between 0.48 and 0.60), confirming their robustness to noise but also highlighting the constraints imposed by PlanetScope’s narrow spectral range. Ridge, Poisson, SVM, and RANSAC frequently yielded low or negative R2, particularly for turbidity, where adjacency contamination and radiometric instability were most pronounced.
3.2. WQI CNN Models
When using the integrated WQI derived from the GIS–MCDA raster, the problem was reformulated as a supervised classification task. The seven WQI classes served as categorical labels for model training and testing.
Classical classification algorithms are generally unsuitable for complex inputs such as time series of spectral vectors because they treat each feature independently and cannot capture the inherent spectral and temporal dependencies in the data. This results in suboptimal performance since these dependencies carry crucial information about the underlying processes. Moreover, the high dimensionality of such inputs significantly complicates the optimization process, especially when the number of training samples is limited. The large feature space can lead to overfitting and poor generalization, making convergence during training unlikely. In contrast, methods like CNNs are better suited to this type of data because they can learn local spectral features and temporal patterns through convolutional operations, preserving dependencies and reducing dimensional complexity via shared weights and pooling layers.
The following WQI results are produced by the CNN models; “spectral” refers to single-date band stacks, and “temporal” refers to band-wise concatenation of monthly windows.
Model evaluation was based on confusion matrices (
Figure 5,
Figure 6,
Figure 7,
Figure 8,
Figure 9 and
Figure 10), with overall accuracy and AUC used as the principal metrics of predictive accuracy (
Table 6).
Figure 5,
Figure 6,
Figure 7,
Figure 8,
Figure 9 and
Figure 10 show confusion matrices illustrating the classification performance of WQI prediction models across the three satellite datasets (Sentinel-2, Landsat 8–9, and PlanetScope). Each figure contains four panels representing the training and test subsets for both spectral and temporal feature configurations.
The performance of ML models applied for WQI classification across Sentinel-2, Landsat 8–9, and PlanetScope datasets, using both spectral and temporal features, is listed in
Table 6. Overall, the models based on spectral inputs consistently outperformed those relying on temporal composites, indicating that spectral variability provides a more stable and discriminative basis for estimating integrated water quality conditions.
3.2.1. Sentinel-2 CNN Performance
In the spectral configuration, both the training and test confusion matrices (
Figure 5) show a dominant diagonal, indicating generally correct class assignments. However, the model frequently confused neighbouring classes, mostly within the mid-range WQI categories (Classes 3–5). It is consistent with the moderate test accuracy of 0.53 and high AUC of 1.00 reported in
Table 6. Misclassifications rarely extend far from the diagonal, suggesting that the model captured the overall ordinal structure of the WQI but struggled to resolve subtle class boundaries.
The temporal model exhibits even stronger diagonal structure. Training performance is notably higher (accuracy 0.85, AUC 1.00), and the test matrices show fewer off-diagonal entries than in the spectral case. Although test accuracy (0.53) matches that of the spectral model, the temporal model achieves higher test R2 (0.82) and more concentrated diagonal predictions, indicating better preservation of ordinal class relationships. This suggests that temporal aggregation stabilized spectral variability and enhanced class separability, reducing confusion among adjacent WQI classes.
The learning curves for the Sentinel-2 spectral and temporal models are calculated (
Figure 6). In both configurations, training accuracy increased steadily and reached values above 0.85, accompanied by a consistent decrease in training loss throughout the 500 epochs. Validation accuracy remained lower, fluctuating mostly between 0.45 and 0.60 in both cases, without a strong upward trend. Validation loss showed pronounced variability, including frequent spikes that increased in magnitude at later epochs. These patterns indicate that the model learned stable representations on the training data, while validation performance remained less consistent.
3.2.2. Landsat 8–9 CNN Performance
For Landsat 8–9 (
Figure 7), the spectral model showed moderate classification capability, achieving high AUC values (0.98 train, 0.97 test) but only test accuracy of 0.53 (
Table 6). The test confusion matrix confirms this mismatch: although the model correctly follows the overall WQI gradient, misclassifications remain frequent across several classes, including errors beyond neighbouring categories. This indicates that, despite good probabilistic separation reflected in the AUC, the spectral model struggled to assign discrete class labels with high reliability. It is likely a consequence of Landsat’s coarser spatial resolution and fewer narrow spectral bands compared with Sentinel-2, limiting its ability to resolve subtle differences between adjacent WQI classes.
Temporal modelling resulted in severe degradation of performance. Both train and test AUC values collapsed to 0.50, with test accuracy decreasing to 0.08 and R
2 reaching −3.56, clearly indicating prediction collapse (
Table 6). The temporal confusion matrices corroborate that nearly all samples were assigned to a single WQI class, with almost no differentiation across the seven classes. This behaviour suggests that temporal stacking introduced noise rather than informative temporal structure. The likely cause is Landsat’s long revisit interval combined with inconsistent atmospheric and illumination conditions between acquisition dates, which reduced temporal coherence and led the CNN to overfit the training set while failing entirely to generalize to unseen data.
The learning curves for the Landsat 8–9 temporal model are calculated (
Figure 8). Training accuracy increased gradually to approximately 0.70, while training loss decreased smoothly over epochs. In contrast, validation accuracy remained low and highly variable, fluctuating mostly between 0.05 and 0.25 without a clear upward trend. Validation loss showed substantial instability, with frequent large spikes throughout training. These patterns indicate that, although the model fitted the training data, its performance on the validation set was inconsistent under the temporal configuration. Furthermore, the validation predictions tended to collapse into a single WQI class for extended periods during training, with the dominant predicted class shifting from epoch to epoch, reflecting unstable class separation under temporal inputs.
3.2.3. PlanetScope CNN Performance
For PlanetScope (
Figure 9), the spectral model showed strong classification ability, consistent with its test AUC of 0.97, accuracy of 0.42, and R
2 of 0.77 (
Table 6). The spectral confusion matrices display a clear diagonal trend, with most predictions falling into the correct WQI class. Misclassifications occur primarily between adjacent classes (especially around Classes 2–3 and 4–5) which indicates that the model successfully captured the underlying ordinal gradient while occasionally struggling with fine boundary transitions. This behaviour aligns with the high spatial resolution of PlanetScope imagery (3 m), which enables discrimination of small-scale spatial patterns relevant to water quality.
The temporal model performed substantially worse, with a test AUC of 0.94, accuracy of 0.44, and a low R
2 of 0.10 (
Table 6). The temporal confusion matrix reveals considerable class mixing: several classes show dispersion into multiple neighbouring categories, and true classes 3–5 exhibit notable overlap. Although diagonal structure is still present, class separability is reduced compared with the spectral model. This degradation likely reflects PlanetScope’s limited spectral range combined with day-to-day variations in illumination and atmospheric conditions, which introduce noise into temporal features and reduce their predictive stability.
The learning curves for the PlanetScope spectral and temporal models are calculated (
Figure 10). For both configurations, training accuracy increased steadily, reaching approximately 0.90, while training loss decreased smoothly across epochs. Validation accuracy remained notably lower, fluctuating mostly between 0.35 and 0.55 without a clear long-term upward trend. Validation loss exhibited substantial variability, with frequent spikes that persisted throughout training. Compared to the spectral configuration, the temporal model showed similar behaviour, with slightly larger oscillations in validation loss but comparable validation accuracy ranges. The curves indicate stable convergence on the training data but limited consistency in validation performance for both PlanetScope configurations.
3.3. CNN-Based Predictions of WQI
The CNN-based WQI prediction for the year following in situ monitoring period, generated separately for Sentinel-2, Landsat 8–9, and PlanetScope under spectral and temporal model configurations are calculated (
Figure 11). Since no ground-truth data exist for this period, the maps represent forward predictions of spatial water quality patterns derived from the trained models.
Across all sensors, the maps in
Figure 11 display a broadly consistent spatial structure: higher WQI classes (1–3) occur mainly in the western and central parts of Vrana Lake, while lower-quality classes (5–7) are more frequent along the eastern and southeastern margins. This gradient mirrors the dominant spatial trend captured during model training.
For Sentinel-2, the spectral model (top left) produces a smooth but spatially detailed gradient, with noticeable internal class transitions. In contrast, the temporal model (top right) yields a more uniform surface, with reduced fine-scale variability and more clustered class regions.
For Landsat 8–9, spectral predictions (middle left) exhibit stronger heterogeneity and a wider distribution of mid- to low-quality classes across the lake. The temporal model (middle right) produces highly homogenized outputs, with most pixels assigned to a narrow range of lower-quality classes, reflecting reduced class discrimination.
For PlanetScope, the spectral model (bottom left) generates the most spatially detailed output among all sensors, with well-defined class boundaries and visible local variation. The temporal model (bottom right) preserves the general lake-wide gradient but presents a smoother pattern with less within-lake differentiation.
4. Discussion
4.1. Cross-Sensor Comparison
Across all sensor–model configurations, Sentinel-2 demonstrated the strongest and most consistent performance for both regression of individual in situ parameters and CNN-based WQI classification. Its combination of dense visible and NIR spectral coverage and moderate spatial resolution allowed the models to capture both the optical complexity and spatial gradients of Vrana Lake. Ensemble regression models reached the highest R
2 values for EC, turbidity and DO, while CNN classification achieved AUC = 1.00 and stable WQI class separation. These findings align with recent studies by Pizani et al. (2020) [
19] and Toming et al. (2016) [
20] showing that Sentinel-2 reliably estimates water quality indicators across rivers, lakes and reservoirs. They highlight the suitability of Sentinel-2 as the primary remote-sensing component within the SIGMaL framework.
PlanetScope performed very well in tasks requiring fine spatial discrimination. Its spectral CNN model produced the most detailed WQI boundary delineation among all sensors, which is consistent with previous work showing that PlanetScope’s 3 m resolution excels at mapping small-scale spatial heterogeneity despite its limited spectral range [
21,
45]. However, because it carries only a few broad multispectral bands, it is more susceptible to atmospheric variation and less robust when modelling temporally aggregated features. This behaviour is fully reflected in the SIGMaL experiments and matches patterns observed in earlier comparative water-quality study by Di Francesco et al. (2025) [
46].
For Landsat 8–9 (OLI/TIRS), the results diverged between regression and classification tasks. In situ temperature regression achieved exceptionally high accuracy (R
2 ≈ 0.996), consistent with many studies demonstrating that Landsat’s thermal bands provide highly reliable surface water temperature retrievals [
19,
47]. In contrast, Landsat’s CNN classification performance was modest in the spectral configuration (test accuracy = 0.53) and collapsed almost entirely in the temporal configuration (accuracy ≈ 0.08; strongly negative R
2). This behaviour reflects Landsat’s coarser 30 m spatial resolution, lower revisit frequency, and fewer narrow spectral bands. These characteristics limit its ability to resolve the subtle water-quality gradients needed for seven-class WQI discrimination within the SIGMaL workflow. Similar shortcomings of Landsat relative to Sentinel-2 in inland waters have been observed broadly in recent comparisons by Deng et al. (2024), Pizani et al. (2020), and Parida et al. (2025) [
17,
19,
21].
Across all sensors, spectral CNN models consistently outperformed temporal models [
29]. Spectral snapshots preserve instantaneous optical conditions, whereas temporal composites blend scenes captured under different illumination, atmospheric states, and hydrodynamic conditions, reducing contrast and adding noise. This aligns with recent studies by Deng et al. (2024), Pizani et al. (2020), and Toming et al. (2016) [
17,
19,
20] emphasizing that, despite growing interest in temporal deep learning, snapshot-based spectral models remain more accurate for WQI estimation in optically complex inland waters.
An important explanatory factor in this study is the temporal offset between field surveys and satellite overpasses (
Table 2). Maximal offsets ranged from –11 to +7 days, especially problematic for Landsat. Vrana Lake is a shallow lake and strong Bora, Jugo or Maestral winds can change temperature and nutrient distributions within hours. Temporal inputs thus often combined reflectance measurements that possibly no longer corresponded to in situ water state. It weakens temporal coherence and degrading CNN performance across all temporal SIGMaL configurations, particularly for Landsat’s already sparse revisit schedule.
Within this framework, satellite observations are essential because they provide spatially exhaustive, synoptic measurements that allow integrated WQI patterns to be mapped consistently across the entire lake surface. Repeated sampling of the same 20 in situ stations, even when combined with spatial interpolation, cannot provide sensor-comparable, wall-to-wall coverage or capture spatial organization at the resolution and extent enabled by satellite imagery. Satellite data are therefore not used to increase the number of independent observations, but to enable spatial generalization and pattern recognition beyond the discrete sampling network.
4.2. WQI Outperforms Modelling Individual Parameters
A central methodological finding is that using the integrated WQI as the modelling target substantially improved predictive stability relative to direct regression of raw physicochemical parameters. Within the SIGMaL framework, CNN classification produced clearer ordinal structure, more stable confusion matrices, and better cross-sensor consistency than parameter-specific models. This confirms that WQI acts as a noise-reduced, integrated ecological signal, smoothing short-term fluctuations and reducing the influence of measurement noise or parameter-specific anomalies.
Recent studies similarly show that ML/WQI models provide greater robustness and interpretability than models predicting individual parameters. For example, Wong et al. (2022) [
48] demonstrated that WQI-based machine-learning models (particularly modified Random Forest) outperform raw parameter prediction by providing higher accuracy and more stable explanatory structure. Pang et al. (2025) [
49] showed that deep-learning approaches in remote sensing similarly benefit from using integrated indices such as WQI, which improve model robustness and cross-sensor transferability. The results of this study support these findings and show that WQI provides a superior modelling target within SIGMaL.
4.3. Spatial Predictions of WQI
CNN-based annual predictions for the post-monitoring period showed a consistent lake-wide west–east gradient across all sensors, with higher-quality classes (1–3) dominating the central and western areas and lower-quality classes (5–7) occurring more frequently along the eastern margins. This pattern matches field observations and known hydrodynamic processes in Vrana Lake, where nutrient inputs and restricted water exchange influence eastern basin conditions.
The Sentinel-2 and PlanetScope spectral models provided the clearest spatial structure. Sentinel-2 produced smooth, ecologically meaningful gradients, whereas PlanetScope highlighted small-scale shoreline and central-basin heterogeneities. Landsat 8–9 reproduced the general gradient but produced smoother, more spatially homogeneous maps consistent with its coarser spatial resolution. Temporal models, especially for Landsat and PlanetScope, yielded more uniform spatial fields and reduced internal variability, which is consistent with the confusion matrices and learning curves showing diminished class separability under temporal input conditions.
Visual differences among the WQI maps derived from Sentinel-2, Landsat 8–9, and PlanetScope do not contradict the relatively high quantitative performance metrics reported in
Table 6. The WQI prediction is formulated as an ordinal classification problem, where classes represent ordered categories derived from continuous GIS–MCDA scores rather than exact spatial boundaries. High AUC and R
2 values therefore indicate consistent discrimination and correct ranking of relative water quality conditions, even when the spatial expression of class boundaries differs among sensors. These differences primarily reflect sensor-specific characteristics, including spatial resolution, spectral configuration, and revisit frequency, which influence the level of spatial detail and smoothness in the predicted maps. Consequently, the observed map discrepancies represent variations in spatial sensitivity rather than inconsistencies in model performance.
Finally, the one-year temporal offset between the in situ–based WQI modelling and satellite-based prediction may influence model accuracy due to potential domain shifts in key input parameters. Specifically, changes in the minimum and maximum values, distribution characteristics, or inter-parameter relationships driven by differing hydrological, meteorological, or anthropogenic conditions could affect model generalization. Consequently, the spatial predictions presented here are interpreted as a scenario-based extrapolation of lake water quality patterns rather than a strict temporal validation.
4.4. Methodological Limitations and Future Work
Using Level-2 surface reflectance products (rather than performing atmospheric corrections based on date and lake specifications) likely introduced residual atmospheric and adjacency artefacts. Such effects can be significant in shallow, optically complex lakes. Although ML models are often robust to moderate atmospheric errors, employing algorithms tailored for inland waters (e.g., ACOLITE, iCOR, C2RCC) could further improve physical consistency in future work.
As shown in
Table 2, temporal offsets of up to 11 days were unavoidable due to cloud cover, satellite revisit constraints, and safety considerations for fieldwork. Because Vrana Lake mixes rapidly under strong wind conditions, water quality can change significantly within these time windows. Thus, “temporal” stacks often aggregated reflectance signals that no longer matched in situ conditions, explaining the instability and class-collapse seen especially in the Landsat temporal CNN models.
WQI simplifies ecological interpretation but conceals short-term or parameter-specific extremes (e.g., chlorophyll-a spikes). Future SIGMaL implementations should pair WQI-based classification with selective regression of critical parameters.
While promising, the results presented here reflect conditions in a single, moderately productive lake. The SIGMaL framework is designed to support spatial pattern recognition, comparative assessment, and monitoring prioritization, particularly in data-limited coastal lakes. It is not intended to replace in situ measurements or to provide fully quantitative water quality estimates that are directly transferable to other waterbodies without local calibration. Applying SIGMaL to other lakes will therefore require recalibration of the WQI, additional in situ sampling, and potentially model retraining to accommodate different optical environments.
Whitin this context, SIGMaL evaluates the ability of satellite sensors to reproduce the relative spatial organization of water quality across the lake, rather than fine-scale or instantaneous variability at individual locations. Nonetheless, the cross-sensor evaluation presented here provides a strong basis for generalizing the approach for shallow coastal lakes.
5. Conclusions
This study demonstrates that integrating in situ monitoring, GIS–MCDA, satellite remote sensing, and ML within the proposed SIGMaL framework provides a robust and scalable approach for assessing water quality in shallow and dynamic freshwater ecosystems such as Vrana Lake. Across all modelling approaches, WQI-based prediction consistently outperformed regression of individual physicochemical parameters, confirming that integrated ecological indices offer a more stable and noise-resistant modelling target for remote sensing applications.
Among the evaluated satellite systems, Sentinel-2 emerged as the most suitable sensor for integrated WQI mapping, combining the highest and most consistent classification performance (AUC ≈ 1.00, R2 ≈ 0.84) with its rich visible and NIR spectral configuration. PlanetScope excelled in capturing fine-scale spatial variability (R2 ≈ 0.77) due to its high spatial resolution. Landsat 8–9 performed best for WT retrieval but showed reduced capability for multi-class WQI discrimination, particularly in temporal CNN models, largely due to revisit limitations and temporal mismatches with field campaigns. Accordingly, Sentinel-2 is recommended as the primary sensor for operational WQI-based monitoring within the SIGMaL framework, with PlanetScope serving as a complementary data source for high-resolution spatial analyses and Landsat 8–9 supporting temperature-focused or long-term monitoring applications.
Temporal modelling was generally less effective than spectral modelling across all sensors, partly due to inconsistent overpass timing and rapid hydrodynamic changes in the lake, which weakened temporal coherence. Despite these challenges, CNN-based WQI predictions successfully reproduced the known west–east water quality gradient of Vrana Lake, demonstrating the ecological relevance of the integrated modelling framework.
The results of this study highlight that the SIGMaL framework offers a scalable, transferable, and operationally practical approach for water quality monitoring in coastal shallow lakes. Future work should expand the framework to multiple lakes, incorporate more advanced atmospheric correction, and explore hybrid approaches that pair WQI classification with parameter-specific retrievals.