1. Introduction
Eutrophication of inland lakes remains one of the most widespread environmental challenges worldwide, driven primarily by excessive phosphorus and nitrogen inputs that stimulate algal blooms, reduce water transparency, and degrade aquatic ecosystems [
1,
2,
3]. Accurate and spatially continuous estimation of total phosphorus is essential for understanding nutrient dynamics and supporting effective lake management.
However, obtaining lake-wide distributions of Total Phosphorus (PPUT; µg/L) through conventional in-situ sampling remains challenging due to the sparse spatial and temporal coverage of monitoring stations [
4,
5]. Furthermore, nearshore phosphorus concentrations are difficult to accurately retrieve from medium- or low-resolution satellite imagery, where pixel mixing and land-water adjacency effects severely distort spectral signals [
6,
7,
8].
Remote sensing provides an effective means to complement in-situ observations by enabling repetitive, synoptic coverage of large water bodies [
9,
10]. Many studies have used satellite optical reflectance and derived indices to estimate water-quality parameters such as chlorophyll-a, turbidity, suspended sediments and, to some extent, phosphorus or nitrogen [
11,
12,
13]. However, phosphorus is a non-optically active constituent whose spectral expression is indirect, primarily mediated by its relationship with other optically detectable variables such as phytoplankton and turbidity [
14,
15,
16]. This indirect linkage complicates retrieval: spectral signals are weak, and model relationships are often site-specific [
17,
18].
Another constraint is spatial resolution: many water-quality retrieval studies rely on sensors such as Sentinel-2 MSI (≈10 m resolution) or Landsat 8 OLI (≈30 m resolution). These resolutions may fail to resolve narrow bays, tributary plumes or near-shore mixing zones, and mixed-pixel effects can degrade retrieval accuracy in heterogeneous lake zones [
19,
20]. In this context, high spatial resolution and high temporal revisit satellite data become highly desirable for fine-scale inland lake nutrient mapping.
The satellite constellation PlanetScope (Dove Classic, Dove-R, SuperDove) offers distinct advantages: spatial resolution of ~3–5 m, near-daily revisit frequency, and continuity across multi-generation sensors, enabling high-resolution and frequent monitoring of lake surfaces and fine-scale features such as shoreline and tributary plumes [
21,
22,
23]. In fact, comparative work has shown that PlanetScope imagery outperforms coarser sensors (Sentinel-2, Landsat-8) in retrieving specific water-quality parameters in optically complex inland waters [
24,
25]. Consequently, PlanetScope is especially suited for whole-lake PPUT retrieval and capturing near-shore nutrient heterogeneity, offering improved spatial detail and repeat coverage.
On the modelling front, machine-learning (ML) and deep-learning (DL) approaches have seen increased adoption in water-quality retrieval thanks to their ability to model nonlinear relationships between spectral, environmental and in-situ data [
21,
26,
27]. Algorithms such as Random Forest (RF), Light Gradient Boosting Machine (LightGBM), Support Vector Regression (SVR), and histogram-based gradient boosting (HGBR) have been successfully applied to estimate parameters such as Chl-a, TSS, and even Total Nitrogen (TN)/Total Phosphorus (TP) in inland waters [
28,
29,
30].
More recently, deep learning models including Multilayer Perceptron (MLP), one-dimensional Convolutional Neural Networks (1D-CNN), and Transformer-based architectures have shown additional potential for handling high-dimensional inputs, temporal sequences, and spatial patterns [
31,
32,
33,
34].
However, the performance of such models often degrades when key auxiliary in-situ variables, such as Secchi depth (SSD), which reflect optical clarity and vertical attenuation, are unavailable. These auxiliary variables are typically measured at only a limited number of monitoring stations and thus cannot support continuous spatial mapping across entire lake surfaces. Developing models that maintain high predictive accuracy without reliance on in-situ auxiliary features, therefore, remains a significant challenge for whole-lake nutrient retrieval [
35,
36,
37].
To address this limitation, knowledge distillation (KD) offers a promising solution. In KD, a high-capacity “teacher” model trained on full-feature information (including auxiliary in-situ data) transfers knowledge to a “student” model that uses only readily available features (e.g., remote sensing + meteorology) and can achieve comparable accuracy [
38]. KD has been applied in remote-sensing domains for image classification, segmentation and multi-task learning [
39], although its application to inland water-quality retrieval remains scarce.
Recent advances in machine learning have substantially improved the retrieval of optically inactive parameters such as TP from multispectral imagery. Xiong et al. [
5] developed machine-learning algorithms for TP in eutrophic Lake Taihu using Landsat-8, demonstrating that random forest and other non-linear models can outperform conventional band-ratio approaches, with typical
R2 values around 0.70 for TP estimation. Similarly, Cui et al. [
40] and Qin et al. [
41] used Sentinel-2 imagery combined with tree-based models such as random forest and XGBoost to retrieve TP and other nutrients in large Chinese lakes, reporting
R2 in the range of 0.65–0.80 and highlighting the importance of integrating spectral indices and hydrometeorological variables.
While these studies confirm the feasibility of satellite-based TP retrieval at 10–30 m spatial resolution, they provide limited insight into nearshore gradients and tributary plumes, and they do not explicitly address how to leverage auxiliary clarity measurements such as SSD within a transferable, SSD-free prediction framework. The present study extends this line of work by exploiting 3 to 5 m PlanetScope imagery and a knowledge-distillation strategy to transfer information from SSD-aware models to operational SSD-free students.
Although some studies have used machine-learning or deep-learning for TP retrieval in inland waters [
42,
43,
44,
45,
46], very few have leveraged high spatial–temporal PlanetScope imagery for full-lake PPUT mapping, combined systematic comparison of multiple algorithms under varying feature sets (remote sensing only; remote sensing and meteorology; remote sensing, meteorology, and SSD), and then applied a KD framework to enable SSD-free full-lake prediction. Therefore, this research aims to undertake the following:
Utilize PlanetScope multi-generation imagery (Dove Classic, Dove-R, SuperDove) to achieve high-resolution, high-frequency retrieval of PPUT across the entire lake surface (including near-shore and tributary zones).
Systematically compare multiple machine learning algorithms (HistGBM, CatBoost, RandomForest, ExtraTrees, and GradientBoosting) under 8 distinct feature settings: combinations of remote sensing, INDICES, DOY, meteorological variables and SSD.
Quantify performance degradation when SSD is removed and implement a KD-based framework to transfer knowledge from the full-feature teacher model to a reduced-feature student model, thereby enabling full-lake mapping without SSD.
Apply the optimal student model to generate whole-lake PPUT distribution maps for spring (March–April) and summer (July–August) periods and analyze spatial and seasonal variability of phosphorus in the study lake.
By integrating PlanetScope’s high spatial–temporal resolution, advanced AI modelling, and knowledge distillation strategies, this work seeks to advance the feasibility of fine-scale, lake-wide nutrient retrieval and to support near-real-time eutrophication monitoring and management.
2. Methods
This section describes the general methodological framework developed for high-resolution retrieval of PPUT from PlanetScope imagery, and its application to Lake Simcoe as a representative case study. We first outline the overall framework, including data sources, preprocessing, feature construction, model configurations, and evaluation design. We then detail how this framework is applied to Lake Simcoe—covering the study area characteristics, available monitoring networks, PlanetScope acquisition strategy, and the specific experimental regimes used to train teacher and student models for lake-wide PPUT mapping.
2.1. Problem Formulation
We aimed to retrieve lake-wide PPUT from multi-source predictors. Let denote the fused feature vector per sample, composed of the following:
- (i)
Satellite reflectance-derived features (RS);
- (ii)
The day of the year (DOY) when the sample data were collected;
- (iii)
Meteorological descriptors (MET);
- (iv)
PPUT-related remote sensing indices (IDX);
- (v)
Auxiliary in-situ variables (e.g., SSD).
We compared eight regimes, including two groups and four training stages (
Table 1). Then, we developed a distillation model with
as the input to its teacher model and
as the input to its student model. The goal is to deploy a SSD-free model for full-lake mapping with high accuracy comparable to the SSD-aware models.
2.2. Data Modalities
The following features were incorporated into the model to capture diverse environmental and biological information relevant to this study:
Remote sensing (RS): Surface reflectance bands (blue, green, red, and NIR) from PlanetScope imagery were used to characterize spatial variability in lake optical properties. From these bands, we derived remote-sensing indices that capture absorption/scattering contrasts relevant to PPUT retrieval, including normalized-difference indices (e.g., NDWI = (Green − NIR)/(Green + NIR); NDVI = (NIR − Red)/(NIR + Red); GNDVI = (NIR − Green)/(NIR + Green)), band-ratio indices (e.g., NIR/Red, Red/Green), and visible-band greenness/algal-enhancement indices (e.g., VARI = (Green − Red)/(Green + Red − Blue); Excess Green, ExG = 2·Green − Red − Blue; ExGR), along with intensity/contrast measures (e.g., sum/mean of RGB, redness = Red/(Blue + Green), greenness = Green/(Red + Blue), and NIR-to-visible contrast).
Meteorology (MET): Daily near-surface air temperatures (including average, minimum, and maximum temperatures), precipitation (including mean or accumulation over 3- and 7-day periods), wind (speed and gust), air pressure, snowfall, and sunshine duration, in addition to short-term windows (3 to 7-day mean or accumulation for precipitation).
Auxiliary in-situ variable (SSD): SSD serves as an auxiliary variable in the teacher model for KD, reflecting water optical clarity and vertical attenuation. The teacher model must capture detailed information, but the student model aims to achieve comparable performance without it.
Day of the year (DOY): A feature derived from sampling dates to capture seasonal patterns such as light availability and biological productivity.
Collectively, these predictors describe optical conditions (RS), short-term environmental forcing (MET), seasonal timing (DOY), and an optional in-situ clarity proxy (SSD, available only to the teacher).
2.3. Pre-Processing and Feature Engineering
The pre-processing and feature engineering pipeline encompasses three overarching stages to systematically prepare input data for subsequent modeling:
Step 1: Data Validation and Preparatory Assessment: The 4-band data provided by Planet Scope has already undergone surface reflectance (SR) processing and been normalized to Sentinel-2 bands for consistent radiometry. This eliminates the need for additional SR correction or cross-sensor harmonization, ensuring the foundational spectral data meets radiometric consistency requirements for downstream analysis.
Step 2: Multi-Dimensional Feature Construction: A comprehensive spectral index library is constructed to capture vegetation, water, and other land surface properties. This library includes representative indices adaptable across sensors, such as the Normalized Difference Water Index (NDWI), Normalized Difference Vegetation Index (NDVI), Enhanced Vegetation Index 2 (EVI2), Soil-Adjusted Vegetation Index (SAVI), Optimized SAVI (OSAVI), Excess Green (ExG), Excess Green minus Excess Red (ExGR), and Visible Atmospherically Resistant Index (VARI). Additionally, ratios and normalized differences in individual spectral bands (NIR, Red, Green, Blue) are included, along with intensity and contrast terms to enhance texture-related information. To incorporate seasonality, Day-of-Year (DOY) is used as a direct temporal feature, complemented by cyclic encoding via sine (sin) and cosine (cos) transformations to capture periodic patterns across years. Meteorological data are integrated using aggregated windows, including same-day observations, 3-day and 7-day moving averages (with precipitation summed over these periods), to account for short-term climatic influences on surface conditions.
Step 3: Feature Refinement and Quality Control: Standardization (e.g., z-score normalization) is applied to features intended for linear model components to ensure consistent scaling, while outlier handling (e.g., clipping or Winsorization) is performed as needed to enhance the robustness of the dataset against extreme values. This final stage ensures features are numerically stable and resilient to noise, optimizing their suitability for model training.
The entire pre-processing and feature engineering pipeline is designed as an end-to-end workflow that integrates data validation, multi-dimensional feature extraction, and quality-driven refinement. By leveraging preprocessed spectral data, constructing a diverse set of biophysical and temporal features, and applying rigorous normalization and outlier-mitigation techniques, the pipeline ensures the input dataset is both comprehensive and robust. This systematic approach not only enhances the predictive power of subsequent models but also improves their generalizability across different geographic regions and environmental conditions, ultimately supporting more accurate and reliable land surface monitoring and analysis.
2.4. Model Zoo and Training Regimes
This paper benchmarked a representative set of tree-based ensemble regressors to evaluate predictive performance under the eight input regimes defined in
Section 2.1 and to quantify the incremental value of MET and SSD. The models being assessed include HistGradientBoosting (HistGBM), CatBoost, RandomForest, ExtraTrees, and GradientBoosting, covering both boosting- and bagging-based paradigms commonly used for nonlinear regression with heterogeneous predictors. To ensure a fair comparison, all models were trained using the same data split and preprocessing and evaluated using the same metrics (
R2, RMSE, and MAE as primary metrics; MAPE reported only in the stratified analyses). To maximize validation
R2, the five machine learning models were trained using Optuna for hyperparameter tuning. For each feature group and model, a Bayesian optimization search was conducted over the model’s hyperparameter space (e.g., learning rate, tree depth, subsampling ratio) using stratified 5-fold cross-validation within the training set.
Our emphasis is on performance sensitivity to input availability (e.g., with vs. without SSD/MET), rather than on detailed algorithmic exposition or model-specific optimizations. Additional model descriptions, equations, and implementation settings are provided in
Appendix A. In addition to serving as benchmarks, the ensemble learners described above also provide candidates for the backbone model used in the subsequent knowledge distillation (KD) framework (
Section 2.5). Rather than fixing a priori which algorithm to distill, we first compare the five models across the eight feature-group regimes and then adopt the ensemble with the best overall performance as the common backbone for both the teacher and the student. This design ensures that KD is built on the strongest available base learner, while keeping the architectural choice data-driven and empirically justified. Details of the backbone selection are reported in
Section 4.3.
2.5. Knowledge Distillation
Building on the model zoo described in
Section 2.4, we implement the teacher–student knowledge-distillation (KD) framework using a two-stage regression architecture. In the first stage, a tree-based ensemble regressor is used as the backbone model; specifically, we choose the ensemble learner that exhibits the best overall predictive performance in the benchmark comparison (
Section 4.3). This backbone captures nonlinear interactions among spectral bands, spectral indices, meteorological descriptors, and (for the teacher only) SSD.
In the second stage, we append a linear calibration head to the backbone predictions. Concretely, we concatenate the backbone output with a small subset of physically interpretable predictors (e.g., selected spectral indices and short-term meteorological aggregates) and fit a Ridge regression (RidgeCV) model with ℓ
2 regularization. This two-stage design decouples high-capacity nonlinear features learning from final prediction calibration, improving numerical stability, reducing variance at the distribution tails, and providing an interpretable linear combination of a few key drivers. Both the teacher and the student share the same two-stage architecture; they differ only in their input feature sets and the loss function used to train the student, as described below. The soft target is constructed as:
where
balances the contribution of ground-truth labels
and teacher predictions
. For feature selection, permutation-based
Candidate features are computed to quantify their importance; student models are then trained using the top-K features (with K swept) to enhance compactness and generalization. To optimize
, we perform a grid search over
and validate via station-grouped K-fold cross-validation to prevent spatial leakage.
2.6. Validation and Metrics
Model performance was primarily evaluated using three standard regression metrics:
- 2.
Root Mean Square Error (RMSE) quantifying the average magnitude of prediction error:
- 3.
Mean Absolute Error (MAE) measuring the average absolute deviation:
In addition, mean absolute percentage error (
MAPE) was reported only in the stratified analyses to facilitate relative error comparisons across segments with different concentration ranges:
where
is the mean of
and
is a small constant added to avoid instability when
is close to zero.
Our complete modelling dataset consists of 544 water-quality samples from 11 long-term monitoring stations across Lake Simcoe. Each sample pairs an in-situ PPUT measurement with a temporally matched PlanetScope scene and the corresponding meteorological descriptors. To prevent spatial leakage and to evaluate the models on genuinely unseen locations, we adopted a station-based grouped splitting strategy implemented via scikit-learn’s GroupShuffleSplit. The grouping variable was the station identifier; all samples from a given station were assigned exclusively to either the training or the test set. Using a nominal 80/20 split (test_size = 0.2, random_state = 42), this procedure produced a training set of 399 samples (73.3%) from 8 stations and a test set of 145 samples (26.7%) from 3 distinct stations, with no station appearing in both sets. The slight deviation from a perfect 80/20 ratio reflects the constraint that splits are performed at the station level rather than at the individual-sample level.
Within the training set, we further employed station-grouped 5-fold cross-validation (GroupKFold, n_splits = 5) for all hyperparameter tuning and feature-selection experiments, including (i) the K-sweep over the number of retained features (K = 10, 15, …, 60) and (ii) the grid search over the distillation weight α {0.0, 0.1, …, 0.9}. This nested design separates (i) the outer train–test split, which measures generalization to new spatial locations (unseen stations), from (ii) the inner station-grouped CV, which selects model hyperparameters without leaking information across stations.
Such a station-grouped strategy is critical in inland water-quality modelling, because neighbouring samples at the same station are strongly spatially autocorrelated; random sample-wise splits would artificially inflate performance estimates and fail to reflect the true difficulty of transferring models to new monitoring locations.
2.7. Lake-Wide Prediction
Based on PlanetScope surface reflectance (SR) inputs, the lake-wide PPUT retrieval workflow followed a systematic, reproducible sequence that mirrors the data-processing logic implemented in our training pipeline, including mosaicking, masking, and model inference.
First, for each target prediction date, cloud-filtered SR mosaics were generated using the matched UDM2 quality mask. This step removed pixels flagged as clouds, cloud shadows, haze, or bright artifacts, following the same masking approach applied during model training. The individual tiles of a given date were then seamlessly merged into a spatial mosaic to ensure consistent coverage across the entire Lake Simcoe region.
Second, lake boundary polygons were obtained from the Ontario Ministry of Natural Resources and Forestry’s Land Information Ontario (LIO) geospatial data portal [
47]. The “Waterbody” dataset provides authoritative GIS lake outlines for Ontario and was used as the reference geometry for delineating Lake Simcoe. These polygons served multiple purposes: (i) clipping and isolating the lake surface from the PlanetScope mosaics, (ii) defining valid prediction regions, and (iii) serving as the base layer for cartographic visualization.
Third, the SSD-free student model, trained with KD, was applied to infer PPUT values for every valid water pixel within the lake mask. This inference step follows the same preprocessing pipeline as the training code, including feature extraction (all spectral indices, DOY encodings, meteorological features, if applicable) and the standardized input structure used during student training.
Fourth, a single representative mid-summer PlanetScope scene (26 July 2024) was selected to characterize low external loading and thermally stratified conditions. Pixel-level predictions from this scene were used to generate a high-resolution lake-wide PPUT map for spatial pattern analysis.
Finally, the resulting maps were analyzed to extract key spatial patterns. This included characterizing contrasts between nearshore and offshore zones, evaluating the spatial extent and strength of tributary-driven phosphorus plumes (e.g., the Holland River and Black River), and comparing seasonal changes in the distribution of lake-wide PPUT.
These analyses provide contextual understanding of how hydrologic conditions and nutrient delivery pathways shape the spatio-temporal structure of phosphorus concentrations in Lake Simcoe.
Overall, the methodological framework presented in this section establishes a general, dataset-agnostic pipeline for retrieving total phosphorus (PPUT) from multi-source predictors. The workflow integrates standardized preprocessing of multispectral surface reflectance, construction of an extensive feature library, including spectral indices, meteorological descriptors, and seasonal variables, and the implementation of a comprehensive model comparison across both machine-learning and deep-learning families.
The framework further incorporates a knowledge-distillation strategy to enable high-accuracy nutrient prediction even when auxiliary in-situ variables (e.g., SSD) are unavailable across the full spatial domain. Through unified feature engineering, grouped cross-validation, α-grid optimization, and Top-K feature selection, the proposed approach ensures robust generalization and consistent deployment across varied hydrological or optical environments.
3. Case Study: Lake Simcoe, Ontario, Canada
The general methodological framework described in
Section 2 provides a flexible, sensor-independent pipeline for PPUT retrieval that leverages multisource predictors, feature engineering, multi-model benchmarking, and knowledge distillation-enhanced learning.
To demonstrate the practical applicability and effectiveness of this framework in a real-world inland lake environment, we conducted a case study on Lake Simcoe, Ontario, Canada. This section details the study area characteristics, the construction of the multi-source dataset (PlanetScope SR, in-situ water quality, meteorology), the implementation of the teacher-student training paradigm, and the deployment of the final SSD-free student model for lake-wide PPUT mapping under different hydrological seasons.
3.1. Study Area
As
Figure 1 shows, Lake Simcoe (44.3–44.6° N, 79.2–79.6° W) is the largest inland lake in southern Ontario outside the Great Lakes system, covering ~722 km
2 with a mean depth of ~15 m and a maximum depth of 41 m. The watershed includes more than 30 tributaries, the most important of which are the Holland River, Black River, Beaver River, and Talbot River. The lake outflows southeast through the Trent-Severn Waterway toward Lake Ontario. Lake Simcoe supports a rapidly growing watershed population and provides multiple ecosystem services and socio-economic benefits. The lake supplies drinking water to several municipalities, assimilates municipal and agricultural wastewater, and supports a provincially significant year-round sport fishery, boating, and recreational tourism industry [
48]. The watershed includes a mosaic of urban, agricultural, and forested land uses and is home to diverse communities, including rapidly expanding suburban populations and Indigenous communities. Recent demographic growth and land-use intensification have increased pressure on nutrient management and shoreline ecosystems. In this context, high-resolution, spatially explicit PPUT mapping can directly inform source control, shoreline restoration, and adaptive management actions to protect both ecosystem health and the well-being of local beneficiaries.
At the same time, Lake Simcoe has experienced sustained phosphorus loading and eutrophication pressures over recent decades, prompting intensive monitoring and the implementation of the Lake Simcoe Protection Plan. The combination of long-term in-situ water-quality and SSD datasets, diverse watershed land uses, strong management relevance, and ongoing policy initiatives makes Lake Simcoe an ideal testbed for evaluating high-resolution, AI-based phosphorus retrieval frameworks and for demonstrating how such methods can support evidence-based lake and watershed management.
3.2. Data Sources
The implementation of the proposed hybrid AI framework for Lake Simcoe required integrating three primary data sources: PlanetScope multi-generation satellite imagery, long-term in-situ water-quality observations, and meteorological data from nearest weather stations. These data layers were assembled into a harmonized, case-specific dataset that captures the optical, hydrological, and atmospheric conditions of Lake Simcoe.
Figure 1.
Lake Simcoe study area showing the LSRCA in-situ water-quality sampling stations (blue dots) and Environment Canada meteorological stations (red dots) used to construct daily and short-term (3–7 day) aggregated meteorological predictors for each satellite–in-situ match-up.
Figure 1.
Lake Simcoe study area showing the LSRCA in-situ water-quality sampling stations (blue dots) and Environment Canada meteorological stations (red dots) used to construct daily and short-term (3–7 day) aggregated meteorological predictors for each satellite–in-situ match-up.
PlanetScope imagery (Dove Classic, Dove-R, and SuperDove) covering the period 2018–2023 was obtained for all acquisition dates intersecting the lake. Surface Reflectance (SR) products were used to ensure radiometric consistency, and cloud-covered areas were removed using the accompanying UDM2 quality mask. For dates with multiple overlapping scenes, mosaics were produced to achieve complete spatial coverage. Only acquisitions with less than 10% cloud cover were retained for lake-wide prediction and for constructing training samples. Each in-situ measurement was temporally matched to the nearest PlanetScope acquisition within a ±1-day window to minimize radiometric-in-situ mismatch.
In-situ water-quality data were provided by the Lake Simcoe Region Conservation Authority (LSRCA) [
49], covering 15 monitoring stations distributed across embayments, nearshore zones, and offshore basins. The target variable used for model development was PPUT, while SSD served as an auxiliary variable incorporated exclusively into the teacher model. Records underwent quality control to remove invalid or duplicate samples and were spatially harmonized using the official station coordinates.
Meteorological variables were collected using the Meteostat API [
50] and linked to each in-situ record by selecting the five nearest available stations. The meteorological fields included daily air temperature (tavg, tmin, tmax), precipitation, wind speed and gust, atmospheric pressure, snowfall, and sunshine duration. To capture short-term hydrometeorological variability associated with watershed loading and water-column mixing, 3-day and 7-day aggregated metrics were derived. These variables were subsequently merged with SR-based features and in-situ observations to create the final multi-modal training dataset.
3.3. Case-Specific Data Preparation
Data preparation for the Lake Simcoe case followed the general procedures outlined in
Section 2 and was adapted to account for the lake’s specific optical and hydrological characteristics. All PlanetScope SR scenes were clipped using the Lake Simcoe boundary polygon obtained from the Ontario Land Information Ontario (LIO) Waterbody dataset. This ensured that subsequent modeling and mapping were restricted strictly to lake pixels, thereby preventing land reflectance contamination in littoral zones, which are abundant along the lake’s complex shoreline.
Temporal alignment between PlanetScope SR, meteorological observations, and in-situ sampling was a central requirement. Because Lake Simcoe undergoes rapid changes during spring melt, strict temporal-matching criteria (±1 day) were applied to minimize discrepancies caused by rapidly changing optical and hydrologic conditions. Spatially, samples located near river mouths—especially those of the Holland River and Black River—were evaluated to ensure that their surrounding SR values were not affected by adjacency effects or mixed land-water pixels. Where necessary, edge pixels were removed during preprocessing to maintain data integrity.
The final prepared dataset retained only high-quality, temporally synchronized, and spatially validated pixel-station pairs, serving as the foundation for model-training and distillation stages.
3.4. Application of the Modeling Framework to Lake Simcoe
The methodological framework described in
Section 2 was operationalized for Lake Simcoe by training machine-learning and deep-learning models using the prepared multi-modal dataset (
Figure 2). While the underlying algorithms, feature categories, and evaluation metrics follow the general design, their application in this case study reflects the unique spectral and hydrological characteristics of Lake Simcoe.
A comprehensive multi-model comparison was conducted using remote sensing only, remote sensing combined with meteorology, and the full RS, INDICES, DOY, MET, and SSD feature set. As expected, the inclusion of SSD substantially improved model accuracy, highlighting the importance of optical clarity indicators for phosphorus prediction. Because SSD measurements are not spatially continuous across the lake, to ensure accurate prediction, a KD scheme was employed. A high-capacity teacher model utilizing RS + INDICES + DOY + MET + SSD inputs was first trained, and its soft predictions were then used to guide the SSD-free student model. This approach enabled the student model to inherit domain knowledge about water clarity even when operating without SSD.
To adapt the framework to the Lake Simcoe setting, feature importance rankings were explicitly generated for this dataset. A Top-K feature selection analysis revealed that a compact set of forty features provided the best balance between model complexity and predictive performance. Using these Lake Simcoe-specific configurations, the final student model achieved an R2 of 0.83 on held-out stations, demonstrating the robustness of the KD-enhanced workflow for practical deployment.
3.5. Lake-Wide PPUT Mapping for Lake Simcoe
Following model development, the SSD-free student model was applied to full-lake PlanetScope SR mosaics to produce spatially continuous PPUT maps at a 3–5 m resolution. All mosaics were screened with UDM2 masks and clipped to the Lake Simcoe polygon. Only pixels flagged as clear in the UDM2 clear band (band 1 = 1), which implicitly excludes pixels classified as cloud, cloud shadow, haze, or snow/ice in the other UDM2 classes were retained. For each valid water pixel, the complete set of required input features, including spectral indices, seasonal attributes, and any available meteorological descriptors, was reconstructed using the same transformations employed during model training, ensuring methodological consistency. A single representative mid-summer PlanetScope scene (26 July 2024) was selected to characterize low external loading and thermally stratified lake conditions.
In summary, this case study operationalized the proposed workflow for Lake Simcoe by constructing the multi-source dataset, training the teacher–student KD models, and generating lake-wide PPUT predictions from PlanetScope mosaics for contrasting seasonal conditions. The quantitative evaluation across feature regimes and the resulting spatial patterns in the lake-wide maps are presented in
Section 4.
4. Results
4.1. Influence of Feature Groups and SSD Availability on Model Performance
The multi-model evaluation across eight feature-group regimes was designed to systematically examine how different categories of predictors influence the stability, accuracy, and generalizability of PPUT retrieval. Because phosphorus concentrations are governed by a mixture of hydrological, biogeochemical, and optical processes, the effectiveness of machine-learning models depends not only on algorithmic design but on the observability of physically meaningful variables.
SSD represents a vertically integrated clarity metric that encodes water-column light attenuation and particulate loads, both of which are tightly linked to phosphorus dynamics in inland lakes.
Table 2 presents a comparison of model-prediction error statistics (
R2, RMSE, MAE) across all models and feature groups, revealing a clear structural separation between SSD-aware and SSD-free regimes.
While the raw numerical contrast shows that mean R2 increased from 0.6741 to 0.9364 when SSD was added, the broader implication is that SSD introduces a form of physical regularization into the prediction problem. Because SSD encapsulates information about water-column scattering and absorption processes strongly influenced by suspended sediments, algal biomass, and particulate phosphorus, including SSD provides a constraint that effectively reduces the model’s solution space. For completeness, we evaluated an SSD-only model as a station-scale reference. Using SSD as the sole predictor, the SSD-only models achieved test R2 of approximately 0.80–0.85 (best ≈ 0.846), with RMSE ≈ 13.7–15.8 and MAE ≈ 7.44–7.81. While SSD alone provides a strong predictive signal for PPUT at monitoring stations, it is not deployable for lake-wide mapping because SSD is not available wall-to-wall; therefore, we report SSD-only results only as an information-content reference rather than a competing lake-wide baseline.
The high consistency of SSD-aware performance across five structurally distinct models (tree ensembles, gradient boosting, and CatBoost) indicates that the improvement is not algorithm-dependent but arises from the biophysical relevance of SSD itself. The sharp decline in accuracy in SSD-free scenarios (≈39% relative drop) further highlights the difficulty of inferring vertical water clarity from surface-only features.
The boxplots in
Figure 3 deepen this interpretation by demonstrating that SSD influences not only mean model accuracy but also the distribution of prediction errors. The substantial reduction in RMSE (from 0.49 to 0.22 µg/L) and MAE (from 0.33 to 0.16 µg/L) suggests that SSD reduces both bias and variance in the predictive outputs. This stabilizing effect stems from SSD’s ecological role: because phosphorus concentrations co-vary strongly with suspended particulate loads and algal biomass, SSD acts as a proxy for the processes that modulate phosphorus availability.
The fact that even minimal SSD-aware feature sets (e.g., SSD + Bands) outperform feature-rich SSD-free models indicates that no combination of spectral indices or meteorological variables can fully substitute for the depth-integrated information that SSD provides. SSD constrains predictions across heterogeneous optical conditions—such as nearshore vs. offshore waters—thereby preventing error inflation in areas where surface reflectance alone is ambiguous.
The
R2 heatmap (
Figure 4) demonstrates that SSD’s influence is robust across model architectures with differing inductive biases. All models achieved
R2 ≥ 0.93 under SSD-aware regimes, demonstrating that SSD reduces the complexity of the prediction task to a level where algorithmic differences become nearly irrelevant. This convergence suggests that SSD effectively linearizes or simplifies the multidimensional mapping between observable features and PPUT, making the problem well-defined even for models that otherwise underperform in high-variance conditions (e.g., Random Forest).
In SSD-free settings, the wide performance spread (0.55–0.75 R2) reflects the inherent ambiguity of estimating phosphorus without clarity information. Here, models must rely on indirect proxies, reflectance-based indices or meteorological drivers, whose relationships to PPUT vary seasonally and spatially. The higher sensitivity to model architecture in this regime highlights the extent to which SSD provides structural information that the model would otherwise have to infer.
The analyses in
Figure 1,
Figure 2 and
Figure 3 indicate that SSD is the single most influential variable for PPUT prediction, not merely because it improves accuracy, but because it encodes the fundamental optical and particulate processes that shape phosphorus dynamics. SSD transforms the PPUT retrieval task from an underdetermined surface-reflectance inversion into a physically grounded, well-constrained prediction problem. The strong cross-model consistency under SSD-aware regimes and the pronounced instability of SSD-free models demonstrate that high-fidelity phosphorus estimation in inland lakes requires access to depth-integrated clarity information (either measured directly or approximated through advanced techniques such as KD). These insights motivate the hybrid AI framework developed in this study and lay the foundation for operational phosphorus monitoring at high spatial and temporal resolution.
4.2. Performance Gains from Feature Accumulation
To evaluate the contribution of additional feature classes, we examined prediction accuracy across eight progressively complex feature groups. For SSD-aware groups, adding Day-of-Year (DOY), remote sensing indices, and meteorological features resulted in incremental improvements. Transitioning from SSD + Bands to SSD + Bands + DOY increased R2 by approximately 0.01, and further inclusion of spectral index and meteorology yielded diminishing yet consistent gains. This pattern suggests that SSD already encapsulates a large proportion of the optically relevant environmental variability, while auxiliary features provide refinement rather than fundamental explanatory power.
SSD-free models showed a more apparent benefit from incremental features. Raw spectral bands alone produced modest performance (mean
R2 ≈ 0.59), but incorporating temporal descriptors (Bands + DOY) increased accuracy to 0.70, and the addition of spectral indices (Bands + DOY + Indices) further increased
R2 to 0.75 (
Figure 5). Inclusion of meteorological variables yielded marginal improvements (<0.03). These patterns indicate that derived indices more effectively capture water-color changes driven by suspended solids, CDOM, and biomass, which act as partial proxies for phosphorus dynamics.
4.3. Cross-Model Comparison and Algorithmic Behavior
As shown in
Table 2, across all feature-group regimes, the comparative behaviour of the five single models (CatBoost, HistGradientBoosting, GradientBoosting, ExtraTrees, and RandomForest) revealed systematic and interpretable differences in predictive skill that align with their algorithmic properties and the underlying optical complexity of Lake Simcoe.
In the absence of SSD, model performance diverged substantially. CatBoost, HistGradientBoosting, and GradientBoosting consistently outperformed ExtraTrees and RandomForest, with gains of approximately 0.05–0.10 in R2 and reductions of 0.03–0.06 in RMSE across the Bands, Bands + DOY, and Bands + DOY + Indices regimes.
This advantage reflects the ability of boosting-based methods to adaptively refine decision boundaries and capture second-order interactions among spectral indices, meteorological descriptors, and temporal variables—interactions that are important when the signal-to-noise ratio of surface reflectance is moderate and auxiliary depth-based clarity information is absent.
RandomForest, by comparison, exhibited the lowest accuracy among the five models in all SSD-free settings. This behaviour is consistent with the known limitations of bagging ensembles in high-dimensional regression, where subtle gradients rather than coarse class separations drive predictive performance. The model’s tendency to average over many weak learners, combined with its limited capacity to model fine-grained feature interactions, led to underfitting in regimes that rely heavily on spectral proxies for water clarity, phytoplankton concentration, and suspended solids.
The performance gap between model families shrank markedly once SSD was introduced. In all SSD-aware regimes—SSD + Bands, SSD + Bands + DOY, SSD + Bands + DOY + Indices, and SSD Full—differences in R2 across the five models were reduced to within 0.01–0.02, with all models achieving R2 > 0.93. This convergence reflects SSD’s dominant explanatory value as a vertically integrated optical clarity metric.
The inclusion of SSD effectively linearizes the remaining feature–response relationships, reduces the dependence on multi-order interactions, and stabilizes predictions across models with different inductive biases. Under these conditions, even RandomForest achieved R2 values comparable to CatBoost and HistGradientBoosting, demonstrating that SSD acts as a strong stabilizing variable capable of diminishing algorithmic differences.
HistGradientBoosting and GradientBoosting emerged as the most consistently stable high performers across the full spectrum of feature groups. HistGradientBoosting offered near-optimal accuracy in both SSD-free and SSD-aware settings while maintaining relatively low computational cost. CatBoost showed similar strength, especially in SSD-aware regimes, benefiting from its native handling of categorical interactions and ordered boosting. ExtraTrees exhibited slightly higher variance across feature groups, an effect likely attributable to its randomized split-selection strategy, which introduces additional noise when feature importance is highly uneven.
The single-model comparison across the five ensemble learners (
Figure 3,
Table 2) shows that HistGradientBoosting (HistGBM) provides the best overall trade-off between accuracy and robustness. Under the SSD-inclusive feature regime, HistGBM attains the highest test
R2 and the lowest error among all candidates, while in SSD-free regimes it remains among the top performers and exhibits relatively small variability across feature groups.
RandomForest and ExtraTrees tend to plateau at slightly lower
R2, and CatBoost offers only marginal gains at substantially higher computational cost. Based on this consistent behaviour, we selected HistGBM as the backbone regressor in the teacher–student KD framework described in
Section 2.5. In all KD experiments, both the teacher and the student therefore share the same HistGBM-based backbone and RidgeCV calibration head, differing only in their input feature sets (with vs. without SSD) and training objectives.
4.4. Feature Importance and Physical Interpretation
Table 3 presents the short-term meteorological (e.g., tmax_3d, pres_3d, wspd_7d), NIR reflectance (band4_mean), and water-color indices (e.g., ExG, ExGR, SR_NR, ND_NB) as influential predictors and their hypothesized physical relevance to PPUT dynamics in Lake Simcoe.
A category-level aggregation (
Figure 6) further revealed that meteorological variables contributed the largest cumulative Δ
R2 (0.081), followed by remote-sensing indices (0.039), spectral bands (0.007), and temporal descriptors (0.000). This hierarchy is not merely a numerical ranking but reflects the underlying physical mechanisms governing phosphorus dynamics in Lake Simcoe.
Meteorological drivers—particularly wind, temperature, and precipitation—directly regulate hydrodynamic mixing, sediment resuspension, tributary inflows, and catchment runoff, all of which act as primary pathways for phosphorus mobilization and redistribution. Consequently, short-term meteorological variability introduces real biogeochemical forcing that cannot be inferred from surface reflectance alone, explaining why this category produces the most significant marginal gain in model performance.
Remote-sensing indices formed the second most influential category because they capture higher-order spectral relationships—such as the balance between scattering and absorption—that are sensitive to turbidity, algal biomass, and CDOM. These processes, while closely linked to phosphorus availability, often manifest in the reflectance domain through nonlinear transformations rather than through individual raw bands. Thus, indices such as ExG, ND-based ratios, and greenness amplify subtle optical cues associated with P-related ecological states, enabling the model to resolve fine-scale gradients even when depth-dependent clarity information is absent.
In contrast, raw spectral bands provided only a modest ΔR2 (0.007), consistent with the fact that unprocessed reflectance signals blend multiple optical constituents and are more susceptible to noise, illumination variability, and confounding factors such as glint or adjacency effects. Temporal descriptors contributed effectively zero incremental explanatory power because seasonality is implicitly encoded within the spectral and meteorological data themselves; once these categories are included, DOY adds little new information.
Together, these findings clarify why the model derives the greatest benefit from meteorological and index-based predictors: one category captures the physical drivers of phosphorus transport, while the other captures the optical manifestations of its ecological consequences. The synergy between these two categories underpins the hybrid AI framework’s strong performance.
4.5. KD and Optimal Feature Subset
The K-sweep analysis (
Figure 7) reveals a typical diminishing-returns profile for the SSD-free student model. Accuracy increases steeply when the number of retained features grows from K = 10 to about K ≈ 20, and then enters a broad performance plateau between K ≈ 32 and K = 40. Within this plateau, individual K values exhibit small oscillations due to sampling variability in the station-grouped cross-validation.
For example, K = 32 and K = 40 both achieve similar validation accuracy (for K = 32, R2 equals to 0.8306, RMSE equals to 9.853 and for K = 40, R2 equals to 0.8318 and RMSE equals to 9.818), whereas K = 33 shows a local dip (R2 = 0.73, RMSE = 12.49 µg L−1) associated with a less favourable feature subset. The global optimum occurs at K = 40, which attains the highest cross-validated R2 (0.8318) and the lowest RMSE (9.82 µg L−1) among all tested K values. For K > 40, both R2 and RMSE systematically deteriorate (e.g., K = 50 yields R2 ≈ 0.71–0.75 and RMSE ≈ 12.9–17.6 µg L−1), indicating overfitting and redundancy when too many features are retained.
We therefore select K = 40 as the final configuration, representing the upper end of the stable plateau and a sparse yet expressive 40-feature subset for the SSD-free student model.
Figure 8 confirms the feasibility of deploying SSD-free models for operational, lake-wide nutrient monitoring with reasonable accuracy. For the K = 40 feature subset, we performed a grid search over α
{0.0, 0.1, …, 0.9} using station-grouped 5-fold cross-validation (Group K-Fold) within the training set. α = 0.2 yielded the highest mean cross-validated
R2 (0.676) and the lowest RMSE among all tested values, and was therefore adopted for all reported student-model results.
To further evaluate model performance across different PPUT concentration ranges, we conducted a stratified analysis by dividing the dataset into concentration bins (
Table 4). We focused on alternative metrics for evaluating performance within concentration subsets, while
R2 is sensitive to the variance of the dependent variable and can yield misleading values when applied to restricted-range data, we instead employed RMSE, MAE, and MAPE (Mean Absolute Percentage Error) to assess model accuracy across different concentration ranges.
In the low-concentration range (<25 µg/L, comprising 75.6% of all samples), the model achieved RMSE values of 2.81 µg/L (training) and 6.29 µg/L (testing), with corresponding MAPE values of 14.0% and 35.6%, respectively. These modest error magnitudes indicate acceptable predictive performance in the concentration range most frequently observed in Lake Simcoe. The high-concentration range (≥50 µg/L) showed RMSE values of 15.38 µg/L (training) and 14.08 µg/L (testing), with MAPE values of 11.5% and 14.3%, demonstrating consistent relative accuracy despite the larger absolute errors inherent to higher concentration values. The middle range (25–50 µg/L) exhibited higher variability in error metrics, with testing RMSE of 24.19 µg/L and MAPE of 52.2%, likely due to the limited sample size (23 in training, 7 in testing) and greater uncertainty in this transitional concentration zone.
4.6. Lake-Wide High-Resolution PPUT Prediction
Figure 9 presents a representative mid-summer snapshot of lake-wide PPUT predictions for 26 July 2024 derived from the SSD-free student model. The land area and the clouds are masked out by our filter algorithm. The map shows spatially coherent fields at 3 m resolution, with a scene-wide mean concentration of 7.57 µg/L and an observed range from 5.00 to 68.25 µg/L. The pelagic surface is dominated by low to moderate concentrations forming a broad mesotrophic background, while distinct high-value patches emerge along the shoreline, in narrow channels, and at tributary confluences.
The statistical distribution confirms that most of the lake surface remains in a relatively low-concentration regime. Median and upper-quartile values are p50 = 7.15 µg/L and p75 = 8.22 µg/L, with the 90th and 95th percentiles reaching 8.97 and 9.49 µg/L, respectively. Across all water pixels (n ≈ 7.89 × 107), 97.40% fall within the 5–10 µg/L bin, and only 1.88% and 0.19% lie in the 10–20 µg/L and 20–30 µg/L ranges. Pixels exceeding 30 µg/L account for just 0.53% of the lake area, yet they represent ecologically critical hotspots where particulate phosphorus loading and/or resuspension are strongly enhanced.
At the basin scale, quadrant-wise averages indicate relatively modest cross-lake contrasts: the northeast and northwest quadrants show similar means (7.42 and 7.41 µg/L), whereas the southeast and southwest quadrants are slightly elevated (7.78 and 7.70 µg/L), consistent with the influence of significant inflows and shallow embayments in the southern basins. Overall, the 26 July snapshot demonstrates that the distilled student model is capable of generating physically plausible, spatially detailed PPUT fields: the lake interior is characterized mainly by low to moderate concentrations, while high-PPUT waters are confined to structurally and hydrologically meaningful nearshore and tributary-influenced zones.
5. Discussion
The experimental results obtained in this study provide an opportunity not only to benchmark predictive performance but also to understand the physical, optical, and algorithmic mechanisms that govern phosphorus retrieval from high-resolution satellite data. Rather than viewing the models as black boxes, we interpreted their behavior in the context of Lake Simcoe’s bio-optical regime, the contrasting roles of surface reflectance and SSD, and the influence of short-term meteorological forcing. By jointly analyzing multi-model feature groups, permutation-based importances, K-sweep dimensionality patterns, and teacher-student knowledge-distillation behaviour, we can link quantitative metrics such as R2, RMSE, and MAE to underlying processes such as light attenuation, sediment resuspension, watershed inputs, and stratification dynamics.
The following subsections synthesize these lines of evidence, with a focus on (i) explaining SSD’s dominant predictive role, (ii) assessing the extent to which SSD-free monitoring is feasible through distillation, (iii) clarifying how feature engineering and model class shape performance, and (iv) discussing the implications, limitations, and broader applicability of the proposed PlanetScope-based AI framework.
5.1. Mechanisms Underlying SSD’s Dominant Predictive Role
The experimental findings confirm that SSD carries unique optical and biogeochemical information that is seldom recoverable from multispectral satellite data alone. SSD integrates the effects of light attenuation through the water column, which—in inland waters—is controlled primarily by suspended particulate matter, chromophoric dissolved organic matter (CDOM), and phytoplankton biomass. Each of these constituents is directly or indirectly linked to phosphorus loading: particulate phosphorus is adsorbed onto mineral particles and onto organic detritus, which together dominate scattering. In contrast, dissolved phosphorus often co-varies with CDOM originating from terrestrial runoff. Consequently, SSD provides a vertically integrated proxy for both particulate and dissolved phosphorus pools. In contrast, PlanetScope’s spectral bands capture only surface reflectance, with the effective water-leaving signal typically limited to the top 3 m in such inland water environments [
51].
This depth-integration is crucial in Lake Simcoe, where internal loading, sediment resuspension, and near-bed processes contribute substantially to the total phosphorus budget. Satellite reflectance cannot fully resolve these processes because the optical signal from the upper layer dominates it. In contrast, SSD effectively “compresses” the combined effects of particle backscattering, CDOM absorption, and pigment absorption into a single, observable variable. The considerable R2 improvement when SSD is added (on the order of +0.22–0.26, depending on model family and feature group) therefore reflects fundamental observational physics rather than algorithmic artifacts. The inability of raw spectral bands or indices to compensate for SSD suggests that no combination of band ratios, temporal descriptors, or short-term meteorology can fully substitute for vertically integrated clarity metrics in optically complex lakes.
From an information-content perspective, the residual performance gap between SSD-aware and SSD-free configurations can be interpreted as the portion of the variance attributable to truly unobservable vertical structure. Even ensemble methods show only marginal gains once SSD is included, indicating that the remaining error is not dominated by model inadequacy but by the inherent observational limits of 4-band surface reflectance. In this sense, SSD acts as a rate-limiting information source for PPUT prediction, transforming an underdetermined inversion problem into a well-constrained regression task.
5.2. Implications for Scalable, SSD-Free Monitoring
The superior performance of the SSD-aware teacher model underscores the importance of strategic in-situ SSD sampling for model calibration. At the same time, the strongly performing student model (R2 ≈ 0.83, RMSE = 9.82 µg/L) demonstrates that KD provides a feasible bridge between accuracy and scalability. By training the student to reproduce both hard PPUT labels and soft teacher outputs, the framework allows the teacher-learned SSD-driven structure to be embedded in a reduced feature space consisting solely of remote-sensing and meteorological variables.
The retention of approximately 88% of the teacher’s explanatory power indicates that the student model internalizes key relationships between clarity, optical signatures, and hydrometeorological forcing, effectively mimicking SSD’s predictive role without requiring SSD as an explicit input.
This behavior can be viewed through the lens of an information bottleneck: the teacher model uses SSD to learn a high-dimensional manifold linking depth-integrated optical conditions to PPUT, while the student approximates this manifold using a lower-dimensional set of observable variables. The chosen distillation configuration (with a relatively modest teacher weight in the loss) encourages the student to balance fidelity to the measured PPUT with alignment with the teacher’s smoother decision boundaries. As a result, the student benefits from the teacher’s physics-informed structure yet avoids overfitting to noise in the limited in-situ dataset.
The error increase associated with the student model (e.g., MAE is higher by ~1.5 µg/L than the teacher) is non-negligible for fine-grained regulatory assessment. Still, it remains acceptable for regional surveillance, long-term trend analysis, or early-warning applications. In particular, the consistently moderate MAPE values of the student model (below 36% for the dominant low-concentration range) demonstrate that the model provides operationally useful predictions for water quality monitoring and management applications across the lake’s typical concentration spectrum, and proved the student model’s accuracy is sufficient to distinguish broad trophic classes, whose thresholds are typically separated by >20 µg/L. Thus, the distillation framework effectively delineates a practical division of labour: teacher models anchored by SSD are reserved for high-stakes calibration and validation. In contrast, SSD-free student models support routine, spatially extensive monitoring across larger lake networks where dense in-situ sampling is economically or logistically infeasible.
5.3. Feature Engineering, Physical Interpretability, and Dimensionality Trade-Off
The superior performance of derived spectral indices over raw bands supports the integration of domain knowledge in feature engineering. Ratio- and normalized-difference-based indices mitigate illumination, atmospheric, and viewing-geometry effects, and they more directly express the balance between scattering and absorption that underpins water-column optical behavior. Indices such as ExGR, ND-based ratios, and SR_NR emphasize differences between green, red, and NIR reflectance that are sensitive to phytoplankton, suspended sediments, and CDOM. Their prominence in the permutation-based importance rankings indicates that the model relies on these transformations to extract physically meaningful signals that raw bands only weakly encode.
Feature-level rankings show that spectral transformations (e.g., Excess Green, NIR/Red ratios, and normalized-difference ratios) consistently appear as a second tier of influential predictors after the dominant meteorological drivers. This pattern is consistent with indices amplifying the nonlinear balance between scattering and absorption across the visible–NIR region, thereby isolating optical signatures of suspended particles, phytoplankton biomass, and CDOM that are only weakly expressed in individual raw bands.
The feature-importance hierarchy indicates a complementary division of information content: meteorological variables encode the short-term forcing that redistributes phosphorus (runoff pulses, mixing, and resuspension), while reflectance-based indices capture the optical manifestation of these processes in the surface layer. This coupling helps explain the strong performance of the hybrid predictor set and supports the physical interpretability of the learned relationships.
At the category level, meteorological variables emerged as the strongest contributors to ΔR2, ahead of remote-sensing indices, with spectral bands and temporal descriptors playing secondary roles. This ordering is physically consistent: short-term temperature, wind, and pressure patterns govern stratification, resuspension, and runoff, which in turn mobilize and redistribute phosphorus. Spectral indices then capture the optical manifestations of these processes through changes in turbidity, pigment concentration, and water colour. The relatively small incremental benefit of stand-alone bands and DOY implies that once physically meaningful meteorology and indices are present, additional raw reflectance or calendar information adds little unique explanatory power.
The K-sweep analysis provides a complementary perspective on feature-space complexity. Accuracy increased rapidly as K rose from 10 to about 20, reflecting the addition of genuinely informative features that capture independent aspects of phosphorus dynamics. Between K = 20 and K = 40, performance gains became more modest, indicating a regime of diminishing returns where newly added predictors were increasingly correlated with variants of existing ones.
The optimal configuration at K = 40 corresponds to a point at which the model retains sufficient feature diversity to approximate the teacher’s manifold while avoiding over-parameterization. Beyond K > 50, performance degrades, consistent with overfitting in a setting with modest sample size and many highly collinear features. These patterns support the use of carefully curated, physically interpretable feature subsets in operational scenarios, rather than indiscriminately including all available predictors.
5.4. Cross-Model Implications and Underlying Machine Learning Mechanisms
The multi-model comparisons highlight several broader implications for the use of machine learning in inland water-quality retrieval [
52]. First, the stark performance improvement produced by SSD across all algorithms confirms that vertically integrated optical information fundamentally constrains the phosphorus estimation problem. In SSD-free regimes, models must infer water-column clarity and nutrient status from a combination of surface reflectance and meteorological context—an inherently high-variance task, particularly in morphometrically complex systems like Lake Simcoe. In this context, boosting-based methods outperform bagging because they iteratively refine decision boundaries to capture weak, nonlinear interactions between spectral indices and short-term atmospheric forcing. Bagging methods, by averaging over many decorrelated trees, tend to underfit subtle structures that are essential for distinguishing overlapping optical states with different phosphorus signatures.
Second, the convergence of all models in SSD-aware regimes, with R2 values tightly clustered above 0.93, indicates that when a strong physical predictor is available, model architecture becomes significantly less important. Once SSD is present, even comparatively simple ensembles approach the performance of more sophisticated boosting methods, and additional architectural complexity yields negligible performance gains. This pattern suggests that the primary bottleneck in PPUT retrieval lies in the observability of key state variables rather than in the representational capacity of modern machine-learning models. It also reinforces the idea that improvements in auxiliary data streams—such as routine SSD sampling or other clarity proxies—may deliver greater benefits than further algorithmic tuning.
Third, the success of the distilled student model demonstrates how teacher-student frameworks can encode biophysical relationships—such as depth-dependent attenuation and particulate scattering—into a compressed feature space. The teacher effectively learns a high-dimensional representation of optical—biogeochemical structure underpinned by SSD, while the student approximates this representation using only remote-sensing and meteorological variables. This process resembles manifold learning, in which the student is constrained to follow the teacher’s learned decision surface rather than exploring spurious correlations in the SSD-free feature space.
Finally, the alignment between feature-importance patterns and established limnological processes (e.g., the influence of wind-driven resuspension, storm-driven inflows, and algal growth on phosphorus distribution) suggests that the models are learning mechanistic relationships rather than arbitrary statistical associations. This enhances confidence in the scientific interpretability and robustness of the proposed AI framework.
5.5. Comparison with Previous TP Retrieval Studies and Rationale for Backbone Selection
Previous work has made substantial progress in satellite-based retrieval of total phosphorus, but most studies face one or more constraints related to spatial resolution, feature completeness, or model deployment ability. Xiong et al. developed a remote-sensing algorithm for TP in eutrophic lakes using MODIS FAI-type indices combined with both conventional and machine-learning models, achieving
R2 ≈ 0.60 for a Taihu-specific model and
R2 ≈ 0.64 (RMSE ≈ 0.06 mg·L
−1) for a generalized multi-lake algorithm [
5,
53]. This work is valuable because it systematically compares semi-analytical and data-driven approaches and explicitly addresses generalization across lakes. However, the coarse 250 m MODIS resolution limits its ability to resolve nearshore gradients and tributary plumes, and the feature set is largely restricted to surface reflectance and FAI-type indices without vertically integrated clarity metrics or short-term meteorological drivers.
Qiao et al. used Landsat-8 imagery for the Miyun Reservoir and conducted a comprehensive comparison of twelve machine-learning algorithms, showing that Extra Trees (ETRs) yielded the best TP retrieval performance with
R2 > 0.85 and very low MAE on a single-reservoir dataset [
53]. Their study is a strong benchmark for algorithmic comparison under medium spatial resolution (30 m), but the feature space is dominated by spectral bands and simple indices. The model is site-specific and operates at a scale where many littoral processes remain subpixel, and there is no explicit treatment of auxiliary vertically integrated indicators such as SSD or of temporal meteorological context.
Several recent studies have adopted gradient-boosting ensembles and Sentinel-class imagery to improve TP retrievals at regional scales. Wang et al. used Sentinel-3/OLCI images across the Yangtze–Huaihe lake region and found that an XGBoost-based model outperformed empirical approaches but still achieved only moderate accuracy (
R2 ≈ 0.53, RMSE ≈ 0.08 mg·L
−1) when generalized across many lakes [
54]. Cui et al. optimized an XGBoost model for TP retrieval in Taihu Lake using Sentinel-2 and a carefully selected feature combination, achieving
R2 ≈ 0.72 with improved stability relative to using all variables [
40]. Lin et al. further advanced this line of work by developing an interpretable LightGBM-based model that reconstructs long-term TP dynamics (2005–2024) in Lake Taihu, emphasizing explainability and driver attribution, yet still within a single large lake and at 10–30 m resolution [
55]. These studies collectively demonstrate that boosted tree ensembles (XGBoost, LightGBM and related methods) are consistently among the top performers for TP retrieval, and that adding carefully engineered indices improves robustness. However, they generally operate at medium resolution, rarely integrate in-situ clarity proxies such as SSD into the retrieval model, and do not address how to deploy a “high-information” model when such auxiliaries are absent.
Other recent work has begun to incorporate meteorological variables into TP or nutrient modelling. For example, Li et al. proposed a multimodal framework that combined satellite data and meteorological forcings to estimate several water-quality parameters, including TP, obtaining
R2 ≈ 0.50 for TP at the regional scale [
56], while Qin et al. showed that air temperature is a key driver for TP and TN variability in northeastern lakes when combined with Sentinel-2 and machine-learning methods [
57]. These studies highlight the importance of meteorological context but typically treat meteorological features as simple covariates; they neither integrate vertically integrated optical measures (SSD) nor explore how such rich feature spaces can be transferred to operational settings where some inputs are missing.
Against this backdrop, the present study differs from and extends prior work in three main ways. First, it exploits multi-generation PlanetScope imagery (3–5 m) to explicitly resolve nearshore and tributary structures that are unresolved by MODIS, OLCI, or even Sentinel-2 in many lakes. This enables detection and mapping of narrow plume-like phosphorus hotspots and embayment gradients that previous medium-resolution studies could only infer indirectly or at coarse scales.
Second, the feature space explicitly integrates a vertically integrated clarity metric (SSD), short-term meteorological descriptors (3–7 day aggregates of temperature, wind speed, precipitation, and pressure), and physically informed spectral indices. This design directly addresses two major gaps identified in the TP retrieval literature: the lack of water-column integrated optical information and the limited incorporation of short-term hydrometeorological drivers.
Third, the use of a teacher–student knowledge-distillation framework allows the high-accuracy, SSD-informed teacher to be “compressed” into an SSD-free student that can operate at any pixel where only remote-sensing and meteorological variables are available, thereby resolving the common operational dilemma of sparse in-situ data.
The choice of HistGradientBoosting as the backbone model for both teacher and student is also grounded in and consistent with the broader literature. Gradient-boosting ensembles (including XGBoost, LightGBM, and related variants) have repeatedly emerged as top performers in TP and nutrient retrieval tasks across lakes and regions, often outperforming random forests and support-vector regressors because they can capture complex, non-linear interactions while remaining relatively robust on tabular feature sets [
57,
58,
59].
In this study, an initial benchmark across five ensemble learners (HistGBM, CatBoost, RandomForest, ExtraTrees, and GradientBoosting) and eight feature regimes showed that HistGradientBoosting systematically offered the best or near-best trade-off between accuracy, stability across feature groups, and computational efficiency. On this basis, the hybrid two-stage model used for knowledge distillation employs a HistGradientBoostingRegressor as the base learner, augmented by a standardized Ridge regression “linear head” trained on out-of-fold residuals and teacher predictions. This configuration benefits from the strong non-linear fitting capacity of boosting while allowing a lightweight linear correction layer to absorb residual structure and KD signals, providing a good balance between accuracy, stability, and interpretability for both the SSD-informed teacher and the SSD-free student.
By explicitly positioning the proposed framework relative to these earlier studies, the contribution of this work can be summarized as follows: it brings high-resolution (3–5 m) TP mapping into an optically complex lake, leverages both vertically integrated clarity and meteorological forcing in a unified feature space, and introduces a knowledge-distillation mechanism that preserves most of the teacher’s accuracy in an operationally scalable, SSD-free student. In doing so, it addresses several key limitations of previous TP retrieval studies—limited spatial resolution, incomplete driver sets, and non-deployable model architectures—while remaining consistent with the demonstrated strengths of gradient-boosting ensembles for water-quality retrieval.
5.6. Limitations and Future Research
Despite the encouraging performance of both teacher and student models, several limitations warrant careful consideration. The first constraint concerns optical water type dependence. The current models are calibrated primarily on mesotrophic to eutrophic conditions in Lake Simcoe, where PPUT ranges from several to tens of µg/L. In oligotrophic lakes with very low phosphorus and correspondingly weak optical signals, multispectral sensors may lack the sensitivity necessary to resolve phosphorus-related variability, and the learned relationships may not transfer without recalibration or the addition of hyperspectral data.
Second, there are temporal resolutions and timing constraints. Although PlanetScope provides near-daily nominal revisit, adequate temporal coverage is reduced by cloud contamination, data gaps, and mismatches with in-situ sampling times. Short-lived events such as storm-induced turbidity spikes, rapid tributary pulses, or internal waves can therefore be underrepresented or entirely missed in the satellite record. Integrating PlanetScope with constellations such as Sentinel-2 and Landsat-8/9 could mitigate some of these limitations by providing a denser, multi-sensor time series, but would introduce additional challenges in cross-sensor harmonization.
Third, the framework is subject to vertical ambiguity. The models predict surface or near-surface PPUT-like conditions and are largely insensitive to deep-water phosphorus accumulation below the optical penetration depth. In strongly stratified periods, deep hypolimnetic phosphorus enrichment may therefore remain undetected, requiring complementary profiling or moored sensor data for comprehensive assessment.
Fourth, spatial heterogeneity in nearshore zones remains a challenge. The use of pixel aggregation and limited spatial predictors can smooth sharp gradients at the land-water interface, where human activities, shoreline morphology, and local hydrodynamics can create strong PPUT contrasts over short distances. Incorporating explicit spatial descriptors—such as distance to shore, bathymetric depth, or proximity to tributary mouths—may help improve model performance in these transition regions.
Fifth, these comparisons indicate that feature observability is the primary determinant of performance, and algorithmic differences are most consequential only when SSD is unavailable. For lake-wide inference over large PlanetScope mosaics, HistGradientBoosting provides a pragmatic balance of predictive accuracy, computational efficiency, and robustness, while the SSD-aware regime mainly serves as a high-fidelity calibration and validation benchmark.
Finally, the knowledge-distillation framework introduces its own form of uncertainty propagation. Because student models inherit part of the teacher’s structure, any systematic biases or unresolved uncertainties in the teacher will be partially transferred. Furthermore, distillation tends to smooth extreme values, which may modestly reduce sensitivity to rare but ecologically significant high-phosphorus events. Future work could explore confidence-aware distillation, Bayesian teacher-student architectures, and explicit uncertainty quantification to better characterize and propagate predictive uncertainty.
Also, the distillation weight α was kept constant throughout training, consistent with standard KD formulations. Future work could explore adaptive or schedule-based α schemes (e.g., annealing α across epochs or modulating it as a function of student confidence), which may further enhance the balance between teacher guidance and data-driven learning.
5.7. Broader Implications
The integration of PlanetScope’s high-resolution multispectral imagery with machine learning and KD techniques has substantial implications for the future of inland water-quality monitoring. By demonstrating that lake-wide phosphorus distributions can be reconstructed at 3–5 m spatial resolution without reliance on dense auxiliary in-situ measurements such as SSD, the proposed framework addresses one of the most persistent gaps in operational monitoring programs: limited spatial coverage. The ability to generate stable, high-frequency phosphorus estimates from commercial satellite constellations enables a transition from episodic, labour-intensive sampling campaigns to a more proactive, spatially continuous monitoring paradigm.
In practice, high-resolution PPUT predictions improve situational awareness for lake managers. Fine-scale maps enhance the detection of nearshore nutrient hotspots, tributary plume dynamics, and localized resuspension events—features often missed by medium-resolution satellite products or sparse station networks. The temporal density of PlanetScope acquisitions mitigating eutrophication episodes before they fully develop. Such capabilities are particularly valuable for targeting field campaigns toward critical periods and locations, optimizing limited monitoring resources, and refining basin-wide nutrient budgets that account for spatial heterogeneity in loading and retention.
The lake-wide map for 26 July 2024 further illustrates how such SSD-free predictions can be used in practice to support spatially targeted management. Although nearly 97% of surface pixels fall within a relatively low 5–10 µg/L range, high-concentration waters form compact yet coherent clusters aligned with tributary mouths, narrow channels, and sheltered embayments. This pattern indicates that a small fraction of the lake area exerts a disproportionate influence on phosphorus loading and ecological risk. From a management perspective, these hotspot polygons provide natural candidates for intensified in-situ sampling, source apportionment studies, and land-use interventions in the corresponding sub-catchments.
Conversely, the broad, low-variability background in the open lake suggests that additional routine sampling in the pelagic interior would yield diminishing returns compared with targeted efforts in nearshore and tributary corridors. In this way, high-resolution PPUT maps produced by the distilled model do not simply visualize nutrient status; they operationalize a tiered monitoring strategy in which scarce field and remediation resources can be preferentially allocated to the small but dynamically important zones where phosphorus actually accumulates and propagates.
Beyond Lake Simcoe, the findings suggest a scalable implementation pathway. A tiered operational concept can be envisioned in which high-accuracy, SSD-aware teacher models are deployed on a limited set of benchmark lakes for calibration and regulatory assessment.
In contrast, distilled SSD-free student models are applied across larger regional lake networks for routine surveillance. In data-sparse regions or for historical reconstruction, further simplified models using a reduced set of spectral indices and temporal descriptors could still provide qualitative assessments of nutrient status at minimal cost. Collectively, these advances highlight the broader potential of combining commercial high-resolution satellite archives with AI-based knowledge-transfer methods to support scalable, accurate, and operational inland water-quality assessment, while maintaining a scientifically interpretable link to the underlying hydrological and biogeochemical processes.
Taken together, the analyses in
Section 5 show that the success of PPUT retrieval in Lake Simcoe is less determined by the choice of machine-learning architecture than by which aspects of the lake system are made observable to the model. SSD emerges as a pivotal depth-integrated constraint that renders the inversion problem well-posed, while meteorological drivers and carefully designed spectral indices encode, respectively, the forcing and the optical expression of phosphorus dynamics. KD then provides a principled way to transfer this structure into an SSD-free student model, enabling scalable mapping across space and time without fully sacrificing accuracy. At the same time, the identified limitations—vertical ambiguity, optical water-type dependence, temporal mismatch, and uncertainty propagation—highlight the need to view such models as components of an integrated observing system rather than as stand-alone decision tools.
Thus, combining high-resolution PlanetScope imagery, physics-informed feature design, and teacher-student learning can bridge long-standing monitoring gaps in inland waters, while preserving a clear mechanistic link between model outputs and the hydrological and biogeochemical processes they are intended to represent.
6. Conclusions
This study demonstrates that integrating multi-generation PlanetScope imagery with a hybrid machine-learning and knowledge-distillation framework provides a robust pathway for high-resolution, lake-wide retrieval of total phosphorus (PPUT) in an optically complex inland lake.
By systematically evaluating five ensemble learners across eight feature-group regimes, we showed that the dominant constraint on PPUT retrieval is not model architecture but feature observability—most notably the availability of SSD as a vertically integrated indicator of water clarity. Across all models, including SSD increased the mean R2 from approximately 0.67 to 0.94 and reduced error metrics by more than half, confirming that water-column transparency strongly governs the learnable structure of the PPUT prediction problem.
Building on this insight, we developed a two-stage teacher–student framework in which a physically informed teacher model is trained with SSD and transfers its representation to an SSD-free student via knowledge distillation. The distilled student retained about 88% of the teacher’s accuracy (R2 = 0.83, RMSE = 9.82 µg/L, MAE = 5.41 µg/L) while relying only on PlanetScope reflectance, derived spectral indices, and short-term meteorological descriptors. The K-sweep analysis further revealed that an intermediate subset of 40 features provides an optimal balance between predictive skill and parsimony, indicating that a compact combination of optical and meteorological drivers can encode the essential dynamics of phosphorus transport, resuspension, and biological uptake.
Application of the SSD-free student model to PlanetScope SuperDove imagery from 2020 to 2025 produced metre-scale PPUT maps that resolve spatial patterns far beyond the capability of traditional monitoring and medium-resolution satellites. A representative case study for 26 July 2024 showed that the vast majority of lake-surface pixels (>97%) fall within a low-concentration band of 5–10 µg/L, whereas rare (<1%) but spatially coherent hotspots exceeding 20 µg/L occur near tributary mouths, sheltered embayments, and narrow channels. These hotspots form contiguous clusters that coincide with known river-inflow corridors and shallow, wind-exposed shorelines, highlighting the role of nearshore processes and watershed inputs in structuring phosphorus heterogeneity across Lake Simcoe.
In general, the findings indicate that the lake-wide nutrient monitoring using labour-intensive field campaigns can be complemented and enhanced using high-resolution satellite observations and distilled AI models. The proposed framework is readily transferable to other lakes where SSD or similar auxiliary measurements are collected at limited stations but cannot be mapped wall-to-wall. By enabling near-real-time estimation of PPUT at 3 m resolution, the approach provides a foundation for adaptive monitoring programs, early-warning tools for eutrophication, and improved watershed-scale nutrient budgeting. Overall, the combination of commercial high-resolution imagery, physics-informed feature engineering and knowledge-transfer algorithms constitutes a robust and generalizable strategy for inland water-quality retrieval.