Estimation of Total Phosphorus Concentration in Lakes in the Yangtze-Huaihe Region Based on Sentinel-3/OLCI Images

: Total phosphorus (TP) concentration is a crucial parameter to assess eutrophication in lakes. As one of the most concentrated regions for freshwater lakes, the Yangtze-Huaihe region plays a signiﬁcant role in monitoring TP concentrations for the sustainable utilisation of China’s water resources. In this study, a TP concentration estimation model suitable for large-sized lake groups was developed using a combination of measured and remote sensing data powered by advanced machine learning algorithms. Compared to traditional empirical models, the model developed in this study demonstrates signiﬁcant accuracy in ﬁtting (R 2 = 0.53, RMSE = 0.08 mg/L, MAPE = 34.20%). Moreover, the application of this model to lakes in the Yangtze-Huaihe region from 2017 to 2022 has been conducted. The multi-year average TP concentration was 0.18 mg/L. Spatial distribution analyses showed that total phosphorus concentrations were higher in small lakes. In terms of temporal changes, the interannual decreases in total phosphorus concentrations were 0.02 mg/L, 0.01 mg/L, and 0.01 mg/L for small, medium, and large lakes, respectively. We also found that large lakes typically exhibited a “high in spring and summer, low in autumn and winter” pattern until 2020, but transitioned to a “high in summer and autumn, low in spring and winter” pattern after 2020 due to the removal of closed ﬁsh nets, which were having a signiﬁcant impact on the lake ecosystem. Other lakes in the area consistently showed a pattern of “high in spring and summer, low in autumn and winter” during the six-year period. These ﬁndings may provide useful references and suggestions for the environmental protection and management of lakes in China.


Introduction
The ongoing increase in lake eutrophication has emerged as one of the most significant issues in aquatic environments due to global warming, industrialisation, and intensive agriculture.Approximately 40% of the world's lakes experience severe issues, such as ecosystem and biodiversity deterioration, because of severe eutrophication [1].Consequently, long-term and extensive monitoring of lake nutrient content is crucial for regulating lake eutrophication.
Currently, various techniques are used to retrieve water quality parameters, including analytical [2], semi-empirical [3], and empirical methods [4].However, machine learning algorithms have demonstrated unmatched advantages over conventional approaches thanks to the quick development of artificial intelligence technology.By leveraging large datasets, these algorithms can establish complex relationships between water quality parameters and multiple variables, enabling the estimation of crucial parameters, such as the Secchi disk depth (SDD) [5], chlorophyll-a (Chl-a) [6], total suspended matter (TSM) [7], and chromophoric dissolved organic matter (CDOM) [8], and they have been successfully applied to the long-term monitoring of multiple lakes on a large spatial scale [9][10][11].Nonetheless, it is crucial to note that, in addition to these optical characteristic parameters, other non-optical parameters are also closely related to lake eutrophication and play a crucial role in assessing the safety of the water environment, such as total nitrogen (TN), total phosphorus (TP), ammonia nitrogen (NH 3 -N), and the permanganate index (COD Mn ) [12,13]; however, there are some difficulties in estimating these indices in turbid inland water bodies.
Among these parameters, TP is one of the primary nutrients contributing to lake eutrophication [14].Extensive research has used remote sensing techniques to construct models for estimating TP concentrations in water bodies.For instance, Baban et al. [15] employed Landsat 5 TM satellite data to establish a regression relationship between remote sensing and in-situ measurements for TP concentration estimation.Wang et al. [16] utilized MODIS images to develop a regression model to estimate TP concentration in Hulun Lake.Additionally, researchers have explored indirect TP estimation by leveraging the relationships between TP and optical water quality parameters, such as TSM, Chl-a, and CDOM [17][18][19].In recent years, machine learning algorithms have gained prominence in TP concentration estimation, surpassing traditional methods.For instance, García Nieto et al. [20] employed Support Vector Machines (SVM) to construct a TP estimation model for Englishmen Lake in Spain and achieved an impressive coefficient of determination (R 2 ) of 0.90. Lee et al. [21] applied four different machine learning algorithms to estimate TP concentrations in Euiam Lake, Korea, yielding R 2 values above 0.70.Xiong et al. [22] developed the eXtreme Gradient Boosting (XGBoost) algorithm specifically tailored for TP concentration estimation in Taihu Lake, achieving an R 2 of 0.60 and an RMSE of 0.07 mg/L.Among the various machine-learning techniques, XGBoost, an optimized gradient boosting algorithm, has demonstrated superior performance, self-learning capabilities, and predictive power [23].Cui et al. [24] identified XGBoost as the optimal choice for estimating Chl-a concentration in Nansi Lake using hyperspectral data.Similarly, Hu et al. [25] compared six machine learning methods and concluded that XGBoost was the most effective model for predicting TP concentration in Taihu Lake.The traditional direct inversion method is simple and has good results, but it is not suitable for lakes with complex relationships, while the indirect inversion is detrimental to the model accuracy, so currently these two inversion methods are only applied to a single lake.In contrast, machine learning has higher application ability on a large regional scale, which is not only well proven in parameters with optical characteristics, such as chlorophyll-a [6] and Secchi disk depth [26], but also shows good robustness in the inversion of non-optical parameters, such as dissolved CO 2 [27,28].Although some scholars believe that machine learning has the potential for application in the remote sensing estimation of TP for lakes at a large regional scale [22], a complete proof and application have not been given.
In conclusion, there is a lack of machine learning models that are widely applicable to the estimation of TP concentrations in lake populations.The Yangtze-Huaihe region is recognised as one of the most densely populated freshwater lakes areas in China and represents a prominent region with respect to lake eutrophication issues [29].Monitoring TP concentrations in the Yangtze-Huaihe region is crucial for effective water management within the region.Therefore, this study aimed to utilise remote sensing techniques and synchronous satellite observations to establish a machine learning-based algorithm for estimating TP concentrations in the Yangtze-Huaihe Lake group.This study further aims to analyze the spatiotemporal variations in the TP and investigate the underlying driving mechanisms, thus providing scientific references for the monitoring and management of lake pollution and eutrophication in the Yangtze-Huaihe region.The objectives of this study were as follows: (1) to develop a machine learning model for estimating TP concentrations in the lakes of the Yangtze-Huaihe region and (2) to analyse the spatiotemporal distribution patterns of TP concentrations in the lakes of the Yangtze-Huaihe region and explore the associated driving mechanisms.

Study Area
The study area encompasses the Yangtze-Huaihe region, which lies between the lower reaches of the Yangtze River and the Huai River.It is located at approximately 28 • 21 -33 • 40 N, 115 • 42 -121 • 1 E and is characterized by a subtropical monsoon climate with mild seasons and significant interannual variations in rainfall.The geographical features of the region are predominantly formed by the sedimentation of the Huai River and the Yangtze River, with higher elevations surrounding the lower-lying central area, where the altitude is less than 10 m.The Yangtze-Huaihe region is one of the areas in China with the highest density of lakes, which is closely associated with the scarcity of water resources in the country [30].Moreover, the region has a high population density, well-developed freshwater fisheries, and a substantial amount of wastewater discharge.The intensive developmental activities, such as land reclamation and enclosed aquaculture, in the lakes have directly or indirectly contributed to increased nutrient loading [31,32], water quality deterioration, frequent occurrences of cyanobacterial blooms, and severe eutrophication, which pose significant challenges to the sustainable development of the Yangtze-Huaihe region.Therefore, for this study, we have selected shallow lakes within the Yangtze-Huaihe region, with an area larger than 20 km 2, as representative lakes for long-term analysis of TP concentrations, as seen in Figure 1.(Please refer to the area of the lake Supplementary Material Table S1 for details).
study were as follows: (1) to develop a machine learning model for estimating TP concentrations in the lakes of the Yangtze-Huaihe region and (2) to analyse the spatiotempora distribution patterns of TP concentrations in the lakes of the Yangtze-Huaihe region and explore the associated driving mechanisms.

Study Area
The study area encompasses the Yangtze-Huaihe region, which lies between the lower reaches of the Yangtze River and the Huai River.It is located at approximately 28°21′-33°40′N, 115°42′-121°1′E and is characterized by a subtropical monsoon climate with mild seasons and significant interannual variations in rainfall.The geographical features of the region are predominantly formed by the sedimentation of the Huai River and the Yangtze River, with higher elevations surrounding the lower-lying central area, where the altitude is less than 10 m.The Yangtze-Huaihe region is one of the areas in China with the highest density of lakes, which is closely associated with the scarcity of water resources in the country [30].Moreover, the region has a high population density, well-developed freshwater fisheries, and a substantial amount of wastewater discharge.The intensive developmental activities, such as land reclamation and enclosed aquaculture, in the lakes have directly or indirectly contributed to increased nutrient loading [31,32], water quality deterioration, frequent occurrences of cyanobacterial blooms, and severe eutrophication, which pose significant challenges to the sustainable development of the Yangtze-Huaihe region.Therefore, for this study, we have selected shallow lakes within the Yangtze-Huaihe region, with an area larger than 20 km 2, as representative lakes for long-term analysis of TP concentrations, as seen in Figure 1.(Please refer to the area of the lake Supplementary Material Table S1 for details).

Data 2.2.1. In Situ Data
The research team conducted field surveys in six medium to large-sized lakes in the Yangtze-Huaihe region between 2017 and 2022.The sampling points were evenly distributed across the lakes, and specific information is presented in Table 1.500 mL brown bottles were used to collect surface mixed water samples, which were then immediately transported to the lab with ice packs in light-protected coolers.Filtration experiments were conducted on the same day, followed by refrigeration for chemical analysis.Global positioning system (GPS) devices were used to record the coordinates of the sampling points, and measurement time and weather conditions.TP concentrations were determined using the acid digestion-molybdenum antimony anti-colorimetric method.The imagery used in this study is obtained from Sentinel-3 images.The Ocean and Land Color Instrument (OLCI) is a multispectral radiometer carried by Sentinel-3A (launched in 2016) and Sentinel-3B (launched in 2018).It has 21 spectral channels in the range of 400-1020 nm, including 16 hydro-chromatic bands and a high signal-to-noise ratio, with a spatial resolution of about 300 m.The data were downloaded from the European Space Agency website (https://scihub.copernicus.eu/dhus/#/home,accessed on 28 June 2022).
Currently, there are various methods for atmospheric correction of water bodies.Among them, the dark spectrum fitting (DSF) atmospheric correction algorithm of Acolite has been proven suitable for the use of Sentinel-3 OLCI imagery in turbid water bodies [33].Therefore, in this study, the Acolite atmospheric correction algorithm was employed.Subsequently, masking of algal blooms, clouds, and cloud shadows was performed, and terrain correction was conducted using SNAP 8.0.0 software.Spectral reflectance of the sampling points was extracted based on their latitude and longitude coordinates using ENVI 5.3.These data were used for the construction and validation of the TP concentration inversion model.

Auxiliary Data
Meteorological data was incorporated to investigate the influence of natural factors on TP concentration in the Yangtze-Huaihe region.Rainfall data utilized the daily global precipitation dataset provided by NASA (GPM_3IMERGHH_06, https://disc.gsfc.nasa.gov,accessed on 15 February 2023).The spatial resolution of this dataset is 0.1 • × 0.1 • .Tem- perature data was obtained from the ERA5-Land dataset (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-land-monthly-means?tab=overview, accessed on 15 February 2023) and expressed in Kelvin (K).In this study, the temperature values were converted from Kelvin to Celsius ( • C) by subtracting 273.15 from the original data.

Model Development 2.3.1. Modeling Set Construction
The variable set comprises OLCI image bands and remote sensing indices.
(2) Remote Sensing Indices: Hu et al. [34] developed the Floating Algae Index (FAI) for monitoring algal blooms in water bodies, which have a close relationship with nutrient concentrations [22].Due to the absence of the Short-Wave Infrared (SWIR) band in OLCI, the Near-Infrared (NIR) band is used as a substitute, resulting in the creation of the Adjusted FAI (AFAI) [35].Additionally, the Cyanobacteria and Macrophytes Index (CMI) is employed to differentiate cyanobacterial bloom areas and aquatic vegetation areas, while the Turbid Water Index (TWI) is used to assess turbid water bodies [36].The formulas for calculating these indices are as follows: Among these, λ BLUE , λ GREEN , λ RED , λ NIR1 , λ NIR2 , λ SWIR correspond to the center wavelengths of 430 nm, 560 nm, 665 nm, 754 nm, 865 nm, and 1016 nm in OLCI, respectively.
Based on the above, five approaches are constructed using the original bands and transformations, such as reciprocal, exponential, square, and square root, in Table 2.In addition, the AFAI, CMI, and TWI indices are included as variables.In total, each scheme consists of 20 variables, and XGBoost models are built for each approach.Each approach's input variables are marked with an ID (please refer to Supplementary Material Table S2 for details).

Algorithm
The XGBoost algorithm was employed to establish a remote sensing algorithm for inversing estimation of phosphorus concentration in the Yangtze-Huaihe region.XGBoost, proposed by Chen et al. [37] in 2016, is an algorithm based on gradient boosting decision trees optimization.The core idea of this algorithm is to utilize gradient information to progressively improve the predictive capability of the model by constructing decision trees.In each iteration, the XGBoost algorithm calculates residuals based on the model's predictions from the previous round and then fits new decision trees to these residuals.Through continuous iterations, new decision trees are added, and their prediction results accumulate.In addition, it controls the model's complexity and prevents overfitting, while improving model runtime speed by introducing regularization terms.In this study, the XG-Boost algorithm was implemented using the Scipy library in Python 3.9.To determine the structure of the XGBoost model, multiple hyperparameters need to be adjusted, including learning rate, maximum tree depth, subsample, and regularization.After debugging, the final parameters were set as learning rate of 0.02, max tree depth of 7 layers, subsample of 0.8, and regularization of 0.01, with other parameters set to default values.
Before constructing the algorithm, the importance index of input variables, referred to as IncMSE (increase in mean squared error), needs to be calculated.A higher IncMSE value indicates greater importance of the variable.The accuracy of IncMSE increases with a larger number of random iterations, but at the cost of longer computation time.In this study, 300 iterations were selected.For each scheme, the variables were sorted in descending order based on the calculated IncMSE scores, and then sequentially inputted into the algorithm.The approach with higher accuracy was selected based on model evaluation metrics through cross-validation.
2.4.Model Evaluation 2.4.1.K-Fold Cross-Validation K-fold cross-validation is widely regarded as a common method for assessing the performance and reliability of machine learning models [38].It effectively addresses the issue of parameter tuning and prevents information leakage from influencing the model's hyperparameters.The entire dataset is divided into K equally sized subsets, and in each iteration, one subset is used as the validation set while the remaining K-1 subsets are used for training.This process is repeated K times, allowing for model training and evaluation.The average of the evaluation metrics obtained from the K iterations is considered as the final evaluation metric.

Algorithm Accuracy Evaluation
The model accuracy is evaluated using the following metrics: coefficient of determination (R 2 ), root mean squared error (RMSE), and mean absolute percentage error (MAPE).These metrics provide a comprehensive assessment of the model's performance.5)

Spatial-Temporal Distribution Analysis
A total of 882 cloud-free or partially cloud-free OLCI images were selected from 2017 to 2022 in the Yangtze-Huaihe region (Table 3).The machine learning algorithm was applied to estimate the TP concentration in each lake based on the Yangtze-Huaihe region dataset.Firstly, monthly TP concentrations were calculated for each lake per year and per month.Secondly, the average TP concentrations were determined for each lake across seasons (Spring: from March to May, Summer: from June to August, Autumn: from September to November, Winter: from December to February).Additionally, the mean TP concentration was calculated for each 12' latitude and longitude grid cell, and the results were visualized.The lakes were classified into four categories based on their surface area: large lakes (>500 km 2 ), medium lakes (100-500 km 2 ), medium-small lakes (50-100 km 2 ), and small lakes (20-50 km 2 ).The spatial-temporal variations of TP concentration in the Yangtze-Huaihe region were analyzed accordingly.

Model Accuracy Evaluation
The results of the variable importance index calculations for the five schemes are provided (Supplementary Material Figure S1).Based on the variable importance index, the variables were input into the algorithm in descending order of importance.The training sets of the machine learning algorithms exhibited high fitting accuracy, considering only the accuracy of the validation set.The fitting results were similar for the five scenarios and AFAI, b8, b4, b6, and b11 were selected for all of them, indicating that these five variables are important factors in estimating TP concentrations.
Based on the results of 5-fold cross-validation and scatter plot distributions, as seen in Figure 2, Approach 1 demonstrated higher accuracy and smaller errors (R 2 = 0.53, RMSE = 0.08, MAPE = 34.20%).The scatter plot distribution of approach 1 was more concentrated.Therefore, the AFAI, b8, b4, b6, b11, b17, b14, b2, and CMI were selected as input variables to construct the model for TP concentration estimation.

Spatial Distribution of TP Concentration in the Yangtze-Huaihe Region
The seasonal variations in TP concentration in the lakes of the Yangtze-Huaihe region during the period 2017-2022 are depicted (Figure 3).The average TP concentrations were relatively higher during the spring (0.21 mg/L) and summer (0.23 mg/L) seasons, while they are lower during autumn (0.19 mg/L) and winter (0.15 mg/L), surpassing the threshold for algal bloom occurrence (0.08 mg/L) [39].
In terms of spatial distribution, smaller lakes near Taihu Lake, Hongze Lake, and Poyang Lake exhibited comparatively higher TP concentrations.The average TP concentration near the lake shore was significantly higher than that in the centre of the lake.Regarding the longitude, the fluctuation trends remained consistent throughout the four seasons.Within the longitude ranges of 117°0′-117°48′E, 118°36′-119°0′E, and 120°0′-120°36′E, which include the western part of Poyang Lake, the smaller lakes between Poyang Lake and Chaohu Lake, and Taihu Lake, the average TP concentrations are relatively

Spatial Distribution of TP Concentration in the Yangtze-Huaihe Region
The seasonal variations in TP concentration in the lakes of the Yangtze-Huaihe region during the period 2017-2022 are depicted (Figure 3).The average TP concentrations were relatively higher during the spring (0.21 mg/L) and summer (0.23 mg/L) seasons, while they are lower during autumn (0.19 mg/L) and winter (0.15 mg/L), surpassing the threshold for algal bloom occurrence (0.08 mg/L) [39].
In terms of spatial distribution, smaller lakes near Taihu Lake, Hongze Lake, and Poyang Lake exhibited comparatively higher TP concentrations.The average TP concentration near the lake shore was significantly higher than that in the centre of the lake.Regarding the longitude, the fluctuation trends remained consistent throughout the four seasons.Within the longitude ranges of 117 • 0 -117 • 48 E, 118 • 36 -119 • 0 E, and 120 • 0 -120 • 36 E, which include the western part of Poyang Lake, the smaller lakes between Poyang Lake and Chaohu Lake, and Taihu Lake, the average TP concentrations are relatively lower.

Temporal Variation of TP Concentration in Lakes of Different Sizes
Figure 4 shows the temporal variation in TP concentrations in lakes of different sizes across the Yangtze-Huaihe region from 2017 to 2022.The large lakes exhibited a distinct pattern, with higher TP concentrations during spring and summer and lower concentrations during autumn and winter from 2017 to 2019.However, from 2020 to 2022, the TP concentrations in large lakes increased during summer and autumn and decreased during spring and winter.Other lake types displayed a consistent pattern over the six-year period, with higher TP concentrations during spring and summer and lower concentrations during autumn and winter.
Large lakes showed relatively weak and stable seasonal variations, with average TP concentrations ranging from 0.13 to 0.23 mg/L.The highest TP concentration of 0.23 mg/L Analyzing the distribution based on latitude, the southern part of Poyang Lake at 29 • 24 N exhibited higher TP concentrations during spring and summer.Conversely, the northern part of Poyang Lake between 29 • 36 and 30 • 0 N showed lower TP concentrations.Additionally, the northern part of Hongze Lake at 33 • 48 N consistently displayed higher TP concentrations in all four seasons.

Temporal Variation of TP Concentration in Lakes of Different Sizes
Figure 4 shows the temporal variation in TP concentrations in lakes of different sizes across the Yangtze-Huaihe region from 2017 to 2022.The large lakes exhibited a distinct pattern, with higher TP concentrations during spring and summer and lower concentrations during autumn and winter from 2017 to 2019.However, from 2020 to 2022, the TP concentrations in large lakes increased during summer and autumn and decreased during spring and winter.Other lake types displayed a consistent pattern over the six-year period, with higher TP concentrations during spring and summer and lower concentrations during autumn and winter.
Remote Sens. 2023, 15, x FOR PEER REVIEW 10 of 18 Across lakes of different sizes, TP concentrations peaked during the summer of 2020, and TP concentrations during autumn were also higher than those in other years in the same season.

Comparison with Other Algorithms
To compare the accuracy of the XGBoost model developed in this study, which is based on both field measurements and OLCI images, we selected two existing empirical models [40,41] that have shown good performance in turbid inland waters, as shown in Table 4.While the existing algorithms achieved high accuracy in specific lakes, with R 2 values above 0.80, their effectiveness in large-sized regions was not satisfactory, with R 2 values below 0.10, RMSE of 0.13 mg/L, and MAPE exceeding 50% (Figure 5).Therefore, these models are unsuitable for TP concentration retrieval in large-sized regions.Large lakes showed relatively weak and stable seasonal variations, with average TP concentrations ranging from 0.13 to 0.23 mg/L.The highest TP concentration of 0.23 mg/L was recorded during the summer of 2020.Medium and medium-small lakes exhibited larger variation amplitudes than large lakes, and their average TP concentrations during spring and summer were generally higher than 0.20 mg/L.For medium-sized lakes, the highest TP concentration of 0.18 mg/L was observed during autumn 2020, and during winter, TP concentrations remained below 0.15 mg/L.Medium-small lakes showed average TP concentrations ranging from 0.18 to 0.21 mg/L during the autumn and between 0.15 and 0.17 mg/L during the winter.Small lakes displayed the most significant seasonal variation and had higher average TP concentrations, with concentrations above 0.20 mg/L during the spring, summer, and autumn, and between 0.15 and 0.17 mg/L in winter.
Across lakes of different sizes, TP concentrations peaked during the summer of 2020, and TP concentrations during autumn were also higher than those in other years in the same season.

Comparison with Other Algorithms
To compare the accuracy of the XGBoost model developed in this study, which is based on both field measurements and OLCI images, we selected two existing empirical models [40,41] that have shown good performance in turbid inland waters, as shown in Table 4.While the existing algorithms achieved high accuracy in specific lakes, with R 2 values above 0.80, their effectiveness in large-sized regions was not satisfactory, with R 2 values below 0.10, RMSE of 0.13 mg/L, and MAPE exceeding 50% (Figure 5).Therefore, these models are unsuitable for TP concentration retrieval in large-sized regions.However, machine learning also has certain limitations when it comes to estimating TP concentration.Insufficient sample size in the training dataset can affect the accuracy and generalization ability of the model [42]; additionally, different types of lakes can also influence the model's accuracy [43].Therefore, cross-validation is employed to divide the dataset into different training and validation sets to minimize the impact of sample size on model performance and enhance its generalization ability [44,45].Li et al. [46] utilized 67 samples to construct a machine learning model for estimating the concentration of TP in water bodies, while Ding et al. [47] constructed an XGBoost model for estimating TP concentration in Nanyi Lake using 78 samples, both achieving good results.This suggests that, by selecting suitable algorithms and employing appropriate processing methods, machine learning can still yield good validation results even with a limited number of samples.Although the R 2 value of the XGBoost model in this study reached above 0.50, it exhibited differences in estimating results for different types of lakes.Specifically, there was a tendency for overestimation of TP concentration in Gaoyou, Hongze, and Gehu Lake when TP concentrations were low, while there was noticeable underestimation of high TP concentration in Taihu Lake.When comparing the water quality of these lakes, it was found that TSM concentrations were higher in Gaoyou, Hongze and Gehu Lake, while Chl-a concentrations were higher in Taihu Lake (Figure 6), which illustrates that the model in this study tended to overestimate in lakes dominated by non-algal particles, such as Gaoyou, Hongze and Gehu Lake [48], while it underestimated in lakes dominated by phytoplankton, such as Taihu Lake [43].Currently, few researchers use machine learning to estimate the properties of lakes with different optical characteristics.The number of sampling sites used in this study is limited, so this model has some limitations.
Based on the above findings, no machine-learning model has been specifically devel- However, machine learning also has certain limitations when it comes to estimating TP concentration.Insufficient sample size in the training dataset can affect the accuracy and generalization ability of the model [42]; additionally, different types of lakes can also influence the model's accuracy [43].Therefore, cross-validation is employed to divide the dataset into different training and validation sets to minimize the impact of sample size on model performance and enhance its generalization ability [44,45].Li et al. [46] utilized 67 samples to construct a machine learning model for estimating the concentration of TP in water bodies, while Ding et al. [47] constructed an XGBoost model for estimating TP concentration in Nanyi Lake using 78 samples, both achieving good results.This suggests that, by selecting suitable algorithms and employing appropriate processing methods, machine learning can still yield good validation results even with a limited number of samples.Although the R 2 value of the XGBoost model in this study reached above 0.50, it exhibited differences in estimating results for different types of lakes.Specifically, there was a tendency for overestimation of TP concentration in Gaoyou, Hongze, and Gehu Lake when TP concentrations were low, while there was noticeable underestimation of high TP concentration in Taihu Lake.When comparing the water quality of these lakes, it was found that TSM concentrations were higher in Gaoyou, Hongze and Gehu Lake, while Chl-a concentrations were higher in Taihu Lake (Figure 6), which illustrates that the model in this study tended to overestimate in lakes dominated by non-algal particles, such as Gaoyou, Hongze and Gehu Lake [48], while it underestimated in lakes dominated by phytoplankton, such as Taihu Lake [43].Currently, few researchers use machine learning to estimate the properties of lakes with different optical characteristics.The number of sampling sites used in this study is limited, so this model has some limitations.
Based on the above findings, no machine-learning model has been specifically developed for TP concentration estimation in lake groups.To enhance the accuracy of the model, future research should focus on increasing the number of sampling points and classifying water bodies based on their optical properties.

Drivers Analysis
From the perspective of climate change (Figure 7), meteorological factors such as temperature and precipitation directly or indirectly influence nutrient concentrations [49,50].As the temperature increases, the release rate of nutrients from lake sediments accelerates, leading to an increase in TP concentration in the water.Furthermore, despite the dilution effect of rainfall on lake TP concentration, extreme rainfall events can generate significant external phosphorus loading [51].During summer, floods are more frequent in the middle and lower reaches of the Yangtze River, while spring experiences higher rainfall and more rainy days compared to autumn [52], indicating that TP concentrations in lakes may be higher during spring and summer.In addition, it should be noted that the TP concentrations in different types of lakes were relatively high during the summer and autumn of 2020.This is attributed to the flooding that occurred in the Yangtze-Huaihe region in July, which resulted in the influx of excessive phosphorus loads into the lakes, directly causing an overall increase in TP concentrations and the accumulation of phosphorus in the sediments, thereby impacting subsequent water quality conditions [53].
Agricultural pollution can also contribute to the elevated TP concentrations in lakes.In the Yangtze-Huaihe region, the TP concentrations in lakes exhibited a trend of being higher in spring and summer and lower in autumn and winter.This pattern may be related to the double-cropping rice planting system in which fertilisers are applied to the soil during both the spring and summer seasons [54], resulting in a significant inflow of nutrients into lakes [55].For large lakes, the changing trend in TP concentrations after 2020, characterised by higher levels in summer and autumn and lower levels in spring and winter, is mainly attributed to the time required for the ecological water quality recovery process after the removal of enclosure aquaculture in 2019 [56].The ecosystem gradually improved after 2020, and a large number of planktonic organisms absorbed phosphorus and organic matter from the water during reproductive growth, leading to a significant decrease in TP concentrations during spring in large lakes.However, although medium-and small-sized lakes have responded positively to the effects of enclosure removal, the presence of large agricultural areas and dense human populations results in a significant discharge of domestic and industrial wastewater [57,58].The decreasing trend in TP concentrations was not evident in medium and small lakes, highlighting the importance of pollution control and management in the inflow areas of these lakes.
In conclusion, both natural factors and human activities have distinct effects on sea-

Drivers Analysis
From the perspective of climate change (Figure 7), meteorological factors such as temperature and precipitation directly or indirectly influence nutrient concentrations [49,50].As the temperature increases, the release rate of nutrients from lake sediments accelerates, leading to an increase in TP concentration in the water.Furthermore, despite the dilution effect of rainfall on lake TP concentration, extreme rainfall events can generate significant external phosphorus loading [51].During summer, floods are more frequent in the middle and lower reaches of the Yangtze River, while spring experiences higher rainfall and more rainy days compared to autumn [52], indicating that TP concentrations in lakes may be higher during spring and summer.In addition, it should be noted that the TP concentrations in different types of lakes were relatively high during the summer and autumn of 2020.This is attributed to the flooding that occurred in the Yangtze-Huaihe region in July, which resulted in the influx of excessive phosphorus loads into the lakes, directly causing an overall increase in TP concentrations and the accumulation of phosphorus in the sediments, thereby impacting subsequent water quality conditions [53].

Comparison of TP Concentration Estimation in Typical Lakes
Based on the results of this study, the lakes were categorized into four classes according to their area size, and one representative lake is selected from each category to compare with previously published research results (Figure 8 and Table 5).(1) Xiong et al. [59] analyzed total phosphorus (TP) concentrations in Lake Taihu ranging from 0.14 to 0.17 mg/L in 2017-2019, with an error of only 5% from the average TP concentration during the same period of this study.Shang et al. [60] estimated TP concentrations in Taihu Lake from 2017 to 2020, showing a similar trend to our study, the lower values are due to the fact that TP concentrations are highest in summer, but the Landsat-8 images used by Shang et al. had a longer temporal resolution, possibly not covering periods of high TP concentration, leading to a lower annual average estimation.(2) Qian et al. [61] conducted monthly surveys of TP concentration in Gehu Lake, achieving a correlation of 0.65 with this study.The reasons for the local variations are as follows: (a) There is a discrepancy Agricultural pollution can also contribute to the elevated TP concentrations in lakes.In the Yangtze-Huaihe region, the TP concentrations in lakes exhibited a trend of being higher in spring and summer and lower in autumn and winter.This pattern may be related to the double-cropping rice planting system in which fertilisers are applied to the soil during both the spring and summer seasons [54], resulting in a significant inflow of nutrients into lakes [55].For large lakes, the changing trend in TP concentrations after 2020, characterised by higher levels in summer and autumn and lower levels in spring and winter, is mainly attributed to the time required for the ecological water quality recovery process after the removal of enclosure aquaculture in 2019 [56].The ecosystem gradually improved after 2020, and a large number of planktonic organisms absorbed phosphorus and organic matter from the water during reproductive leading to a significant decrease in TP concentrations during spring in large lakes.However, although medium-and smallsized lakes have responded positively to the effects of enclosure removal, the presence of large agricultural areas and dense human populations results in a significant discharge of domestic and industrial wastewater [57,58].The decreasing trend in TP concentrations was not evident in medium and small lakes, highlighting the importance of pollution control and management in the inflow areas of these lakes.
In conclusion, both natural factors and human activities have distinct effects on seasonal and interannual variations in TP concentrations in lakes in the Yangtze-Huaihe region.

Comparison of TP Concentration Estimation in Typical Lakes
Based on the results of this study, the lakes were categorized into four classes according to their area size, and one representative lake is selected from each category to compare with previously published research results (Figure 8 and Table 5).(1) Xiong et al. [59] analyzed total phosphorus (TP) concentrations in Lake Taihu ranging from 0.14 to 0.17 mg/L in 2017-2019, with an error of only 5% from the average TP concentration during the same period of this study.Shang et al. [60] estimated TP concentrations in Taihu Lake from 2017 to 2020, showing a similar trend to our study, the lower values are due to the fact that TP concentrations highest in summer, but the Landsat-8 images used by Shang et al. had a longer temporal resolution, possibly not covering periods of high TP concentration, leading to a lower annual average estimation.(2) Qian et al. [61] conducted monthly surveys of TP concentration in Gehu Lake, achieving a correlation of 0.65 with this study.The reasons for the local variations are as follows: (a) There is a discrepancy between the monthly monitoring values and the average values within the monthly interval of the image; (b) Qian et al. used interpolation to calculate TP concentrations in Gehu Lake, and their results may have been influenced by the uneven distribution of sampling data.In contrast, this study employed remote sensing imagery, providing broader continuous spatial coverage.Due to the limitation of spatial resolution, it may be difficult to capture the detailed information for specific localized areas in this study, which leads to local discrepancies compared to the results of Qian et al. (3) The seasonal variation trend for TP concentration in Dianshan Lake observed in the study of Xiong et al. [62] is similar to the trend in this study, but the discrepancies in results are due to: (a) the distribution of sampling points: Xiong et al. analyzed seasonal variations using data from 13 monitoring points in Dianshan Lake, most of which were located near the lakeshore, with high rainfall intensity in spring and summer and a high external phosphorus load [52], possibly leading to overestimations in spring and summer-when the external phosphorus load is lower in autumn and winter, it is likely to be underestimated; (b) the time series of the studies: the time series of this study (2017-2022) differed from that of Xiong et al. (1996Xiong et al. ( -2015)).The relevant research indicates that local farmers around Dianshan Lake have extensively converted farmland into orchards in pursuit of higher economic returns.These orchards have a higher application rate of fertilizers [63], resulting in an exacerbation of pollution in Dianshan Lake.Therefore, the differences in water quality during different time periods in the study have led to variations in the estimated values of TP concentration.(4) Gong et al. [64] conducted a water quality survey in Tuohu Lake in 2019, with a TP concentration of 0.35 mg/L in spring, consistent with this study.The inconsistency in autumn and winter is due to Gong et al. conducting only four surveys in select months in 2018.Field surveys can be affected by sudden water quality events, and uneven distribution of remote sensing imagery within seasons can also result in inconsistent statistical results.In addition, Gong et al. established a total of eight sampling points in Taihu Lake, with five of them located near the lake shore.During autumn and winter, the TP concentration is generally lower due to the reduction in external phosphorus loading [65].These findings offer invaluable guidance for the monitoring of TP concentration in lakes within the Yangtze-Huaihe region.
In summary, the TP concentration estimated in this study is generally consistent with the trends observed in previous studies, confirming its scientific validity and reliability.These findings offer invaluable guidance for the monitoring of TP concentration in lakes within the Yangtze-Huaihe region.

Conclusions
In this study, a TP concentration estimation model applicable to large-sized regions, using synchronous satellite observations and the XGBoost algorithm, was developed with a focus on the lake group in the Yangtze-Huaihe region.The model utilised nine input variables: AFAI, b8, b4, b6, b11, b17, b14, b2, and CMI.Through cross-validation, the model achieved satisfactory fitting accuracy (R 2 = 0.53, RMSE = 0.08 mg/L, MAPE = 34.20%),outperforming traditional empirical models.The model was applied to OLCI images of lakes in the Yangtze-Huaihe region from 2017 to 2022.The results indicated an interannual decrease in TP concentrations for small, medium, and large lakes, with reductions of 0.02 mg/L, 0.01 mg/L, and 0.01 mg/L, respectively.Furthermore, regarding seasonal variations, the TP concentration in large lakes exhibited two distinct phases: from 2017 to 2019, "high in spring/summer and low in autumn/winter" and "high in summer/autumn and low in spring/winter" during 2020-2022.Other-sized lakes showed a consistent pattern of "high in spring/summer and low in autumn/winter" variations.Additionally, the smaller lakes had higher TP concentrations.The removal of enclosed nets was found to be beneficial for improving water quality in large lakes.However, for medium and small lakes, the discharge of domestic and industrial wastewater in the vicinity resulted in insignificant changes in the TP concentrations.Therefore, when focusing on the ecological environment of large lakes, it is essential to focus on the governance and improvement of water quality in medium and small lakes.
This study demonstrates the potential application of machine learning algorithms for estimating TP concentrations in lakes of the Yangtze-Huaihe region using remote sensing satellite data and in situ measurements.In future research, the establishment of a virtual constellation combining satellites, such as Landsat-TM/ETM/OLI and MODIS, can be considered, in order to further expand the research scale in terms of time and space, enabling comprehensive monitoring and analysis of the global nutrient status of lakes.

Table 1 .
Summary of Sampling Point Information.

Table 2 .
Approach of input variables.

Table 3 .
Statistics of OLCI images in the Yangtze-Huaihe region from 2017 to 2022 (unit: scenes).

Table 5 .
Comparison of TP Concentrations in Typical Lakes.

Table 5 .
Comparison of TP Concentrations in Typical Lakes.