1. Introduction
Water quality assessment is a critical aspect of environmental monitoring, directly influencing both ecological integrity and human health [
1]. It involves evaluating a range of physical, chemical, and biological parameters [
2]. Key physical indicators such as temperature and pH significantly influence chemical reactions and biological processes in aquatic environments. For example, elevated temperatures can lower dissolved oxygen (DO) levels and alter metal speciation [
3], while pH variations affect the solubility and toxicity of heavy metals.
Beyond these factors, chemical and biological contaminants such as heavy metals, pathogens, and organic pollutants pose serious threats to freshwater systems [
4,
5]. Effective monitoring of these pollutants is essential for maintaining water quality and ecosystem services. From a remote sensing standpoint, water quality parameters are categorized into optically active (e.g., turbidity and chlorophyll) and optically inactive (e.g., BOD and TDS) groups [
6]. This distinction guides the methods used for large-scale water quality assessments of lakes and rivers.
Traditional field-based monitoring methods, although precise, are labor-intensive, costly, and spatially limited [
7]. Remote sensing has emerged as a valuable complement, offering wide-area coverage and high temporal frequency [
8,
9]. Multispectral satellite imagery enables the detection of optically active parameters based on changes in surface reflectance [
10]. However, it falls short in estimating optically inactive parameters, limiting its standalone applicability. With increasing urbanization, lakes in urban and peri-urban regions face escalating pollution pressures from both domestic and industrial runoff [
11]. Urbanization contributes heavy metals, microplastics, and untreated sewage, while agricultural activities near urban zones add fertilizers and pesticides to the mix [
12]. These pressures contribute to eutrophication, oxygen depletion, and degradation of aquatic ecosystems, reinforcing the need for advanced monitoring strategies.
Urban lakes are vital for biodiversity, ecological balance, and the provision of ecosystem services such as groundwater recharge, microclimate regulation, and recreation. However, many of these water bodies are degrading due to rapid urbanization and inadequate waste management. Therefore, maintaining long-term environmental sustainability and aligning with international goals such as the UN Sustainable Development Goals (SDGs), especially SDGs 6 and 11, depends heavily on monitoring and managing urban lake water quality. By assessing water quality in select urban lakes using remote sensing methods and identifying patterns to guide targeted conservation and restoration, this study contributes directly to these sustainability efforts.
Recent advancements have leveraged machine learning (ML) to enhance water quality prediction using satellite, meteorological, and land use data [
13,
14,
15,
16,
17,
18]. These models can capture nonlinear relationships between environmental variables and water quality parameters (WQPs). Yet, challenges remain in generalizing across geographic regions and integrating diverse data types, especially when working with limited field observations. This study addresses these gaps by proposing a multi-source, ML-based framework to estimate three key WQPs: turbidity, total dissolved solids (TDS), and biological oxygen demand (BOD) in urban and peri-urban lakes. Field observations from three lakes in West Bengal (Rabindra Sarovar, Mirikh Lake, and Hanuman Ghat Lake) were integrated with Landsat-8 satellite imagery, meteorological variables (e.g., temperature, rainfall, and wind speed), and land use data.
To evaluate the influence of input data types on model performance, three scenarios were developed: Scenario 1—Using only remote sensing spectral indices. Scenario 2—Combining spectral indices with meteorological variables. Scenario 3—Integrating spectral indices, meteorological variables, and land use features. By comparing model performance across these scenarios, the study aims to assess the relative importance of each data type and examine how environmental context affects model accuracy.
2. Study Area
This study focuses on three selected lakes located in different regions of West Bengal, India, each representing distinct environmental settings and anthropogenic influences. These lakes, Rabindra Sarovar, Mirikh Lake, and Hanuman Sagar, were chosen based on the availability of water quality data and their contrasting geographical and land use characteristics. The spatial distribution of these lakes is illustrated in
Figure 1.
Rabindra Sarovar, situated in the heart of Kolkata, is one of the most prominent urban lakes in West Bengal. Constructed by the Calcutta Improvement Trust in the early 20th century to support city expansion, the lake holds national significance. It spans approximately 73 acres of water surface, with a maximum length of 1770 m and a width of 286 m, surrounded by about 119 acres of open green space. Due to its location in a densely populated urban area, the lake is particularly vulnerable to pollution from stormwater runoff, domestic sewage, and recreational activities.
For Rabindra Sarobar (Kolkata), a comparative physicochemical study (2012) measured pH, conductivity, turbidity, and dissolved oxygen, reporting DO levels around 7.3 mg/L and relatively low TDS compared to the nearby Santragachhi Jheel. More recent surveys by the WBPCB and subsequent reports highlight severe eutrophication, algal blooms, silt accumulation, and elevated heavy metal sedimentation (notably zinc and magnesium), along with decreasing depth and increasing shallow zones between 2022–2025 [
19].
Mirikh Lake is located in the hilly Sikkim-Darjeeling Himalayan region, specifically in the Kurseong Subdivision of Darjeeling district. Positioned at an altitude of 1767 m, the lake covers an area of approximately 1.12 km2 (110 hectares). It is primarily fed by rainfall and small perennial streams. The lake falls under the jurisdiction of the nine wards of Mirikh Municipality. The surrounding region, characterized by steep slopes and lush vegetation, offers a stark contrast to the urban environment of Rabindra Sarovar, providing a unique setting to study natural influences on water quality.
Mirik Lake has been extensively studied, with assessments of bacteriological parameters such as total bacterial count, total and faecal coliforms, and faecal streptococci across seasons, revealing levels well above safe limits for recreational use [
20].
Hanuman Sagar (commonly referred to as Hanuman Ghat Lake) is located in Tarekeshwar, in the Hooghly district of West Bengal. It is a small suburban lake with a surface area of approximately 0.018 km2, surrounded by agricultural fields and semi-urban settlements. Agricultural runoff, which carries fertilizers and pesticides, is a significant contributor to water quality degradation in this lake. Its small size and proximity to farmland make it an ideal case for analyzing the impact of agricultural activities on water bodies.
The selection of these three lakes, spanning urban, hilly, and suburban-agricultural landscapes, provides a comprehensive understanding of how different environmental settings and land use practices influence water quality dynamics. Considering the land cover patterns around the lakes, all three lakes are unique. The selection of these three lakes, characterized by distinct land cover patterns, predominantly forest (Mirikh Lake), agricultural areas (Hanuman Ghat), and urban centers (Rabindra Sarovar), allows for a comprehensive assessment of how diverse land use practices influence water quality.
3. Materials and Methods
3.1. Methodology
The overall methodology of the study is shown in
Figure 2. To create the training dataset, the method integrates remote sensing indices with meteorological and land use data.
3.2. Remote Sensing Data
3.2.1. Water Quality Indices
Several remote sensing-based inputs were used for water quality analysis, primarily Landsat-8 OLI (Operational Land Imager) imagery. Due to its smaller data size and more consistent time series, the mission was chosen for this study. Spectral indices were derived using simple mathematical operations: addition, subtraction, multiplication, and division between reflectance values of different bands. Surface reflectance bands used included SR_B1, SR_B2, SR_B3, SR_B4, SR_B5, SR_B6, SR_B7, and SR_B10 (here “SR” denotes surface reflectance). These bands were obtained from Landsat-8 Level 2, Collection 2, Tier 2 surface reflectance products using the Google Earth Engine (GEE) platform. For each pairwise combination of bands, indices were calculated using the four operations mentioned above, resulting in 49 unique indices (
Table 1). With four operations applied to each index, a total of 196 indices were generated and used in the modeling process. In addition to these unique indices, traditional water quality indices include the Normalized Difference Turbidity Index (NDTI), Normalized Suspended Matter Index (NSMI), Normalized Difference Chlorophyll Index (NDCI), and Green Blue Normalized Difference Vegetation Index (GBNDVI) (
Table 2).
3.2.2. Meteorological Data
Meteorological and land use data were essential in complementing the remotely sensed data for this study. Key meteorological variables, including precipitation, temperature, and wind speed, were collected to provide additional context for understanding the environmental conditions affecting water quality. Precipitation data were sourced from the Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS), with a spatial resolution of 5 km. Wind speed data were obtained from NASA’s POWER (Prediction of Worldwide Energy Resource) project, while temperature information was derived from MODIS (Moderate Resolution Imaging Spectroradiometer) data. Details are shown in
Table 3.
The data extracted from the field observation through the West Bengal Pollution Control Board portal were filtered. The sampling date is mentioned in the data. The date matched the remote sensing observations from Landsat 8 data. Many of the observations lack a corresponding satellite observation. In these situations, the nearest date of satellite pass was matched to the ground observation (A maximum of 2 days before or after the field observation was considered for choosing satellite imagery). This data indexing method significantly increased the number of usable data points, helping improve model performance.
3.2.3. LULC Data (Land Use and Landcover)
For land use data, information on land cover types, such as urban, agricultural, and forest areas, was sourced from ESRI’s land use/land cover dataset at a resolution of 10 m. These data were further analyzed to perform buffer zone analysis around the study sites, focusing on areas within a 1000 m radius from the center of the lakes. The details of all of the datasets are shown in
Table 3.
Together, these meteorological and land use variables, as well as remotely sensed data, were integrated to create a comprehensive dataset for modeling and analyzing water quality parameters. Combining these data sources allowed for a better understanding of the factors influencing water quality in the study areas.
3.3. Field Measurements
Field-based water quality data were obtained from the official portal of the West Bengal Pollution Control Board (WBPCB) (
http://emis.wbpcb.gov.in accessed on 2 February 2025). This platform provides long-term observational records of 28 water quality parameters from various monitoring stations across the state. The current study selected three key parameters: biological oxygen demand (BOD), total dissolved solids (TDS), and turbidity.
These parameters were selected based on their availability across all three study lakes, relevance to freshwater ecosystem health, and potential for indirect estimation using remote sensing data. This is possible because these parameters often correlate well with optically active components, such as suspended solids or organic matter, which influence surface reflectance in specific spectral bands. In contrast, parameters such as bacterial load, nitrogen, and phosphorus are not detectable using remote sensing techniques due to their lack of a discernible spectral signature and variability in natural water bodies.
The temporal field measurements for the three selected water quality parameters, along with their corresponding dates of acquisition, were extracted and later used to match satellite data from the same time windows. A summary of the statistical characteristics of the collected water quality parameters across the three lakes is presented in
Table 4.
3.4. Modeling, Prediction, and Evaluation
3.4.1. Principal Component Analysis (PCA)
The integration of spectral indices, original spectral bands, land cover classifications, and meteorological variables led to a high-dimensional dataset with considerable inter-variable correlations. To address multicollinearity, reduce redundancy, and improve computational efficiency, principal component analysis (PCA) was employed. PCA is a statistical technique that transforms a set of correlated variables into a new set of uncorrelated variables known as principal components (PCs). Each component is a linear combination of the original input variables, and the components are ordered according to the amount of variance they explain within the dataset.
Before applying PCA, all input variables were standardized to have zero mean and unit variance. The PCA transformation was applied to the standardized feature matrix, and only the top
k principal components, which captured the majority of the total variance, were selected for further analysis. This step ensured that the most informative patterns in the data were retained while discarding noise and less relevant dimensions. The expression gives the transformed dataset in the reduced k-dimensional space:
where
= matrix of top
k eigenvectors
And = transformed data in the new k-dimensional space
This reduced feature set was then used as input to the machine learning models to enhance generalization, minimize overfitting, and improve model training efficiency.
3.4.2. Machine Learning Models
This study employed several machine learning models for predictive analysis, including Random Forest, XG-Boost (eXtreme Gradient Boosting), and Decision Tree regression.
Random Forest: Random Forest is an ensemble method that combines multiple decision trees to make predictions [
25]. Each tree in the forest is built based on a random subset of data and features. The random vector sampled independently for each tree ensures diversity in the trees, and the collective output of all trees is used for final predictions. As more trees are added, the model’s performance improves, and its generalization error decreases, stabilizing as the number of trees increases. The strength of individual trees and their correlations impact the model’s overall accuracy. The final prediction
is typically the majority vote (for classification) or the average (for regression) of the outputs
h t (
x) from all
T trees:
As more trees are added, the model’s performance improves, and its generalization error decreases, stabilizing as the number of trees increases. The strength of individual trees and their correlations impact the model’s overall accuracy.
XG-Boost: Extreme Gradient Boosting (XG Boost) is a highly efficient, scalable algorithm used to optimize the performance of Gradient Boosting Machines (GBMs) [
26]. It accelerates the training process and improves prediction accuracy by leveraging parallel and distributed computing. Known for its excellent predictive performance, XG Boost has been successfully applied across various fields. It remains a top choice for many machine learning tasks due to its versatility and speed.
XG Boost builds trees sequentially, where each new tree
(
x) corrects the errors of the previous ones. The overall prediction is the sum of the outputs of all trees:
Each tree
is trained to minimize a regularized objective function that balances model accuracy and complexity:
where
l is a loss function (such as mean squared error) and Ω is a regularization term.
Known for its excellent predictive performance, XG Boost has been successfully applied across various fields. It remains a top choice for many machine learning tasks due to its versatility and speed.
Decision Tree: A decision tree is a straightforward, interpretable algorithm that uses a tree-like structure to model decisions based on feature attributes [
27]. In both regression and classification tasks, decision trees split data into subsets using decision nodes governed by conditional rules. Each path from the root to the leaf represents a sequence of decisions, with the leaf node providing the final prediction. Training a decision tree involves learning these decision rules from the input data.
The decision at each node is based on selecting the feature
and threshold
s that best split the data to minimize a criterion like Gini impurity (for classification) or variance (for regression). For example, Gini impurity at a node is given by the following:
where
is the proportion of samples belonging to class
k at that node.
Training a decision tree involves learning these decision rules from the input data. These models were employed to analyze and predict water quality parameters, with each having its unique strengths in handling complex datasets. For feature selection, principal component analysis was used to assess the most relevant variables. PCA helps reduce the data and the redundancy between the variables.
3.4.3. Model Evaluation
For model validation, two common regression metrics, R-squared (R
2) and Root Mean Square Error (RMSE), were used to assess the performance of the machine learning models. R-squared measures the proportion of variance in the observed data that can be explained by the model. Higher values on this scale, which goes from 0 to 1, indicate better model performance:
where
is the actual value,
is the predicted value, and
is the mean of actual values.
The RMSE measures the average magnitude of prediction errors, indicating how closely predictions align with the actual values. It is defined as follows:
Lower RMSE values indicate better model accuracy, reflecting smaller deviations between predicted and actual values.
A typical train–test split technique was used to evaluate the model, randomly dividing the dataset into training (80%) and testing (20%) subsets. To determine the models’ capacity for generalization, they were first trained on the training data and then tested on the unseen test data. Additionally, cross-validation was employed during model training to adjust hyperparameters and lower the chance of overfitting.
Three different input feature configurations were examined in terms of performance metrics: (i) using spectral indices alone, (ii) using spectral indices in conjunction with meteorological data, and (iii) using spectral indices, meteorological data, and land use data.
3.4.4. Spatiotemporal Analysis
For spatiotemporal analysis, satellite images from 2024 were used. The input variables were extracted, including band indices values, land cover information, and meteorological data, for the three lakes. Random points, according to the coverage of satellite image pixels, were generated to extract these data from the lake’s surface. These were used as input to the trained model to estimate the turbidity, TDS, and BOD. The estimation was performed every month. Each month, one cloud-free image was selected for use as input data. The Inverse Distance Weighted (IDW) interpolation method was used to create the spatial maps of the predicted water quality parameters using the given points. For each lake, the number of points varied due to the varying size of the lakes. For Hanuman Ghat Lake, 10 points, for Mirikh Lake, 118 points, and for Rabindra Sarovar Lake, 194 points were used for interpolation. During analysis, if there were no cloud-free images available in a month, that month was not considered.
4. Results
4.1. Summary of the Data
4.1.1. Rainfall
Significant geographical and temporal patterns are shown by analyzing meteorological factors over the three lakes between 2014 and 2023. These patterns offer crucial background for understanding the observed dynamics of water quality. All three sites show significant seasonality in rainfall, with the monsoon months (June–September) seeing the highest amounts (
Figure 3). The high post-monsoon rainfall at Hanuman Ghat Lake, which is primarily an agricultural area, probably contributes to the high turbidity and TDS levels during this time due to runoff-induced sediment and pesticide movement [
28]. On the other hand, Mirikh Lake, situated in a mountainous area, exhibits brief but strong rainfall peaks that coincide with transient increases in turbidity, indicating rapid water influx and increased erosion potential during storm events [
29].
4.1.2. Temperature
Due to Rabindra Sarovar’s low-lying urban location, temperature profiles exhibit regular seasonal trends, with higher values in the pre-monsoon and summer months (March–May) (
Figure 4). By accelerating microbial metabolism and reducing dissolved oxygen availability, elevated temperatures are known to exacerbate BOD levels [
30]. This finding is consistent with Rabindra Sarovar’s consistently high BOD, particularly during the warmer months. However, because of its higher elevation, Mirikh Lake consistently experiences lower average temperatures, which supports its typically low BOD values and suggests less pressure from biological degradation.
The temperature plots have some data gaps in the monsoon months. These surface temperature estimates have been derived from CHIRPS data. These temperature data had some gaps, and during data cleanup, these null values were removed, which caused the gaps.
4.1.3. Wind Speed
Although the trends in wind speed were more erratic, somewhat higher average speeds were seen near Mirikh Lake, which might improve natural aeration and facilitate the dispersion of pollutants (
Figure 5).
On the other hand, lower wind speeds over Rabindra Sarovar can worsen the effects of urban runoff on water quality by causing pollutant stagnation and decreased mixing. The spatial concentration of turbidity and TDS near lake borders suggests that Hanuman Ghat Lake’s rainfall events and low wind conditions likely contribute to the settling of particles close to the inflow zones [
31].
Overall, by connecting seasonal rainfall and temperature dynamics with corresponding variations in turbidity, TDS, and BOD, the meteorological patterns help support the machine learning-based water quality predictions. These connections demonstrate the importance of incorporating climate data into forecasting models to accurately depict the environmental factors that affect lake water quality.
4.2. Principal Component Analysis
Principal component analysis (PCA) was applied across all three modeling scenarios to address multicollinearity, reduce redundancy among input variables, and improve the efficiency of machine learning training. Each scenario involved different combinations of variables, ranging from only spectral indices in Scenario 1 to a comprehensive set that included spectral indices, meteorological parameters, and land use data in Scenario 3 (
Figure 6). Since Scenario 3 yielded the best model performance, its PCA results are presented herein. The scree plot shown below (
Figure 6) illustrates the cumulative variance explained by the first ten principal components. Notably, the first four components alone account for approximately 80% of the total variance in the dataset, indicating that a reduced set of principal components can effectively represent the majority of information from the original high-dimensional feature space.
These top components were used as inputs to the machine learning models to reduce the risk of overfitting and improve computational efficiency. Although PCA internally generates loadings that define how original variables contribute to each component, this study focuses only on the explained variance, as visualized in the scree plot, and does not explicitly analyze or report variable-level loadings.
4.3. Model Performance
R
2 and Root Mean Square Error (RMSE) were used as accuracy measures to assess how well the machine learning models predicted three important water quality parameters: turbidity, total dissolved solids (TDSs), and biological oxygen demand (BOD). The findings shown in
Table 5,
Table 6 and
Table 7 indicate that both the type of input data supplied and the algorithm employed significantly affected the model’s performance.
When all input features, spectral indices, meteorological variables, and land use data were added (
Table 5), XGBoost produced the most accurate turbidity forecasts among all of the tested models. The model demonstrated strong generalization and predictive ability, as evidenced by an R
2 of 0.73 and an RMSE of 16.09 in this configuration. Despite a strong validation score, XGBoost’s testing R
2 fell to 0.21 when spectral indices were the only ones utilized, indicating overfitting. Accordingly, turbidity, a measure greatly impacted by environmental factors, cannot be accurately forecasted from reflectance data alone and greatly benefits from contextual information, such as rainfall and land use patterns.
The Decision Tree regression model performed exceptionally well when extensive input features were used, predicting TDS. The model yielded an RMSE of 81.09 and a testing R2 of 0.81 after incorporating land use and meteorological data. This suggests that Decision Trees were able to capture the non-linear correlations between TDS and the elements that influence it, particularly in contexts where seasonal variations and agricultural runoff have an impact. Ridge regression achieved a near-perfect testing R2 of 0.99 with a negligible RMSE, yielding remarkable results for BOD estimation. In contrast to TDS and turbidity, BOD exhibited a more consistent and linear trend, which Ridge regression could accurately represent with fewer input features. This result also implies that, in comparison with the other metrics, BOD variations throughout the study lakes might be more stable and less susceptible to transient environmental changes.
The significance of combining multiple data sources to enhance model reliability and reduce overfitting is one of the main conclusions drawn from the comparison analysis. The intricacy of water quality dynamics cannot be fully captured by spectral indices alone, despite them offering crucial surface-level information. Model robustness across all three water quality indicators was substantially enhanced by incorporating meteorological characteristics, such as temperature, wind speed, rainfall, and spatial data from land use maps.
4.3.1. Spatial and Temporal Variability Analysis of WQPs
After training the model using data from all three lakes combined for the period ranging from 2014 to 2023, the trained model was used to estimate the water quality for the three lakes in 2024. Cloud-free images from the pre- and post-monsoon periods were used as input data for the estimation, and the predicted values were used to interpolate the results in the maps shown in
Figure 7,
Figure 8 and
Figure 9.
Using the trained models for each water quality parameter, cloud-free data were extracted for the three lakes and then used to predict the water quality parameters. These data were extracted for points over the lakes. For Hanuman Ghat, images were acquired for seven months without clouds. Meanwhile, for Mirikh Lake, five images were acquired, and for Rabondrasarovar Lake, four images without clouds were extracted each month. These images were chosen in the form of one for each month. This was performed to capture the temporal variations in the water quality of the lakes. During the pre- and post-monsoon periods, significant changes occur in the water quality parameters (
Figure 7,
Figure 8 and
Figure 9).
Important information about the geographical distribution and seasonal variability of water quality parameters, including turbidity, TDS, and BOD, across the three research lakes can be found in the interpolated maps produced for 2024 using the trained machine learning models. These spatial patterns illustrate the influence of nearby land use and seasonal variations, reflecting the physical and environmental characteristics of the lakes.
There are noticeable hotspots at the southern and southeastern borders, and turbidity levels remain high during the post-monsoon months (November–December). This trend is consistent with surface runoff from surrounding agriculture, which is driven by the monsoon, and likely brings fertilizers and eroded soil into the lake, increasing the amounts of suspended particles (
Figure 7).
Notably, there are regular localized maxima in the lake’s turbidity and biological oxygen demand (BOD) in the lower right corner, which is home to a well-known local temple. This is probably because the lake receives organic and solid waste from ritual offerings, washing operations, and heavy foot traffic. Similarly, there are higher BOD and TDS levels around the lake’s bottom edge, which is dotted with eateries and food vendors, especially in January, May, and December. The organic and chemical contamination observed in this area is likely caused, in part, by the improper disposal of wastewater and food scraps from these businesses.
There are sporadic hotspots in total dissolved solids (TDS), particularly in January, May, and December, located near the southern and western shorelines. Urban drainage, wastewater effluent from the lake’s commercial and religious operations, and agricultural runoff could all contribute to these quantities. Due to deeper water and improved mixing, which dilute pollutant concentrations away from point-source inputs, the central portion of the lake remains relatively cleaner, consistently exhibiting lower values for all metrics.
Mirikh Lake has noticeable temporal variations, particularly in turbidity, which peaks in February, March, and December (
Figure 8). The lake is situated in a hilly area with a comparatively lower anthropogenic load. During periods of heavy rainfall or land disturbance, this pattern is consistent with natural erosion and sediment intake from the steep surrounding terrain. Throughout the year, turbidity hotspots shift locations, although they typically appear near the lake’s edge, most often where small rivulets or inflow channels release silt. With very little variation in localized locations over the months, particularly in March and December, TDS concentrations stay quite constant and low. This is in line with a system that is primarily supplied by rainfall and has little exposure to dissolved chemical inputs.
The distribution of BOD levels is more intricate. Elevated BOD levels along the lake’s northern and southeast arms are observed in February and March; these readings may be related to areas where tourists congregate or to the inflow of organic waste from streams. Nonetheless, the lake’s ecological condition is generally rather good, as seen by the low levels of human pollution and the robust buffering vegetation that surrounds the majority of its shoreline.
Rabindra Sarovar’s predicted maps (
Figure 9) frequently display high BOD levels, especially close to the lake’s edge, which is heavily encircled by urban infrastructure. The maps show slight seasonal fluctuation, suggesting that organic contamination is persistent and probably caused by continuous flows from recreational and residential sources. Although the turbidity and TDS levels are lower than at Hanuman Ghat, they exhibit localized peaks close to stormwater outlets and park visitor trails, suggesting that urban runoff and localized sediment disturbance may be the reasons. In contrast with Hanuman Ghat, urban surface pollution and potentially inadequate water circulation are the leading causes of the geographic variance here, rather than agricultural runoff.
Both natural processes and human-induced activities can be directly related to the seasonal and regional fluctuations in the projected water quality metrics. Increased surface runoff is the leading cause of elevated turbidity and TDS readings during the post-monsoon months, particularly in Hanuman Ghat and Mirikh Lake. The concentrations of suspended and dissolved pollutants increase due to monsoonal rainfall, which moves sediments, fertilizers, and agricultural chemicals from surrounding fields and urban areas into water bodies [
18]. During pre-monsoon times, on the other hand, water levels are usually lower due to reduced surface runoff and sediment influx, which improves the water’s clarity and quality [
1].
Localized sources of contamination are reflected in the spatial heterogeneity across the lakes, where pollutant concentrations are higher at the edges than in the center zones. Stormwater drains, household sewage, and recreational areas directly contribute to the lake peripheries of urban lakes, such as Rabindra Sarovar, resulting in persistently elevated BOD levels, particularly close to entry points [
17,
32]. These fringe zones are more vulnerable to human influences and may have inadequate buffer vegetation protection. However, due to their deeper water and better circulation, the lakes’ middle regions often have lower pollution concentrations, as pollutants are diluted and spread out [
11].
The post-monsoon increase in turbidity and TDS in Hanuman Ghat Lake indicates the impact of the nearby agricultural area. In line with findings from other rural water bodies impacted by agricultural operations, runoff from neighboring farms probably contains phosphates, nitrates, and sediments [
11]. The function of natural erosion and slope-driven transport in highland lakes is reflected in Mirikh Lake, which, despite its comparatively natural environment, sees sediment input during periods of heavy rainfall, especially near stream inflow zones [
32].
The turbidity, TDS, and BOD water quality metrics for January are superimposed on the relevant land use/land cover (LULC) maps of each lake in
Figure 10. Surrounded by dense urban territory, Rabindra Sarovar consistently exhibits high BOD and turbidity levels near its edge, indicating ongoing organic and particulate pollution from surface runoff and untreated sewage. Higher turbidity and TDS are found in Hanuman Ghat Lake, which is surrounded by agricultural areas. This is consistent with runoff from agrochemicals, fertilizers, and sediments. On the other hand, Mirikh Lake, which is mostly encircled by forest, shows noticeably lower levels of pollutants in all three categories, indicating less human impact. These spatial patterns underscore the significance of nearby land use in shaping water quality dynamics and emphasize the need to integrate land and water management techniques in environmental monitoring and urban planning.
4.3.2. Implications for Urban Water Management
For this analysis, Landsat 8 was used. This is a 30 m/px resolution dataset. This coarse resolution may not be appropriate for smaller water bodies in urban areas. Higher-resolution data with multiple bands are needed for this type of urban water quality assessment. These urban waterbodies face severe water quality issues due to human activities in their vicinity. Other than that, runoff from urban areas contains many pollutants, and agricultural runoff, loaded with phosphates and nitrate compounds, severely degrades water quality. Among these three lakes, Hanuman Ghat Lake is near the farming areas.
5. Discussion
To predict important water quality parameters (WQPs) in various urban and peri-urban lake systems, this study demonstrates the integration of remote sensing data, meteorological factors, and land use information. The study provides important new insights into the temporal and spatial dynamics of biological oxygen demand (BOD), total dissolved solids (TDSs), and turbidity in three distinct lakes: Hanuman Ghat Lake, Mirikh Lake, and Rabindra Sarovar.
One of the key findings is that machine learning models perform significantly better when supported by multi-source data [
2]. Models trained using only spectral indices showed strong validation performance but failed to generalize well during testing, likely due to overfitting and insufficient environmental context. Conversely, when land use and meteorological data were incorporated, the models demonstrated enhanced robustness and higher predictive accuracy. This confirms the hypothesis that environmental conditions and anthropogenic land use practices significantly influence water quality and must be considered in modeling efforts.
Every lake displayed unique pollution profiles that mirrored the patterns of land use in the area. Immersed in a dense urban environment, Rabindra Sarovar consistently displayed elevated BOD values, indicating persistent organic pollution most likely caused by untreated household waste and restricted water flow. Due to increased agricultural runoff, Hanuman Ghat Lake, situated in a semi-urban agricultural landscape, exhibited significant seasonal variability, particularly in turbidity and TDS, peaking during the post-monsoon period. Mirikh Lake, situated in a comparatively pristine hilly area, occasionally experiences increases in turbidity, which may be attributed to natural sediment transport and rainfall disruptions rather than human activity.
Compared to similar studies, such as Bormudoi et al. (2022) [
13], this study extends the application of remote sensing-based water quality modeling. Bormudoi et al. [
13] used regression and artificial neural network (ANN) models to estimate turbidity and TDS in Deepor Beel Lake using Landsat-8 data, achieving R
2 values of 0.83 for turbidity and 0.87 for TDS with ANN, although the performance of linear regression models was slightly lower.
In contrast, the present study demonstrates that integrating spectral indices with land use and meteorological variables can further improve predictive performance across multiple lakes. For example, this study achieved an R2 of 0.81 for TDS (using a Decision Tree) and 0.73 for turbidity (using XGBoost) in diverse lake environments. Notably, this work also models BOD traditionally considered difficult to estimate via remote sensing alone using Ridge Regression, achieving an R2 of 0.99.
By integrating machine learning with multi-source data, the present study advances previous remote sensing-based assessments of water quality by improving predictive accuracy. For example, [
14] employed deep learning models to evaluate lake water quality in China and reported R
2 values of approximately 0.85 for parameters such as turbidity and chlorophyll-a, although their approach was limited by single-source input data and lower interpretability. Similarly, the authors of [
15] used machine learning techniques to analyze spatiotemporal variations in Hulun Lake, achieving turbidity R
2 values up to 0.79 with Sentinel-2 imagery, but without incorporating meteorological or land use variables. In comparison, our study achieved R
2 values of 0.73 for turbidity and 0.81 for TDS, comparable to or exceeding those of previous studies, while additionally estimating BOD with high precision (R
2 = 0.99) using Ridge Regression. The successful prediction of BOD, which is generally considered optically inactive, highlights the benefit of integrating meteorological and land-cover data to enhance model performance.
This study introduces several methodological advancements. First, it improves model robustness by combining contextual meteorological and land use information with remote sensing indices. Second, it addresses multicollinearity and enhances computational efficiency through dimensionality reduction using principal component analysis (PCA). Although the models performed well for each lake, generalizability may be affected by variations in the number of ground-truth observations (e.g., Rabindra Sarobar with five samples versus Hanuman Ghat with sixteen). Ensuring more frequent and balanced sampling in future work will further strengthen the temporal stability and spatial transferability of the models.
This study enhances predictive performance by incorporating contextual data (land cover and weather) into the modeling framework. It expands on earlier remote sensing-based WQP assessments, as evidenced by comparing the results with prior literature. This aligns with previous studies that highlight how machine learning can enhance conventional water quality monitoring by bridging temporal and spatial gaps, particularly in areas with inadequate field infrastructure.
6. Conclusions
This work presents a comprehensive method for assessing the water quality of urban and peri-urban lakes by integrating advanced machine learning models with data from remote sensing, weather, and land use. By focusing on three key parameters—turbidity, total dissolved solids (TDSs), and biological oxygen demand (BOD)—the study demonstrates that combining multi-source data, rather than relying solely on spectral indices, significantly enhances predictive accuracy.
Land use patterns and seasonal variations significantly impact water quality dynamics, according to a spatial and temporal analysis of water quality across three distinct lake environments: hilly (Mirikh Lake), urban (Rabindra Sarovar Lake), and agricultural–suburban (Hanuman Ghat Lake). The models provided important insights into seasonal variations and localized pollution sources by detecting both chronic and episodic pollution events.
The results underscore the importance of incorporating environmental context into evaluations based on remote sensing. Additionally, they highlight how machine learning can help fill data gaps and facilitate real-time water quality monitoring. The noted constraints indicate that higher-resolution imagery and larger parameter sets could further improve predictive performance, even though medium-resolution satellite data were helpful.
The substantial variation in the number of ground-truth observations among the studied lakes (e.g., Hanuman Ghat: 16, Mirik: 14, and Rabindra Sarobar: only 5) may influence the model’s ability to generalize across different water bodies. Lakes with fewer samples, such as Rabindra Sarobar, may yield less reliable predictions due to limited training data, particularly in complex urban settings. Although the multi-lake training approach helps enhance overall model robustness, future work should aim for more balanced and comprehensive sampling to improve lake-specific prediction accuracy and further strengthen model generalizability. The study’s conclusions have significant implications for the sustainable management of urban lakes. By monitoring water quality in urban lakes like Rabindra Sarobar, Mirik Lake, and Hanuman Ghat Lake using remote sensing and in situ validation, the study demonstrates a scalable method for identifying pollution trends and early signs of eutrophication. These insights support evidence-based policymaking for community-level waste management, nutrient load control, and periodic dredging. Moreover, in the context of rapid urbanization and climate change, integrating such monitoring into urban planning aligns with SDGs 6 (Clean Water and Sanitation) and 11 (Sustainable Cities and Communities) to protect urban water bodies as essential recreational and ecological resources.
Predictive modeling frameworks, such as the one created in this study, are essential for facilitating real-time surveillance of pollution sources, which aligns with Sustainable Development Goal 6.3, which emphasizes enhancing ambient water quality. Spatial resolution remains a significant drawback, although Landsat-8’s 30 m resolution is sufficient for large bodies of water; however, it is insufficient for capturing narrow inlets and fine-scale urban features. To better address coastal–urban connections, future research should utilize higher-resolution datasets, such as Sentinel-3 OLCI or PlanetScope (3 m). Furthermore, by providing decision-makers with interpretable results, explainable AI (XAI) techniques, combined with IoT-based in situ sensors for continuous ground truthing, could enhance model transparency and encourage regulatory adoption.
To summarize, this study presents a reproducible and scalable approach to monitoring water quality in urban lakes in India. It encourages the creation of data-driven water management plans, which are essential for preserving freshwater supplies in areas that are fast becoming more urbanized. For a more comprehensive assessment of water quality, future research should extend this approach to broader and more diverse geographic regions and include additional indicators such as nutrients, heavy metals, and microbial burdens.