1. Introduction
Water resources provide ecosystem services of high natural and economic value for the population in general. Consequently, more than 40% of human settlements are located near coastal regions and on the shores of lotic and lentic resources [
1,
2]. Unfortunately, this makes these bodies of water more susceptible to pollution and overexploitation. In this way, WQ monitoring has become the most suitable strategy to evaluate sustainability in water management practices [
3,
4]. In recent years, there has been an increase in continuous monitoring campaigns for WQ parameters in various Latin American countries [
5]. These initiatives aim to comprehend and proactively address WQ degradation by analyzing data collected during monitoring campaigns.
Conventional monitoring determines the WQ parameters by collecting samples in the field and their subsequent analysis in the laboratory. This is why it becomes a highly precise technique, but its complexity increases when working in large bodies of water. Consequently, the work is laborious and time-consuming, which results in an increase in costs that governments in many poor or developing countries cannot afford [
6]. In addition, the sampling points may be limited due to restricted access in sectors with irregular topographies. Therefore, the accuracy and precision of the data may be compromised, including by in situ sampling error or laboratory analysis error. Hence, conventional methods cannot easily identify temporal and spatial variations of WQ parameters. Consequently, it is not possible to represent the complete state of the water surface and thus an obstacle prevents the monitoring and management of the quality of water masses [
1,
5,
6,
7].
On the other hand, advances in space science, cloud computing and ML contribute to the development of new techniques to work with natural resource management. For instance, satellites have built-in optical and thermal sensors that measure reflected electromagnetic radiation. This information is used to evaluate WQ with high spectral and spatial resolution [
2,
8]. Remote sensing as a technique to monitor WQ has been used since the 1970s, so that, since that decade, studies with methodological approaches have been developed to take advantage of the advantages offered by satellites [
9]. The satellite radiometers used up to now are designed for the observation of the ocean and the terrestrial surface; therefore, they are not suitable for observing continental waters [
10,
11]. However, the fine spatial resolution of terrestrial sensors enables the acquisition of acceptable results for monitoring WQ parameters [
4,
12].
The literature review underscores the increasing utilization of Landsat-8 and Sentinel-2 sensors, highlighting their remarkable advantages in terms of fine spatial and temporal resolution [
13,
14]. Landsat-8, for instance, provides data at 16-day intervals, roughly equivalent to 22 annual images, depending on the location, while Sentinel-2 captures images every 5 days, resulting in approximately 73 images annually. These attributes have played a pivotal role in yielding highly promising results in the remote detection of WQ parameters within continental water bodies [
15,
16,
17,
18].
Nevertheless, a significant challenge for these studies has been the limitation in accessing sufficient training data for ML models [
2,
19]. The acquisition of satellite images is constrained by adverse climatic factors, such as persistent cloud cover or precipitation, which hinder the capture of surface water reflectance values [
20]. The pressing need for an adequate quantity of training data materializes as a substantial challenge in this research domain. To address this data limitation, several studies employ the k-Fold-Cross-Validation technique to maximize the use of limited data and build robust models [
18]. It is heartening to note that the relationship between the reflected light from specific parameters and their field-measured concentrations has proven to be an effective avenue for generating promising results in predictive models [
21].
Furthermore, to overcome the shortage of training data for ML models, several studies opt to incorporate in situ data from monitoring activities available through open-access portals of national water and environmental agencies across diverse nations [
19]. For instance, Papenfus et al. [
22] employed data from the United States Environmental Protection Agency’s Water Quality Portal to facilitate remote sensing of Chl-a in lakes and reservoirs within the United States. Similarly, other data sources include the European Environment Agency (EEA) Waterbase portal in Europe, the Global Freshwater Quality Database (GEMStat) at a global scale, and Canada’s Open Government Portal [
19].
In Latin America, studies such as that by Rodríguez López et al. [
23] used in situ data from Dirección General de Aguas de Chile to estimate Chl-a concentration using Landsat-8 and obtained r
2 values ranging from 0.64 to 0.93 when testing various neural networks. In Brazil, Bettencourt et al. [
24] estimated turbidity and Chl-a through in situ data from the National Agency of Waters (ANA-Hidroweb). In Argentina, Germán et al. [
25] estimated Chl-a levels using Sentinel-2 satellite data in conjunction with in situ measurements obtained from a monitoring program conducted by the Ministry of Water, Environment, and Public Services of the province of Córdoba. Their findings revealed an r
2 of 0.77. In Mexico, Otto et al. [
26] used data from the RNMCA together with Landsat radiometric data as input variables to develop empirical models and estimate turbidity in Lake Chapala; the authors obtained an r
2 of 0.7. Similarly, Torres Vera [
27], based on data from the RNMCA, developed an ML model to estimate TSS in Lake Chapala using Landsat images; the r
2 obtained was 0.81. Another significant work was conducted by Arias Rodríguez et al. [
5], where the authors evaluated an extreme learning machine (ELM), a support vector regression, and a linear regression to estimate Chl-a, turbidity, TSS, and Secchi disk depth in the lakes of the Mexican territory (Chapala, Cuitzeo, Patzcuaro, Yuriria, and Catemaco). They integrated in situ measurements of the RNMCA with data from Landsat-8, Sentinel-3, and Sentinel-2, and reported that the atmospherically corrected Sentinel-3 data and ELM models performed better, particularly for turbidity (r
2 = 0.7). This illustrates the remarkable evolution in the application of remote sensing technologies for WQ monitoring in Latin American countries while emphasizing the innovative strategies employed to address the challenges in this research field.
Despite the vast scientific literature dedicated to WQ monitoring through remote sensing techniques, the global environment remains a complex and evolving system, as emphasized by Sagan et al. [
8]. In light of this understanding, dependence solely on existing research becomes inadequate and occasionally unfeasible. To tackle this challenge, it is essential to engage in continuous research and monitoring of water bodies that have not undergone comprehensive analysis.
This situation is exemplified in the case of lakes Cajititlán and Zapotlán, distinguished by their unique geography, hydrology, surrounding land use, and environmental conditions, rendering them particularly pertinent to this study. Each water body constitutes a distinct system, and solutions effective in one may not be directly applicable in the other. The diversity and distinctiveness of these ecosystems underscore the necessity of data collection and the generation of specific contextual information for future water research and management projects [
14]. Furthermore, both lakes hold ecological and touristic significance within the country, as their waters are employed for agricultural irrigation and recreational purposes [
28,
29].
In the perspective of developing countries like Mexico, budget constraints often limit the resources allocated for water management [
30]. Accessing advanced equipment, such as hyperspectral sensors or drones equipped with high-resolution multispectral cameras, can pose a significant challenge due to financial restrictions [
31,
32]. In this context, the study’s primary objective is to introduce an innovative and cost-effective solution for monitoring WQ in Lakes Cajititlán and Zapotlán. By breaking new ground, the aim is to contribute to filling the critical gap in the field of remote sensing studies, where these particular bodies of water have remained largely unexplored. To achieve this goal, a wide range of ML algorithms were systematically evaluated, distinguishing the best performing ones to ensure the effectiveness and robustness of the method. By addressing this research gap, this work advances the understanding and management of WQ, thereby establishing a valuable precedent for future studies in similar ecological contexts.
Utilizing the data made available by the National Water Commission (CONAGUA) and delivering a pragmatic management tool for these bodies of water, a valuable resource was provided that serves the interests of both the scientific community and the local population. In the present day, society assumes a pivotal role in the decision-making processes related to water resource management [
33]. There is a growing demand for robust tools that streamline the acquisition of pertinent information concerning WQ in these natural resources. Such information is indispensable for preempting environmental challenges, including water pollution, and proactively mitigating these issues [
34].
This study involved correlating radiometric data from Landsat-8 and Sentinel-2 with WQ parameters characterized by an active spectral signal. While existing methods from the literature were employed, the uniqueness of this research is corroborated by the examination of water bodies that had not previously been monitored by remote sensing. Furthermore, the advantage lies in the availability of RNMCA data from 2009 to the present, which effectively increases the volume of input data for training ML algorithms.
The effectiveness of eight state-of-the-art ML algorithms spanning various categories was evaluated, introducing a broader range compared to previous studies using RNMCA data. The scope of hyperparameter adjustment was expanded through grid search techniques to enhance the model performance. The first category of algorithms encompasses ensemble methods, where the Gradient Boosting Regressor was considered for its capacity to amalgamate the predictive prowess of multiple decision trees. This attribute renders it particularly adept at capturing the intricate and interrelated dynamics inherent in WQ parameters [
35]. Concomitantly, the Random Forest Regressor, a model that harnesses an ensemble of decision trees, was engaged to deliver precise predictions [
36]. Furthermore, the SLA was leveraged to enhance predictive performance. The SLA operates by stacking the outputs of individual estimators and utilizing a regressor to compute the final prediction, harnessing the collective strength of each constituent estimator [
37]. The second category encompasses neural networks, where the MLP assumes a pivotal role. The MLP, renowned for its proficiency in apprehending intricate relationships, excels at modeling non-linear dependencies between WQ input and output variables more effectively than conventional linear regression models [
21,
38]. Within the third category, regularization techniques were incorporated, with particular emphasis on the Ridge regression algorithm. Ridge regression, through the introduction of a penalty term, effectively mitigates the risk of overfitting in linear regression models [
36]. The fourth category, consisting of instance-based methods, introduced the K-Neighbors Regressor. This algorithm relies on the similarity between data points to make predictions, rendering it well-suited for the estimation of WQ parameters [
35]. In the fifth category, decision trees were comprehensively explored for their inherent interpretability and effectiveness [
21]. Finally, the sixth category extended the evaluation to encompass other algorithms, such as Support Vector Machines, renowned for their distinctive capabilities in modeling intricate and non-linear relationships [
36].
The thorough investigation of these diverse ML algorithms underlines the primary objective of the study: to identify the most effective approach for addressing the intricate and nonlinear characteristics inherent to WQ parameters [
21]. Additionally, the strategic adjustment of hyperparameters, encompassing broad ranges, played a pivotal role in enhancing the predictive model’s performance. This comprehensive analysis ensures that the results are not only robust but also capable of meeting the complex challenges posed by WQ parameter estimation.
5. Conclusions
Working with historical data from the RNMCA, which conducts long-term monitoring campaigns spanning from 2009 to the present, has facilitated the development of a useful database for the study. Despite facing limitations stemming from atmospheric factors and occasional satellite data mismatches, this dataset has proven to be helpful for the development of ML models. To address these challenges, extensive hyperparameter tuning was performed and the widely used k-Fold-Cross-Validation technique was applied in data-limited scenarios. As a result, the ML models exhibited varying predictive capacities, with the MLP and SLA algorithms demonstrating superior performance, yielding valuable insights into the spatial distribution of Chl-a, SST, and turbidity in Lakes Cajititlán and Zapotlán.
Underscoring the significance of remote sensing, the study revealed that only through a qualitative analysis of spectral signatures within satellite images was it possible to identify heightened light reflection in the green and near-infrared wavelengths, a telltale sign of the greenish coloration characteristic of eutrophic waters in Lake Cajititlán.
Within this context, the study meticulously developed models that showcase the highest predictive capabilities. Importantly, these models were fine-tuned to accommodate the unique characteristics of each lake. The inability to generalize findings arises from the differences in water composition, topography, and various other environmental factors that distinguish the two lakes. This research has, therefore, generated invaluable data for the comprehensive analysis of these lakes, hitherto untouched in the realm of WQ monitoring through remote sensing. Consequently, this study lays the groundwork for future research endeavors in a rapidly growing field that has garnered the attention of both researchers and water management authorities.
The outcomes of the study also underscore the immense potential of integrating remote sensing techniques into the monitoring campaigns conducted by the RNMCA. This expansion offers the possibility of including more water bodies in monitoring efforts that were previously excluded due to a range of limitations. Furthermore, it sets a promising precedent for advancing environmental monitoring practices, ultimately facilitating more informed and timely decision making in water resource management.