Prediction of Phytoplankton Biomass in Small Rivers of Central Spain by Data Mining Method of Partial Least-Squares Regression

: The Water Framework Directive (WFD, EC, 2000) states that the “good” ecological status of natural water bodies must be based on their chemical, hydromorphological and biological features, especially under drastic conditions of floods or droughts. Phytoplankton is considered a good environmental bioindicator (WFD) and climate change has a strong impact on phytoplankton communities and water quality. The development of robust techniques to predict and control phytoplankton growth is still in progress. The aim of this study is to analyze the impact of the different stressors associated with the change in phytoplanktonic communities in small rivers in the center of the Iberian Peninsula (Southwestern Europe). A statistical study on the identification of the essential limiting variables in the phytoplankton growth and its seasonal variation by climate change was carried out. In this study, a new method based on the partial least-squares (PLS) regression technique has been used to predict the concentration of phytoplankton and cyanophytes from 22 variables usually monitored in rivers. The predictive models have shown a good agreement between training and test data sets in rivers and seasons (dry and wet). The phytoplankton in dry periods showed greatest similarities, these dry periods being the most important factor in the phytoplankton proliferation


Introduction
The Water Framework Directive (WFD, EC, 2000) states that the "good" ecological status of natural water bodies must be based on their chemical, hydromorphological and biological characteristics, compared to the reference conditions [1]. To comply with the protection of surface waters established in the Water Framework Directive, it is necessary to monitor the ecological and chemical status of water quality, especially under drastic conditions of floods or droughts due to the greater epidemiological risk that occur during these periods.
Phytoplankton is considered a good environmental bioindicator since it presents temporary patterns related to environmental changes and, in addition, the processes that act on this community operate on a reduced time scale, so phytoplankton is an important ecological tool to obtain answers in the short term [2,3]. Furthermore, spatio-temporal variability in the structure of phytoplankton communities plays an important role in the structure and function of aquatic ecosystems [4]. Multiple factors affect the phytoplankton population, among these are the main nutrients (nitrogen, carbon and phosphorus) [5], the environmental conditions, the hydrodynamics and hydromorphology of rivers [6,7] and the biotic conditions (competition, predators, etc.) [8].
With regard to environmental and climatic conditions, phytoplankton depends on light intensity and temperature since they affect the speed of photosynthetic processes [9,10] on the level of the water surface since a low flow rate and a decrease in the level of water in rivers produces an increase of phytoplankton [11]. Other studies have also shown that increasing organic carbon and nutrient inputs from landfills can lead to changes in the competitive dynamics between bacteria and phytoplankton, reducing phytoplankton biomass and increasing bacterial abundance [5]. In this sense, climate change affects ecosystems on a planetary scale [12] and is especially important in some regions around the world. Thus, several predictive models have shown that the Mediterranean climate region is particularly sensitive to global warming due to the progressive establishment of a drier and warmer climate [13,14]. The effects of drought on the hydrology of the Mediterranean basins has been studied [15][16][17][18][19] since it is expected that the effects-in terms of frequency and intensity-of the hydrological drought will be more severe due to climate change.
Climate change has a special effect on unregulated rivers that are temporary or intermittent. Temporary rivers are ecologically unique, supporting important ecosystem processes and functions and being highly relevant in the conservation and protection of the biodiversity. At the same time, they suffer a large number of anthropogenic impacts, including alterations of their flow regime, changes in their bends and channels, nutrients excesses and invasive species [20]. Predictions on climate change have indicated that the Mediterranean region will suffer severe deficits in the flow of its rivers, increasing the vulnerability of temporary rivers and of those that are now perennial, which will become temporary [21,22]. The appropriate management of the rivers, maintaining their level and flow in regulated rivers, can improve the quality of the water, especially when they contain phytoplankton species that can harm the human population such as cyanobacteria [23].
The objective of this study is to analyze the impact of the different stressors associated with the change in phytoplanktonic communities in small rivers in the center of the Iberian Peninsula (Southwestern Europe) with the multivariate method of Partial Least Squares (PLS). PLS statistical regression is a recent technique that generalizes and combines features from principal component analysis and multiple regression [24,25] and that can be used to analyze data from environmental effects on biodiversity [26,27] and large-scale influence of climate [28,29]. In the present study, the establishment of statistical models, suitable for predicting concentration of phytoplankton and cyanophytes from 22 variables usually monitored in rivers, has been carried out. Furthermore, the influence of phytoplankton and cyanobacteria concentration with respect to other environmental and morphological variables in the different sampling points and seasonal periods, has been established. A better knowledge of the limiting factors in the growth of phytoplankton will allow watershed managers to improve the quality of the discharge sites and prevent risks to the population.

Study Area
The study area for the determination of superficial water quality is located in the province of Salamanca (Western Spain). This province covers an area of 12,340 km 2 and forms the South-Western part of the River Duero basin, which is the most important aquifer system of the Iberian Peninsula. The climate of the region is continental, with considerable seasonal fluctuations in temperature (the difference in mean temperature between the hottest and coldest days is almost 20 °C) and low humidity. Precipitation is low (mean annual rainfall 380 mm), highly irregular and usually absent in July and August, and, hence, during the dry season the hydric balance is clearly negative. This Salamanca province has 3 river basins (Figure 1), two belonging to the Duero river, (Tormes and Águeda river basins) and one river basin belonging to Tajo river (Alagón river basin). The Tormes river basin is not contemplated because it has been previously studied in depth by the authors [18].

Sampling and Analysis
The 22 parameters were measured at 33 sampling points ( Figure 1: red points). They were selected to evaluate the evolution of the quality of water of the Águeda and Huebra rivers (Águeda river basin) and Alagón river (Alagón river basin) upstream and downstream of municipal wastewater discharges (Figure 1: black points) to consider the influence of these discharges on water quality. The present study has been carried out during the years 2015 and 2017. Furthermore, within the years studied, 2 seasonal periods have been investigated. May to September seasonal period is considered as summer (summer 2015 and 2017) and November to March seasonal period as winter (Winter 2017). On the other hand, the first study period corresponds to the 2014-15 hydrological year, been considered as a wet hydrological year. The second period corresponding to the year 2017 (hydrological years 2016-17 and 2017-18) registered a rainfall much lower than normal, having been considered as very dry period. This covered an extreme drought occurring from mid-July 2016 until mid-October 2017.
The analyses parameters were: total solids, ammonia, nitrite, nitrate, total phosphorus, sulfate, chloride, fluoride, calcium, magnesium, chemical oxygen demand, biochemical oxygen demand, total organic carbon, colour and total and fecal coliforms in the water samples. This parameters were determined using official or recommended methods of analysis [29,30]. The in situ measurements were: pH, temperature, conductivity, turbidity, and dissolved oxygen. Algal class analysis (Cyanophyta, Cryptophyta, Chlorophyta, Bacillariophyta and Dinophyta) was carried out with the fluoroprobe, a submergible spectrofluorometer (bbe FluoroProbe) [31].

PLS Regression Method
The prediction models were set up using of the PLS option of SIMFIT statistical open source package [32]. PLS regression is particularly useful to predict a new set of dependent variables (response) from a large set of independent variables (predictors). Prediction models are achieved by extracting from the predictors and response variables a new set of orthogonal factors called latent variables, which capture the best predictive power. PLS regression searches for a set of components performing a simultaneous decomposition of predictors and response variables with the constraint that these components explain as much as possible the covariance between predictors and responses.

Results
Two river basins at two different seasonal periods (dry and wet) have been studied. As an example, the development of the predictive model for the dry winter period in the Águeda River is presented.
The PLS technique considers two types of matrices of variables, on the one hand the matrix of predictive variables (X) that will be composed, for each of the rivers in each of the stations studied, by the values of the 22 variables measured. On the other hand, the matrix of response variables (Y) encompasses the two variables to be predicted, which are phytoplankton, measured chlorophyll-a, and cyanobacteria, measured as phycocyanin pigment. Figure 2 shows the cumulative variance of the latent factors, for both the X and Y variables in the Águeda river (dry winter seasonal period). As can be seen, a plateau is reached where the gain in capturing the variability is very small. Based on the fact that this capture of variability is considered acceptable, can be admitted for calibration purposes that seven factors are sufficient for the model (97% capture of variability in X and 92.94% in Y for phytoplankton and 97% in X and 89% in Y capture variability for cyanobacteria). To quantify the importance of each of the variables X in the prediction model, the scores of statistics VIP ("Variable Influence on Projection" [33]) was used. The VIP-scores for the 22 variables X put into play, for prediction model built with seven factors, are shown in Figure 3. Important predictors were identified in the modelling of phytoplankton and cyanobacteria concentration by considering the variables with VIP-scores higher than one. It should be highlighted as better predictors for both are temperature, ammonium and fecal coliforms. PLS methodology consists of two differentiated parts; calibration with a training-set data and validation with a test-set data. The experimental data for the Ageda river example were divided ¾ for a training (calibration) set data and ¼ for the test (validation) set data. The process of calibration ( Figure 4) and validation (Table 1) of the model is exposed. In training procedure seven PLS factors were selected as optimum and the agreement between the measured and the predicting values for the model are shown in Figure 4, where a good correlation with the training data can be seen. Nevertheless, the above good agreement with the training-set data is not the better approach for the goodness of the model. Therefore, a test-set with a new experimental data was used to validate the model. The prediction rates for the sampling points in the Águeda river example are presented in Table 1. As shown on Table 1, the prediction error percentages have been better for phytoplankton (15%) than for cyanobacteria (40%), which indicates a good fit of the PLS prediction model for phytoplankton.

Discussion
Following the same methodology, in order to carry out some comparisons between the rivers, all the studies were carried out using the same PLS statistical procedure with seven factors for the different rivers in the different seasonal periods analyzed. The results of the comparison are shown in the conclusions.

Conclusions
A new methodology, based on the multivariate regression technique PLS, has been proposed in this work, which allows, based on 22 variables usually monitored in rivers, the concentration of phytoplankton and cyanophytes to be predicted. The predictive models generated have presented a goodness of fit tested successfully using training data series. In turn, these models have performed well for the prediction of phytoplankton and cyanobacterial concentrations from new validation data series, although prediction error rates have been better for phytoplankton (10-25%) than for cyanobacteria (40-60%).
Predictive models are formulated by equations of the linear multiple regression type where the coefficients indicate the participation of each of the variables in the model. In this sense, the determined coefficients have varied from one river to another and between seasons, what was expected. However, a certain similarity of the coefficients for dry summer periods (droughts) has been observed. In these transition periods, their features are most important in the prediction, since they exhibit favourable conditions for the proliferation of the phytoplankton community.