Unsupervised Machine Learning and Data Mining Procedures Reveal Short Term, Climate Driven Patterns Linking Physico-Chemical Features and Zooplankton Diversity in Small Ponds

: Machine Learning (ML) is an increasingly accessible discipline in computer science that develops dynamic algorithms capable of data-driven decisions and whose use in ecology is growing. Fuzzy sets are suitable descriptors of ecological communities as compared to other standard algorithms and allow the description of decisions that include elements of uncertainty and vagueness. However, fuzzy sets are scarcely applied in ecology. In this work, an unsupervised machine learning algorithm, fuzzy c-means and association rules mining were applied to assess the factors inﬂuencing the assemblage composition and distribution patterns of 12 zooplankton taxa in 24 shallow ponds in northern Italy. The fuzzy c-means algorithm was implemented to classify the ponds in terms of taxa they support, and to identify the inﬂuence of chemical and physical environmental features on the assemblage patterns. Data retrieved during 2014 and 2015 were compared, taking into account that 2014 late spring and summer air temperatures were much lower than historical records, whereas 2015 mean monthly air temperatures were much warmer than historical averages. In both years, fuzzy c-means show a strong clustering of ponds in two groups, contrasting sites characterized by different physico-chemical and biological features. Climatic anomalies, affecting the temperature regime, together with the main water supply to shallow ponds (e.g., surface runoff vs. groundwater) represent disturbance factors producing large interannual differences in the chemistry, biology and short-term dynamic of small aquatic ecosystems. Unsupervised machine learning algorithms and fuzzy sets may help in catching such apparently erratic differences.


Introduction
Data in ecology often present high stochasticity, correlated features and a large number of predictors compared to the sample size of the dataset. In community analysis, useful techniques to explore environmental and biological datasets include multivariate analyses and classical clustering algorithms. The rise of machine learning algorithms in ecology in recent decades has become accessible thanks to the advance in computation power, large amounts of data and software availability [1]. These algorithms are well suited to deal with complex and large ecological datasets and with nonlinearity [2]. Some machine learning algorithms are useful with datasets composed by a higher number of features as compared to the number of observations [3]. Generally, machine learning algorithms are divided into two groups: supervised and unsupervised [4][5][6][7]. In supervised learning, the algorithms learn from labelled data during a training phase and extract features to solve classification [8][9][10] or regression problems [11] when many classes or a response variable are involved in model prediction. In unsupervised learning, the algorithms identify patterns in data without considering target variables to identify clusters and structures. zooplankton taxa in 24 ponds located in an agricultural landscape in the core of Po river Basin (Northern Italy) were studied in relation to various habitats and environmental variables. Data recorded in 2014 and in 2015 were compared as, in the study area, mean temperatures in this 2-year period were very different. In 2014, the winter was much warmer while late spring and summer were much colder than the average recorded in the past and in 2015 [49]. In most of 2015, mean monthly temperatures were much warmer than the average recorded during the past. It was hypothesized that in small and shallow aquatic ecosystems, water temperature, chemistry and the build-up of short-living biological communities (e.g., planktonic organisms) follow completely different trajectories depending on climatic anomalies affecting the amount and source of the water supply. Shallow ponds are weakly buffered against perturbations due to their small water volume and limited thermal and dilution capacity. Colder temperatures, associated with groundwater inputs, may delay algal blooms. Stagnation may produce anoxia and accumulation of solutes from sediments whereas diffuse inputs from their watersheds, especially in agricultural areas, may increase nutrient concentrations favoring algal growth. Overall, these sometimes contrasting effects prevent a clear understanding of shallow pond diversity and functioning (e.g., the two-way interactions between physico-chemical features and biological communities) [31].
Fuzzy c-means algorithms were applied as an analytical tool to classify the 24 farm ponds in terms of the 12 zooplankton taxa they supported, and to specify the influence of environmental variables related to land-use and to pond characteristics on the assemblage patterns. Data recorded in 2014 and 2015 were compared taking into account that interannual temperature variations might explain apparently erratic community-wide responses. Besides this main objective, the present work represents a methodological contribution to environmental sciences research, and in particular an application of machine learning in a case study that is generally analyzed by multivariate statistical analysis.

Data Collection
In this study we focused on the occurrence of the main zooplankton taxa in 24 pools and ponds that were randomly selected in a 200 km 2 area located in the Cremona province (central part of northern Italy) [50] (Figure 1).
Analyzed temporary pools and ponds, locally named bodri, have originated by flooding events of the Po River: erosive processes dug cone-shaped holes with depths up to 6-10 m and size varying between 1529 and 7070 m 2 . Bodri generally display pronounced water level fluctuations, regulated by the Po river hydrometric level, precipitations, runoff, vertical migration of the aquifer, also due to irrigation, and summer evaporation. They represent spots of naturality within heavily exploited agricultural contexts and are vulnerable to diffuse pollution due to their small size. Many of the studied water bodies originated before 1723 [51] (for details see Table S1). At present, most bodri are eutrophic, undergo rapid infilling and display pronounced seasonal and daily variation of physico-chemical features. During surveys, they were characterised by the dominant form of primary producers (i.e., phytoplankton, submersed, floating leaves or emerged plants), for the level of saturation of dissolved gas of biological interest (i.e., O 2 , CO 2 , N 2 and CH 4 ), for dissolved nutrients (the inorganic forms of N, P and Si) and for sedimentary features (i.e., organic matter content).  (Table S1), located along the left hydrographyc bank of the Po River, in the Cremona province (North Italy).
Each pond was sampled twice: the first time between May and June 2014 and the second time between June and July 2015. Qualitative zooplankton samples were collected by 105 μm-mesh size plankton nets. Two to sixteen litres of water were filtered for each sample according to the estimated water volume and depth. All samples were preserved in 95% ethanol. All organisms present in the sample were sorted under a stereomicroscope and cladocerans were identified to genus level whereas copepods were distinguished in Calanoida and Cyclopoida. For each pond, 2 litres of water were sampled with a PE bottle. Nine chemical and three physical environmental descriptors were determined for each pond (Tables S2 and S3). Water temperature (wT), dissolved oxygen concentration, pH and electrical conductivity (EC) were measured in situ with a multiparameter probe (YSI model 566 MPS). In the laboratory, the water collected was filtered with Whatman GF/F filters (0.45 μm) and stored in glass vials (Labco Exetainer ® , Lampeter, Wales, UK) for the determination of dissolved inorganic carbon (DIC) [52] and soluble reactive phosphorus (SRP) [53], and in PE vials for the determination of dissolved reactive silica (SiO2) [54], ammonium (NH4 + ) [54] and nitrate (NO3 -) [55] (Tables S2 and S3). Chlorophyll-a (Chla) concentration was determined spectrophotometrically after filtration of 100-500 mL of water (0.45 μm Whatman GF/F filters) and extraction of pigments with 90% acetone [54]. Besides physico-chemical and biological parameters, the ponds perimeter and main depth were also considered in the study as proxies of size [56] (Tables S2 and S3).

Environmental Features Selection
To avoid multicollinearity and to reduce redundant information from the set of environmental features, a score called variance inflation factor (VIF) was computed [57,58]. For a given predictor (p), the variance inflation factor measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model. The smallest possible value of VIF is one (absence of multicollinearity). A VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity [58]. In this work, as a conservative rule of thumb, a threshold equal to 4 was set. VIF was computed for all the environmental features with a stepwise procedure. The environmental features with VIF values > 4 were then excluded and the procedure was repeated until no environmental features with VIF greater than threshold remained.  (Table S1), located along the left hydrographyc bank of the Po River, in the Cremona province (North Italy).
Each pond was sampled twice: the first time between May and June 2014 and the second time between June and July 2015. Qualitative zooplankton samples were collected by 105 µm-mesh size plankton nets. Two to sixteen litres of water were filtered for each sample according to the estimated water volume and depth. All samples were preserved in 95% ethanol. All organisms present in the sample were sorted under a stereomicroscope and cladocerans were identified to genus level whereas copepods were distinguished in Calanoida and Cyclopoida. For each pond, 2 litres of water were sampled with a PE bottle. Nine chemical and three physical environmental descriptors were determined for each pond (Tables S2 and S3). Water temperature (wT), dissolved oxygen concentration, pH and electrical conductivity (EC) were measured in situ with a multiparameter probe (YSI model 566 MPS). In the laboratory, the water collected was filtered with Whatman GF/F filters (0.45 µm) and stored in glass vials (Labco Exetainer ® , Lampeter, Wales, UK) for the determination of dissolved inorganic carbon (DIC) [52] and soluble reactive phosphorus (SRP) [53], and in PE vials for the determination of dissolved reactive silica (SiO 2 ) [54], ammonium (NH 4 + ) [54] and nitrate (NO 3 -) [55] (Tables S2 and S3). Chlorophyll-a (Chla) concentration was determined spectrophotometrically after filtration of 100-500 mL of water (0.45 µm Whatman GF/F filters) and extraction of pigments with 90% acetone [54]. Besides physico-chemical and biological parameters, the ponds perimeter and main depth were also considered in the study as proxies of size [56] (Tables S2 and S3).

Environmental Features Selection
To avoid multicollinearity and to reduce redundant information from the set of environmental features, a score called variance inflation factor (VIF) was computed [57,58]. For a given predictor (p), the variance inflation factor measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model. The smallest possible value of VIF is one (absence of multicollinearity). A VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity [58]. In this work, as a conservative rule of thumb, a threshold equal to 4 was set. VIF was computed for all the environmental features with a stepwise procedure. The environmental features with VIF values > 4 were then excluded and the procedure was repeated until no environmental features with VIF greater than threshold remained.

Fuzzy Clustering
Fuzzy clustering aims at defining a membership value of an object that can be split between different clusters. The most common clustering method, the fuzzy c-means [59], bases the clustering procedure on the minimization of an objective function as reported in Equation (1): where d ij is the distance between the ith observation and jth centroid, p is the number of observations, c is the number of clusters (2 ≤ c ≤ n), µ ij is the membership degree of the ith observation to the jth cluster and satisfying the following conditions: The exponent fuzzifier m defines the degree of fuzziness of the partition, when m approximates the value of 1, it operates as the k-means algorithm. Meanwhile, when m increases in value, the degree of fuzziness increases, and the fuzzy c-means leads to a solution where the memberships of each observation approximate 1/c [59]. The evaluation of the quality of the cluster procedure is made with a particular function that will be maximized or minimized according to the number of clusters c [60]. These procedures allow one to know how well the algorithm fitted the data structure (cluster validity problem). The most common measures for this task are the partition coefficient (PC) and the partition entropy (PE). In this study, the fuzzy c-means on the environmental dataset was used. The environmental features were standardized and a search grid procedure was used: the fuzzy c-means was run multiple times. For each run, a combination of the parameter c (number of cluster) and the fuzzifier exponent m, were set. The best partition was selected according to the maximum value of PC, or, in alternative, to the minimum value of PE. To improve the clustering procedure for each run, the algorithm was randomly initialized 50 times. The Principal Component Analysis (PCA) was performed as an operation of dimensionality reduction, in order to improve the visualization of the ponds with the membership values estimated from fuzzy c-means. The prototypes or the centroids of the estimated clusters, that are the values of the environmental features that characterized each cluster, were compared and the Trophic State Index (TSI) based on the values of prototype of the Chla was computed according to Carlson [61] using the Equation (2): To quantify the habitat heterogeneity of the environmental features, the Euclidean distances matrix between each observation (pond) and the median of each cluster were standardized and computed. Habitat heterogeneity was estimated by the average distance from the clusters' medians [62]. The Permutational analysis of multivariate dispersions (PERMDISP) test for the analysis of multivariate homogeneity of groups was used [63]. PERMDISP compares within-group variance among clusters using the mean distance from individual observations to their cluster median. Bosco Bodini pond was excluded as it was dry in 2015 whereas two chemical parameters (NH4 + and DIC) were excluded due to multicollinearity in 2014. Square root corrections were applied for groups of unequal size [64] and to test differences in habitat heterogeneity among clusters, a permutation procedure n = 999) was used. Average Euclidean distances from clusters' medians were visualized in a reduced space with Principal Coordinate Analysis (PCoA). The analysis was carried out with the R package vegan [65].

Richness and Beta Diversity
For each year and for each cluster the taxa richness and the community structure were computed. The richness of the number of taxa observed was compared between clusters of the same year and by the Mann-Whitney U Test. Differences in Richness were tested also between years with the Wilcoxon Signed-Rank Test for paired samples. The alpha diversity (α) or the mean number of taxa were computed between years. The Sorensen index (βSOR) for presence/absence data was used as a measure of beta diversity for multiple sites [66]. The beta diversity was partitioned into two components: nestedness (βSNE) and turnover (βSIM). The overall beta diversity (βSOR) and its components were computed considering different years and different clusters within years. In order to compare βSOR between and within years, a resampling procedure was applied. An equal number of sites sampled (n = 5) and a total number of samples (n = 500) were set. This procedure allowed us to estimate the distributions of βSOR and the relative components, nestedness and turnover, for multiple sites with an equal number of ponds. The estimated distributions were compared with the Kolgomorov-Smirnov test. In order to highlight differences in zooplankton community diversity across time, pairwise measures of βSOR of each pond in two different years (2014 and 2015) were compared [67]. The βSOR analysis was carried out with the R package betapart [68].

Community Structure and Association Rules
Community structure was described by characteristic taxa and association rules mined from frequent pattern tree growth (FP-growth) and visualized by the frequent pattern tree (FP-tree). For each cluster, the characteristics of taxa were computed using the indices of presence (Pi) [13,69]. For each taxa, Pi was expressed as Pi = Pic/Nstc, where Pic is the i taxon belonging to a particular cluster, and Nstc is the number of ponds in a particular cluster. A taxon was identified as characteristic if its indices of presence were higher than the threshold Pi, set at 0.6 [13,69].
Considering the whole dataset of presence/absence data, association rules were extracted using frequent pattern growth algorithm (FP-growth), in order to highlight and evaluate correlations among co-occurrences of different taxa. An association rule is an implication X→Y that describes the existence of a relationship between X and Y species or group of species [22,70]. FP-growth is based on a divide-and-conquer approach, the algorithm identifies small patterns by decomposing the mining problem into a set of smaller ones represented by conditional databases, extracted on a compressed data representation, the FP-tree. This approach reduces the search space and the computational effort [23].
To select an association rule from the set of all possible rules, constraints of various quantitative measures of interestingness and significance were applied, using objective measures [71,72]. Interestingness measures the strength of the relationship between X and Y. As the first step of association rule mining, the threshold values of a support-confidence framework were used [73]. Support measures the probability of observing a particular group of species X in the dataset, while the confidence is the conditional probability of observing the species Y given the presence of the species X. A threshold value of minimum support equal to 0.1 and minimum confidence equal to 0.80 were set. The second step relied on an interestingness measure called lift. Lift quantifies the statistical dependence of two or more taxa in a particular rule; it is a positive real number, with a value equal to 1 under statistical independence [24]. Association rules were sorted in descending order of lift and association rules with lift value lower or equal to 1 were not considered [74]. The zooplankton community structure of each cluster was visualized in a compact way using the frequent pattern tree (FP-tree) [23,25]. Association rules were mined by Weka software version 3.8.4 [75] and visualized with the R packages Arules and ArulesViz [76,77].

Fuzzy c-Means
In both years, the number of clusters was c = 2, corresponding to a value of the fuzzifier m = 1.5 ( Figure S2). The evaluation of the quality of the clustering was made considering the best partition of the maximum value of partition coefficient (PC) and the minimum value of partition entropy (PE). In 2014 and 2015, the maximum value of PC, 0.68 and 0.63, respectively, and the minimum value of PE, 0.49 and 0.56, respectively, were obtained for c = 2. In 2014, the highest membership associated with Cluster 1 was observed for Pavarini, Pescaroli West and Santa Maria Maddalena, and with Cluster 2 for Bosco Piazza, San Giorgio and Cascina Tavernelle (Table 2). In 2015, the highest memberships associated to Cluster 1 and 2 were observed for Motta, Bicocca and Pastore 1, and for Pastore 4, Bosco Braca and Pescaroli West, respectively. In 2014, the prototypes showed that ponds in Cluster 1 were characterized by higher values of wT, pH, Chla, SiO 2 , depth and perimeter and lower values of EC, SRP andNO 3 than ponds in Cluster 2 (Figures 2 and 3). In 2015, the ponds grouped in Cluster 1 were characterized by lower values of pH, Chla and higher values of EC, NH 4 + , DIC, SRP, NO 3 -, reactive silica (SiO 2 ), depth and perimeter than ponds in Cluster 2. The wT of Cluster 1 was similar to that of Cluster 2 (Table 3). In 2015, both clusters estimated by fuzzy c-means showed higher values of the prototypes relative to wT, Chla and SiO 2 , compared to the prototypes of clusters estimated in 2014. In 2014, the difference of trophic status between clusters was higher than in 2015. In 2014, the TSI was 40.30 for Cluster 1 and 29.56 for Cluster 2, while in 2015, it was 42.70 for Cluster 1 and 45.23 for Cluster 2. Ponds that, in both years, remained grouped in the same cluster were Pastore 3, Vecchio, Bazzi and Motta, grouped in Cluster 1, and Temporanea, Bosco Valloni, San Giorgio, Forche, Martignana, Bosco Piazza and Cascina Tavernelle, grouped in Cluster 2. Habitat heterogeneity, estimated by the average distance from clusters' median, was 2.10 (Cluster 1) and 2.30 (Cluster 2) in 2014, while it raised to 3.10 (Cluster 1) and 2.54 (Cluster 2) in 2015 ( Figure S3). The permutation test showed that habitat heterogeneity was not significantly different among clusters (F3 = 1.3103, p-value = 0.273).

Richness and Beta Diversity
In 2014, the taxa richness was significantly lower in Cluster 1 than in Cluster 2 (W = 35.5 and p-value = 0.036), whereas in 2015 there was no significant difference between clusters (W = 69 and p-value = 0.5651) (Figure 3). Richness was higher in 2014 than in 2015 (V = 152.5, p-value = 0.02) (Figure 3), with values of alpha diversity (α) equal to 3.61 in

Richness and Beta Diversity
In 2014, the taxa richness was significantly lower in Cluster 1 than in Cluster 2 (W = 35.5 and p-value = 0.036), whereas in 2015 there was no significant difference between clusters (W = 69 and p-value = 0.5651) (Figure 3). Richness was higher in 2014 than in 2015 (V = 152.5, p-value = 0.02) (Figure 3), with values of alpha diversity (α) equal to 3.61 in 2014 and to 2.56 in 2015. The distributions of beta diversity index (βSOR) and the relative components nestedness (βSNE) and turnover (βSIM), estimated by resampling, were statistically different considering clusters and years (Table S4 and Figure 4). In 2014, the overall beta diversity (βSOR) and the turnover (βSIM) were higher than in 2015, but the nestedness (βSNE) was lower in 2014 than in 2015 (Table 4 and Figure 4). In 2014, beta diversity (βSOR) and turnover (βSIM) were higher in Cluster 1 than in Cluster 2, but nestedness (βSNE) was lower in Cluster 1 than in Cluster 2. In 2015, beta diversity (βSOR) and turnover (βSIM) were higher in Cluster 2 than in Cluster 1, but the nestedness (βSNE) component was higher in Cluster 1 than in Cluster 2 (Table 4 and Figure 4). The pairwise comparison of beta diversity (βSOR) between ponds in different years was the maximum (βSOR = 1) for Vecchio, Santa Maria Maddalena, Rita, Pescaroli East, Pavarini, Motta, Martignana and Bosco Braca, while the minimum value was recorded for Bosco Valloni (βSOR = 0.33) (Figure 4).

Community Structure and Association Rules
In 2014, the characteristic taxa were Calanoida (Pi = 0.69) for Cluster 1 and Simochephalus  (Figure 5). Considering the whole dataset, nine association rules were found ( Figure 6 and Table S5). One association rule showed higher values of lift (3.92) and indicated the co-occurrence of Calanoida, Cyclopoida and Pleuroxus with Simocephalus. This association was found in ponds of Cluster 2 in 2014. Three association rules with lift values equal to 1.52 indicated the co-occurrence of Daphnia, Simocephalus, Chydorus and Pleuroxus with Calanoida (Table S5).  (Table  S5).     (Table  S5).   were characterized by the presence of Calanoida and Cyclopoida, whereas the community structure of Cluster 2 was characterized by the co-occurrence of Simocephalus, Calanoida, Pleuroxus and Cyclopoida (Figure 7 and Figure S4) as expressed by the first association rule with the highest value of lift. In 2015, the community structure of Cluster 1 was characterized by the presence of Calanoida, Daphnia and Chydorus while Cluster 2 was characterized by the presence of Calanoida, Daphnia and Scapholeberis (Figure 7 and Figure S4). These associations were found in Cluster 2 of both years 2014 and 2015 and in Cluster 1 of 2015. The association rule with the lowest lift value (lift = 1.26) indicated the co-occurrence of Chydorus and Cyclopoida with Calanoida and was found in both Clusters 1 and 2, in 2014. The f requent pattern tree (FPt) revealed different patterns in the community structure between clusters and years. In 2014, most of the ponds of Cluster 1 were characterized by the presence of Calanoida and Cyclopoida, whereas the community structure of Cluster 2 was characterized by the co-occurrence of Simocephalus, Calanoida, Pleuroxus and Cyclopoida (Figure 7 and Figure S4) as expressed by the first association rule with the highest value of lift. In 2015, the community structure of Cluster 1 was characterized by the presence of Calanoida, Daphnia and Chydorus while Cluster 2 was characterized by the presence of Calanoida, Daphnia and Scapholeberis (Figure 7 and Figure S4). Each node represents a specific taxon and its absolute frequency (number of ponds where the taxon was found). The branches join the co-occurrence of taxa. Only the taxa with frequency > 20% were reported (see also Figure  S4).

Discussion
Results from this study suggest that shallow ponds may undergo completely different trajectories in the same geographical area and may display pronounced differences in terms of their water's physico-chemical and biological parameters. This is not surprising due to their overall small size and to the small ratio between their water volume and sediment surface. The small water volume has limited buffer capacity against climatic anomalies, water ingression from the aquifer or from the watershed, resulting in local, sharp changes of physico-chemical parameters and, as a consequence, of biological communities [78][79][80]. Each node represents a specific taxon and its absolute frequency (number of ponds where the taxon was found). The branches join the co-occurrence of taxa. Only the taxa with frequency > 20% were reported (see also Figure S4).

Discussion
Results from this study suggest that shallow ponds may undergo completely different trajectories in the same geographical area and may display pronounced differences in terms of their water's physico-chemical and biological parameters. This is not surprising due to their overall small size and to the small ratio between their water volume and sediment surface. The small water volume has limited buffer capacity against climatic anomalies, water ingression from the aquifer or from the watershed, resulting in local, sharp changes of physico-chemical parameters and, as a consequence, of biological communities [78][79][80].
Phytoplankton communities have, in turn, the potential to control inorganic nutrients and regulate dissolved oxygen, inorganic carbon concentrations and water pH. Assimilative processes are contrasted by nutrient regeneration from sediments that, together with the low ratio between water volume and sediment surface, amplify the effects produced by microbial dissimulative pathways (e.g., oxygen shortage) [79]. However, the most interesting result of this study does not deal with different solutes or chlorophyll concentrations in the analyzed shallow water ecosystems. Ponds are intrinsically heterogeneous, they can be net autotrophic or net heterotrophic and these extremes correlate with high chlorophyll, oxygen and low nutrient concentrations or the opposite, respectively [78]. What is novel here is that the functioning of ponds, exemplified by snapshots showing phytoplankton, nutrient concentrations and zooplankton community composition, may diverge from year to year due to some sort of continuous disturbance or to the absence of a stable steady state. Such instability is favored by the vulnerability of ponds to a large set of pressures and may contribute to the paradox of their diversity, which is a central topic in recent freshwater research and has important implications for ecosystem restoration, in particular in heavily impacted agricultural areas [78,79].
The main findings of this study were extracted from the dataset via the application of an unsupervised machine learning and data mining algorithm, fuzzy c-means and frequent pattern growth. Such an approach allowed us to assess the factors that influenced assemblage composition and the apparently erratic distribution patterns of zooplankton taxa in 24 ponds and in two consecutive years. Data in ecology are characterized by high uncertainty, bias and a hierarchical level of complexity. Machine learning tools were used according to the level of complexity of ecological systems in order to understand environmental and biological dynamics [81], shift in species assemblages along time [12] and to plan conservation actions for ecological communities threatened by anthropic pressure and climate change [82]. Classical clustering methods generally define sharp boundaries between groups and each object belongs only to a particular cluster. These classical procedures do not consider the continuous realm of ecological features. The assumptions based on Boolean rules might lead to misclassifications and might fail to detect outliers. A set of ecological objects, in our case the freshwater ponds, can be partitioned using fuzzy logic, where a probabilistic approach helps to capture the continuous nature of ecological data.
Brownscombe et al. [3] applied supervised machine learning techniques combined with unsupervised fuzzy c-means to a wide range of informational sources in order to identify potential spawning aggregation sites for a marine fish species. The flexibility and the probabilistic output of the fuzzy logic, with respect to the classical partition procedure based on crispy clustering, was used to describe species association in marine ecosystems that consist of communities or cohesive units of not-random taxa groups and assemblages of taxa that are randomly associated [13]. Fuzzy clustering algorithms were used also for the identification and partition of similar regions and hydrologically homogeneous watersheds [83].
In this study, for both years, the fuzzy c-means algorithm allowed us to identify two different clusters. In 2014, ponds in Cluster 1 were characterized by high concentrations of chlorophyll-a, high pH and water temperature while ponds in Cluster 2 were characterized by high concentrations of chemical species, with silica as the only exception. Cluster 1 was more autotrophic and showed a higher TSI than Cluster 2. In 2015, temperature was comparable between clusters and the situation of 2014 was reversed, with Cluster 1 more heterotrophic and characterized by a higher concentration of chemical species (SRP, DIC, NH 4 + , NO 3 − ), higher conductivity, depth and perimeter and lower values of pH than Cluster 2. In different years, the chemistry of water changed in many ponds, and large differences were recorded in Pastore 3, Vecchio, Bazzi and Motta, where the concentrations of all chemical parameters increased and a reduction of pH was recorded. In contrast, Temporanea, Bosco Valloni, San Giorgio, Forche, Martignana, Bosco Piazza and Cascina Tavernelle showed a reduction in the concentration of nutrients and of electrical conductivity. In 2015, all ponds showed higher concentrations of chlorophyll-a than in 2014, confirmed by the increase of the trophic state index. This result was probably due to the effect of higher temperatures in 2015 than in 2014. Differences between cluster communities might be related to different factors, among which the amount and the quality of water inputs from runoff, from the aquifer or the thermal regime in winter and late spring. A general increase in the concentration of reactive silica was observed in 2015 as compared to 2014, likely due to regeneration from sediments uncoupled to uptake. Fuzzy c-means identified a pond (Rita) with the lowest value of cluster membership in both years and the highest value of nitrate, likely due to diffuse inputs. The zooplankton community structure (species richness and species composition) is potentially affected by both water chemistry and site morphology, and by anthropogenic pressures in lakes and watersheds [33,84,85]. At the geographical level, the species pool is driven by dispersal constraint, whereas the habitat species pool is due to environmental constrains [86]. The high taxonomic diversity in zooplankton communities is only partially expressed in individual freshwater habitats and the differences in zooplankton community structures among systems is largely associated with specific environmental conditions [87][88][89]. Abiotic factors (e.g., pH, temperature, light intensity) can also influence the zooplankton community structure of fishless aquatic habitats directly by sorting species based on differential physiological tolerances, or indirectly by interacting with biotic conditions such as primary production and invertebrate predation [90][91][92]. The observed taxa are also determined by internal dynamics due to biotic factors, such as predation, intraspecific and interspecific competition. A particular pattern of taxa was the result of internal and external process that defined the community structure.
In this work, a data mining algorithm was used to evaluate the co-occurrence of taxa on presence-absence data in a system of freshwater ponds. This method allowed us to quantify possible correlations among taxa in frequent patterns extracted from data and to highlight differences in the community structure between consecutive years. In 2014, the taxa richness was higher in Cluster 2 than in Cluster 1 and the community structure was different. Cluster 2 was dominated by the presence of Simocephalus in many ponds, while in Cluster 1 the taxa with a higher presence were Cyclopoida and Calanoida. Furthermore, Simocephalus was not present in ponds characterized by higher pH in both years. In 2014, Pastore 4 showed the lowest value of membership for Cluster 1: all environmental features were similar to prototypes of Cluster 1 except for chloropyll-a and reactive silica, that were more similar to Cluster 2. However, the community structure in Pastore 4 was composed of Cyclopoida, Calanoida and Daphnia, which was the most frequent pattern in Cluster 1. In 2014, the community structure was characterized by not-common taxa association composed by Pleuroxus, Alona, Moina and Macrothrix, while, in 2015, the presence of these taxa was not recorded. The smallest membership associated to a particular cluster was found for ponds belonging to the group with the lowest nutrient concentrations. In both years, a low beta diversity was observed for clusters with higher concentrations of chemical species, high conductivity and pH.
In general, nestedness of species assemblages occurs when the biota of sites with smaller numbers of species are subsets of the biota at richer sites [93,94], reflecting a non-random process of species loss as a consequence of any factor that promotes the orderly disaggregation of assemblages [95]. Turnover implies the replacement of some species by others, as a consequence of environmental sorting or spatial and historical constraints [96]. The environmental features working as a driver at the local scale might have shaped the community assemblage, decreasing the species replacement between the ponds as shown by a lower turnover. As reported in Gianuca et al. [97] a higher heterogeneity usually produces turnover patterns but, in our study, from PERMDISP analysis, a difference in heterogeneity among clusters was not recorded. In this work, the ponds characterized by lower nutrient concentrations showed higher turnover. In clusters with a higher value of trophic status, the component of nestedness increased, that is poorest ponds in taxa richness were subsets of the richest ponds. A nestedness pattern may highlight an internal cluster gradient of environmental features that might drive the community assemblage. Margalef [98] gave rise to the widespread concept that the lower the level of lake eutrophication, the more complex the structure of aquatic animal communities. A general reduction in taxa richness was observed in 2015 compared to 2014, suggesting a tendency to a higher trophic status of both clusters. The higher temperatures of the water in 2015 may have favored the increase of phytoplankton production. In 2015, a general reduction in the complexity of the community structure was observed compared to 2014. This condition was highlighted by the FP-tree. In 2015, the community structure was characterized by Calanoida and Daphnia. Calanoid copepods generally appear to be best adapted to oligotrophic conditions, whilst cyclopoid copepods and cladocerans are relatively more abundant in eutrophic waters [99,100]. In our study, the presence of Cladocera decreased between years, but Cyclopoida disappeared with the highest trophic status observed in 2015. Variation in community structure alters ecosystem functioning and biodiversity metrics can indicate how communities influence ecosystems [101][102][103][104]. Seasonal and interannual increases in Daphnia abundance have been associated with P limitation due to higher requirements in Daphnia than in other taxa [105]. Moreover, filtrator Daphnia species fed on the smallest food particles with a low selectivity, while many species of cyclopoids show a raptorial feeding type and high selectivity preferences of much larger food items [106]. Chydorids are more successful in very productive habitats, feeding by scraping algal particles from periphyton. On the contrary, Bosmina shows the lowest clearance rate, declines with increasing food concentrations and does not co-occur with Daphnia [106]. However, information about food quality and quantity as well as adaptive life history strategy needs to understand the mechanistic role of association rules in ecosystem functioning. In perspective, information on taxa's functional traits might be included in analysis by unsupervised machine learning.
Results from physico-chemical and biological (e.g., chlorophyll-a and zooplankton communities) analysis of shallow ponds reveals large variability of all single (e.g., nitrate, pH) and aggregated (e.g., trophic status, biodiversity indexes) parameters over short temporal scales. Under such conditions, traditional statistical approaches may fail to extract significant patterns or aggregations, and consider them as erratic. Fuzzy logic allowed us to group ponds in clusters that differed in two consecutive years, likely due to small differences in external stressors, affecting the unstable equilibrium between autotrophic and heterotrophic processes and their dominance. The latter, in turn, affect nutrient concentrations and the intensity of algal blooms, producing cascading consequences on zooplankton diversity and community composition. Year-to-year slight variations in water temperatures, different timing, absence of diffuse nutrient input via runoff and variable interactions with the aquifer may drive timing of blooms but also the intensity of heterotrophic microbial activities in sediments of shallow ponds. Such variations result in dynamic rearrangement of ponds in clusters, and might end up in excess nutrients sustaining algal growth or in nutrient limitation, stimulating zooplankton richness.

Conclusions
This study represents an example of a promising application of machine learning techniques applied to a heterogeneous dataset. The fuzzy-set theory provided a mathematical approach that was able to cope with imprecision and allowed us to cluster ponds without defining discrete boundaries. Such an approach allowed us to assess the factors that influenced assemblage composition and the apparently erratic distribution patterns of zooplankton taxa. In our case, the freshwater ponds were partitioned with a probabilistic approach that allowed us to capture the continuous nature of ecological data.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/w13091217/s1. Table S1. Reported the latitude and longitude in WGS84 coordinate reference system of the 24 ponds under study, the number of years since origin, Table S2: Showed the physical and chemical environmental features measured for each pond, the unity of measure, the symbols adopted in the study and the laboratory assay or the method of estimation, Table S3.  Figure S1. The panels showed the correlation matrix of the environmental features for the years 2014 and 2015. The diagonal showed the labels of the eleven environmental features: water temperature (wT), pH, Oxygen (O 2 ), conductivity (EC), dissolved inorganic carbon (DIC), ammonia (NH 4 + ), soluble reactive phosphorus (SRP), nitrate (NO 3 -), chlorophyll-a (Chla), silica (SiO 2 ), depth and perimeter. In the upper diagonal part was present the Pearson correlation coefficient between pairs of variables proportional in size to the magnitude of the value, the stars showed the significance level of the correlation test (* p < 0.05, ** p < 0.001, *** p < 0.0001). The lower diagonal part showed the scatterplot between pairs of variables. The red line reported the lowess smoother. Figure S2. Values of partition coefficient (PC) and partition entropy (PC) for the years 2014 (left panel) and 2015 (right panel). For each plot, the grey scale colors, the point's shape and the geometry of the lines were relative to the different number of clusters (c) in the range 2-6. Figure S3. Showed the ordination plot of the average Euclidean distance of the scaled environmental features of the water chemistry, from the median of each cluster found by fuzzy c-means. Each ponds were reported on the first two principal coordinate axis and the symbols were relative to the ponds of Cluster 1 ( ) and Cluster 2 (∆) in 2014, Cluster 1 (+) and Cluster 2 (x) in 2015. The ellipses represent 1 standard deviation of the Euclidean distances from the median of the clusters. The PERMIDISP analysis after permutation test, revealed not significant difference between groups in habitat heterogeneity. Table S4. Showed the results of Kolgomorov-Smirnov test between distributions of beta diversity indices after the resampling procedure. The stars represents the p-value (* p < 0.05, ** p < 0.01, *** p < 0.001 and **** p < 0.0001). Table S5. Showed the association rules from presence/absence data mined with frequent pattern growth algorithm, in 2014 and 2015.
The association rules highlight the frequency and the correlations between taxa co-occurrences. For each rule were reported the quantitative measures of interestingness: support, confidence and lift. Figure S4. Frequent pattern trees (FPt) for the community structure in Cluster 1 (a) and 2 (b) in 2014 and in Cluster 1 (c) and 2 (d) in 2015. Each node represents a specific taxon and its absolute frequency (number of ponds where the taxon was found). The branches join the co-occurrence of taxa.

Conflicts of Interest:
The authors declare no conflict of interest.