Model-Based Analysis of the Potential of Macroinvertebrates as Indicators for Microbial Pathogens in Rivers

The quality of water prior to its use for drinking, farming or recreational purposes must comply with several physicochemical and microbiological standards to safeguard society and the environment. In order to satisfy these standards, expensive analyses and highly trained personnel in laboratories are required. Whereas macroinvertebrates have been used as ecological indicators to review the health of aquatic ecosystems. In this research, the relationship between microbial pathogens and macrobenthic invertebrate taxa was examined in the Machangara River located in the southern Andes of Ecuador, in which 33 sites, according to their land use, were chosen to collect physicochemical, microbiological and biological parameters. Decision tree models (DTMs) were used to generate rules that link the presence and abundance of some benthic families to microbial pathogen standards. The aforementioned DTMs provide an indirect, approximate, and quick way of checking the fulfillment of Ecuadorian regulations for water use related to microbial pathogens. The models built and optimized with the WEKA package, were evaluated based on both statistical and ecological criteria to make them as clear and simple as possible. As a result, two different and reliable models were obtained, which could be used as proxy indicators in a preliminary assessment of pollution of microbial pathogens in rivers. The DTMs can be easily applied by staff with minimal training in the identification of the sensitive taxa selected by the models. The presence of selected macroinvertebrate taxa in conjunction with the decision trees can be used as a screening tool to evaluate sites that require additional follow up analyses to confirm whether microbial water quality standards are met.


Introduction
The most frequent health risk related to the ingestion of water is associated with microbial contamination by human or animal feces, which is a source of pathogenic bacteria, viruses, protozoa and helminthes [1,2]. Pathogens are introduced in rivers via point and non-point sources, and their autochthonous growth is stimulated by nutrients brought from the aforementioned sources [3]. The health risk increases when untreated wastewater from urban sewage systems (point source) is directly discharged into water bodies, potentially causing large outbreaks of waterborne diseases [4]. In addition, water from rivers and lakes has off stream uses such as drinking water or irrigation, and instream uses such as recreational activities with primary contact (e.g., swimming). Therefore, water quality control must always be of paramount importance [5].
The indicators often used to verify microbial contamination of water in developed countries are: total coliforms, and fecal coliforms and/or Escherichia coli [6,7]. Likewise, in many tropical countries, the assessment of running water quality is predominantly made by using physicochemical methods. However, most of the methods for determining physicochemical and microbiological parameters require expensive laboratory analyses that in the majority of developing countries, do not allow for the establishment of national rigorous monitoring programs of water bodies due to limited technical and financial resources. For those reasons, the development of cost-effective water monitoring programs is essential [8], and must include techniques for measuring microbial water quality.
The biological methods for monitoring river water health have evolved over more than a century. For example, benthic macroinvertebrates are used to assess the water quality over time, because they respond to both physicochemical changes and hydro-morphological variations within streams and rivers [9,10]. Physicochemical and microbiological parameters provide limited water quality information at a specific point in time [9,11]. In contrast, biological samples can also predict average values of chemical parameters when their cumulative effects have been more pronounced in the biota over a period of time preceding the biological sampling [11]. As such, the use of bioindicators in water quality assessment for streams has been integrated into the European Water Framework Directive [12]. In developing countries, biological river assessment was introduced and subsequently developed only recently [9], based mainly on adaptation of the English Biological Monitoring Working Party (BMWP) [13][14][15].
Fecal coliform (FC) concentration has been modeled using both deterministic and stochastic methods. The deterministic models focused on understanding the die-off variation of fecal coliforms in relation to temperature, and changes under kinetics conditions (i.e., transportation) such as the velocity along the rivers [16]. Alternatively, stochastic models have been used to obtain the relationship between fecal coliform and physicochemical [17] or microbiological [18] variables, or timing variation during a rainfall [19]. Negative correlation between FC concentrations and macroinvertebrate diversity (Shannon-Wiener diversity index) was observed in ponds [18].
On the other hand, the assessment of habitats and the determination of the relation between the presence of an organism and environmental variables has been done through the modeling of running waters based on ecological, physicochemical and microbiological parameters. These modeling techniques have allowed for the handling of the non-linear behavior of the ecosystem, obtaining models with a high reliability [20][21][22]. In this way, the FC has been associated as one of the explanatory variables describing the presence or absence of some taxa of macroinvertebrates [22][23][24]. Machine learning with different modeling techniques, such as classification trees (CTs) combine reliable classification predictions with transparency, and have been proven to be effective to assess running waters [25,26]. The CTs provide good modeling techniques as they focus on the presence/absence or abundance of macroinvertebrate taxa (family or species) in relation to a specific impact or a disturbance in the streams [11,20,[26][27][28]. Consequently, considering the described correlations between fecal coliform presence and macroinvertebrate diversity [18,[22][23][24], compliance to regulatory standards can be simulated based on the prevailing macroinvertebrate community structure by training classification trees on combined observations of fecal coliforms and macroinvertebrates, thereby acting as a proxy indicator for fecal coliform contamination.
In our research, with the environmental and biological variables collected in the Machangara River in Ecuador between February and March of 2012, three decision tree models (DTMs) were developed as indicator tools to check the compliance to three of the Ecuadorian microbial water quality standards associated with fecal coliforms. The construction of the DTMs was based on the presence and abundance of macroinvertebrates in the Machangara River basin. The models were built based on statistical adjustments and ecological criteria. For model optimization, statistical techniques were used, such as the elimination of false positives (FP) achieved by applying weights as well as the minimum confusion entropy from the models. Two of the three final obtained DTMs were validated with datasets collected in July of 2015 and March of 2016.

Study Area
This study focuses on the basin of the Machangara River, which is an Andean mountain river that in its origin is a river of the first order, finishing as a river of the fourth order upon its discharge into the Cuenca River. The Machangara River is about 37 km in length [29], and at the end of its path, crosses the city of Cuenca, located in the southern Province of Azuay in Ecuador ( Figure 1). Cuenca is the third largest city in the country with an estimated 2015 population of about 370,000 inhabitants [30]. The Cuenca River basin is part of the Hydrographic Demarcation Santiago, one of the Amazon Effluents.
The Machangara River is about 325 km 2 , of which 252 km 2 is forest protected by the Ecuadorian government. The aforementioned basin is regulated all year by two hydroelectric power plants, with their respective dams, Labrado and Chanlud, situated in the upper area of the catchment and upstream from Cuenca (Figure 2a). Water is extracted from the catchment basin for use primarily as a supply of drinking water, agricultural irrigation, and to a lesser extent for industrial use. The altitude of the basin varies from 2440 to 4420 m above sea level (m a.s.l.) and its mean altitude is 3557 m a.s.l. The average annual rainfall in the basin varies from 877 mm in the lower part to 1363 mm per year in the upper areas. With regard to the average annual temperature, this fluctuates between 16.3 • C in the lowlands to 9.0 • C in the more elevated areas of the Machangara basin [31,32]. Two seasons, which are distributed in two periods each, are present during the year: the rainy season from the middle of February until the beginning of July, and from the second half of September until the first two weeks of November with the dry season being the rest of the year. The monthly average discharge of the Machangara River from 1964 to 2010 at its outlet the Cuenca River was 8.4 m 3 ·s −1 , the average minimum monthly discharge was 5.3 m 3 ·s −1 in August and the average maximum monthly discharge was 14.6 m 3 ·s −1 in May [33].
Despite the combined sewage system in Cuenca, poor water quality results occurred along the parts where the river flows through the city. This is mainly due to some sewage networks and industrial pollution points that are discharging in different locations along the river and its tributaries that are affecting the water quality of these streams [29,34]. In addition, discharges from combined sewer overflow (CSO) events, when wet-weather flows exceed the sewage treatment plant capacity, and surface water outfalls (SWO) cause the degradation of physicochemical and biological quality [35][36][37][38]. Similarly, pollution from agricultural and livestock runoffs transport polluted water into the rivers [39]. This poor water quality in the river running along the city, could have been influenced by pollutants such as organics expressed as BOD 5 , organic nitrogen, phosphates and fecal coliforms [29].

Data Collection
The dataset used in this research was collected and measured once during the rainy season in  (Table A1 in the Appendix A). From this data, four variables were measured in situ: water temperature, conductivity, dissolved oxygen (DO) and pH with an ORION 5Star 1219001 (Thermo Scientific, Waltham, MA, USA) multi-parameter probe. Flow velocity was measured using the float method described by the U.S. Environmental Protection Agency [40]. The rest of the parameters and the methods used by their determination in the laboratory of Sanitation at the Water Supply and Wastewater Management Municipal Company ETAPA-EP in Ecuador, are shown in Table A1 (Appendix A). Benthic macroinvertebrates samples were collected from the rivers and their tributaries by using the kick-sweep method. This method is applied by shuffling the feet walking backwards against a current while holding a standard net (inlet area 575 cm 2 , mesh size 500 µm, depth 27.5 cm) for six minutes in a stretch of approximately 10-20 m, allowing personnel to collect in the net the material from immediately upstream. One kick net sample was collected in each site, which included all different habitats present such as bed substrate, litters, macrophytes and parts of terrestrial vegetation immersed in the water. Additionally, macroinvertebrates were manually picked from stones and leaves [41][42][43]. Macroinvertebrates were then sorted in the field and preserved in ethyl alcohol at 70% [43]. All macroinvertebrates collected were identified in the laboratory to family level with the help of a stereoscope, with magnifications that varied from 0.8× to 5×, and specific reference materials [44][45][46]. At each sampling location, the Biological Monitoring Working Party index adapted to Colombia (BMWP-Col) [15,45,47], was calculated (Figure 2b), which takes into account the score of sensitivity to organic pollution of the taxa found. The range of the sensitive score goes from one (for very tolerant taxa), to 10 (for most sensitive families). BMWP-Col is calculated as the sum of the sensitivity scores of each taxa captured in each site. BMWP-COL scores can be divided into five water quality categories that consist of: bad (≤15), deficient (16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30)(31)(32)(33)(34)(35), moderate , good (61-99) and very good (>99) [45,47].

Ecuadorian Water Regulation in Relation to Water Use
The Ecuadorian government has regulations regarding the water quality in relation to water use [48]. The standard norm set a value limit for different parameters in relation to particular water usage, giving three thresholds to regulate the concentration of fecal coliforms with regard to water use ( Table 1). The most stringent microbial water quality standard for fecal coliforms is applied for recreational water use with primary contact ( Table 1). The least stringent microbial water quality standard for fecal coliforms is for raw (untreated) water used for drinking water before receiving non-conventional treatment (Table 1). Non-conventional treatment methods include slow sand filtration and multi-stage filtration, which is recommended for small towns that need flows less than 8 L/s and a population <5000 people whose town needs flows up to 21 L/s and a population <12,000 people [49]. The intermediate microbial water quality standard for fecal coliforms is for agriculture ( Table 1).

Model Development
Decision tree models (DTMs) were developed to predict the fecal coliforms regulation fulfillment according to the water uses (Table 1), and were expressed as three discrete levels. In this research, the attributes or independent variables of the DTMs were the presence/absence or abundance of macroinvertebrates taxa that were observed in at least three sampled points ( Table 2). The discrete dependent variables were in fulfillment of the three microbial water quality standards for fecal coliforms, which were measured as most probable number per 100 mL (MPN.100 mL −1 ). The decision trees are hierarchical structures, where internal nodes contain a test on the input independent variables. Each branch of an internal test corresponds to an outcome of the test and the prediction for the values of the dependent variable is stored in a leaf. Each leaf of the decision tree contains a prediction for the dependent variable. Decision trees explain variation in dependent variables by splitting independent variables at certain thresholds in the node of the tree. Furthermore, each division or level can produce more nodes with branches that follow a new ordering instruction [50][51][52][53]. Decision trees have been applied in numerous ecological studies such as macroinvertebrate habitat suitability analysis [20,27], because the DTM combines reliable classification with a transparent set of rules [52]. Furthermore, the classification trees are robust techniques that can deal with small datasets [54] less than 50 data points [55], particular to the case of this study, in which the dataset is composed of the results of 33 different sites. In addition, with a small dataset the accuracy of the classification trees models is higher than other techniques such as logistic regression models [56]. All observations were included to construct the models, because classification trees are not sensitive to outliers [57].
In this study, the machine learning software, Waikato Environment for Knowledge Analysis (Weka) [58], and its package J4.8 decision tree classifier that is a Java re-implementation of C4.5 [59] were used for inducing classification trees and creating a prediction model. The model training and validation were performed with three, five, ten-fold (k-fold) cross validation (three, five, 10, k fcv) in which the records are randomly split into k equally-sized subsets. In each set, k-1 subgroups are used as the training set and the k-th that remains is run as the test set. This process is repeated k times and each subset is used as the test set exactly once [28]. The expansion of the tree is stopped with the pruning process, which gives to every leaf, a minimum number of instances to allow branching. With the aim to improve this process, two pruning confidence factors (PCF) were employed: 0.25, which is the default value, and 0.1. With a small dataset, lower cross validation values can result in more robust models, but with a relatively low performance [60]. Tables 3 and A2 in Appendix A show the settings for the eight models obtained with three, five, 10 fcv and 66% of data as a trained set, as well as a PCF of 0.1 and 0.25, before optimization.

Modeling and Analysis
First, the accuracy of the DTMs was evaluated with two measurements obtained from the confusion matrix. This matrix identifies true positive (TP), false positive (FP), false negative (FN) and true negative (TN) cases predicted by each run.
The first fitted measure was the number of correctly classified instances (CCI), which is calculated as the sum of the diagonal (i.e., TP + TN) divided by the sum of all values (i.e., TP + FP + TN + FN) [64]; a value expressed in percentage. The CCI range is from 0 to 100%, where a value of 100% has the greatest accuracy of the model [65]. The second fitted ratio was Cohen's Kappa statistic, which is a derived statistic that measures the proportion of possible cases of correct predictions (TP and TN) by a model after accounting for chance predictions [27,66]. This coefficient is calculated as: The interpretation of the model fit with respect to different Kappa statistic values is as follows:  [67]. Models are considered good when the Kappa statistics is higher than 0.4 and CCI at least 70% [28].
When the cross validation results, which are calculated beginning with the confusion matrix, are slightly different, it is difficult to determine in the first instance which measurement is better for evaluating a decision tree model (DTM). Furthermore, the accuracy of the DTM (i.e., CCI) is uniquely obtained regardless of how the other off-diagonal elements take their values [68]. The misclassification information (i.e., FP + FN) of confusion matrices can be analyzed using the measurement of the overall confusion entropy of a confusion matrix (CEN), which evaluates the confusion level of the class distribution of misclassified samples. According to Wei, Yuan, Hu and Wang [68], higher accuracy of the models is likely to correspond to lower confusion entropy. Likewise, the CEN is more precise than the correctly classified instances (CCI), and can replace this latter coefficient to evaluate classifiers in classification applications. In addition to the CCI, the least confusion entropy was considered as a decision value to choose the best model for each analyzed regulation. For this calculation, the following expression was adapted to a confusion matrix of 2 × 2, from equations given by Wei, Yuan, Hu and Wang [68]: where, P j is called confusion probability of class j and CEN j is defined as confusion entropy of class j. These values were calculated with the next expressions Equations (3) and (4). and In Equation (4) P FP and P FN are the misclassification probability of classifying the samples of class i to class j subject to class j, are defined in Equation (5).
In order to check the stability of the DTMs, and knowing that the dataset is relatively small, the dataset was randomly and manually divided into three subsets, and stratified based on fulfillment or non-fulfillment of the regulation in analysis. Two of these subsets were used to train the model, and the third subset was applied to test the model. This process was repeated three times so that each subset was used to check the others. Furthermore, the groups of two subsets used for the learning process were settled on J4.8 with a pruning confidence factor of 0.25 and 0.10. Additionally, when a false positive (FP) was detected in the confusion matrix, the cost-sensitive classifier (CSC) tool was employed to give new weights to the FP. The Stability of the DTMs was calculated from the wide variation of the standard deviation [28] of the CCI and Kappa statistics [54] that were obtained from the test subsets.
The optimization of the models to be used for more than statistical fit perspective must be assessed from an ecological point of view. In some cases, erroneous results from an ecological angle could also occur. For this reason, before choosing a model, an ecological examination has to be considered [69], in which the obtained rules from the DTM are compared and tested for what is generally accepted in ecology [70]. Thus, for example, an acceptable knowledge rule is: "The ecosystem has a higher ecological status when the concentration of nutrients is low". While, an erroneous knowledge is for example: "The quality of the ecosystem is very high with a low oxygen concentration" [69]. Thus, in this research for the ecological evaluation, two criteria were included. The first, the DTM, was discarded when a taxon resulting from the model had a tolerant score (TS) lower than four, which ensured that the microbial water quality assessment was not done in a highly polluted place. The second criterion was to ensure that at least one of the taxon resulting from the DTM was always present. In some cases, it is possible to obtain from the branches (rules) of a DTM that the presence of any taxon is not necessary for compliance with the fecal coliform regulation. This is an aspect that could give erroneous results on the application of the DTM.
Finally, the selected models, after optimization process were assessed with two new datasets taken both in dry (July of 2015) as well as in rainy (March of 2016) seasons.

Current Water Quality Status
Fecal coliforms concentrations were greater in urban and suburban sites than sites from other land uses (Figure 3), while a summary of the variation of the physicochemical parameters collected during the sampling campaign can be reviewed in Table A1. Similarly, the fecal coliforms results, in relation to the three microbial water quality standards described in Section 2.3 and with land use (Figure 2a, Table 2), show that the nine points sampled in the south-east section of the basin that are located in the urban and suburban areas of Cuenca, do not meet the official microbial water quality standards (Figure 4a-c). All other locations (24 points) meet the regulation standards regarding agriculture (<1000 MPN.100 mL −1 ) and raw water (<2000 MPN.100 mL −1 ). It is important to note that these 24 sites met both regulations at the same time. Additionally, nine points previously indicated, five sites are not meeting the recreational regulation (<200 MPN.100 mL −1 ). The location of the aforementioned five sites is close to livestock zones: three are near the center of the Machangara basin (points: 24, 27 and 45), and two other locations are in the northeast area of this catchment (points: 13 and 15) (Figure 4a). In total, 36 taxa of macroinvertebrates were captured (Table 2), which were the basis for the calculation of the BMWP-Col. When analyzing the results of the BMWP-Col of the 33 points indicated in Figure 2b, in relation to the fecal coliforms regulations, the outcomes show 14 points with different biological water quality (i.e., two good, eight moderate, two deficient, and two bad) that do not meet the recreational fecal regulation. In addition, nine points with diverse BMWP-Col (i.e., five moderate, two deficient and two bad), do not meet the values of the agriculture and raw water fecal regulations.

Model Development
For the construction of the models, 23 taxa that were observed in at least three different points were used ( Table 2). In total eight models were constructed during the model development stage, from which four resulted from the absence-presence dataset while that four other models developed with the abundance dataset (Tables 3 and A3 in Appendix A). Based on the correctly classified instances (CCI) and Kappa statistics, a reliable decision tree model (DTM) (i.e., CCI > 70% and k > 0.4) obtained from models 2a1 and 2a2 (Table 2), was developed with the abundance database, allowing a preliminary assessment of the fulfillment of the agriculture fecal coliform guidelines (Figure 5b). No reliable model was obtained to assess the fulfillment of the recreational fecal coliform regulation. Similarly, from the presence-absence database no confident DTMs (Table 3) were obtained to check the accomplishment of any fecal coliforms guidelines. With the dataset used, it was not possible to obtain a specific model to verify raw water regulation, although the model obtained for agriculture fecal regulation, which has a more stringent threshold, could be adopted to check the raw water regulation. Likewise, the two best models had as a result, the same DTM (Models 2a1 and 2a2- Table 3), whose description is shown in Section 3.3 following an ecological examination.

Model Optimization
The decision tree models (DTMs) were optimized adding new weights to false positives in training instances, with the aim to minimize the false positive (FP) errors. This is possible with a cost-sensitive classifier (CSC) tool with the J.48 algorithm in the WEKA package. It was not possible to obtain a specific model for the raw water fecal regulation, but the resulting DTMs obtained from the agriculture regulation could be applied to check the raw water fecal regulation. Moreover, the threshold of the agriculture regulation is more stringent than the raw water fecal coliform regulation. In this stage 40 models were developed (Table A4 in Appendix A), from which eight DTMs were reliable (Table 4), with their correctly classified instances (CCI) higher than 0.7 and with their Kappa statistics higher than 0.4 (Table 4). These eight DTMs were initially pre-selected from a statistical point of view (Table 4). Two groups of models for evaluation of the recreational fecal coliform regulation had similar trees with different abundance requirement (models from 1a5 to 1a7 and from 1a9 to 1a12), that group which had the model with the least entropy of a confusion matrix (CEN) was chosen (models from 1a5 to 1a7). For the agriculture fecal regulation, the DTM resulting from models 2a3 to 2a6 was the same that was obtained in models 2a1 and 2a2 in the previous section, "Model development". The DTMs achieved from models 2a3 to 2a6 and from 2a7 to 2a11 had the same families with the same requirements of abundance, differing between both DTMs the sequence of their leaves. In this case, the group that had the model with the least CEN was selected (models from 2a3 to 2a6), resulting in six total DTMs after statistical evaluation (1a4, 1a5 to 1a7, 1a8, 2ap3, 2ap4 and 2ap5, and 2a3 to 2a6- Table 4). All models that were obtained after the optimization process with their results of the correctly classified instances (CCI), Kappa statistics, the number of leaves obtained in each model through k-fold (i.e., three, five and 10) cross validation and the overall confusion entropy of a confusion matrix (CEN), are shown in Table A3.
These pre-selected decision tree models (DTMs) were verified from an ecological point of view. Three group of models were discarded: the first with the model 1a4, the second with the model 2ap3, and the third with the models 2ap4 and 2ap5 (Tables 4 and A4 in Appendix A). Chironomidae, which is a taxon with very low pollution sensitivity, is present in the leaves of the first discarded 1a4 DTM (Table 4). This 1a4 model was constructed with the abundance dataset to assess the fulfillment of the recreational fecal coliform regulation, while, the second (model 2ap3) and third (models 2ap4 and 2ap5) DTMs were developed with the absence-presence dataset, to evaluate the accomplishment of the agriculture fecal coliform regulation. In those DTMs, the rules are determined by the presence and absence of Perlidae and Baetidae taxa. The absence of both aforementioned sensitive taxa meets the agriculture fecal coliform regulations (Tables S2 and S4-Supplementary Materials). However, this situation can also register in polluted sites.
The three remaining DTMs (1a5 to 1a7, 1a8 and 2a3 to 2a6- Table 4) were evaluated with the validation datasets, from which two models were confirmed (1a5 to 1a7 and 2a3 to 2a6- Table 4), whereas the one DTM obtained from model 1a8 (Table 4), constructed for verification of recreational water use with primary contact guidance, could not be validated nor discarded. Validation was not possible due to the fact that the latter DTM did not meet with the requirement of abundance given by its second branch (Table S3 in Supplementary materials). This, despite the fact that the first branch of the model met the FC regulation and was validated.
Finally, two decision tree models (DTMs) were selected (from models 1a5 to 1a7 and from models 2a3 to 2a6- Table 4), in which the abundance of each taxon refers to the number of specimens collected in five square meters (5 m 2 ). The first DTMs is applicable as preliminary tools for verification of recreational water use with primary contact guidance, which is referred to in this work as the 'recreational fecal regulation'. This first DTM (from models 1a5 to 1a7- Table 4) has as a condition, the presence of Baetidae (Ephemeroptera) with an abundance less or equal to three and the presence of Scirtidae (Coleoptera) with an abundance minor or equal to three (Figure 5a). The second DTM (from models 2a3 to 2a6- Table 4) is used as a proxy indicator to evaluate the success of the agriculture fecal standards that regulate agriculture and livestock water uses. This second DTM (from models 2a3 to 2a6-Table 4- Figure 5b) was the same that was obtained before the optimization step (models 2a1 and 2a2- Table 3). The model showed that the presence of Perlidae (Plecoptera) is necessary, if this taxon is not present, Baetidae (Ephemeroptera) must have an abundance of one but less than or equal to four. If its abundance is higher, the non-fulfillment of the regulation is complete. The rules generated by the leaves of the chosen DTMs were also checked with the fulfillment of the recreational and agriculture fecal coliforms regulations (Tables S1 and S2-Supplementary Materials), as well as the validation datasets (Tables S3 and S4-Supplementary Materials), verifying that all points that met the requirements of the DTMs satisfied the analyzed fecal coliforms standards. The stability of the models of the same class (e.g., 3-fcv and 0.10 as PCF) was determined by the variation among correctly classified instances (CCI) and Kappa statistics obtained from the tree fold cross validation. The results shown in Supplementary Materials (Tables S5 and S6), demonstrate that on average the standard deviation represents 20% of the mean of the CCI and 61% of the mean of the Cohen's Kappa statistics, for the models of the recreational fecal regulation. While, for the agriculture regulation models, the standard deviation is, on average 14% of the CCI and 73 % of the Kappa statistics. This revealed that the CCI deviation was acceptable, while for the Kappa statistics the variation range was high.

Model Relevance and Optimization from a Statistical Point of View
Classification trees successfully modeled the abundance of some macroinvertebrates taxa as a proxy indicator of the fulfillment of two Ecuadorian fecal coliform regulations for water use. One decision tree model (DTM) was obtained in the development stage, and one after the optimization phase. Furthermore, both DTMs were also confirmed with the validation datasets. In both cases, the models had a maximum of three variables that were hierarchically structured as levels of knowledge, allowing their rules to be easily applied [20]. However, the inclusion of a large number of variables would result in a complex DTM with many rules that would hamper its application [71]. Additionally, this technique is non-parametric and non-linear. Consequently, the independent and dependent variables are not assumed to have a linear relationship [57].
It was not possible to obtain a specific DTM to check the raw water coliform regulations. This was because the same locations satisfied both agriculture and raw water regulations. However, the DTM obtained to verify the agriculture regulation could be used to check the raw water fecal coliform regulation, as the threshold of the agriculture regulation is more stringent. With a new dataset, in which the occurrence of sites that meet only the raw water coliform regulation, a specific model for checking the fulfillment of this standard could be constructed. Before the optimization phase, no models were obtained with the presence-absence dataset, while a DTM was only found with the abundance dataset. Most likely, it happened because the presence-absence dataset was binary (i.e., 0 and 1), while with the abundance dataset, the classification tree technique probably had more attributes to construct the rules of classification. Thus, Maimon and Rokach [63] noted that with the use of binary data the manipulation of categorical data is simplified and its normalization is eliminated, which makes it more difficult for binary data to be clustered. From a statistical point of view, after the optimization process in which the false positives errors were more costly than the false negatives [58], two DTMs (model 2ap3 and from models 2ap3 and 2ap4- Table 4) were obtained from the presence-absence information and six models resulted from the abundance datasets (1a4, 1a5 to 1a7, 1a8, 1a9 to 1a12, 2a3 to 2a6 and from models 2a7 to 2a11- Table 4). For the recreational fecal coliform regulation, it was not possible to construct a reliable model with the presence-absence dataset. While with the abundance dataset, the same decision tree model for the agriculture fecal coliform regulations was achieved before and after the optimization process until the false positives (FPs) were weighted four times, with the help of a cost-sensitive classifier (CSC). When the FP was weighted from five to 12, the rules generated by the trees changed their order, resulting in a new, reliable DTM (from models 2a7 to 2a11-Tables A2-A4) with the same final outcomes as the previous DTM (from models 2a3 to 2a6-Tables A2-A4). The maximum correctly classified instances (CCI) and Kappa statistics and the least confusion entropy of a confusion matrix (CEN) were obtained when the FP was weighted twice, yet with higher weighted values than 12, unreliable decision tree models were obtained. The DTMs resulting from the abundance dataset for the recreational fecal regulations (models: 1a4, 1a5 to 1a7, 1a8, and from 1a9 to 1a12-Tables A2 and A3), were shown to be reliable when the FP was weighted from two to nine with the CSC, arriving at the maximum CCI and Kappa statistics and the least CEN when the weighted value was seven (model 1a8-Tables A2 and A3). In this regard, Maimon and Rokach [63] showed that to select the optimum value of weighted false positive requires a sensitive analysis of the effect of its value on the accuracy of the resulting model.
During the optimization process, two groups of reliable decision tree models (DTMs) (models from 1a5 to 1a7 and from 1a9 to 1a12- Table 4), constructed to evaluate the recreational fecal regulation, showed the same trees with the same taxa, but with different abundance requirements. When the abundance was higher, the DTM was more reliable. This was likely due to the WEKA trying to increase the model accuracy when the false positives (FPs) were reweighted with the cost-sensitive classifier (CSC), an increase of Scirtidae was required since the size of the dataset was relatively small.
With regard to the stability of the models, classification trees with relatively small datasets tended to be unstable [28], a pattern that also was found in the selected DTMs. Thus, with the analysis of the variation of the correctly classified instances (CCI) and Kappa statistics, the first parameter (i.e., CCI) appeared more stable than the Kappa statistics. This typically happens when a dataset is relatively small, and each database has limited extractable information, so accordingly, Kappa statistics values represent the information content of the dataset [26].

Model Relevance and Optimization from an Ecological Point of View
With regard to the organic pollution tolerance of taxa, the BMWP-Col index gives a sensibility score range with one being the most tolerant families, to 10 being the less tolerant macroinvertebrates. Thus, the tolerance values of the taxa shown in the final decision tree models (DTMs) shown in Figure 5 were 10 for Perlidae (Plecoptera), six for Scirtidae (Coleoptera), and five for Baetidae (Ephemeroptera) [72]. In the Ecuadorian Andes, Scirtidae was found in clean and slightly polluted rivers, while Baetidae were found in places that were clean as well as in some polluted sites, but not in very polluted points [73]. Likewise, Perlidae was present in pristine conditions and unpolluted places in the Andes of Ecuador [73,74]. With regard to the relationship between fecal coliforms and biological water quality in the Cuenca River basin, it was found that fecal coliforms were the explanatory variable for the presence of Physidae [23], which has a low tolerance score of three [72], in places where the biological water quality varied from poor to moderate. In the same way, it was established that one of the explanatory variables for the Perlidae presence was fecal coliforms [22]. Whereas, Acosta and Hampel [24] in the Cuenca River basin, found that fecal coliforms were unique variables that had relative importance in the distribution of the macroinvertebrate communities in the rivers of the moorland. The authors also pointed out that fecal coliforms influenced the structure of the benthic communities in rivers with urban influence. The two final DTMs (models from 1a5 to 1a7 and from 2a3 to 2a6- Table 4 and Figure 5), chosen in this research, show that the three sensitive taxa of macroinvertebrates (i.e., Perlidae, Scirtidae and Baetidae), may also be sensitive to fecal pollution.
The decision tree models (DTMs) resulting from model 1a4 (Tables 4 and A4), which constructed for recreational regulation analysis with the abundance dataset, did not pass the ecological examination due to the presence of Chironomidae, whose tolerance score is two, was present in one of its leaves (rules). This situation could have been due to the identification of Chironomidae that was analyzed to family level and not to a sub-taxa level. In some instances, this kind of identification such as the subfamilies of Chironomidae includes species with large differences in tolerance to pollutants [11]. Similarly, two DTMs obtained from model 2ap3 and from models 2ap4 and 2ap5, which were constructed with the presence-absence dataset, were discarded. In both DTMs, the agriculture regulation was accomplished without the presence of Perlidae and Baetidae. However, the presence of both aforementioned taxa was not registered in polluted and very polluted places [73,74], that could give both DTMs erroneous outcomes; although, both models could be modified, retaining only the part of the decision trees that could give reliable results. In this regard, in data mining models such as decision trees, a single model can be modified into multiple models, and the resulting models can operate in a large variety of conditions [63].
The percent of occurrence of the analyzed taxa in the sampled sites in the Machangara River basin was as follows: Perlidae 36%, Scirtidae 36% and Baetidae 82%. The absence of these taxon in other areas may be due to specific reasons. For example: the habitat in some places may be unstable [75], or it was not suitable for a specific taxon [76,77]. While in suitable environments, a taxa could be temporarily absent, for example, due to migration or seasonal variation [73]. Likewise, the abundance, a fundamental parameter in the chosen decision tree models (DTMs), may also fluctuate seasonally. Thus, Jacobsen [73] found that the density of macroinvertebrates is much higher in the dry season than in the rainy season in the Ecuadorian highland streams. Although a higher abundance of macroinvertebrates would not influence the final results of the chosen DTMs as the maximum threshold of the required abundance is four. In some areas that were close to livestock, the concentration of fecal coliforms in the river was low. Perhaps the riparian habitats were able to uptake pollution transported by run-off from livestock areas [78], or the run-off volumes were small and only transported minimal amounts of pollutants into the streams and rivers. Or also, the river places located below livestock areas experienced an unstable habitat from recurring shifts in pollutants concentration, especially during rainy season.

A Possible Screening Tool for Microbial Pollution
The intake of microbiologically contaminated water is a great concern from a human health perspective [6]. The main sources of organic pollution to surface water are: wastewater, storm water outfalls, as well as livestock and wildlife feces [5,79]. Additionally, the presence of pathogens in the water shows a good correlation to the presence of fecal contamination [80]. As a result, fecal bacteria or thermal-tolerant bacteria have been used as the main indicators of fecal pollution and also the possible presence of disease-causing organisms [6,7,81].
The procedure to sample and to identify the presence or absence and abundance of selected macroinvertebrates families in a river, takes less than one hour by individuals who have been trained in identification and sampling protocols. This activity can applied in the field by a person with minimal training. Since the models allow personnel to focus on a few taxa of key indicator importance, not all taxa need to be identified, and the focus can be placed on searching for particular groups. Conversely, standard methods to measure fecal indicator bacteria for recreational, irrigation or drinking water uses, require at least 24 h to obtain results. For the detection of E. coli or thermo-tolerant coliforms, several methods have been recommended by the International Organization for Standardization (ISO), including procedures such as most probable number (MPN) [5]. Furthermore, this detection of water pollution needs to be performed at least daily [82]. In contrast, the models (DTMs) introduced in this work could be used as inexpensive (proxy) bioindicators for fecal contamination that do not require laboratory support or highly qualified personnel. As a result of this research, the application of the decision tree models (DTM) is a simpler and faster method as a proxy indicator to assess fecal pollution in rivers.
It is important to note, that the two decision tree models (DTMs) introduced and chosen for application in this research, can be improved both by collecting more data from the same sites in different seasons and by collecting more data from new sites in the Machangara River basin in the dry and rainy seasons. Thus, the taxa variation between two seasons [73] can be included. This new data can be used to update the current DTMs. The models introduced in this work should also be tested in different river basins before being applied in other locations, due to the variation of environmental conditions such as weather, vegetation, and soil use. For this reason, it is recommended that samples be taken from different locations in relation to land use. According to Forio, et al. [83], testing these models in a wider range of situations over time, will permit researchers to define the range of applications for which the model predictions are suitable. Additionally, after their first application, the results must be confirmed in a laboratory using traditional analysis.

Conclusions
Decision tree models (DTMs) were developed as preliminary assessment tools to check the compliance to two Ecuadorian microbial water quality standards associated with fecal coliforms. These DTMs were based on the presence and abundance of Perlidae, Scirtidae and Baetidae in the Machangara River basin located in the southern Andes Mountains of Ecuador. The two best-performing models were adopted and can be applied by personnel with minimum training in the identification of the aforementioned taxa. The use of the cost-sensitive classifier (CSC) in the Waikato Environment for Knowledge Analysis (Weka) package to eliminate false positives (FP) in the confusion matrix improved the reliability of the resulting models. The models introduced in this work still need to be tested over time to ensure their stability (and reliability), before being applicable to areas with sources of fecal pollution. It needs to be stressed that these tools will not eliminate microbial tests, but can serve as a rapid screening process and moreover, allow the detection of key indicator invertebrate taxa related to water quality.
Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4441/10/4/375/s1, Figure S1: Sampled sites location 2012, 2015 and 2016; Table S1: Verification of the fulfilment of the recreational with primary contact Ecuadorian water use regulations associated with fecal coliforms according the decision tree models (DTMs); Table S2: Verification of the fulfilment of the agriculture and livestock water use regulations associated with fecal coliforms according the decision tree models (DTMs), before the optimization process; Table S3: Verification of the fulfilment of the recreational with primary contact Ecuadorian water use regulations associated with fecal coliforms according the decision tree models (DTMs), with new dataset taken in July of 2015 (dry season) and March of 2016 (rainy season); Table S4: Verification of the fulfilment of the agriculture and livestock Ecuadorian water use regulations associated with fecal coliforms according the decision tree models (DTMs), with new dataset taken in July of 2015 and March of 2016; Table S5: Calculation of the variation of correctly classified instances (CCI) and Kappa statistics in the recreational fecal regulation models, in which three cross validations were manually applied; Table S6: Calculation of the variation of correctly classified instances (CCI) and Kappa statistics in the agriculture fecal regulation models, in which three cross validations were manually applied.
Acknowledgments: This research was executed in the context of the VLIR-UOS IUC Programme-University of Cuenca and the VLIR Ecuador Biodiversity Network project. The authors would also like to extend their gratitude to the Council of the Machangara River basin for allowing the use of the field information collected from the Construction of Integrated Management Plan of the Machangara Basin project.
Author Contributions: Ruben Jerves-Cobo was involved in sampling preparation, supported the sampling, analyzed the data and wrote the article. Xavier Iñiguez-Vela, Gonzalo Cordova-Vela prepared and performed the sampling campaign. Catalina Diaz-Granda helped to prepare and to support the sampling campaign. Wout Van Echelpoel, Felipe Cisneros, Ingmar Nopens and Peter L.M. Goethals were involved in data analysis and writing the article.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations were used in this manuscript: