Application of Bipartite Networks to the Study of Water Quality

Water is a basic natural resource for life and the sustainable development of society. Methods to assess water quality in freshwater ecosystems based on environmental quality bioindicators have proven to be low cost, reliable, and can be adapted to ecosystems with well-defined structures. The objective of this paper is to propose an interdisciplinary approach for the assessment of water quality in freshwater ecosystems through bioindicators. From the presence/absence of bioindicator organisms and their sensitivity/tolerance to environmental stress, we constructed a bipartite network, G. In this direction, we propose a new method that combines two research approaches, Graph Theory and Random Matrix Theory (RMT). Through the topological properties of the graph G, we introduce a topological index, called J P ( G ) , to evaluate the water quality, and we study its properties and relationships with known indices, such as Biological Monitoring Working Party ( B M W P ) and Shannon diversity ( H ′ ). Furthermore, we perform a scaling analysis of random bipartite networks with already specialized parameters for our case study. We validate our proposal for its application in the reservoir of Guájaro, Colombia. The results obtained allow us to infer that the proposed techniques are useful for the study of water quality, since they detect significant changes in the ecosystem.


Introduction
One of the Sustainable Development Goals of the 2030 Agenda for Sustainable Development is to ensure the availability and sustainable management of water and sanitation for all and to preserve water quality by reducing pollution [1,2]. One of the less expensive and, at the same time, most reliable tools to assess water quality of aquatic ecosystems is the use of indicator organisms (bioindicators) of environmental stress [3,4]. Furthermore, by knowing the responses of these communities to certain environmental stressors, one can determine the cause of environmental deterioration and propose management and restoration strategies [5][6][7].
In (1), β i is the tolerance value to pollution, δ(β i ) is the number of macroinvertebrate families with the same tolerance value, and α ij are the abundances of macroinvertebrate families, the relationships (iRj) between aquatic macroinvertebrates identified at the order level, and their tolerance value.
On the other hand, the results shown in [41], derived from the analysis and characterization of random bipartite networks, allow their application to the study of the disturbances of ecological systems represented by bipartite networks. It has been demonstrated [41] that there exist universal properties of random bipartite networks that could help us understand the dynamical behavior of the systems they represent, regardless of specific details, such as the number of vertices and connections.
The objective of this work is to propose an interdisciplinary approach for water quality assessment in freshwater ecosystems through bioindicators. For this purpose, bipartite networks provide the main element for the integration of Graph Theory and Random Matrix Theory, the two approaches used in our proposal. In the first part of this work (Section 2), we describe the construction and interpretation of the bipartite network G, representing a lentic system. Then, we formalize its analysis by means of the topological index JP(G), which we always contrast with the widely used BMWP(G) index, stressing the differences and advantages of the JP(G) over BMWP(G). Moreover, we validate the use of the JP(G) index by its application to the water quality assessment of a real-world case: The Guájaro Reservoir, Colombia. In the second part of this work (Section 3), we perform a scaling analysis of random bipartite networks with parameters already specialized on the Guájaro Reservoir. We then show that by properly defining a universal curve for the average Shannon entropy of the eigenvectors of the adjacency matrices of the bipartite network G, it is possible to define water quality classes equivalent to those obtained from the topological study of Section 2.

Topological Analysis to Assess Water Quality
The presence or absence of macroinvertebrate families and their tolerance or sensitivity to pollution are fundamental elements in the construction of biotic indices for the assessment of water quality. In [40], this phenomenon is represented geometrically by a bipartite network G, which is defined as follows. Definition 1. Let G = G(V, E) be a bipartite network with weights α ij and vertices V = {β i , A j }, where β i , with i = 1, 2, . . . , N 1 , are the tolerance values to pollution of the N 2 taxa of macroinvertebrate A j , with j = 1, 2, . . . , N 2 ; in this case, the taxonomic identification is at the order level. E is the set of edges (relations) β i ∼ A j . The vertex degree of β i , denoted by δ(β i ), is the number of macroinvertebrate families characterized by the tolerance value to pollution β i .
The JP(G) index of Equation (1) is rewritten as The basic definitions of Graph Theory and how to compute the JP(G) index are shown in Appendix A. The construction of the network G is shown in Figure 1.
. . , N 1 , are the tolerance values to pollution of the N 2 taxa of macroinvertebrate A j , with j = 1, 2, . . . , N 2 . In G, each edge represents a different macroinvertebrate family whose thickness depends on the relative abundance. Note that the vertices of the set A j need the taxonomic identification to be at the order level; thus, one should be careful on their labeling because different macroinvertebrate families could have the same order with different tolerance values to pollution.
The structure of aquatic macroinvertebrate communities is defined by the number of species (richness) and their diversity. The richness of such communities can often be assessed at higher taxonomic levels; for example, gender, family, and order [42]. Furthermore, the richness measures reflect the diversity of the aquatic assembly. On the other hand, the relative abundance is the proportion of a taxon regarding all the taxa contained in an ecosystem. The relative abundance determines how rare, common, or dominant a taxon is. From now on, unless otherwise indicated, the elements of the network G are defined as follows.
The absence or presence of edges in network G, i.e., the absence or presence of some macroinvertebrate families, is directly related to the water quality. Thus, from the construction of the bipartite network G, we have that the fewer the number of connected components, the higher the macroinvertebrate family diversity and, thus, the better the water quality, and vice versa. The previous observation can be formalized as follows: If G contains r connected components, then there is an inverse relation between the water quality and r. As examples, in Figure 2, we show graphs with r = 1, 3, and 8 connected components (from left to right); they describe a completely clean water system, a moderately polluted water system, and a heavily polluted water system, respectively. That is, the water quality decreases with the decrease of network connectivity α.
The range of variation of ecological quality indices is a fundamental part of ecosystem assessment, as it allows the ecological status of the ecosystem to be classified [43]. In particular, the range of mobility of the biotic indices used for water quality assessment allows the definition of quality classes, the meaning of index ranges, and colors to make cartographic representations [12]. In this direction, Proposition A1 and (4) allow us to obtain the minimum and maximum values that the JP(G) index can reach in different contexts. The next result also appears in [40]. If N 1 is the maximal tolerance value, then Note that the extreme values found for the JP(G) index are explicitly related to the maximal tolerance value to pollution, N 1 and the rare and dominant taxa. By applying the previous results to the specific problem of assessing water quality, we have that:

•
If the macroinvertebrate families' abundance in the system is uniform (α ij = k), then JP(G) = The topological index JP(G) is a function of the maximal abundance of one or more families, that is, it is in close relation to the load capacity of the system and the present dominant families; therefore, it provides a more objective measure than the BMWP(G) index (where the presence or absence of a single family can significantly modify the water quality evaluation) and states the functional relation between the uniformity and diversity of macroinvertebrate families.
Notice that In addition, with a similar argument, we can get that BMWP(G) · log 2 (ϕ) ∆ i ≤ JP(G). Therefore, we can establish theoretical/qualitative relationships between BMWP(G) and JP(G) indices, and these can be formalized as follows: Given a network G, we have that On the other hand, if N 1 is the maximal tolerance value, then Relationship (5) allows us to infer that, if all tolerance values have the same number of macroinvertebrate families, then the value of the BMWP(G) index is constant (unchanged). In addition, if M is the number of macroinvertebrate families and N 1 is the maximal tolerance value, then The practical implication of the Relation (6) for the study of water quality is that it allows us to know the range of the number of families of macroinvertebrates present in the experiments in relation to their tolerance values associated with the BMWP(G) index. For proof of (5) and (6), see Appendix A.
Here, we define the ecological quality ratio (EQR), denoted by σ, as the quotient between the observed value and the expected value of the JP(G) index at a reference site. As a consequence of the above results, the maximum value of the ecological status classes defined by the JP(G) index will determine the EQR. If N 1 is the maximal tolerance value, then Note that the value of σ is expressed as a numerical value between 0 and 1, which implies that, if σ approaches zero, the ecological status of the ecosystem is low, whereas if σ approaches one, then the ecological status of the ecosystem is high. Furthermore, note that the JP(G) index is normalized, so that the parameter σ allows the stratification and evaluation of the stress level of the system as a function of the maximal tolerance value and dominant families, both measurable parameters of the system.

Water Quality Classification
Now we apply the BMWP(G) and JP(G) indices to a lentic system. To that end, we use the data from the sampling and tolerance values for the macroinvertebrate families identified in the Guájaro Reservoir, Colombia (for more information on the study area, see Figure 1 in [44]). Table 1 is an adaptation of the data reported in Tables 3 and 6 in [44] for the Guájaro Reservoir: Macroinvertebrate families grouped at order level A j , tolerance values β i , and the abundance of each macroinvertebrate family α ij -elements used to construct the bipartite network G. The bipartite network G in Figure 3 is the geometric representation of the data in Table 1; the thickness of the edges represents the abundances of each taxon, normalized with the function log 2 (α ij + 1), and the macroinvertebrate family richness. We use Gephi 9.2 [45] to construct the graphs.  In general, the studies that apply the BMWP index to evaluate water quality classify this quality in five classes (see [46,47] and its references). The most common approach to comparing the efficiency and scope of application of biotic indices is to compare them through their correlation with physiochemical parameters [48,49]. Here, we consider two important parameters as references: Dissolved oxygen (DO) and temperature (T), because there is an inverse correlation between these parameters. Changes in species composition and diversity of benthic macroinvertebrates [50,51], the BMWP index [52], and water quality [53][54][55] are directly correlated with the concentration of DO. In particular, it has been shown that the most tolerant macroinvertebrate families to pollution are most abundant under hypoxic conditions [56], and that sensitive families to pollution are strongly associated with high concentrations of DO [48]. It has been verified that values obtained for given biotic indices and the quantity of DO have a direct correspondence (see, e.g., [57,58]).
To classify the water quality in our study, the class width is determined by the quotient between the expected value of the JP(G) index at a reference site and the number of classes. The expected value of the JP(G) index at a reference site is obtained when the average abundances are uniformly distributed and there exist all possible relations (edges) in network G (see e.g., Figure 9). Table 2 shows the mobility ranges between classes for the BMWP(G) and JP(G) indices together with DO for the Guájaro Reservoir. It is worth highlighting that we have an additional parameter, σ, to assess the ecological status (or stress) of the Guájaro Reservoir (see Table 2).

Comparison between Indices to Study Water Quality
On the other hand, to observe a relationship between tolerance values and abundances, each taxon was grouped according to its contamination tolerance value, and its abundances were added; the result was normalized with the logarithm function (see Table A1).
The Shannon diversity index is one of the most important indices that frequently accompanies other indices in the assessment of water quality [24,59,60]. Its comes from information theory [61] and is applied in natural sciences to measure species diversity in biological communities. It is defined as where ρ i is the relative abundance of the species i and M is the total number of species in the community. With this prescription, the larger the value of H , the larger the species diversity in a given community (and vice versa) [62][63][64]. In our application, 0 ≤ ρ i ≤ 50 and M = 55; i.e., the number of macroinvertebrate families in the Guájaro Reservoir. Finally, to support the conclusions drawn from Figure 7, we constructed 100 random samples of the macroinvertebrate families present in the Guájaro Reservoir and computed the corresponding BMWP(G), JP(G), and H indices. Moreover, we computed the Pearson coefficient between these indices.
In the following section, we complete our proposal for the assessment of water quality by introducing a technique that can be used either in addition to or completely independently of the index JP(G) presented above.

Spectral Analysis of the Bipartite Network G Associated with the Guájaro Reservoir
Here, we introduce a second technique to assess water quality, which is based on the spectral properties of the bipartite network G. In particular, we will apply below the approach reported in [41], where Random Matrix Theory (RMT) was used to study the spectral properties of bipartite graphs.
Recall that network G is characterized by the sizes of the disjoint subsets, N 1 and N 2 , and the connectivity α ∈ [0, 1]. Let n = N 1 + N 2 and m = min(N 1 , N 2 ); i.e., n is the size of G and m the size of the smaller subset of G. We define the elements of the adjacency matrix A of G as if there is an edge between vertices i and j, 0 otherwise.
Since we want to build an RMT ensemble, we choose ij as statistically independent random variables drawn from a normal distribution with zero mean and variance one. In addition, ij = ji , since G is assumed to be undirected.
As shown in [41], a bipartite network produces block adjacency matrices when the vertices are labeled according to the subset they belong to. In Figure 4a,b, we show two examples of block adjacency matrices A of bipartite graphs G characterized by n = 38 and m = 8. Two values of α are used in Figure 4: 0.25 and 0.75, so that the matrix of Figure 4a is more sparse than that of Figure 4b. Note that in Figure 4, we are already using the parameter values for the network G that we are interested in: n = 38 and m = 8 or N 1 = 8 and N 2 = 30, as defined in the previous section.
In [41], by the scaling analysis of the average Shannon entropy S of the eigenvectors of the adjacency matrices of bipartite graphs G, it was shown that the spectral and eigenvector properties of G are scaled by the parameter ξ. There, ξ was defined as with where α * characterizes the localization-to-delocalization transition of the average Shannon entropy for a fixed ratio m/n. So, in the following, we will verify the scaling of S with ξ by finding the values of C and δ in Equation (10) for our particular application, i.e., for bipartite graphs with n = 38 and m = 8, and, afterwards, we will use the universal scaled curve S vs. ξ as a calibration curve to qualify water quality. The Shannon entropy for the normalized eigenvector Ψ k is given as Indeed, in Figure 4c,d, we present S k for the eigenvectors of the adjacency matrices of panels (a,b), respectively. Note that the block structure of the adjacency matrices makes the corresponding Shannon entropies to be grouped into two sets characterized by two different average values. The groups are separated in Figure 4 by vertical dashed lines: One group corresponds to k ∈ [1,8] ∪ [31,38], and the other group to k ∈ [9,30]. However, to compute S here, we average the entropies of all of the eigenvectors of the matrix A.
Next, we use exact numerical diagonalization to compute the eigenvectors Ψ k of the adjacency matrices of large ensembles of random bipartite graphs G characterized by the parameter set (n, m, α). Then, in Figure 5a, we show curves of S vs. α for four values of the network size n for the fixed ratio m/n = 8/38, the ratio of interest. Note that when α → 0, the vertices of network G are isolated and the corresponding matrix A is a diagonal random matrix, better known in RMT as the Poisson limit; in this case, the eigenvectors of A have a single component different from zero so S = 0. In the opposite limit, α → 1, the bipartite network G is complete, and S gets a maximum value S max that depends on n. Thus, to properly compare the average Shannon entropy corresponding to graphs of different sizes, we normalize it to S max . Moreover, the curves of Figure 5a show a transition from S ≈ 0 to S ≈ S max as we increase α from α ≈ 0 to α = 1-a signature of the delocalization of the eigenvectors of A.   It is important to stress that with the statistical study of Figure 5a, we intend to explore all possible parameter combinations of the system under study-in our case, the Guájaro Reservoir characterized by m = 8 and n = 38 that corresponds to the right-most curve in Figure 5a, but also of systems with equivalent characteristics; see the other curves in Figure 5a, all characterized by the ratio m/n = 8/38.
Then, as in [41], we define the localization-to-delocalization transition point α * as the value of the connectivity for which S /S max ≈ 0.5. In Figure 5b, we present α * vs. n in a log-log plot where we can clearly see a power-law behavior. Therefore, the fitting of Equation (10) gives C = 4.79 and δ = −0.915. Now, we are ready to verify the scaling of the Shannon entropy with the parameter ξ, so, in Figure 6a, we plot the curves of Figure 5a, but now as a function of ξ.
We can clearly see in Figure 6a that all curves S /S max vs. ξ fall on top of the other as anticipated, since ξ is the scaling parameter of the network G.  Table 3.
Indeed, we will use the universal curve of Figure 6a as a calibration curve to qualify water quality as follows. First, to ease the analysis in Figure 6b, we again plot the curve of Figure 6a, but interpolated, so it is now smoother. Now recall that the localized (delocalized) eigenvector regime, S ≈ 0 ( S ≈ S max ), corresponds to low (high) connectivity and, accordingly, corresponds to low (high) diversity of macroinvertebrate families in our problem of water quality evaluation. Therefore, for the network G, S ≈ 0 characterizes low water quality, while S ≈ S max corresponds to excellent water quality. So, the eigenvector localization-to-delocalization transition depicted by the curve S /S max vs. ξ of Figure 6b corresponds to the low-to-excellent transition in the water quality.
Therefore, we define the same five classes of water quality used in the previous section (low, regular, good, very good, and excellent) by prescribing ranges of Shannon entropy values. This prescription is quite arbitrary, so the user should choose the most appropriate one depending on the particular application. In particular, here we would like to assign the classes of low and excellent water quality to narrower ranges of S than for the rest of the classes. That is, we label the water quality as low (excellent) if the Shannon entropy of the corresponding network G falls in the lower (higher) 5% window of the full range of S . The rest of the labels are evenly distributed in between; see Figure 6b. In Table 3, we report the Shannon entropy ranges corresponding to the five classes of water quality.
Evidently, the ranges of the Shannon entropy defining the water quality classes correspond to well-defined ranges of the parameters ξ and, accordingly, α. The ranges of ξ and α are also reported in Table 3. Table 3. Water quality classes according to the values of S , ξ, and α.

Class Water Quality
S ξ α

Result Analysis
According to (4), if we take the average abundances (259 individuals) reported in [44], we assume that all macroinvertebrate families are present, N 1 = 8, and log 2 (259) ≈ 8, then JP(G) = 288 (expected value at the reference site) and BMWP(G) = 259; thus, both indices indicate that the water quality is excellent. In this particular case, by dividing 288 by five, we get that the class width is 57.6; see Table 2. In consequence, if there is no dominant taxon, then the BMWP(G) and JP(G) indices are closely related. The data reported in [44] for the Guájaro Reservoir allow the inference that BMWP(G) = 207 and the average score by station is 166, JP(G) = 142.34, and, on average, DO = 5.4 mg/L, which implies that, according to the BMWP(G) index, the water quality is very good, while the DO and JP(G) index values indicate that the water quality is just good; see Table 2. The fact that the water quality differs between the BMWP(G) and JP(G) indices is because the BMWP(G) index does not take into account the dominant families, which, in this case, are the most tolerant to pollution (75.26%); see Figure 3.
In Equation (7), if JP(G) = 142.34, N 1 = 8, and log 2 (259) ≈ 8, then we have that σ = 0.49. This implies that the stress level of the ecosystem is regular; see Table 2. Note that, with the application of the JP(G) index and the EQR, we obtain similar results for water quality and ecological status.
When calculating Pearson's correlation coefficient between the data in Table A1, we obtain that R = −0.778 (p = 0.02); this shows that there is a negative linear correlation between tolerance values and abundances. In this direction, when we simulate samples with all the macroinvertebrate families reported in [44], but with random abundances, we observe that while the JP(G) index is sensitive to population fluctuations, the value of the BMWP(G) index does not change. For example, in Figure 7, we can observe significant variations of the JP(G) index when there are direct or inverse correlations between abundances and tolerance values. It is important to note that when the quality of the water is low, the tendency is to find few resistant families whose abundances are high. Conversely, when water quality is good, diversity (families' richness and abundance) should be high, including sensitive families present in lower abundances, although resistant families can be found in different types of water quality. The calculation of Pearson's correlation coefficient between the BMWP(G), JP(G), and H indices allows us to infer that the BMWP(G) and H indices are highly correlated (R 2 = 0.76), while the correlations between the JP(G) and BMWP(G) indices and JP(G) and H are low (R 2 = 0.07 and R 2 = 0.13, respectively); see Figure 8. This implies that the index JP(G) shows significant differences as compared to the BMWP(G) and H indices. The scaling study of eigenvector properties of bipartite networks shown above allows us to relate the parameters S , ξ, and α to classes of water quality (see Table 3). To exemplify the application of the previous results on the particular case, we consider here (i.e., the Guájaro Reservoir) two cases: The hypothetical situation where all macroinvertebrate families reported in [44] are present (see Figure 9) and the actual bipartite network already shown in Figure 3. The network in Figure 9 has connectivity α = 0.229, which corresponds to a very good water quality according to Table 3, as expected, since all macroinvertebrate families are present in the lentic system. In contrast, the actual network representing the Guájaro Reservoir, shown in Figure 3, has a connectivity of α = 0.191, implying that the water quality is good; see Table 3. Note that this qualification is in complete agreement with that obtained with the application of the JP(G) index in the previous section.

Conclusions
The geometric representation of a phenomenon associated with an ecological system allowed the study of a topological index to assess water quality. This index, the JP(G) index, includes variables absent in other indices reported in the literature; for example, the qualitative and quantitative characteristics of the BMWP and H indices, dominant taxon, and macroinvertebrate family richness. In addition, the analytical properties of the network G, shown in the form of propositions, allowed us to make more objective inferences on water quality assessments because it was demonstrated that the topological index JP(G) is sensitive to population dynamics, i.e., the index proposed here is capable of detecting changes in water quality from the structure of the macroinvertebrate community, changes that are not perceptible by the BMWP and H indices independently. Moreover, the ecological quality ratio σ, defined through the index JP(G), describes the stress or ecological status of the ecosystem more reliably. In addition, in order to improve the definition of water quality classes, the JP(G) index and the well-known BMWP(G) index could also be related and compared with other indices with similar dynamics or that consider other environmental variables; for example, the Family Biotic Index or Simpson's diversity index. This may be the subject of a future work.
The study of the spectral properties of the bipartite networks allowed us to assess water quality through the network parameters. This statistical method allows the integration of systems with a greater number of randomly related nodes, that is, it allows the integration of a greater number of macroinvertebrate families, and not only the 38 of the real-world case we consider here. This provides us with a technique that promotes the use of complex systems where non-observable relationships are present, as long as the phenomena of interest are represented by bipartite graphs.
The approach presented is associated with manual data collection. Data collection can be affected by the accessibility of sampling sites and temporality, i.e., sampling sites must be spatially well distributed, taking into account the rainy and dry seasons because, in each season, the presence of macroinvertebrates will vary and some sampling sites may not be accessible if the shape of the terrain changes. Note that the precision in the assessment of water quality may be strongly affected by the type and time of sampling, as well as the available data. For example, in our study, the data available for the Guájaro Reservoir only take into account eight levels of tolerance to pollution and 55 macroinvertebrate families; other studies report up to 10 levels of tolerance and more than 100 macroinvertebrate families, which may guarantee a better water quality assessment.
In summary, the interdisciplinary approach to water quality assessment through bioindicator organisms proposed here allowed us to associate analytical, topological, and spectral properties of bipartite graphs with water quality classes. Moreover, we stress that our approach combining Graph Theory and Random Matrix Theory techniques could be adapted and applied to other phenomena related to complex bipartite networks.

Definition A3.
A bipartite network is a network whose vertices can be grouped into two subsets, X and Y, such that each edge joins a vertex in X with another vertex in Y. Particularly, a bipartite network is complete if every vertex in X is adjacent to every vertex in Y.
Proposition A1. If M is the number of macroinvertebrate families and N 1 is the maximal tolerance value, then Proof. Since log 2 (ϕ) ≤ log 2 (α ij ) ≤ log 2 (ν) and JP(G) = The other inequality can be obtained in a similar way.
The previous result allows the establishment of the mobility ranges of the JP(G) index. In addition, it allows us to know the number of macroinvertebrate families present in a given sample: The following result appears in [40].

Proposition A2.
If N 1 is the maximal tolerance value, then Proposition A3. If M is the number of macroinvertebrate families and N 1 is the maximal tolerance value, then M ≤ BMWP(G) ≤ N 1 M.