Information and Phylogenetic Systematic Analysis

Information in phylogenetic systematic analysis has been conceptualized, defined, quantified, and used differently by different authors. In this paper, we start with the Shannon Uncertainty Measure information measure I, applying it to cladograms containing only consistent character states. We formulate a general expression for I, utilizing a standard format for taxon-character matrices, and investigate the effect that adding data to an existing taxon-character matrix has on I. We show that I may increase when character vectors that encode autapomorphic or synapomorphic character states are added. However, as added character vectors accumulate, I tends to a limit, which generally is less than the maximum I. We show computationally and analytically that limc→ I = 2 log t , in which t enumerates taxa and c enumerates characters. For any particular t, upper and lower bounds in I exist. We use our observations to suggest several interpretations about the relationship between information and phylogenetic systematic analysis that have eluded previous, precise recognition.


Statement of Purpose
Phylogenetic systematic analysis involves using data (i.e., characters) to classify groups (i.e., taxa).Over the past half-century, the evolutionary systematic community has concentrated effort on practice, and several methods and many algorithms for producing branching diagrams from which hierarchical classifications may be obtained have been developed.One theoretical aspect to such practical activities that hitherto has received limited attention and therefore warrants thorough investigation involves the information content in such classifications, especially as this relates to their predictive qualities.
Our general purpose herein is to initiate such thorough investigation.Copious (e.g., genomic) data are becoming available from next-generation-sequencing methods, and classifications are proving to be useful in identifying human pathogens that were unknown heretofore (e.g., SARS, avian flu, novel Ebola strains); the ability to quantify information content could provide a complementary metric in systematic analyses.The challenge is shifting from acquiring to handling data-deciding how to utilise them most efficiently to answer questions about evolution and classification.Information content will constitute a paramount factor in making such decisions.Quantifying information content is the impetus for the current study.
Our specific purpose herein is to provide a first-approach for quantifying information obtained in conducting phylogenetic systematic analyses.The bifurcating branching diagrams (i.e., cladograms)constructed on the basis of nonintersecting, nested character vectors-are referred to in this paper as "consistent cladograms".Analyzing consistent cladograms facilitates quantifying the information that is obtained by systematists in constructing classifications (in the strict, technical sense according to Shannon [1] and Shannon and Weaver [2]).We introduce "apomorphic density format" as a normal form for taxon-character matrices from which consistent cladograms are constructed.Using equations from information theory, we derive information measure I. Quantifying information in this manner is important because any classification system method involves dividing an ensemble into groups and obtaining an objective means to evaluate or compare different divisions for a particular ensemble as information accumulates.This provides a means for researchers to measure economy in data collection.We present numerical examples by defining "basal matrices" and using them as taxon-character matrix templates into which we add character vectors.This procedure simulates what occurs to I as systematists acquire additional information, encode that information as data in taxon-character matrices, and conduct cladistic analyses.These analyses thus allow us to study changes in I as a function of changes in character-number.We observe that I has a limit as character number increases but this limiting value counterintuitively is less than the maximum value.On the basis of these demonstrations, we devise more-precise interpretations for previous conceptions about information associated with conducting phylogenetic systematic analyses.
Achieving our general and specific purposes will remind researchers about the important role that information content can and should play in contemporary research.The resources available to conduct large-scale phylogenetic systematic analyses usually entail that information obtained is presented only in summary form (e.g., as cladograms), data and analytical details often relegated as supplementary, available as appended small-scale format text and in electronic repositories.The approach described subsequently herein enables researchers to quantify the information content in taxon-character matrices as they relate to the consistent cladograms derived from those matrices.Researchers specifically can evaluate characters-independently or in-combination-on the basis of the information that they provide, relating that information to their state distributions on and, therefore, support for classifications.Such relations bear on evolutionary scenarios when the bifurcating branching diagrams and character states are interpreted respectively as phylogenetic trees (i.e., inferred evolutionary relationships) and trait origins (or transformations).Synthesizing additional theoretical considerations like this in phylogenetic systematic analyses ultimately will strengthen justification for the term "bioinformatics".

Consistent Cladograms
In cladistic analyses, cladograms are constructed from taxon-character matrices.Cladograms for which character states that are contained in character vectors nest hierarchically in a nonintersecting manner constitute "consistent cladograms" (i.e., consistent cladograms contain no homoplasous, or inconsistent, character states; in this paper, only characters that are encoded as binary character states are considered).We herein consider exclusively such cladograms because, for each, information measure I can be calculated solely from the taxon-character matrix; whereas calculating I for inconsistent cladograms requires, for each, accessing the taxon-character matrix as well as the corresponding cladogram, itself.

Information Theory and Cladistic Analysis
The information that is obtained by constructing a consistent cladogram from a t c  taxon-character matrix containing only 0 (plesiomorphic) or 1 (apomorphic) character states is quantified by information measure I and calculated as a difference between information measures, minimum observed Both terms on the right side in Equation ( 1) derive from the Shannon Uncertainty Measure (Shannon, 1948;Shannon and Weaver, 1949).The total information capacity for a t c  taxon-character matrix minimum I may be calculated in a surprisal analysis framework (e.g., [38]) as Referring to [36], the observed information state for a cladogram observed I is calculated as  , i b enumerates the 1s that are contained in the ith column vector and d enumerates the 1s that are contained in the t c  taxon-character matrix [25,26,31,36].Astute readers will notice that information I is presented hereby in a potential-based, negative manner, as a difference between initial state Iminimum (i.e., complete ignorance) and final state Iobserved (i.e., current knowledge), to accommodate convention.

Apomorphic Density Format
Because neither Equation ( 2) nor (3) takes into account positions for 1s within character vectors, or order for character vectors within taxon-character matrices, neither factor affects the information measure I for consistent cladograms.Therefore, to facilitate calculating I by visual inspection, t c  taxon-character matrices from which consistent cladograms would be obtained can be modified in two ways.First, the 1s that are contained in each character vector can be considered as "sinking" toward the bottom and "piling" atop one another.Second, character vectors can be rearranged so that those containing the most 1s are shifted left and those containing the fewest 1s are shifted right.We define taxon-character matrices in this configuration to be in "apomorphic density format."Reformating the matrices in this manner disconnects the relationship between taxa and character states, but, from the foregoing observations, imparts no effect on I (i.e., the same I would be obtained for consistent cladograms using taxon-character matrices in either original or apomorphic density formats).For instances, the following three matrices yield equal I for the consistent cladogram that is depicted in Figure 1, although calculating I by visual inspection is facilitated greatly using the rightmost one: original data matrix 'sinking' 1s to bottom swapping column vectors of column vectors 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 Figure 1.Consistent cladogram that would be obtained from cladistic analysis involving leftmost matrix (4); in that matrix, row i contains character states for taxon i and columns contain character vectors.

t t t t 1.5. A General Expression Quantifying I
A general expression quantifying information measure I in association with constructing consistent cladograms can be derived from studying their associated t c  taxon-character matrices in apomorphic density format.Let i a be positive integers that enumerate character vectors containing i 1s (character vectors containing a single 1 encode autapomorphic character states; character vectors containing t 1s encode synapomorphic character states for and, so, define the "ingroup"; therefore, we adopt notation in which the index variable i runs from 1 through t; no information is available from character vectors containing only 0s; we assume throughout this analysis that proper outgroup analysis has been conducted and outgroup character states comprise only 0s).For instance, the matrices in Equation ( 4) yield 1 2 a  , 2 2 a  , and 4 1 a  .Taxon-character matrices generally will yield the integers 1 2 , ,..., t a a a .This convenient prescription can be visualized using the general t c  taxon-character matrix, which is presented in apomorphic density format, Let d enumerate 1s over all c character vectors.Then From Equation (2) and matrix (5), From Equations ( 3) and ( 6) and matrix (5), (a derivation for Equation ( 8) is provided in Appendix).Using Equations ( 1), (7), and (8), the information that is obtained by constructing any consistent cladogram from its corresponding t c  taxon-character matrix is The units with which I is reported are bits, which quantify the information acquired in making binary decisions (i.e., choosing between two equiprobable events).

Consistent Basal Matrices and Conventions
We define a "consistent basal matrix" to be a t c  taxon-character matrix containing the fewest column vectors that are required to construct a consistent cladogram.For instance, two consistent basal matrices for four taxa are possible: Consider the three consistent basal matrices for five taxa: each consistent basal matrix can be assigned numbers for taxa; total 1s; second-greatest 1s within a character vector; and third-greatest 1s within a character vector.Then, representation Equation (11.4) becomes (5 taxa; 14, 4, 3); Equation (11.5) becomes (5 taxa; 13, 4, 2); and Equation (11.6) becomes (5 taxa; 12, 3, 2).This representation system will prove to be convenient because it corresponds to the order in which plots for I "descend" relative to each other after the maximum has been reached (an observation that is discussed in subsequent sections).

Adding Character Vectors to Consistent Basal Matrices
Adding autapomorphic or synapomorphic character states to a consistent cladogram corresponds to adding appropriate character vectors to the associated consistent basal matrix.Additional character vectors may encode autapomorphic character states, containing a single 1 ("autapomorphic character vectors"), or synapomorphic character states, containing multiple 1s ("synapomorphic character vectors").
For four taxa, the following four autapomorphic character vectors are equivalent in their contribution to information measure I: (12.1), (12.2), (12.3), and (12.4)However, if any among these autapomorphic character vectors were added to a taxon-character matrix and that matrix subsequently were transformed into apomorphic density format, then the character vector(s) would be arranged in the form that is assumed by character vector Equation (12.4); all additional autapomorphic character vectors would assume this form, to facilitate calculating I by visual inspection.Similarly, the following synapomorphic character vectors are equivalent in their contribution to I: (13.1), (13.2), (13.3), (13.4), (13.5), and (13.6) but, as with the autapomorphic character vectors, to facilitate calculating I by visual inspection, all synapomorphic character vectors containing two 1s would be arranged in the same form, the one that is assumed by character vector Equation (13.6).

Achieving Maximum I as Autapomorphic Character Vectors are Added to Consistent Basal Matrices
Information as quantified by information measure I has an upper bound; it can be increased by adding to taxon-character matrices autapomorphic character vectors, but it decreases as added character vectors accumulate.
The maximum I that can be obtained by constructing a pectinate 4-taxon cladogram is approximately 2.2467 bits; this is achieved when six autapomorphic character vectors are added to consistent basal matrix (10.1).Thus, the following 4 9  taxon-character matrix, which is presented in apomorphic density format, will yield the maximum I when a cladogram is constructed from it.
1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 For a symmetric 4-taxon cladogram, the maximum I that can be obtained is approximately 2.2226 bits; this is achieved when five autapomorphic characters are added to consistent basal matrix (10.2), a situation which also is presented in apomorphic density format: As character vectors are added, I generally is greater for the pectinate than for the symmetric consistent basal matrix case (Figure 3).As with the case involving four taxa, the I that can be obtained by adding to 5-taxon consistent basal matrices character vectors encoding autapomorphic character states is bounded (Figure 4).The order in which plots containing t taxa that are presented herein descend after their maxima I have been reached is determined by, first, the total 1s that are contained in the consistent basal matrix; second, for matrices containing equivalent total 1s, the second-greatest 1s within a character vector; and third, the third-greatest 1s within a character vector.Therefore, if two matrices representing the same taxa also contain the same total 1s, then the second-greatest 1s that are contained in a single character vector should be compared to rank-order their representative plots.For instance, between the two representations (a taxa; b, c, d) and (a taxa; b', c', d'), if b > b', then (a taxa; b, c, d) will descend with a greater I after the maximum has been achieved.If b = b', then c and c' can be compared in a similar manner.
For t = 4 and 5 taxa, the plot for the pectinate basal consistent matrix case always descends with the greatest I and always achieves the absolute maximum I.For greater t, although the plot for the pectinate basal consistent matrix case eventually descends with the greatest I, the basal consistent matrix corresponding to the plot that reaches the absolute maximum varies (Figures 5-7).For six or seven taxa, only one other basal consistent matrix can yield a maximum I that exceeds that for the pectinate cladogram (Figures 5 and 6).For eight taxa, two matrices yield greater maxima (Figure 7).
Generally, as t is increased, more autopomorphic character vectors must be added for I to descend in "order" (i.e., according to the foregoing prescriptions).For instances, almost 50 autapomorphic character vectors must be added to distinguish graphically the top three plots for eight taxa (Figure 7), whereas fewer than 30 character vectors are required for seven taxa (Figure 6).
On the basis of these observations, two predictions concerning increases to I as autapomorphic character vectors are added to consistent basal matrices containing t > 5 taxa can be proffered: (1) the absolute maximum I will be obtained with a t c  taxon-character matrix that would yield most-parsimoniously a nonpectinate cladogram; (2) as t were increased, more and more taxon-character matrices that would yield most-parsimoniously nonpectinate cladograms will be

Achieving Maximum I as Synapomorphic Character Vectors are Added to Consistent Basal Matrices
The information measure I also can be increased by adding to taxon-character matrices synapomorphic character vectors; but, ultimately, adding to taxon-character matrices containing t taxa character vectors containing x 1s, where 2 x t   , will decrease the I that is obtained by constructing a cladogram, just as adding autapomorphic character vectors will.Greater xs effect greater decreases (e.g., representation (8 taxa; 24, 4, 4) in Figures 8 and 9). Figure 8. Plots relating information measure I and added character vectors c containing two 1s (i.e., synapomorphic character states grouping two taxa) for four consistent basal matrices for eight taxa.Three plots descend qualitatively in a similar pattern to those followed when character vectors encoding autapomorphic character states are added (Figures 5-7).For the symmetric representation (8 taxa; 24, 4, 4), I changes markedly.

Minima for I
Numerical modeling and computer simulation reveal that augmenting a consistent basal matrix containing t taxa by adding n character vectors containing x ones ( 1 x t   ) diminishes I and the attenuation approaches 2 log t bits as n approaches infinity.More formally, we present the theorem: if x a enumerates character vectors containing x ones that are added to a consistent basal matrix, then, 2 lim log The asymptotic nature for these limits can be understood intuitively.If a column vector containing x 1s were added repeatedly to a consistent basal matrix containing t taxa, then, as the repetition number  were to approach infinity, the initial consistent basal matrix would become insignificant.The resulting matrix essentially would become 0 0 (17) in which each column vector would contain x ones (i.e., as x a   , the matrix essentially would comprise n character vectors, each containing x ones).The total 1s would equal approximately xn, and I could be estimated using Equations ( 1)-( 3): We have observed that I approaches this limit in numerical modeling and computer simulation investigations in which autapomorphic or synapomorphic character vectors were added (the interested reader may imagine accumulating autapomorphic character states for one taxon or synapomorphic character states for a group; a formal demonstration for limit Equation ( 16) is provided in Appendix).We remark that I increases in a complicated manner if a variety of character vector types are added to a consistent basal matrix.

Information and Phylogenetic Systematic Analyses
Information in phylogenetic systematic analyses has been used as a quantitative metric, means for considering evolution in thermodynamic terms, criterion for choosing among competing systematic classifications, guiding variable in constructing supertrees, parameter for measuring diversity, and basis for comparing data types.We briefly review these applications (in this section) and, on the basis of the foregoing results, render these applications amenable to more-precise interpretation (in the subsequent section).

Information as a Quantitative Metric
Schuh and Farris [21] defined the "levels sum" measure to quantify information; this metric may be calculated by constructing a matrix with terminal taxa along both edges, entering as matrix elements counts for pairwise informative components, and summing all elements (Figure 10).Rohlf and Sokal [19] and Rohlf et al. [20] considered resolution as a proxy for information and provided a means for quantification on the basis of clades (Figure 10).Nelson [22] decomposed cladogram information into "component" and "term" partitions: component information may be quantified by enumerating clades above the most-basal node; term information may be quantified by summing cumulatively nested clades (Figure 10).Mickevich and Platnick [23] considered 14 measures: component information, term information, total (component + term) information, "proportion of the maximal possible total information", "number of three-taxon statements", "number of character state distributions prohibited by cladograms with non-terminal polychotomies", levels sum, "number of fully resolved cladograms allowed by a Nelson consensus", "number of resolutions prohibited by a Nelson consensus", "proportion of total number of resolutions that are prohibited by a Nelson consensus", "number of resolutions allowed by an Adams consensus", "number of resolutions prohibited by an Adams consensus" and the "general information index" (Figure 10); they stipulated that cladograms containing polychotomies should constitute minimally informative classifications and found that pectinate cladograms comprising constitute maximally informative classifications.Page [24] evaluated critically the points that had been made by Mickevich and Platnick, arguing that they failed to provide compelling reasons for "not equating the information of a tree topology with its degree of resolution"; Page also provided a formula for calculating WISS (Weighted Invariant Step Strategy) total steps ("total information" in [23]); observed that WISS ignores prohibited taxonomic statements; and showed that the conclusions that were drawn by Mickevich and Platnick [23] (i.e., that pectinate cladograms are most-informative) failed to account for prohibited taxonomic statements and derived from a particular perspective about taxonomic structure (i.e., structure = "clusters", or monophyletic taxa).
Figure 10.Information measures for a pectinate cladogram containing five taxa.The levels sum measure [21] is calculated by enumerating pairwise informative components, here 10 (four with the most-basal taxon, three with the next most-basal taxon, two with the second-least basal taxon, and one between the two least-basal taxa).Resolution [19,20] may be calculated by enumerating nodes (shown as dots) above the ingroup node and dividing by n-2, here 3/3 = 1.Component information [22] may be quantified by enumerating clades (shown as dots) above the most-basal node, here 3. Term information [22] may be quantified by summing cumulatively nested clades (above the most-basal node), here 6.Total information = term information + component information [22], here 9. Other measures include the proportion of the maximal possible total information [23], here 9/9 = 1; number of three-taxon statements (Mickevich and Platnick, 1989), here 10; number of character state distributions prohibited by cladograms with non-terminal polychotomies [23], here 22; number of fully resolved cladograms allowed by a Nelson consensus [23], here 1; number of resolutions prohibited by a Nelson consensus [23], here 104; proportion of total number of resolutions that are prohibited by an Adams consensus [23], here 1; number of resolutions allowed by an Adams consensus [23], here 1; and the general information index-proportion of the maximal possible total information x proportion of total number of resolutions that are prohibited by an Adams consensus, here 1. binary trees that fit an unordered character without cost and used it to show that molecular data are more informative than are morphological data but less so than other information measures might indicate.

Interpretation
Common to the aforementioned uses for information in phylogenetic systematic analysis is a quantification for the "knowledge state" characterizing the taxonomist.Equation ( 9) is suited ideally for generalizing this quantification, as it is formulated on the basis of information theory and involves subtracting from the total information potentially available (Imaximum) the information that currently is available (Iobserved).This generalization property will assist systematists in comprehending information, however measured (i.e., whether on the basis of cladogram properties, as described in the subsection "Information as a Quantitative Metric", or according to information theory proper, as described in the other three subsections in the preceding section).

Information as a Quantitative Metric
For instance, using the analyses that are presented herein, we now can reconcile the discrepancy between conclusions drawn by Mickevich and Platnick [23]-that pectinate cladograms constitute maximally informative classifications-and conclusions reached by Page [24]-that, in drawing their conclusion, Mickevich and Platnick had ignored prohibited taxonomic statements and considered taxonomic structure exclusively as clusters.Mickevich and Platnick had restricted their analysis predominantly to t = 5 taxa, as they felt that this constituted "the smallest number of taxa for which there is a diversity of cladogram topologies to be interesting."As shown herein, for t = 5 taxa (and c > 5 characters), the pectinate cladogram is the maximally informative consistent cladogram; whereas, for t > 5, nonpectinate cladograms may yield more information.Because Mickevich and Platnick had confined their attention to the specific case t = 5 taxa, the generalities that could be posited from their analyses were limited, a constraint that Page observed intuitively and demonstrated.

Information as a Means for Considering Evolution in Thermodynamic Terms
Concerning Wiley and Brooks' [25,26] proposal that evolution is a natural process in which the entropy that is produced is minimized, we observe that their formulation concerns organizational changes over time that involve changes to t, rather than organization within a particular phylogenetic tree (for which t would be fixed).Testing hypotheses that derive from their theory would be conducted most-effectively using paleontological data (i.e., extinct taxa for which dated fossil material over epochs were available), as changes in information over time could be analyzed (i.e., with increasing t and fixed c).

Information as a Criterion for Choosing among Competing Systematic Classifications
Concerning the D criterion that was proposed by Brooks et al. [31], the results that are presented herein indicate that D yields a maximum but would approach the asymptote 2 log t (which is less than the maximum) as more data were accumulated for any group under analysis.This behavior ought to be considered in invoking the "greatest-D criterion" in any particular phylogenetic systematic analysis (i.e., for any particular t).Brooks et al. recognized that greater D sometimes could be obtained from less-than-most-parsimonious cladograms.In this respect, the results that are presented herein explain the authors' caution: "the D measure is to be used to choose among trees after the set of trees has been reduced to those of equally-shortest length".

Information as a Guiding Variable in Constructing Supertrees
Concerning the methods proposed by Purvis [32], Ronquist [33], and Wilkinson et al. [35] for constructing supertrees, different coding schemes (i.e., matrices containing different character vectors) indeed will yield different information, and Equation ( 9) provides an independent means for assessing whether information will increase upon recoding.

Information as a Parameter for Measuring Diversity
Concerning Stone's [36] using information as a means for comparing diversity among higher taxa, the results that are contained herein show that information may be implemented heuristically in this manner; in fact, if sister groups were to be compared, then accumulating autapomorphic character states for one taxon in each sister group or synapomorphic character states for each sister group would enable researchers to correlate information content with diversity directly (trivially, via limit Equation ( 16)).For situations in which different character state types were amassed for different taxa, information content and diversity would be related but in a more-complicated manner.

Information as a Basis for Comparing Data Types
Finally, concerning the proposal by Cotton and Wilkinson [37] to use information as a basis for comparing data types, the results that are contained herein entail that the information that may be obtained from any data type is bounded if those data would yield most-parsimoniously a consistent cladogram; if t > 5, then only data that would yield most-parsimoniously a nonpectinate consistent cladogram would yield maximum information, especially for large t.

Prospectus
Information is being applied increasingly in phylogenetic-systematic-related analyses, as data from genome projects amass and analyses involving entire genomes become more common.For instances, information has been used in cladogram construction [42][43][44], for drug discovery [45], as a guide in pharmacological data mining [46], with assessing Markov processes [47], and to calculate probabilities for chemical words in DNA sequences [48].Possessing a more-precise interpretation for how information in phylogenetic systematic analysis has been used could assist in formalizing links between biology and information (ultimately justifying the term "bioinformatics") and ultimately a synthesis between bioinformatics and evolution.
Demonstrating the Limit for Equation (16) Consider a taxon-character matrix A comprising ai columns containing i ones.Let which is convex in the variable 0   .Then     This makes sense intuitively, as one is acquiring different information about a positive proportion among the taxa (those described through the y a s, rather than those described by the x a s).

Figure 2 .
Figure 2. Cladogram topologies for the two possible consistent basal matrices for four taxa: the pectinate cladogram at the left would be obtained from a cladistic analysis involving consistent basal matrix (10.1); the symmetric cladogram would be obtained from a cladistic analysis involving consistent basal matrix (10.2).

Figure 3 .
Figure 3. Plots relating information measure I and added autapomorphic characters c for consistent basal matrices (10.1) (pectinate) and (10.2) (symmetric).An upper bound to I exists in both cases.

Figure 4 .
Figure 4. Plots relating information measure I and added autapomorphic characters c for consistent basal matrices (11.1)-(11.3).An upper bound to I exists in all three cases.

Figure 7 .
Figure 7. Plots relating information measure I and added autapomorphic characters c for some consistent basal matrices for eight taxa.The representation corresponding to the pectinate cladogram is (8 taxa; 35, 7, 6), whereas the representation corresponding to the symmetric cladogram is (8 taxa; 24, 4, 4).
in Equation(1).The quantity I may be expressed as TheoremLet t be fixed and c   , in a way such that, for some x, x a   while, for all other j x  , j aOne may let c   , letting some or all of the j a diverge to infinity, as well, but in a manner such converge to a δ-measure concentrated at the point i = x.In this case, converges to zero, which is the result in question.We remark that, if two (or more) y a s diverge to infinity in a manner such that the ratio x the inequality being implied by Jensen's inequality for convex functions.