This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

In this paper, new techniques that allow conditional entropy to estimate the combinatorics of symbols are applied to animal communication studies to estimate the communication’s repertoire size. By using the conditional entropy estimates at multiple orders, the paper estimates the total repertoire sizes for animal communication across bottlenose dolphins, humpback whales and several species of birds for an N-gram length of one to three. In addition to discussing the impact of this method on studies of animal communication complexity, the reliability of these estimates is compared to other methods through simulation. While entropy does undercount the total repertoire size due to rare N-grams, it gives a more accurate picture of the most frequently used repertoire than just repertoire size alone.

The complexity of animal communication is a topic frequently discussed, but difficult to resolve. While it is beyond dispute that many species communicate, even the basic purposes of these communications, whether to communicate information or to just influence the behavior of others to increase their own fitness, is hotly debated [

The complexity of animal language has been studied using many methods, including various techniques to estimate repertoire size, such as curve-fitting [

After formulating information theory in 1948, Shannon was not long in turning its powers to shedding light on human language [_{i}_{i},j_{i}, j_{bi}_{i}

Amongst the simplest methods for computing conditional entropies is from joint entropies. The joint entropy, _{i}

The conditional entropy of order _{N}

For the English alphabet of 27 letters (26 letters plus the space character), Shannon calculated the first-order entropy at 4.14 bits, the second-order conditional entropy at 3.56 bits and the third-order conditional entropy at 3.30 bits. The zero-th order entropy of 4.75 bits was based on log_{2}

Soon after human languages, animal communication of varying types were studied using entropy. One of the first citations explicitly analyzing animal communication by means of information theory was that of J.B.S. Haldane and H. Spurway [

Further studies along this line include the analysis of the chickadee (

These studies are primarily focused on measuring information through entropy in bits in the first order, and sometimes higher orders, as well. For multiple orders, information graphs, plots of the bits of the conditional entropy by order, are sometimes used [

An information graph is the plot of the higher order conditional entropies by order. Some of the first uses and analyses of information graphs in the context of Markov sequences are given in [

With these caveats, the information graphs will still be shown as an illustration of the results of the studies on each animal communication and should be used with caution to establish the complexity of sequences.

In general, the larger the order of dependence, the more “complex” the communication is deemed. For example, many bird call sequences seem to show first-order dependence, though this is unsure, since a sample size of multiples of the number of symbols squared is needed to confirm this (

While information graphs are relatively easy to construct given the right data, there is a large issue of estimating entropy. Namely, entropy estimators can have large biases that depend on the sample size, which typically underestimate the true value of the entropy [

Because of the often large numbers of possible variables, entropy estimators can be very sensitive to sample size and introduce bias into measurements. This was first investigated in [

When dealing with actual data, it can be relatively straightforward to estimate

One of the lesser known, but extremely useful, facets of information theory is the way entropy can be used for combinatorics. In particular, the number of combinations of a symbol set can be more accurately estimated using the first-order entropy than can be done with an assumption of random likelihood. For example, if an alphabet has _{N}_{N}

Since _{N}

In order to improve on the estimate of _{N}_{N}_{2} = 2^{H(X|Y)}. However, a factor of two is necessary for the equation to reduce to the base case of Shannon and Weaver if

In the above _{2}, is the conditional entropy in bits for the digram sequence, _{N}_{L}

Since conditional entropy must monotonically decrease with each higher order, _{L}_{L}_{2} = 2^{2H2}. For distinct three-letter words, we can use _{3} = 2^{3H3},

In addition to estimating the size of the repertoire, combinatorics can be used to estimate upper bounds for the entropy bias when details about the dataset are unavailable. This is primarily through estimating ^{NH}^{NH}

In addition, one can estimate a lower bound for ^{NHN}. With these two values of

In the next section, we will investigate the complexity of several species, including bottlenose dolphins, humpback whales and several species of birds and investigate the size of their N-gram repertoires.

In this paper, we will use entropy combinatorial techniques to estimate the N-gram repertoires of six species: bottlenose dolphins

In [

One of the defining features of humpback whales,

Dobson and Lemon [

In [

In [

Here, we use the data from these papers to reproduce graphically the information graphs for the communications of each species (

First, we will represent the minimum bias-corrected conditional entropies as information graphs from order 0, log

As can be seen in ^{2}, especially with the large song type repertoire of birds.

In analyzing the data from the species and estimating repertoires, it is essential to define sample sizes and correct for bias. In _{2} and _{3}.

In

In

From these tables, especially

Clearly, we have a more accurate idea of total repertoire with those animals for which the repertoire size differs very little from the maximum or minimum bias assumptions. These are dolphins, humpback whales and European starlings. The other bird species have a large number of song types. This huge symbol size causes a large swing between the estimates for minimum and maximum bias. In these cases, the minimum bias estimate is more representative, since the number of possible N-grams that first-order entropy would imply is enormous with such a large symbol set. In the end, the best way to accurately measure the repertoire sizes, particularly for dolphins and humpback whales, is to make a much larger measurement of sequences with

As stated in the introduction, apart from the information theory perspective, repertoire size has often been investigated using sampling methods, such as curve-fitting and capture-recapture. These methods can be used if song bout data is available to predict repertoire size, their accuracy increasing with the number of samples. In order to compare the method developed in this paper with actual data and these two methods, a program was created that synthesized an arbitrary signal with a predefined entropy of the first, second and third order.

Using this program, the number of N-grams was compared with the estimates using the entropy method for dolphins and humpback whales. For dolphins and whales, respectively, 20,000 symbol and 2000 symbol sequences with matching conditional entropies were created, and the number of N-grams from one to three were counted. Since the samples were so large, neither curve-fitting nor capture-recapture had an issue finding the total repertoire size, since the exponential distribution of the total number of symbols (see

For the humpback whales, the total number of simulated 2-grams exactly matched the prediction of a repertoire size of 18. This would seem to confirm the validity of the method. The dolphin story was more complex. With dolphins, the total number of simulated N-grams exceeded the values estimated by the entropy estimations in all cases; however, the details tell a more complex story. While the repertoire is large in terms of N-grams, the frequency is very concentrated amongst the top N-grams. The top 5 2-grams and 3-grams are 78% and 63% of all 2-grams (total: 46) and 3-grams (total: 89), respectively. Many of the 2-grams and 3-grams occurred only once in the 20,000 symbol sequence. While the bias in the dolphins is greater due to the relatively small sample size compared to the number of symbols, the repertoire exceeded even the maximum bias estimates for both 2-grams and 3-grams.

Therefore, we can conclude one major strength, but limitation, of the use of conditional entropy to measure the N-gram repertoire. For small repertoires, like the whales, it seems they can accurately estimate repertoires for small combinations, such as 2-grams. For more complex repertoires, they seem to accurately measure the size of the most frequently used N-grams in the repertoire to give a reasonable estimate of the most functionally used N-grams. As a limit, however, conditional entropies can seriously undercount rare N-grams, since their relatively small probabilities contribute to the calculations of entropy only weakly.

If collecting the entire size of the repertoire, ignoring the weighted heterogeneity of the symbols, is desired and samples are available, both curve-fitting and capture-recapture create a more detailed picture, since they can pick up rare occurrences; however, they do not give the same information about the relatively skewed nature of the distribution of symbols the entropy method can provide.

Animal communication analyses through information theory have been useful, and while they cannot answer all questions regarding the intent or possible meaning of such communications, they have shown beyond a doubt that animal communication can have a complex structure that goes beyond random sounds or even the structure of a first-order Markov process.

However, entropy-based analyses alone hold only descriptive power. A logical next step from observing and measuring communications complexity should be determining how to use that complexity to search for communications structures that can help understand animal behavior. The methods outlined in this paper assist in this effort by giving researchers a baseline to investigate further regarding 2-gram or 3-gram call sequences. In particular, the size of the most frequent, and possibly functional, repertoire is clearly enumerated using information theory methods. Similar to work by Getner on starlings [

While the information theory methods are weaker in finding the exact repertoire size compared to count-based methods, such as curve-fitting and capture-recapture, these methods offer an improved understanding of the relationships that develop the syntax of the communication. The basic order of communication, the clustering of “vocabulary” and other detailed features cannot been understood just by comparing repertoire sizes over time and across species. The importance of understanding syntax in this matter has been frequently raised, such as in [

It has long been known that auditory recognition abilities exist in a wide group of species from 2-gram alarm calls in putty monkeys (

Just like word length analyses in human language use syllables as the base unit [

I would like to thank Laurance Doyle for help in gathering and understanding data from past papers. I would also like to thank the anonymous referees for much helpful feedback.

The author has no conflicts of interest.

Information graphs of communications by (

Information graph of written English letters based on [

Information graphs of animal communication conditional entropies for the species analyzed in this paper.

Exponential distribution of repertoire growth over time for bottlenose dolphin 3-grams and humpback whale 2-grams. Based on simulated sequences of 20,000 symbols with the repertoire measured in bouts of 100 symbols for dolphins and a sequence of 2000 symbols with bouts of 10 symbols for humpback whales.

The basic data on the information theory of animal communication from the species analyzed. _{2} the estimate (where available) of the number of 2-grams measured, _{3} the estimate (where available) of the number of 3-grams measured and _{2} and _{3} are the first-, second- and third-order conditional entropies, respectively.

_{2} |
_{3} |
_{2} |
_{3} | |||||
---|---|---|---|---|---|---|---|---|

[ |
27 | 493 | 346 | 346 | 1.92 | 1.15 | 0.56 | |

[ |
6 | 202 | 195 | N/A | 2.15 | 2 | N/A | |

[ |
170 | 10,000 | 10,000 | 10,000 | 7.05 | 1 | 0.29 | |

[ |
105 | 4,811 | 4,691 | 4,691 | 6.03 | 1.47 | 0.81 | |

[ |
35 | 777 | 777 | 777 | 4.64 | 3.33 | 1.09 | |

[ |
44 | 2,700 | 2,700 | 2,700 | 4.03 | 2.74 | 1.95 |

The biases, minimum and maximum, calculated for the joint entropies of orders 1–3 according to the paper data. Values with asterisks indicate where the maximum bias assumption correction would have exceeded the previous order entropy and, therefore, the maximum bias is limited to the difference between the bias-corrected previous order entropy and the original entropy estimate.

Bias Minimum | Bias Maximum | |||||
---|---|---|---|---|---|---|

| ||||||

Species Name | ||||||

0.04 | 0.01 | 0.01 | 0.04 | 0.03 | 0.2 | |

0.02 | 0.06 | N/A | 0.02 | 0.07 | N/A | |

0.01 | 0 | 0 | 0.01 | 1.26 | 1.96* | |

0.02 | 0.0 | 0.0 | 0.02 | 0.66 | 1.3 | |

0.03 | 0.09 | 0.01 | 0.03 | 0.57 | 2.78* | |

0.01 | 0.01 | 0.02 | 0.01 | 0.07 | 0.85* |

The corrected conditional entropies, minimum and maximum, calculated for the conditional entropies of orders 1–3 according to the paper data and values in

Bias Minimum | Bias Maximum | |||||
---|---|---|---|---|---|---|

| ||||||

Species Name | _{2} |
_{3} |
_{2} |
_{3} | ||

1.96 | 1.12 | 0.56 | 1.96 | 1.14 | 0.73 | |

2.17 | 2.04 | N/A | 2.17 | 2.05 | N/A | |

7.06 | 0.99 | 0.29 | 7.06 | 2.25 | 2.25* | |

6.05 | 1.46 | 0.81 | 6.05 | 2.11 | 2.09 | |

4.67 | 3.39 | 1.00 | 4.67 | 3.87 | 3.87* | |

4.04 | 2.74 | 1.96 | 4.04 | 2.8 | 2.8* |

Estimates of total repertoire sizes for 1-gram, 2-gram and 3-gram, minimum and maximum, for each species based on the bias-corrected conditional entropies.

Bias Minimum | Bias Maximum | |||||||
---|---|---|---|---|---|---|---|---|

| ||||||||

Species Name | 1-gram | 2-gram | 3-gram | Total | 1-gram | 2-gram | 3-gram | Total |

27 | 5 | 4 | 36 | 27 | 5 | 5 | 37 | |

6 | 17 | N/A | 23 | 6 | 18 | N/A | 24 | |

170 | 4 | 2 | 176 | 170 | 23 | 108* | 301 | |

105 | 8 | 6 | 119 | 105 | 19 | 78 | 202 | |

35 | 110 | 8 | 153 | 35 | 214 | 3,126* | 3,375 | |

44 | 45 | 59 | 148 | 44 | 49 | 338* | 431 |