Entropy Estimators for Markovian Sequences: A Comparative Analysis

Entropy estimation is a fundamental problem in information theory that has applications in various fields, including physics, biology, and computer science. Estimating the entropy of discrete sequences can be challenging due to limited data and the lack of unbiased estimators. Most existing entropy estimators are designed for sequences of independent events and their performances vary depending on the system being studied and the available data size. In this work, we compare different entropy estimators and their performance when applied to Markovian sequences. Specifically, we analyze both binary Markovian sequences and Markovian systems in the undersampled regime. We calculate the bias, standard deviation, and mean squared error for some of the most widely employed estimators. We discuss the limitations of entropy estimation as a function of the transition probabilities of the Markov processes and the sample size. Overall, this paper provides a comprehensive comparison of entropy estimators and their performance in estimating entropy for systems with memory, which can be useful for researchers and practitioners in various fields.


INTRODUCTION
The entropy associated with a random variable is a measure of its uncertainty or diversity, taking large values for a highly unpredictable random variable (i.e., all outcomes equally probable) and low values for a highly predictable one (i.e., one or few outcomes much more probable than the others).As such, the concept has found multiple applications in a variety of fields including but not limited to nonlinear dynamics, statistical physics, information theory, biology, neuroscience, cryptography, and linguistics [1][2][3][4][5][6][7][8][9][10][11][12][13].
Due to its mathematical simplicity and clear interpretation, Shannon's definition is the most widely used measure of entropy [14].For a discrete random variable X with L distinct possible outcomes x 1 , . . ., x L , the Shannon entropy reads where p(x i ) denotes the probability that the random variable X takes the value x i .It often occurs in practice that the probability distribution of the variable X is unknown, either due to mathematical difficulties or to the lack of deep knowledge of the details of the underlying experiment described by the random variable X.In those situations, it is not possible to compute the entropy using Equation (1) directly.In general, our information is restricted to a finite set of ordered data resulting from the observation of the outcomes obtained by repeating a large number of times, N , the experiment.Hence, the goal is to estimate H from the ordered sequence S = X 1 , . . ., X N , where each X j ∈ {x i } L i=1 with j = 1, . . ., N .A numerical procedure that provides an approximation to the true value of H based on the sequence S is called an entropy estimator.As the sequence S is random, it is clear that an entropy estimator is itself a random variable, taking different values for different realizations of the sequence of N outcomes.It would be highly desirable to have an unbiased entropy estimator, i.e., an estimator whose average value coincides with the true result H for all values of the sequence length N .However, it can be proven that such an estimator does not exist [15] and that, apart from the unavoidable statistical errors due to the finite number N of data of the sample (and which typically scale as N −1/2 ), all estimators present systematic errors which are in general difficult to evaluate properly.Therefore, a large effort has been devoted to the development of entropy estimators that, although necessarily biased, provide a good value for H with small statistical and systematic errors [16].
The problem of finding a good estimator with small errors becomes more serious when the number of data N is relatively small.Indeed, when the sizes of available data are much larger than the possible outcomes (N ≫ L), it is not difficult to estimate H accurately, and all of the most popular estimators are naturally satisfactory in this regime.The task becomes much harder as the numbers L and N come closer to each other.It is particularly difficult in the undersampled regime (N ≲ L) [17], where some, or potentially many, possible outcomes may not be observed in the sequence.It is in this regime where the difference in accuracy among the available estimators is more significant.
We emphasize that the discussed difficulties already appear for independent identically distributed (i.i.d.) random variables.Precisely, the previous literature has largely dealt with entropy estimators proposed for sequences of i.i.d.random variables [16,[18][19][20][21].However, it is not clear that real data arising from experimental observation can be described with i.i.d.random variables due to the ubiquitous presence of data correlations.The minimal correlations in discrete sequences are of a Markovian nature.Then, how do the main entropy estimators behave for Markovian sequences?arXiv:2310.07547v2[cond-mat.stat-mech]17 Jan 2024 The purpose of this work is to make a detailed comparison of some of the most widely used entropy estimators in systems whose future is conditionally independent of the past (Markovian).In Markovian sequences, correlations stem from the fundamental principle that the probability of a data value appearing at a specific time depends on the value observed in the preceding time step.Markov chains have been used to model systems in a large variety of fields such as statistical physics [22], molecular biology [23], weather forecast [24], and linguistics [25], just to mention a few.Below, we analyze the strengths and weaknesses of estimators tested in a correlated series of numerically generated data.We compare the performances for the estimators that have shown to give good results for independent sequences [16].For definiteness, we below consider Markovian sequences of binary data.Furthermore, the calculation of relevant quantities in information theory, such as entropy rate and predictability gain [26], requires estimating the block entropy of a sequence, obtained from the estimation of the entropy associated not to a single result, but to a block of consecutive results.As we will argue in the following sections, the construction of overlapping blocks induces correlations amongst them, even if the original sequence is not correlated.The calculation of the block entropy is also a tool that can be used to estimate the memory of a given sequence [27], which is of utmost importance when dealing with strongly correlated systems [28][29][30][31][32][33].
The rest of the paper is organized as follows.In Section II, we make a brief overview of the ten entropy estimators being considered in this study, nine of which are already known in the literature and an additional estimator built from results presented in ref. [34], which is further developed in this work.In Section III, we present the results of our comparative analysis of these estimators in two Markovian cases: (A) binary sequences; and (B) in an undersampled regime.Section IV contains the conclusions and an outlook.Finally, in Appendix A we provide a new interpretation in terms of geometric distributions of an estimator which is widely used as the starting point to construct others, and in Appendix B we prove the equivalence between a dynamics of block sequences and a Markovian random variable.

II. MATERIALS AND METHODS
In the following, we will use the notation â to refer to a numerical estimator of the quantity a.The bias of â is defined as where ⟨â⟩ represents the expected value of â.The estimator â is said to be unbiased if B[â] = 0.The dispersion of â is given by the standard deviation Ideally, â should be as close to the true value a as possible.Therefore, it is desirable that â has both low bias and low standard deviation.With this in mind, it is natural to consider the mean squared error of an estimator, given by to assess its quality.Hence, when comparing estimators of the same variable, the one with the lowest mean squared error is preferable.
Given an estimator Ĥ of the entropy, its k-th moment can be computed as where the sum runs over all possible sequences S = X 1 , . . ., X N of length N and Ĥ(S) is the value that the estimator takes on in this sequence.The probability P (S) of observing the sequence S depends on whether S is correlated or not.For example, if S is an independent sequence, P (S) can be calculated as For correlated sequences, Equation ( 6) no longer holds.Consider a Markovian system, in which the probability of the next event only depends on the current state.In other words, the transition probabilities satisfy with s the position in the series.A homogeneous Markov chain is one in which the transition probabilities are independent of the time step s.Therefore, a homogeneous Markov chain is completely specified given the L × L matrix of transition probabilities p(x j |x ℓ ) = P (X s = x j |X s−1 = x ℓ ), j, ℓ = 1, . . ., L. In this case, the probability of observing the sequence S can be calculated as where we have applied Equation (7) successively.The calculation of P (S) can be generalized to an morder Markov chain defined by the transition probabilities: that depend on the m previous results of the random variable.
It is clear that the moments of the estimator Ĥ, and consequently its performance given by its mean squared error, depend on the correlations of the system being analyzed.
Most of the entropy estimators considered in this work only depend on the number of times each outcome occurs in the sequence.In this case, the calculation of the moments of the estimator can be simplified for independent and Markovian systems considering the corresponding multinomial distributions [35].
Several entropy estimators were developed with the explicit assumption that the sequences being analyzed are uncorrelated [36,37].The main assumption is that the probability of the number of times n i that the outcome x i occurs in a sequence of length N follows a binomial distribution, This approach is not valid when dealing with general Markovian sequences because Equation (10) no longer holds.Instead, the Markovian binomial distribution [38] should be used, or more generally, the Markovian multinomial distribution [35].Even for entropy estimators that were not developed directly using Equation (10), their performance is usually only analyzed for independent sequences [16].Hence, the need to compare and evaluate the different estimators in Markov chains.

A. Maximum Likelihood Estimator
The maximum likelihood estimator (MLE) (also known as plug-in estimator) simply consists of replacing the exact probabilities in Equation ( 1) for the estimated frequencies, where ni is the number of times that the outcome x i is observed in the given sequence.It is well known that Equation ( 11) is an unbiased estimator of p(x i ), but the MLE estimator, given by is negatively biased [15], i.e., ⟨ ĤMLE ⟩ − H < 0.

B. Miller-Madow Estimator
The idea behind the Miller-Madow estimator (MM) [48] is to correct the bias of ĤMLE up to the first order in 1/N , resulting in where N 0 is the number of different elements present in the sequence.Corrections of higher order are not considered because they include the unknown probabilities p(x i ) [49].

C. Nemenman-Shafee-Bialek Estimator
A large family of entropy estimators are derived by estimating the probabilities using a Bayesian framework [40,44,[50][51][52][53].The Nemenman-Shafee-Bialek estimator (NSB) [54][55][56] provides a novel Bayesian approach that, unlike traditional methods, does not rely on strong prior assumptions on the probability distribution.Instead, this method uses a mixture of Dirichlet priors, designed to produce an approximately uniform distribution of the expected entropy value.This ensures that the entropy estimate is not exceedingly biased by prior assumptions.
The Python implementation developed in ref. [57] was used in this paper for the calculations of the NSB estimator.

D. Chao-Shen Estimator
The Chao-Shen estimator (CS) [18] takes into account two corrections to Equation (12) to reduce its bias: first, a Horvitz-Thompson adjustment [58] to account for missing elements in a finite sequence; second, a correction to the estimated probabilities, pCS (x i ) = ĈCS p(x i ), leading to where N 1 is the number of elements that appear only once in the sequence.The Chao-Shen entropy estimator is then

E. Grassberger Estimator
Assuming that all p(x i ) ≪ 1, the probability distribution of each n i can be approximated by a Poisson distribution.Following this idea, Grassberger (G) derived the estimator presented in ref. [36] by first considering Rényi entropies of order q [59]: Taking into account that the Shannon case can be recovered by taking the limit q → 1, the author proposed a low bias estimator for the quantity p q , for an arbitrary q.This approach led to the estimator given by with and the different values of G ni computed using the recurrence relation where γ = 0.57721 . . . is Euler's constant.

F. Bonachela-Hinrichsen-Muñoz Estimator
The idea behind the Bonachela-Hinrichsen-Muñoz estimator (BHM) [37] is to make use of Equation ( 10) to find a balanced estimator of the entropy that, on average, minimizes the mean squared error.The resulting estimator is given by

G. Shrinkage Estimator
The estimator proposed by Hausser and Strimmer [20] (HS) is a shrinkage-type estimator [60], in which the probabilities are estimated as an average of two models: where the weight α is chosen so that the resulting estimator pHS has lower mean squared error than p and is calculated by [61] Hence, the shrinkage estimator is H. Chao-Wang-Jost Estimator The Chao-Wang-Jost estimator (CWJ) [62] uses the series expansion of the logarithm function, as well as a correction to account for the missing elements in the sequence.This estimator is given by where ψ(z) is the digamma function and A is given by with N 1 and N 2 the number of elements that appear once and twice, respectively, in the sequence.
In the supplementary material of ref. [62], it is proven that the first sum in Equation ( 25) is the same as the leading terms of the estimators developed in refs.[41,42].In Appendix A, we show that each term in this sum is also equivalent to an estimator that takes into account the number of observations made prior to the occurrence of the element x i .

I. Correlation Coverage-Adjusted Estimator
The correlation coverage-adjusted estimator (CC) [27] uses the same ideas that support Equation (15) but considers a different correction to the probabilities, pCC (x i ) = ĈCC p(x i ), where now ĈCC is calculated sequentially taking into account previously observed data, where N ′ ≡ N/2 and the function I(Z) yields 1 if the event Z is true and 0 otherwise.By construction, this probability estimator considers possible correlations in the sequence.
Then, the CC estimator is given by

J. Corrected Miller-Madow Estimator
In ref. [34] it is shown that the bias of the MLE estimator can be approximated based on a Taylor expansion as where Notice that the first term in Equation ( 29) is simply the Miller-Madow correction shown in Section II B, whereas the second term involves the unknown conditional probabilities with a lag l that tends to infinity.These quantities can be hard to estimate directly from observations, especially if dealing with short sequences.However, the calculation of K(l) can be simplified.Assuming that the sequence is independent, it can easily be seen that K(l) = 0 for all l and one recovers the Miller-Madow correction.
Considering that the sequence is Markovian, then K(l) can be written in a simpler way by first noticing that where T is the L × L transition probability matrix given by (T) ij = p(x j |x i ).Hence, where Tr(T l ) is the trace of the matrix T l and λ i are the eigenvalues of T. The last equality of Equation ( 31) is a well-known result in linear algebra.Given that T is a stochastic matrix, then all eigenvalues fulfil that |λ| ≤ 1, and at least one eigenvalue is equal to 1.We will assume that only λ 1 = 1 and we will discuss later on the case where more than one eigenvalue is equal to 1.We can write Equation (29) as Using the well-known result for the sum of the geometric series, then, Notice that the convergence of the series of Equation (32) requires that none of the eigenvalues λ 2 , . . ., λ L has an absolute value equal to 1.Given a finite sequence, we need to estimate the transition matrix T as with nik the number of times the block (x i , x k ) is observed in the sequence.We can then calculate the eigenvalues λ1 , . . ., λL of the matrix T, which is also stochastic, and hence, one of its eigenvalues, λ1 , is equal to 1.
Therefore, the proposed corrected Miller-Madow estimator (CMM) is The correction to the MM estimator should only be used when the absolute value of all eigenvalues but λ1 of the stochastic matrix T are not equal to 1. Otherwise, it is recommended to avoid that correction and simply use ĤMM as the estimator.

III. RESULTS
We now proceed to compare the performance of the different estimators defined in the previous Section II.Let us note first that, given a particular sequence, all entropy estimators, with the exception of the CC and CMM estimators, will yield exactly the same value if we permute arbitrarily all numbers in the sequence.The reason behind this difference is that although the CC estimator takes into account the order in which the different elements appear in the sequence, and the CMM estimator considers the transition probabilities of the outcomes, all other estimators are based solely on the knowledge of the number of times that each possible outcome appears, and this number is invariant under permutations.
Certain estimators, such as CS or CC, can be calculated without any prior knowledge of the possible number of outcomes, L. This feature is particularly advantageous in fields like ecology, where the number of species in a given area may not be accurately known.Conversely, estimators like HS and NSB require an accurate estimate of L for their computation.
As mentioned before, when analyzing an estimator, there are two important statistics to consider: the bias and the standard deviation.Ideally, we would like an estimator with zero bias and low standard deviation.For the entropy, we have already argued that such an unbiased estimator does not exist.Hence, in this case, the "best" estimator (if it exists) would be the one that has the best balance between bias and standard deviation, i.e., the one with the lowest mean squared error given by Equation (4).
In this section, we will analyze and compare these three statistics-bias, standard deviation, and mean squared error-for the ten entropy estimators reviewed in Section II in two main Markovian cases: (A) binary sequences; and (B) in an undersampled regime.
It is possible to compute the Shannon entropy of this random variable using the general definition given by Equation (1).
with the stationary values [5]: The average value and standard deviation of the different entropy estimators were computed using Equation ( 5) for k = 1, 2 by generating all 2 N possible sequences S and computing the probability of each one using Equation (8), where p(X 1 ) are the stationary values given by Equation (37).We have followed this approach to compute the estimator bias B = ⟨ Ĥ⟩ − H and its standard deviation σ = ⟨ Ĥ2 ⟩ − ⟨ Ĥ⟩ 2 .As an example, we plot the absolute value of the bias for sequences of length N = 4 in the colour map of Figure 1, for the ten entropy estimators presented in Section II, as a function of the transition probabilities p 00 and p 11 .
In Figure 1, we can see that, for all ten estimators, the bias is larger in the region around the values p 00 ≃ p 11 ≃ 1.The reason is that, in this region, the stationary probabilities of 0 and 1 are very similar, but given these particular values of the transition probabilities, a short sequence will most likely feature only one of these values, which makes it very hard to correctly estimate the entropy in those cases.Apart for this common characteristic, the performance of the estimators when considering only the bias is quite diverse, all of them having different regions where the bias is lowest (darker areas in the panels).
In order to quantitatively compare the performance of the different estimators, we have aggregated all values in the (p 00 , p 11 ) plane.We define the aggregated bias of an estimator, where the sum runs over all values of the transition probabilities used to produce Figure 1, ∆p = 0.02 is the step value used for the grid of the figure, and B(p 00 , p 11 ) is the bias for the particular values of the transition probabilities.The aggregated bias given by Equation (38) depends only on the sequence length N .We conduct the previous analysis for different values of N .The resulting plot of the aggregate bias B of the entropy estimator as a function of the sequence length is shown in Figure 2. In this figure, we can see that the CC estimator gives the best performance for small values of N , except for N = 2, where the CWJ estimator has the lowest aggregated bias.However, from N = 7 it is the CMM estimator which outperforms the rest.The poor performance of this estimator for low values of N is due to the fact that this estimator, in contrast to the others, requires estimating the transition probabilities, as well as the stationary probabilities, and therefore more data are needed.As expected, all the estimators yield an aggregated bias that vanishes as N increases.
In the colour map of Figure 3, we perform a similar analysis for the standard deviation σ.In the figure, we find that all ten estimators show a similar structure in the sense that the regions of lowest and highest σ are alike.The smallest deviation is mostly located near the left bottom corner of the colour maps and the largest deviation occurs around the regions (0.65 ≲ p 00 ≲ 0.9, 0 ≲ p 11 ≲ 1) and (0 ≲ p 00 ≲ 1, 0.65 ≲ p 11 ≲ 0.9) (green areas in the figures).Of course, the values of σ inside these regions vary for each estimator but they all share this similar feature.In this case, by just looking at the colour maps, it is easy to see that BHM (panel f) and NSB (panel c) estimators are the ones with the lowest standard deviation.
The aggregated standard deviation σ, defined in a similar way to the aggregated bias, σ = (∆p) 2 p00,p11 σ(p 00 , p 11 ), (39) is plotted in Figure 4 as a function of the sequence size N .
In agreement with the previous visual test, the BHM and NSB estimators clearly outperform the rest, even though their advantage is less significant as N increases.
Finally, for every particular N , we compute the mean squared error of the entropy estimators, Equation ( 4), as a function of p 00 and p 11 .Its aggregated value MSE = (∆p) 2 p00,p11 MSE(p 00 , p 11 ), (40) is plotted as a function of N in Figure 5.Even though the CC and CMM estimators outperform the others when considering only the bias, their large dispersion dominates the mean squared error.Overall, it can be seen that the BHM and NSB estimators surpass the rest when both the bias and standard deviation are considered although, again, their advantage becomes less significant as N increases.

B. Undersampled Regime: Block Entropy
Consider a sequence S = X 1 , . . ., X N , where each X i = 0, 1 is a binary variable, with probabilities P (X i = 1) = p, P (X i = 0) = 1−p.We group the sequence in blocks of size n, such that the jth-block is B j = (X j , . . ., X j+n−1 ).We denote by {b i } i=1,...,2 n the set of all possible blocks.The total number of (overlapping) blocks that can be constructed out of a series of N elements is N n = N − n + 1, whereas the total number of possible blocks is L = 2 n .Hence, depending on the values of n and N , the sequence formed by the N n blocks, S n = B 1 , . . ., B Nn , will be in an undersampled regime whenever N n ≪ 2 n .
The block entropy H n is defined by where p(b i ) is the probability of observing the block b i .
The important thing to notice here is that, even if the different outcomes X 1 , . . ., X N of the binary variable X are independent, the block sequence B 1 , . . ., B Nn obeys a Markov process for n ≥ 2. This Markovian property can be easily established by noticing that the block B j = (X j , . . ., X j+n−1 ) can only be followed by the block B j+1 = (X j+1 , . . ., X j+n−1 , 1) with probability p or by the block B j+1 = (X j+1 , . . ., X j+n−1 , 0) with probability 1 − p.Therefore, the probability of B j+1 depends only on the value of block B j .In Appendix B we show that the dynamics of block sequences in the case that X i are i.i.d. is equivalent to that of a new stochastic variable Z that can take any of L = 2 n possible outcomes, z i = 0, 1, . . ., 2 n − 1, with the following transition probabilities for each state z: These types of Markovian systems have been related to Linguistics and Zipf's law [25].
The previous result can be generalized.If the original sequence X 1 , . . ., X N is Markovian of order m ≥ 1, then the dynamics of the block sequences B 1 , . . ., B Nn are also Markovian of order 1, for n ≥ m.
It is well known [5] that the block entropy, when the original sequence S is constructed out of i.i.d.binary variables, obeys where H 1 can be calculated using Equation (36) with p(1) = p and p(0) = 1 − p.Therefore, the entropy rate is constant.We want to compare now the performance of the different estimators defined before when computing the block entropy.In this case, we cannot use an expression equivalent to Equation (5), summing over all sequences S n , since the number of possible sequences is (2 n ) Nn , and it is not possible to enumerate all the sequences even for relatively small values of n and N n .As an example, we employ in our numerical study N n = 20 and n = 6, for which the total number of possible sequences is 2 120 .Therefore, we use the sample mean µ M [ Ĥn ] and the sample variance s 2 M [ Ĥn ] as unbiased estimators to the expected value ⟨ Ĥn ⟩ and the variance σ 2 [ Ĥn ], respectively.After generating a sample of M independent sequences S i n , i = 1, . . ., M , and computing the estimator Ĥn (S i n ) for each of the sequences, those statistics are computed as Ĥn (S i n ), Using Equations ( 43) and ( 44) we can calculate the bias , and the mean squared error s 2 M [ Ĥn ] + B 2 n .In the following, we set M = 10 4 for our simulations.
In Figure 6, we show plots of B n and s M [ Ĥn ] as a function of p ranging from 0.02 to 0.5 with step ∆p = 0.02, for N n = 20.We find that the CC estimator performs remarkably well in terms of bias and we highlight its robustness.Unlike the other estimators, which display significant variations in their bias as p changes, the CC estimator remains approximately constant at a low value.However, the CC estimator presents a high standard deviation, whereas the MLE and MM exhibit the lowest standard deviation.For the majority of estimators considered, we observe that the ones with higher bias are the ones with lower deviation.An exception is the HS estimator.
To analyze the changes in the overall performances of the estimators with different values of N , we calculated the aggregated bias as Similarly, we calculated the aggregated standard deviation as and the aggregated mean squared error as The resulting plots are shown in Figures 7-9, respectively.It was expected that the total bias of the estimators would decrease by increasing N , and in Figure 7 it can be seen that this is indeed the case for all estimators except for the BHM estimator.Surprisingly, the bias of this estimator follows a typical pattern of decreasing as the sample size increases, just like the other estimators.However, it takes an unexpected turn starting at N = 20, as it begins to increase once more.A possible reason for this behaviour is that the BHM estimator is designed to minimize the MSE.
Similarly to the results obtained for the binary Markovian case, the CC estimator demonstrates in Figure 7 excellent performance when solely evaluating bias.Even though its performance for a data size of N = 5 is not outstanding, it begins to outperform all but the CS, CWJ, and HS estimators starting at N = 10, and from that point onward, the CC estimator consistently ranks among the top-performing estimators, together with the NSB and CWJ estimators.
By comparing Figures 7 and 8, it can be seen that there is a certain balance: an estimator with a higher bias usually has a lower deviation when compared to others.This is clearly the case for the MLE and MM estimators, as they are the two with the worst performances in terms of bias, but they have the lowest aggregated standard deviation for most of the data sizes considered.
In this interplay between bias and standard deviation observed for most of the entropy estimators considered here, the NSB estimator is the one that presents the best performance when considering both statistics.From Figure 9, it is clear that this estimator shows the lowest aggregated mean squared error, although just from N = 20 the difference with other estimators like the CC or the G becomes vanishingly small.
It can be seen in Figures 7-9 that the performance of the CMM estimator is very similar to MM's performance, especially for large values of N .This suggests that for Markovian systems defined by the transition probabilities given by Equation ( 42), the correction introduced in Equation ( 35) is not significant, particularly in the limit of large N .

IV. DISCUSSION
We have made a detailed comparison of nine of the most widely used entropy estimators when applied to Markovian sequences.We have also included in this analysis a new proposed estimator, motivated by the results presented in ref. [34].One crucial difference in the way these estimators are constructed is that only the correlation coverage-adjusted estimator [27] and the corrected Miller-Madow estimator take into account the order in which the elements appear in the sequence.To calculate the CC estimator, it is necessary to know the entire history of the sequence , and the computation of the CMM estimator requires the calculation of the transition probabilities.On the contrary, for all other estimators, it is sufficient to know the number of times that each element is present in the sequence, independently of the position in which they appear.Remarkably, this novel approach to the issue of entropy estimation allows us to reduce the bias, even in undersampled regimes.Unfortunately, both of these estimators present large dispersion, which reduces their overall quality.
We have found that, when dealing with Markovian sequences, on average, the Nemenman-Shafee-Bialek estimator [54][55][56] outperforms the rest when taking into account both the bias and the standard deviation for both analyzed cases, namely, binary sequences and an undersampled regime.Ref. [16] presented a similar analysis but for uniformly distributed sequences of bytes and bites, and concluded that the estimator with the lowest mean squared error was the Shrinkage estimator [20].Hence, when choosing a reliable estimator, it is not only important to consider the amount of data available, but also whether correlations might be present in the sequence.
Further analyses should consider Markovian sequences of higher order [63,64].Another interesting topic would be systems described with continuous variables [65,66], where the presence of noise is particularly important.Finally, we stress that there are alternative entropies not considered here [67], for which the existence of accurate estimators is still an open question.Finally, an exciting possibility would be a comparative study of estimators valid for more than one random variable or probability distributions, leading, respectively, to mutual informa-tion [68,69] and relative entropy [47,70,71].
V. ACKNOWLEDGMENTS

FIG. 2 .FIG. 4 .
FIG. 2. Aggregated bias of the entropy estimators for Markovian binary sequences as a function of the sequence size N .

FIG. 5 .
FIG.5.Aggregated mean squared error of the entropy estimators for Markovian binary sequences as a function of the sequence size N .

FIG. 6 .
FIG.6.Bias (top) and standard deviation (bottom) of the entropy estimators, when applied to Markovian sequences of length N = 20 and L = 26 , generated from the transition probabilities given by Equation(42), as functions of p, which vary from 0.02 to 0.5 with step ∆p = 0.02.By construction, the plot is symmetric around p = 0.5.

FIG. 7 .
FIG. 7. Aggregated bias of the entropy estimators for Markovian sequences in the undersampled regime with L = 26 , generated from the transition probabilities given by Equation(42), as a function of the sequence size N .

FIG. 8 .FIG. 9 .
FIG.8.Aggregated standard deviation of the entropy estimators for Markovian sequences in the undersampled regime with L = 26 , generated from the transition probabilities given by Equation(42), as a function of the sequence size N .