This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/.)

This article analyzes the role of entropy in Bayesian statistics, focusing on its use as a tool for detection, recognition and validation of eigen-solutions. “Objects as eigen-solutions” is a key metaphor of the cognitive constructivism epistemological framework developed by the philosopher Heinz von Foerster. Special attention is given to some objections to the concepts of probability, statistics and randomization posed by George Spencer-Brown, a figure of great influence in the field of radical constructivism.

In several already published articles, I defend the use of Bayesian Statistics in the epistemological framework of cognitive constructivism. In particular, I show how the FBST—The Full Bayesian Significance Test for precise hypotheses—can be used as a tool for detection, recognition and validation of eigen-solutions, see [

In Statistics, specially in the design of statistical experiments, Randomization plays a role which is in the very core of objective-subjective complementarity, a concept of great significance in the epistemological framework of cognitive constructivism as well as in the theory of Bayesian statistics. The pivotal role of randomization in a well designed statistical experiment is that of a decoupling operation used to sever illegitimate functional links, thus avoiding spurious associations, breaking false influences, separating confounding variables,

The use of randomization in Statistics is an original idea of Charles Saunders Peirce and Joseph Jastrow, see [

I shall herein analyze some objections to the concepts of probability, statistics and randomization posed by George Spencer-Brown, a figure of great influence in the field of radical constructivism. Abstinence from statistical analysis and related quantitative methods may, at first glance, look like an idyllic fantasy island where many beautiful dreams come true. However, in my personal opinion, this position threatens to exile the cognitive constructivism epistemological framework to a limbo of powerless theories. In this article, entropy is presented as a cornerstone concept for the precise analysis and a key idea for the correct understanding of several important topics in probability and statistics. This understanding should help to clear the way for establishing Bayesian statistics as a preferred tool for scientific inference in mainstream cognitive constructivism.

In what follows, Section 2 corresponds to the first part of this article's title and elaborates upon “the case of Spencer-Brown

In [

Making (or arbitrating) distinctions is, according to Spencer-Brown, the basic (if not the only) operation of human knowledge, an idea that has either influenced or been directly explored by several authors in the radical constructivist movement. The following quotations, from [

Retroactive reclassification of observations in one of the scientist's most important tools, and we shall meet it again when we consider statistical arguments. (p. 23)

We have found so far that the concept of probability used in statistical science is meaningless in its own terms; but we have found also that, however meaningful it might have been, its meaningfulness would nevertheless have remained fruitless because of the impossibility of gaining information from experimental results, however significant. This final paradox, in some ways the most beautiful, I shall call the Experimental Paradox (p. 66).

The essence of randomness has been taken to be absence of pattern. But what has not hitherto been faced is that the absence of one pattern logically demands the presence of another. It is a mathematical contradiction to say that a series has no pattern; the most we can say is that it has no pattern that anyone is likely to look for. The concept of randomness bears meaning only in relation to the observer: If two observers habitually look for different kinds of pattern they are bound to disagree upon the series which they call random (p. 105).

Several authors concur, at least in part, with my opinion about Spencer-Brown's technical analysis of probability and statistics, see [

I also disapprove some of Spencer Brown's proposed methodologies to detect “relevant” event sequences, that is, his criteria to “mark distinct patterns” in empirical observations. My objections have a lot in common with the standard caveats against

Still, it is an error to argue in front of your data. You find yourself insensibly twisting them around to fit your theories.

Finally, I am also suspicious or skeptical about the intention behind some applications of Spencer-Brown's research program, including the use of extrasensory empathic perception for coded message communication, exercises on object manipulation using paranormal powers,

[On telepathy:] Taking the psychical research data (that is, the residuum when fraud and incompetence are excluded), I tried to show that these now threw more doubt upon existing pre-suppositions in the theory of probability than in the theory of communication.

[On psychokinesis:] If such an ‘agency’ could thus ‘upset’ a process of randomizing, then all our conclusions drawn through the statistical tests of significance would be equally affected, including the conclusions about the ‘psychokinesis’ experiments themselves. (How are the target numbers for the die throws to be randomly chosen? By more die throws?) To speak of an ‘agency’ which can ‘upset’ any process of randomization in an uncontrollable manner is logically equivalent to speaking of an inadequacy in the theoretical model for empirical randomness, like the luminiferous ether of an earlier controversy, becomes, with the obsolescence of the calculus in which it occurs, a superfluous term.

Spencer-Brown's conclusions in [

Curiously, Charles Saunders Peirce and his student Joseph Jastrow, who introduced the idea of randomization in statistical trials, also struggled with some of the very same dilemmas faced by Spencer-Brown, namely, the eventual detection of distinct patterns or seemingly ordered (sub)strings in a long random sequence. Peirce and Jastrow did not have at their disposal the heavy mathematical artillery I have quoted in the previous paragraphs. Nevertheless, as experienced explorers that are not easily lured, when traveling in desert sands, by the mirage of a misplaced oasis, these intrepid pioneers were able to avoid the conceptual pitfalls that lead Spencer-Brown so far astray. For more details see [

As stated in the introduction, the cognitive constructivist framework can be supported by the FBST, a non-decision theoretic formalism drawn from Bayesian statistics, see [

The focus of the present section are the properties of “natural” and “artificial” random sequences. The implementation of probabilistic algorithms require good random number generators, (RNGs). These algorithms include: Numerical integration methods such as Monte Carlo or Markov Chain Monte Carlo (MCMC); evolutionary computing and stochastic optimization methods such as genetic programming and simulated annealing; and also, of course, the efficient implementation of randomization methods.

The most basic random number generator replicates i.i.d. (independent and identically distributed) random variables uniformly distributed in the unit interval, [0, 1]. From this basic uniform generator one gets a uniform generator in the ^{d}

Historically, the technology of random number generators was developed in the context of Monte Carlo methods. The nature of Monte Carlo algorithms makes them very sensitive to correlations, auto-correlations and other statistical properties of the random number generator used in its implementation. Hence, in this context, the statistical properties of “natural” and “artificial” random sequences came to close scrutiny. For the aforementioned historical and technological reasons, Monte Carlo methods are frequently used as a benchmark for testing the properties of these generators. Hence, although Monte Carlo methods proper lie outside the scope of this article, we shall keep them as a standard application benchmark in our discussions.

The clever ideas and also the caveats of engineering good random number generators are in the core of many paradoxes found by Spencer-Brown. The objective of this section is to explain the basic ideas behind these generators and, in so doing, avoid the conceptual traps and pitfalls that took Spencer-Brown analyses so much off course.

The concept of randomness is usually applied to a variable or a process (to be generated or observed) involving some uncertainty. The following definition is presented at p. 10 of [

A random event is an event which has a chance of happening, and probability is a numerical measure of that chance.

Monte Carlo, and several other probabilistic algorithms, require a random number generator. With the last definition in mind, engineering devices based on sophisticated physical processes have been built in the hope of offering a source of “true” random numbers. However, these special devices were cumbersome, expensive, not portable nor easily available, and often unreliable. Moreover, practitioners soon realized that simple deterministic sequences could successfully be used to emulate a random generator, as stated in the following quotes (our emphasis) at p. 26 of [

For electronic digital computers it is most convenient to calculate a sequence of numbers one at a time as required, by a completely specified rule which is, however, so devised that no

A sequence of _{i}

Many deterministic random emulators used today are Linear Congruential Pseudo-Random Generators (LCPRG), as in the following example:
^{s}_{0}, is called the seed. Given the same seed the LCPG will reproduce the same sequence, a very convenient feature for tracing, debugging and verifying application programs.

However, LCPRG's are not an universal solution. For example, it is trivial to devise some statistics whose behaviour will be far from random, see [

“_{i}_{i}^{2}, the variance of their mean,

Hence, mean values of iid random variables converge to their expected values at a rate of

Quasi-random sequences are deterministic sequences built not to emulate random sequences, as pseudo-random sequences do, but to achieve faster convergence rates. For ^{d}^{d}^{2}. By visual inspection we see that the points of the quasi-random sequence are more “homogeneously scattered”, that is, they do not “clump together”, as the point of the (pseudo) random sequence often do.

Let us consider an axis-parallel rectangles in the unit box,

The discrepancy of the sequence _{1:}_{n}

It is possible to prove that the discrepancy of the Halton-Hammersley sequence, defined next, is of order ^{d}^{−1}), see chapter 2 of [

Halton-Hammersley sets: Given ^{i}^{1}, ^{2}, … ^{n}

That is, the (^{i}

The Halton-Hammersley set is a generalization of van der Corput set, built in the bidimensional unit square,

Decimal | Binary | ||
---|---|---|---|

| |||

_{2}( |
_{2}( | ||

1 | 0.5 | 1 | 0.1 |

2 | 0.21 | 10 | 0.01 |

3 | 0.75 | 11 | 0.11 |

4 | 0.125 | 100 | 0.001 |

5 | 0.625 | 101 | 0.101 |

6 | 0.375 | 110 | 0.011 |

7 | 0.875 | 111 | 0.111 |

8 | 0.0625 | 1000 | 0.0001 |

Quasi-random sequences, also known as low-discrepancy sequences, can substitute pseudo-random sequences in some applications of Monte Carlo methods, achieving higher accuracy with less computational effort, see [

First, quasi-Monte Carlo methods are valid for integration problems, but may not be directly applicable to simulations, due to the correlations between the points of a quasi-random sequence. … A second limitation: the improved accuracy of quasi-Monte Carlo methods is generally lost for problems of high dimension or problems in which the integrand is not smooth.

When asked to look at patterns like those in

One major source of confusion is the fact that randomness involves two distinct ideas:

These two aspects of randomness are closely related. We ordinarily expect outcomes generated by a random process to be patternless. Most of them are. Conversely, a sequence whose order is random supports the hypothesis that it was generated by a random mechanism, whereas sequences whose order is not random cast doubt on the random nature of the generating process.

Spencer-Brown was intrigued by the apparent incompatibility of the notions of primary and secondary randomness. The apparent collision of these two notions generates several interesting paradoxes, taking Spencer-Brown to question the applicability of the concept of randomness in particular and probability and statistical analysis in general, see [

The relation between the joint and conditional entropy for a pair of random variables, see Section 4,

motivates the definition of first, second and higher order entropies, defined over the distribution of words of size

It is possible to use these entropy measures to assess the disorder or lack of pattern in a given finite sequence, using the empirical probability distributions of single letters, pairs, triplets, _{a}

In the article [

This effect is known as the

When people invent superfluous explanations because they perceive patterns in random phenomena, they commit what is known in statistical parlance as Type I error. The other way of going awry, known as Type II error, occurs when one dismisses stimuli showing some regularity as random. The numerous randomization studies in which participants generated too many alternations and viewed this output as random, as well as the judgments of over alternating sets as maximally random in the perception studies, were all instances of type II error in research results.

It is known that other gamblers exhibit the opposite behavior, preferring to bet on

In analogy to Piaget's operations, which are conceived as internalized actions, perceived randomness might emerge from hypothetical action, that is, from a thought experiment in which one describes, predicts, or abbreviates the sequence. The harder the task in such a thought experiment, the more random the sequence is judged to be.

The same hierarchical decomposition scheme used for higher order conditional entropy measures can be adapted to measure the disorder or patternless of a sequence, relative to a given subject's model of “computer” or generation mechanism. In the case of a discrete string, this generation model could be, for example, a deterministic or probabilistic Turing machine, a fixed or variable length Markov chain,

Entropy is the cornerstone concept of the preceding section, used as a central idea in the understanding of order and disorder in stochastic processes. Entropy is the key that allowed us to unlock the mysteries and solve the paradoxes of subjective randomness, making it possible to reconcile the notions of unpredictability of stochastic process and patternless of randomly generated sequences. Similar entropy based arguments reappear, in more abstract, subtle or intricate forms, in the analysis of technical aspects of Bayesian statistics like, for example, the use of prior and posterior distributions and the interpretation of their informational content. This section gives a short review covering the definition of entropy, its main properties, and some of its most important uses in mathematical statistics.

The origins of the entropy concept lay in the fields of Thermodynamics and Statistical Physics, but its applications have extended far and wide to many other phenomena, physical or not. The entropy of a probability distribution,

This section introduces the notion of convexity, a concept at the heart of the definition of entropy and generalized directed divergences. Convexity arguments are also needed to prove, in the following sections, important properties of entropy and its generalizations. In this section we use the following notations:

A region ^{n}^{1}, ^{2} ∈ _{1}, _{2} ≤ 1 | _{1} + _{2} = 1, the convex combination of these two points remains in _{1}^{1} + _{2}^{2} ∈

Finite Convex Combination: A region ^{n}^{1}, ^{2}, … ^{m}^{j}

By induction in the number of points,

The Epigraph of the function ^{n}

A function

A differentiable function,

Consider ^{0} = _{1}^{1} + _{2}^{2}, and the Taylor expansion around ^{0},
^{1} and ^{2} we have, respectively, that ^{1}) ≥ ^{0}) + ^{0})_{1}(^{1} − ^{2}) and ^{2}) ≥ ^{0}) + ^{0})_{2}(^{2} − ^{1}) multipying the first inequality by _{1}, the second by _{2}, and adding them, we obtain the desired result.

Jensen Inequality: If

For discrete distributions the Jensen inequality is a special case of the finite convex combination theorem. Arguments of Analysis allow us to extend the result to continuous distributions.

If

If the system has _{1}, … _{n}_{i}_{i}

The entropy is unchanged if an impossible state is added to the system,

The system's entropy is minimal and null when the system is fully determined,

The system's entropy is maximal when all states are equally probable,

A system maximal entropy increases with the number of states,

Entropy is an extensive quantity,

The Boltzmann-Gibbs-Shannon measure of entropy,

For the Boltzmann-Gibbs-Shannon entropy we can extend requirement 8, and compute the composite Negentopy even without independence:

If we add this last identity as item number 9 in the list of requirements, we have a characterization of Boltzmann-Gibbs-Shannon entropy, see [

Like many important concepts, this measure of entropy was discovered and re-discovered several times in different contexts, and sometimes the uniqueness and identity of the concept was not immediately recognized. A well known anecdote refers the answer given by von Neumann, after Shannon asked him how to call a “newly” discovered concept in Information Theory. As reported by Shannon in p. 180 of [

“My greatest concern was what to call it. I thought of calling it information, but the word was overly used, so I decided to call it uncertainty. When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage.”

In order to check that requirement (6) is satisfied, we can use (with

Shannon Inequality.

If _{i}_{n}

By Jensen inequality, if

Taking

Shannon's inequality motivates the use of the Relative Information as a measure of (non symmetric) “distance” between distributions. In Statistics this measure is known as the Kullback-Leibler distance. The denominations Directed Divergence or Cross Information are used in Engineering. The proof of Shannon inequality motivates the following generalization of divergence:

Csiszar's

Given a convex function

For example, we can define the quadratic and the absolute divergence as

This section analyzes solution techniques for some problems formulated as entropy maximization. The results obtained in this section are needed to obtain some fundamental principles of Bayesian statistics, presented in the following sections. This section also presents the Bregman algorithm for solving constrained maxent problems on finite distributions. The analysis of small problems (far from asymptotic conditions) poses many interesting questions in the study of subjective randomness, an area so far neglected in the literature.

Given a prior distribution, _{n}

The Lagrangian function of this optimization problem, and its derivatives are:

Equating the

We can further replace the unknown probabilities, _{i}

The last form of the VOCs motivates the use of iterative algorithms of Gauss-Seidel type, solving the problem by cyclic iteration. In this type of algorithm, one cyclically “fits” one equation of the system, for the current value of the other variables. For a detailed analysis of this type of algorithm, see [

Initialization: Take ^{t}^{m}

Iteration step: for

From our discussion of Entropy optimization under linear constraints, it should be clear that the maximum relative entropy distribution for a system under constraints on the expectation of functions taking values on the system's states,

(including the normalization constraint, _{0} = _{0} = 1) has the form

Notice that we took _{0} = −(_{0} − 1),_{k}_{k}

Several distributions commonly used in Statistics can be interpreted as MaxEnt densities (relative to the uniform distribution, if not otherwise stated) given some constraints over the expected value of state functions. For example:

The Normal distribution:
^{n}

The Wishart distribution:

The Dirichlet distribution
_{k}_{k}_{k}

Richard Jeffrey considered the problem of updating an old probability distribution,

His solution to this problem, known as the

In this section the Fisher Information Matrix is defined and used to obtain the geometrically invariant Jeffreys' prior distributions. These distributions also have interesting asymptotic properties concerning the representation of vague or no information. The properties of Fisher's metric discussed in this section are also needed to establish further asymptotic results in the next section.

The Fisher Information Matrix, ^{2} =

The Fisher information matrix can also be written as the covariance matrix of the gradient of the same likelihood,

Harold Jeffreys used the Fisher metric to define a class of prior distributions, proportional to the determinant of the information matrix,

Lemma: Jeffreys' priors are geometric objects in the sense of being invariant by a continuous and differentiable change of coordinates in the parameter space,

Example: For the multinomial distribution,

In general Jeffrey's priors are not minimally informative in any sense. However, in pp. 41–54 of [_{a}

Although Jeffreys' priors in general do not maximize the information gain, _{a}

Comparing the several versions of noninformative priors in the multinomial example, one can say that Jeffreys' prior “discounts” half an observation of each kind, while the maxent prior discounts one full observation, and the flat prior discounts none. Similarly, slightly different versions of uninformative priors for the multivariate normal distribution are shown in [

Perhaps the most embarrassing feature of noninformative priors, however, is simply that there are often so many of them.

One response to this criticism, to which Berger explicitly subscribes in p. 90 of [

It is rare for the choice of a noninformative prior to markedly affect the answer… so that any reasonable noninformative prior can be used. Indeed, if the choice of noninformative prior does have a pronounced effect on the answer, then one is probably in a situation where it is crucial to involve subjective prior information.

The robustness of the inference procedures to variations on the form of the uninformative prior can be tested using sensitivity analysis, as discussed in Section 4.7 of [

Posterior convergence constitutes the principal mechanism enabling information acquisition or learning in Bayesian statistics. Arguments based on relative information,

Posterior Consistency for Discrete Parameters:

Consider a model where ^{1}, ^{2}, …}, ^{1}, … ^{n}

Further, assume that this model has a unique vector parameter, ^{0}, giving the best approximation for the “true” predictive distribution

Then,

Consider the logarithmic coefficient

The first term is a constant, and the second term is a sum which terms have all negative expected (relative to ^{0} is the unique argument that minimizes ^{k}^{k}^{0} |

We can extend this result to continuous parameter spaces, assuming several regularity conditions, like continuity, differentiability, and having the argument ^{0} as an interior point of Θ with the appropriate topology. In such a context, we can state that, given a pre-established small neighborhood around ^{0}, like ^{0}, ^{0}, this neighborhood concentrates almost all mass of ^{0}.

The next results show the convergence in distribution of the posterior to a Normal distribution. For that, we need the Fisher information matrix identity from the last section.

Posterior Normal Approximation:

The posterior distribution converges to a Normal distribution with mean ^{0} and precision ^{0}).

We only have to write the second order log-posterior Taylor expansion centered at

The term of order zero is a constant. The linear term is null, for

The Hessian is negative definite, by the regularity conditions, and because ^{0}. We also see that the Hessian grows (in average) linearly with

The objections raised by Spencer-Brown against probability and statistics, analyzed in Sections 1 and 2, are somewhat simplistic and stereotypical, possibly explaining why they had little influence outside a close circle of admirers, most of them related to the radical constructivism movement. However, arguments very similar to those used to demystify Spencer-Brown's misconceptions and elucidate its misunderstandings, reappear in more subtle or abstract forms in the analysis of far more technical matters like, for example, the use and interpretation of prior and posterior distributions in Bayesian statistics.

In this article, entropy is presented as a cornerstone concept for the precise analysis and a key idea for the correct understanding of several important topics in probability and statistics. This understanding should help to clear the way for establishing Bayesian statistics as a preferred tool for scientific inference in mainstream cognitive constructivism.

(Pseudo)—random and quasi-random point sets on the unit box.

EN, _{2}-entropy

The author is grateful for the support of the Department of Applied Mathematics of the Institute of Mathematics and Statistics of the University of São Paulo, FAPESP—Fundação de Amparo à Pesquisa do Estado de São Paulo, and CNPq—Conselho Nacional de Desenvolvimento Científico e Tecnológico (grant PQ-306318-2008-3). The author is also grateful for the helpful discussions with several of his professional colleagues, including Carlos Alberto de Braganca Pereira, Fernando Bonassi, Luis Esteves, Marcelo de Souza Lauretto, Rafael Bassi Stern, Sergio Wechsler and Wagner Borges.