Next Article in Journal
Formulation of Exergy Cost Analysis to Graph-Based Thermal Network Models
Next Article in Special Issue
Impact Location and Quantification on an Aluminum Sandwich Panel Using Principal Component Analysis and Linear Approximation with Maximum Entropy
Previous Article in Journal
On Quantum Collapse as a Basis for the Second Law of Thermodynamics
Previous Article in Special Issue
Entropy-Based Method for Evaluating Contact Strain-Energy Distribution for Assembly Accuracy Prediction

Entropy 2017, 19(3), 107; https://doi.org/10.3390/e19030107

Article
Physical Intelligence and Thermodynamic Computing
Applied Physics Laboratory, Johns Hopkins University, Laurel, MD 20723, USA
Academic Editor: Dawn E. Holmes
Received: 22 January 2017 / Accepted: 6 March 2017 / Published: 9 March 2017

Abstract

:
This paper proposes that intelligent processes can be completely explained by thermodynamic principles. They can equally be described by information-theoretic principles that, from the standpoint of the required optimizations, are functionally equivalent. The underlying theory arises from two axioms regarding distinguishability and causality. Their consequence is a theory of computation that applies to the only two kinds of physical processes possible—those that reconstruct the past and those that control the future. Dissipative physical processes fall into the first class, whereas intelligent ones comprise the second. The first kind of process is exothermic and the latter is endothermic. Similarly, the first process dumps entropy and energy to its environment, whereas the second reduces entropy while requiring energy to operate. It is shown that high intelligence efficiency and high energy efficiency are synonymous. The theory suggests the usefulness of developing a new computing paradigm called Thermodynamic Computing to engineer intelligent processes. The described engineering formalism for the design of thermodynamic computers is a hybrid combination of information theory and thermodynamics. Elements of the engineering formalism are introduced in the reverse-engineer of a cortical neuron. The cortical neuron provides perhaps the simplest and most insightful example of a thermodynamic computer possible. It can be seen as a basic building block for constructing more intelligent thermodynamic circuits.
Keywords:
Carnot cycle; causality; distinguishability; entropy; intelligent processes; questions

1. Introduction

Work on a fundamental theory of computation that may underlie intelligent behavior has been pursued over many years. This paper introduces that theory and highlights aspects of an emerging engineering formalism.
The term computation enjoys a wide range of meaning. Moreover, what can perhaps be agreed upon is that computation is the conversion of an input into an output. This paper provides a precise philosophical and mathematical formulation of what comprises computation as carried out by physical systems and processes. This in turn leads to an explanation of intelligent processes that is entirely physics-based.
This theory of computation is based on two fundamental axioms. These pertain to distinguishability and causality. These comprise the two most basic of all physical phenomena. The theoretical implication of distinguishability is a generalized form of Boolean algebra that includes conventional logic and by duality, an algebra of logical questions. Logical questions have significant engineering utility in that they provide a powerful way of symbolically capturing and manipulating intelligent process concepts. The consequence of causality is that there are only two kinds of computational dynamics possible—one that includes intelligent processes and another that includes dissipative physical processes for which communication systems serve as an example.
Learning and adaptation within intelligent process can be understood from two mathematically equivalent perspectives. First, it can be seen as the system optimizing its computational efficiency. Also, through a simple transformation of its objective function, it can be seen as maximizing its energy or Carnot efficiency. From a computational perspective, the system is solving a min-max optimization problem that simultaneously balances the rate at which it acquires information to the rate it makes decisions. From a thermodynamic perspective, it is attempting to solve a min-max Helmholtz free energy problem that optimizes its energy efficiency.
Current research focuses on the development of a formal engineering framework for what is called a thermodynamic computer or TC. Theory provides insights into how to engineer TCs. A principal pathfinder application has been the reverse-engineering of the cortical neuron [1].
Section 2 develops the underlying computational theory. It is the most difficult discussion both conceptually and philosophically. It requires the reader to think quite differently about what comprises intelligence and autonomy. Section 3 focuses on mathematics and develops the statistical dynamics and physics of intelligent processes and TCs. Much of this is derived from developments in information theory. It shows how basic thermodynamic principles can underlie intelligent processes. Section 4 provides a TC-based explanation of how cortical neurons work and adapt. Section 5 concludes by outlining critical topics not covered in this paper.

2. Physical Intelligence and Computation

This paper proposes a physical basis for intelligent processes, provides examples, and suggests that a formal engineering framework for engineering intelligent systems is possible. Theory requires consideration of a new form of computation that can provide a common basis for understanding intelligent and dissipative physical processes as a whole. It is based on quantifying causality and what it means to subjectively distinguish one’s environment from the standpoint of computation. This section contains the following subsections:
  • Distinguishability
  • Causality
  • Type I Computation: Communication systems
  • Type II Computation: Intelligent systems

2.1. Distinguishability

At a minimum, computation requires two things: logical tokens of computations (i.e., things that are operated on and that change) and rules that dictate the dynamics of how tokens evolve. The proposed axioms meet these requirements by defining the tokens of computation in the first and its possible dynamics by the second.
Any intelligent process is attempting to solve a problem. Whatever the problem, it corresponds to what is called its computational issue. An intelligent process wants to solve its problem by formally resolving its issue through its processing.
A practical example is that of finding one’s car keys for the drive to work in the morning. The issue is: “Where are my keys?” They are lost! Resolving this issue requires searching every room in the house to hopefully resolve the problem by finding them. A smart search strategy would be to first search the most likely rooms where the keys could be (e.g., entryway, bedroom, kitchen). This will, on average, reduce search time and minimize the energy required to solve this problem.
As will be shown, one can quantify intelligence as an exact number specifiable in units of bits per decision by way of analogy to the bandwidth of digital communication systems [2]. It is the rate given in units of bits per second of the capacity of a system to resolve its computational issue. The faster a system can solve its problem, the higher its intelligence.
Consider the following two axioms:
  • A1: The ability of a system to distinguish its environment is the most basic operation any system can perform.
  • A2: Computational dynamics must abide by causality.
A1 is discussed first. It captures how a system can distinguish its environment and what subjectively comprises information and control.
Consider a system that can observe X and make decisions Y. Uppercase italicized letters denote logical questions. By asking X, the system can acquire information from and about its environment. By making decisions Y, it can control it.
Consider a simple single-cell organism, a protozoan. Some protozoa have the ability to sense light and to “see”. They can also propel themselves in water using their cilia [3]. Consider such a species living in a pond and that it likes to eat algae and other material found near the pond surface. Water near the surface is preferentially illuminated by the sun and tends to provide a richer source of nutrients that the protozoa like to consume.
Imagine protozoa in the pond being perturbed by internal currents that randomly reorient them over time. Moreover, there will be times when a protozoan is oriented such that its forward optical sensor will point in the direction of the brighter pond surface. It may or may not sense light as dictated by its detection threshold and its depth in the pond. The organism’s ability to sense light corresponds to the subjective question X. Possible answers to X must correspond to a binary internal state of the organism placed into one state or another by its environment.
Upon a positive detection, the protozoan may or may not activate its cilia to propel itself forward and toward the surface of the pond where it can successfully feed. Call this decision Y. It again is an internal state set of the organism with it representing one of two possible binary decisions. However, in this case, it is the organism, not the environment, that decides this state.
The only energy-efficient survival strategy for the protozoan is for it to move toward light when it sees it. However, there are three other possible strategies that can be delineated: (1) never activate its cilia when light is detected; (2) activate its cilia when it does not sense light; or (3) never activate its cilia under any circumstances. None of these support survival. Therefore, in some regards, one can say that this simple creature behaves intelligently in how it converts information into action. It could not flourish otherwise.
Now associate the physical protozoan system with a computational issue denoted A to survive. To sense light, A asks the question X ≡ “Do I sense light?” The induced response within A is that this question is either answered “yes” or “nothing” (i.e., no response from its threshold detector). To move, A can be thought of answering a question Y ≡ “Should I activate my cilia?” by either doing so or not through the activation of an internal state.
In general, a subjective inquiry X allows a system A to actively acquire information from its environment. Answers are determined by the specific question posed and by the environment. However, the ability to pose X is defined totally within the system A itself. Most fundamentally, it is how A distinguishes its environment from an information perspective. Similarly, answering Y allows A to make subjective decisions as to how it wants to behave in its environment. It is the way A subjectively distinguishes the possible ways it can control its environment through its subjective decisions. Note that philosophically a question represents a way that a system establishes boundaries of its subjective reality. The world is as we apportion our subjective experience into discrete experiential quanta. The philosopher Clarence Irving Lewis [4] first used the term “qualia” to capture this idea.
The notion of distinguishability underlies the subjective acquisition of information by an intelligent system and its use to make decisions. If one could formalize the notion of questions in this manner, it would allow for a powerful computing paradigm by providing a formal way of considering what it means to ask and answer questions within the subjective frame of a system.
Cox [5], as inspired by Cohen [6], showed that Boolean algebra as currently known is incomplete. In fact, exactly half is missing. Cox formulated a joint Boolean algebra of assertions and questions that captures the elementary dynamics of the physical exchange of information and control of a system with its environment.
Conventional Boolean algebra is an algebra of propositions. “This ball is red!” Conversely, Cox [5] proposes two mutually defining quantities he calls assertions and questions. “An assertion is defined by the questions it answers. A question is defined by the assertions that answer it.” These definitions result in two complete and complementary Boolean algebras, each with its respective logical measures of probability and entropy. Assertions, questions, probability, and entropy all have physical manifestations and are intrinsic to all dynamical system processes including intelligent ones.
That is, logical propositions in fact assert answers or states of nature like “The ball is red!” Every logical expression in conventional logic has a dual question, e.g., “Is the ball red?” Action potentials can be thought of as computational tokens in brains that answer questions posed by the dendrites of neurons or answered by their output axons. Likewise, photons are assertions that answer the question posed by an optical detector.
As a trivial example, consider a card-guessing game where one player desires to know the card held by another. Lower-case italicized letters denote logical assertions such as a and b. The first player resolves this issue by asking the second player questions like S ≡ “What is the suit of the card?” or C ≡ “What is the color of the card?” These questions are defined by the assertions that answer them, where S ≡ {h, c, d, s} and C ≡ {r, b} to make explicit the fact that questions are defined by the assertions that answer them. Of course h ≡ “Heart!”, c ≡ “Club!”, d ≡ “Diamond!”, and s ≡ “Spade!”, whereas r ≡ “Red!” and b ≡ “Black!”.
The common question is denoted by the symbol “∨” with SC = C an example of usage. This is the disjunction of two questions and asks what both question asks. The information common to suite S and color C is the color of the card C. Intelligent systems are interested in XY, the information X common to the decisions Y to be made. In the case of the protozoan system, XY is all the organism is interested in because this drives its survival. Disjunction is a powerful and simple construct for quantifying the concept of actionable information.
The joint question SC = S provides the information given by asking both questions S and C. This is the conjunction of two questions and in this case is the suit of the card question S = SC.
Regarding disjunction [5], the question SC ≡ {h, c, d, s, r, b}. Moreover, any assertion that implies an assertion that defines a question also defines that question. In this case, hr, dr, cb, and sb, and so SC ≡ {r, b} ≡ C. Regarding conjunction [5], SC ≡ {hr, hb, cr, cb, dr, db, sr, sb}. Assertions like hb and so on where suites and colors are absurd and can be eliminated leaving SC ≡ {hr, cb, dr, sb}. But for instance hrh and so SC ≡ {h, c, d, s} ≡ S.
In the case of the card-guessing game, a special logical relationship exists between C and S called strict logical implication [7], whereby “S implies C” or SC. Either condition SC = C or SC = S is necessary and sufficient for strict implication to hold. The implication of one question by another means that if S is answered, then so is C. In the case of the protozoan system, it desires the condition YX so that it properly activates its cilia when it sees light.
Assertions can also strictly imply one another. An assertion a implying another is denoted by ab. This means that if b answers a question, then so does a. The assertion a may contain information extraneous to answering the reference question. For example, s ≡ “He is my son!” implies b ≡ “He is a boy!” regarding the gender question G ≡ “Is your child a girl or boy?” Upon hearing “He is my son!” the person posing question G will strip away the extraneous information about “son” and only hear “boy” given the inquiry G ≡ {b, g}.
In probability theory, the probability of c given a is denoted by p(c|a). Formally, probability is the measure or degree [8] to which one assertion a implies another c or ac. The introduction of notation p(c|a) = (ac) denotes this. By way of symmetry, one can ask if there is a similar measure for the degree of implication of one question by another, and indeed there is. Cox [5] introduces the notation b(B|A) as the bearing of one question B on answering another A, which is represented by the notation b(B|A) = (BA). Formally, bearing b is just entropy [5]. The protozoan, for instance, wants to maximize (XYA) = b(XY|A), where A is formally its computational issue to survive. In the card-guessing problem, b(SC|A) = b(C|A) and b(SC|A) = b(S|A). The first provides one bit of information, and the second provides two bits.
The complement of a question ~B asks what B does not ask relative to resolving an issue A. Therefore, ~BB asks everything. Similarly for assertions, ~b asserts what b does not, and this is the active information perspective assumed. In this case, ~bb tells everything and is always asserted.
It is easy to map back and forth between logical expressions involving bearing and equivalent information-theoretic expressions. This mapping is summarized in Table 1.
The remainder of this paper emphasizes the use of the notations I(X; Y), H(X), and H(Y) instead of b(XY|A), b(X|A), and b(Y|A), respectively. However, keep in mind that any particular intelligent process has a reference frame A to which the corresponding information-theoretic expressions refer. For example, H(X) is sometimes called “self-information”. Conversely, b(X|A) measures the amount of information to be expected by posing the question X to resolve an issue A. In the card-guessing game, C and S are posed relative to the broader issue A ≡ {c1, c2, …, c52} of which card ci in a deck of 52 cards is being held by the other player.
Venn diagrams [9] provide a convenient way of graphically capturing strict logical implication of assertions and questions. If bc, then a smaller circle labeled a will reside somewhere inside a larger circle labeled c such as depicted in Figure 1d. If BC, then a larger circle B contains a smaller circle labeled C again as depicted in Figure 1d right bottom. This depicts that the information request of B is greater than C and includes that of C. The latter diagrams are formally called I-diagrams [10], whereas those for assertions are conventional Venn diagrams.
Figure 1 illustrates dual Venn and I-diagrams for several cases of interest. The transformation between each column is accomplished using the reflection operator [11] where, for instance, a = A , A = a , and a ~ b = A ~ B . The conversion rules [5] are that assertions and questions are interchanged, as are conjunction “∧” and disjunction “∨”. The complementation operator is preserved.
The I-diagram in Figure 1d is of particular interest. Suppose that an intelligent process asks X = C and decides Y = B. As will be shown, the condition YX means that “control authority” of the system exceeds its informational uncertainty. This implies b(XY|A) = b(X|A), meaning that the available information is adequate to provide for perfect control.
As an example, consider the shell game played at least as far back as ancient Greece [12]. In this game, one player using sleight of hand quickly moves three nut shells around one another. One of these shells contains a pea. The objective of the second player is to guess which shell contains the pea. In this game, suppose that the second player has the information X ≡ {x1, x2, x3}. If this player has good visual acuity, he may be able to follow and track the moving shells. In general, this is not the case. The second player then makes the decision Y ≡ {y1, y2, y3}. Because of inadequate information, the second player cannot make perfect decisions. If, however, the second player had the option of making additional choices following wrong ones, he would be guaranteed the ability to win the game given YX. This increase in control authority is required to overcome informational uncertainty.
Consider a physical example of a question and assertion consisting of a photon detector with quantum efficiency η. If a photon hits the detector, it is detected with probability η. The detector corresponds to a question D ≡ “Is a photon present?” Photons d = D correspond to assertions. If one can instead physically build a “photon absent” detector, it would be denoted by ¬D and have the property that it would only respond if no photon is detected.
Photon detectors, photons, dendrites, and axons and action potentials are examples of binary questions. They can be graphically described in a diagram called a dyadic, as shown in Figure 2 for Q ≡ {a, ¬a}. The complementation operator allows the transformation of assertion states, whereas the reflection operator transforms between corresponding assertions and questions.
Cohen [6] described the importance of having a formal means of logically manipulating questions by way of analogy with conventional logic. Clearly, he had some anticipation of Cox’s work [5] as reflected by his statement in Reference [6]: “In order to be able to ask the question ‘What is a question?’, one must already know the answer”.

2.2. Causality

So, what does it mean that computation must abide by causality? Causality has many interpretations. The one assumed here is both physical and computational. Answering the question “What is causality?” seems equally elusive to answering “What does it mean to distinguish?”.
Fortunately, an answer is found in an obscure, but profound, statement that Shannon made in what is perhaps his most famous and important paper [13] in 1959:
“You can know the past, but not control it. You can control the future, but have no knowledge of it.”
Shannon spoke of the importance of this observation and that he would write more about it. However, for unknown reasons, he never did. Information theorists have unsuccessfully sought to understand exactly what Shannon meant by those statements [14].
One is compelled to believe that Shannon saw important implications in what he said. Exactly what he thought these were will remain forever unknown. However, one is certainly granted the freedom to speculate.
Assume that the implication of Shannon’s causality is that two kinds of computation are possible. Processes can exist that reconstruct the past, or other processes can exist that control the future. As will be shown, communication systems [2] provide good examples of the first kind of system. Conversely, it will be shown that intelligent systems attempt to control the future. It is interesting that the latter have a much simpler explanation than the first, with both being complementary and dual to one another in ostensibly every way.
Before beginning, both kinds of systems, whether they are concerned with the past or the future, perform their computations in the present. Not doing so would violate causality. Also, both kinds of processes possess physical origins to which system entropy or uncertainty is referenced. Otherwise, as described by Gibbs [15], there is no thermodynamic need for an absolute origin. For instance, the Carnot cycle of the neuron will be seen to shift its position in location 1 bit to the left or right on an entropy plane as the number of its dendritic inputs changes by 1.
For a communication system, the origin physically and functionally corresponds to the receiver. The origin of an intelligent system corresponds physically to the location where acquired information is converted into decisions. Communications systems are discussed first.

2.3. Communication Systems

Consider a process that reconstructs the past. A communication system does this because its only purpose is to reconstruct information transmitted through a channel. It tries to reconstruct at the present a symbol transmitted to it in the past by a source with a time delay specific to the channel being used. The system must be synchronized such that the receiver will know when the source transmits a symbol and when it should expect to receive it. The symbol itself is not directly received; instead, a modulated signal is inserted into the channel by the source. This signal, while propagating within the channel, is subject to information loss relative to the receiver caused by possible attenuation, channel distortion effects, and noise.
Suppose that the communication system has an input alphabet given by x1, x2, …, xn. The receiver asks the question X ≡ “What symbol was transmitted by the course?” Formally, X ≡ {x1, x2, …, xn}. As the receiver awaits reception, it possesses uncertainty H(X) regarding what was sent. The entropy H(X) is dictated by the alphabet of the communication and the relative probabilities the receiver assigns to each symbol it deems possible to have been sent. The key item of note in this perspective is that the uncertainty H(X) is defined relative to the receiver because the source knows perfectly well what it is transmitting, whereas the receiver does not.
The source, having selected a symbol to send, will modulate and expend energy to transmit it through the communication channel to a receiver. While propagating through the channel, the energy of the signal will be passively dissipated to the environment. At the same time, the signal will be affected by channel noise and distortion effects. These effects combine to reduce the ability of the receiver to perfectly reconstruct codes sent to it in the past.
Depending on the physical proximity of the source and the receiver as measured through the channel, the receiver will measure or otherwise acquire the signal observed at its end of the channel. The receiver necessarily must include an algorithm that will demodulate this signal and then compute a single number that guides its selection of a symbol from its alphabet of hypothetical symbols it thinks could be sent to it by the source in the past. This serves to reduce its entropy H(X).
This number must causally exist prior to the receiver’s choice. Hence, this number must be stored in some type of memory and be causally available when this decision is made. Once this specification is made by the receiver, the utility of the acquired and stored information is consumed. Once exhausted, the entropy of the receiver is restored to its original value H(X). This restoration occurs at the receiver.
Once this occurs, two other actions take place. First, the receiver formulates a new inquiry as to its expectation of the next received code. This formulation can occur instantaneously once a current received code is selected and serves to restore H(X). Secondly, the memory of the receiver must be ready to record the next measurement to be used in selection of the successive symbol. All of these computational elements are captured by a Carnot cycle depicted on a temperature-entropy plane, as shown in Figure 3.
As shown in Figure 1, if the channel capacity C equals the source entropy H(X), the receiver can perfectly reconstruct the past. Otherwise, there will be transmission errors.
There are many implications of adopting a process-oriented perspective of information theory, although not all are covered in this paper. However, assuming the perspective that communication processes can be treated thermodynamically has engineering utility as to what should be explored further.

2.4. Intelligent Systems

Intelligent processes attempt to control their future. They operate in a complementary manner to the communication process previously described. However, their operation is in many ways more intuitive.
Communication processes are exothermic in that energy and information are lost during signal propagation through the channel. The area of the Carnot cycle in Figure 3 given by ΔTΔH equals the energy dissipated [16]. To overcome this, symbols have to be successively reconstructed through re-amplification and then retransmission. This requires energy. As an example, one need only think of undersea fiber-optic channels with repeaters that periodically re-amplify and retransmit codes. This process requires an energy source to ensure continued error-free transmission of signals across oceans.
Conversely, intelligent processes require energy to work and compute. They are endothermic. Hence, they must operate in a thermodynamically inverse manner to communicate processes and therefore can be thought to operate as depicted in Figure 4.
Figure 4 also describes the computational phases comprising a single process cycle within an intelligent process. First, the system poses a question X to its environment. Information is acquired at a constant “high” temperature. This means that the information in the acquisition channel is subject to measurement noise just like that in a communication channel.
The information in X common to the decision Y is logically given by XY. This is formally the actionable information the system possesses relevant to making its decision Y. It takes the form of a single numeric quantity derived from measured data. This mirrors the subjective measure computed by the receiver of a communication system. Once computed, and also as in a communication system, it is stored in memory and the system transitions to a lower temperature. Physically, this means that the actionable information is a fixed numeric quantity stored within the system where it is no longer subject to noise.
The process of storing information is analogous to a physical system undergoing a phase transition because of a reduction in temperature. That, in turn, implies a reduction in the number of thermodynamic microstates that the system can reach. This, for instance, occurs when a liquid transitions into a solid.
Once actionable information is available to the intelligent system, it can then use the information to make decisions. Doing so exhausts this information, and system entropy increases to its original value. This, in turn, will drive the system to acquire new information in the next process cycle.
In its final phase, the intelligent system must physically prepare itself for the next cycle by ensuring it can record new actionable information. As consistent per Landauer’s Principle [17] and the observations made by Bennett [18], this requires energy. More specifically, energy is required by the system to reset its memory, thereby allowing it to record new information. Landauer and Bennett state that this energy is used for memory erasure; however, this is not the perspective taken here. Instead, this energy is required for the system to transition back to its initial higher temperature state where it can then record new information. Information erasure is consequential.

3. Thermodynamic Computing

Given that intelligent processes are physical, one should be able to provide a more detailed account of their underlying physics. What emerges is an engineering formalism that is a hybrid combination of thermodynamics and information theory. Moreover, this formalism guides the way for the development of optimal and more complex intelligent processes. As will be shown, the required optimization can be solved either as a min-max Helmholtz free energy problem or as a min-max information optimization problem, and that these are equivalent problems. This has the engineering consequence that optimized computation and adaptation equate to optimized energy efficiency.
In either case of energy optimization or information optimization, the objective functions are convex in one of the optimization spaces and concave in the other, thus guaranteeing unique solutions at saddle points. For this reason, this section digresses in the area of convex optimization. This is believed critical to understanding how intelligent processes optimize their adaptability and how they operate.
Geometric Programming (GP) provides an efficient means of nonlinear optimization. It is thought to be critical in the practical development of practical operational algorithms and computer architectures for implementing optimal intelligent processes and communication systems. This is done through the context of double matching [19], as originally developed for optimizing communication systems. The relevance of GP to information theory is presented by Chiang [20], who describes its relevance to solving the basic source and channel coding problems of information theory. The duality and convexity properties of GP are well known. In essence, GP is a generalization of Linear Programming (LP) as extended to almost arbitrarily complex convex nonlinear optimization problems such as optimal source and channel coding and the joint optimization achieved through double matching. There is wide availability of efficient GP problem solvers including the interior-point algorithm [21,22,23,24,25,26,27,28]. Much of the following is a re-casting of Reference [20] in the context of optimizing intelligent processes.
Minimizing a convex function subject to inequality constraints and affine equality constraints means solving a problem of the form:
min f 0 ( x ) subject   to   f i ( x ) 0 , i = 1 , 2 , ... , m A x = c x R n
where A R n × l and c R l are constant parameters of the problem. The objective function to be minimized is f 0 , and the m constraint functions fi are assumed convex. The objective function f 0 need not be convex.
From convexity analysis [24], it is well known that within convex optimization problems there exists a global minimum. Lagrange’s dual theory means that optimization problems can be solved using either the primal problem or the Lagrange dual problem. The solution to the dual problem provides a lower bound to the solution of the primal (minimization) problem [24]. The optimal values of the primal and dual problems are not necessarily the same. The difference is called the duality gap. Under Slater’s Condition [20,24], as is the case for entropic measures like mutual information, the duality gap is zero. Slater’s Condition essentially states that a feasible solution exists given the posed inequality constraints.
GP is formulated as follows. First, there are two forms of GP optimization: standard and convex. In standard form, a constrained optimization is performed on a function called a posynomial. The convex form is obtained from the first through a logarithmic change of variable.
Posynomials are defined in terms of a sum of monomials. A monomial is a function f(x) such that f : R + + n R 1 , where R + + n denotes the strictly positive quadrant of n-dimensional Euclidean space. A monomial is defined by:
f ( x ) = d x 1 a ( 1 ) x 2 a ( 2 ) x n a ( n )
where d 0 is a multiplicative constant, as are the exponential constants a ( j ) R , j = 1, 2, …, n. A posynomial is then defined by a sum of monomials:
f ( x ) = k = 1 K d k x 1 a k ( 1 ) x 2 a k ( 2 ) ... x n a k ( n )
where d k 0 , k = 1, 2, …, K and a k ( j ) R , j = 1, 2, …, n. The important aspects of posynomials are their positivity and their convexity in convex form.
The GP problem in standard form corresponds to minimizing a posynomial subject to posynomial upper-bound inequality constraints and monomial equality constraints. Formally:
minimize f 0 ( x ) subject   to f i ( x ) 1 , i = 1 , 2 , , m   f ( x ) = k = 1 K d k x 1 a k ( 1 ) x 2 a k ( 2 ) ... x n a k ( n ) h l ( x ) = 1 , l = 1 , 2 , M
where fi are posynomials given by:
f i ( x ) = k = 1 K i d i k x 1 a i k ( 1 ) x 2 a i k ( 2 ) x n a i k ( n )
and hl(x), l = 1, 2, …, M are monomials h i ( x ) = d 1 x 1 a l ( 1 ) x 2 a l ( 2 ) x n a l ( n ) , and the minimization is over x. Every GP problem in standard form can be represented by two data structures [20] denoted by the matrix A and the vector d, and the mapping from these constants to their respective constraints and objective functions. A is a matrix of exponential constants, and d is a vector of multiplicative constants.
Optimization of GP in standard form is not a convex optimization problem. However, it can be made so through a logarithmic change of variables and its multiplicative constants. In particular, using the transformations yi = logxi, bik = logdik, and bl = logdl, a GP problem in standard form is transformed into the following GP convex optimization problem:
minimize   k = 1 K exp [ a 0 k T y + b 0 k ] subject   to   k = 1 K i exp [ a i k T y + b i k ] 1 , i = 1 , 2 , , m a l T y + b l = 0 , l = 1 , 2 , , M
where a i k = [ a i k ( 1 ) , a i k ( 2 ) , , a i k ( n ) ] T and the minimization is over y.
Before proceeding, an important convexity relationship needs to be summarized. It is critical to understanding the fundamental tie between information theory and statistical mechanics [20]. An important lemma in GP is that the log-sum-exp function as defined by Equation (7) is convex in x:
f ( x ) = log i = 1 n e x i
This function has the functional form of the logarithm of the partition function Z of a Boltzmann distribution, which, as will be shown, is of importance. Proof of the convexity of Equation (7) can be seen through a simple inequality.
The log-sum inequality [29] is given by:
i = 1 n a i log a i b i ( i = 1 n a i ) log i = 1 n a i i = 1 n b i
where each of the elements in a and b are greater than or equal to zero. For a function mapping Rn to R, there exists [30] a conjugate function f* that is also a mapping from Rn to R and defined by:
f * ( y ) = sup x ( y T x f ( x ) )
The function f* is guaranteed to be a convex function because it is the point-wise supremum of a family of affine functions of y [31].
If b ^ i = log b i and i = 1 n a i = 1 in the inequality in Equation (8), then it follows that:
log ( i = 1 n e b ^ i ) a T b ^ + i = 1 n a i log 1 a i
with equality only holding if and only if ai has the form of the Boltzmann distribution, whereby:
a i = e b ^ i j e b ^ j
Because intelligent processes are characterized by Boltzmann distributions, as will be shown, this takes on a critical importance in understanding intelligent processes. To summarize, Equation (10) means that the log-sum-exp function is the conjugate function of entropy. The log-sum-exp function is convex because all conjugate functions are required to be so.
The practical importance of GP in convex form to optimizing intelligent processes arises from posing and solving the Lagrange dual problem, which is linearly constrained, and its objective function takes the form of a generalized entropy function. Using conventional Lagrange dual transformations as shown by Boyd [24], the GP convex optimization problem with m posynomial constraints becomes:
minimize   b 0 T ν 0 j = 1 K 0 ν 0 j log ν 0 j + i = 1 m [ b i T ν i j = 1 K i ν i j log ν i j 1 T ν i ] subject   to   ν i 0 , i = 1 , 2 , , m      1 T ν 0 = 1      i = 0 m A i T ν i = 0
with the optimization performed over the vectors ν i . Here, 1 is a column vector of all 1’s. The length of ν i is Ki, the number of monomial terms in the ith posynomial, i = 1, 2, …, m. The matrix A 0 contains the exponential constants in the objective function with each row corresponding to the constituent monomials. Ai, i = 1, 2, …, m are the matrices of exponential constants in the m posynomial constraints. The multiplicative constants in the objective posynomial correspond to the vector b 0 . Likewise, the bi matrices correspond to the multiplicative constants for the monomials defining the constraint posynomials.
There are several items of note in this formulation of the optimization problem. The dual variable ν 0 is a probability distribution. The other dual vectors ν i are not necessarily so. The objective function consists of the sum of linear terms and entropy-like functions. The exponential constants of the monomials define a single linear equality constraint. Solutions for the GP problem in convex form are globally optimal, as are their solutions found through its Lagrange dual.
As previously discussed, information theory deals with two basic problems—those of source and channel coding. These, respectively, define the information-theoretic limits of data compression and its transmission rate through a channel. In the first problem, an average distortion or loss criteria is specified, and the communication system cannot exceed this level nor can the transmission be lowered further. In the latter, an average transmission cost is typically imposed below which the system must operate in the maximization of its transmission rate. Collectively, these two problems are known as Shannon’s dual problems [13].
Within intelligent processes, the rate distortion problem becomes that of minimizing the rate at which that information is acquired and distilled (compressed) into actionable information. The second problem becomes that of maximizing the rate at which actionable information can be used to make decisions and perform useful control. The duality of the problems of communication processes and intelligent processes can be simply understood from Bayes’ Theorem. For a communication process with source x and receiver y, Bayes’ Theorem for the joint density is factored as p(x)p(y|x). This form is needed to address maximizing the source entropy H(X) and minimizing I(X; Y). Conversely, an intelligent process with acquired information X and control decision Y emphasizes the factorization p(x, y) = p(x|y)p(y). This form addresses maximizing the decision rate H(Y) while minimizing required information acquisition rate I(X; Y) to achieve the maximized output. All of these optimization problems have a GP formulation. The practical implication is that communication and intelligent processes can be efficiently optimized.
Consider the problem of maximizing the control authority H(Y) of an intelligent system subject to control costs for a discrete memoryless intelligent system like a cortical neuron. It is memoryless in the sense that the system response is only a function of the current input and that past inputs do not impact this decision except as through much longer time-scale self-adaptation. The term control authority describes the rate at which an intelligent process can make decisions and effect control. In the case of an animal, for example, it would represent the creature’s physical agility or the accessible phase states of a physical system [32].
An intelligent process acquires information by asking X = {x1, x2, …, xm}. The system then uses some portion of this information, the actionable information, to make decisions by answering Y = {y1, y2, …, yq}. Maximizing control authority formally corresponds to finding:
C ( S ) = max { p ( y i } I ( X ; Y }
where the optimization is performed over the output (decision) distribution {p(yi)}.
Now define r R q × 1 with:
r j = i = 1 m p ( x i | y j ) log 1 p ( x i | y j ) = H ( X | y j )
then:
I ( X ; Y ) = i = 1 m p ( x i ) log p ( x i ) p y T r
where py is a column vector of the probabilities p(y1), p(y2), …, p(yq). Likewise, define the column vector px of probabilities p(xi). When expressed in this form, it is clear that intelligent process capacity optimization is the Lagrange dual of a convex GP problem. Including an average decision cost constraint of the form j = 1 q s i p ( y i ) S , the formal Lagrange dual formulation of capacity maximization problem is given by:
maximize   p y T r i = 1 m p ( x i ) log p ( x i ) subject   to   P T p y = p x p y T s S 1 T p y = 1 p ( y i ) 0
where the optimization is carried out over py and px for fixed P. The matrix P is the m by q of conditional probabilities p(xi|yj). This problem can equivalently be expressed in convex form as:
minimize   log j = 1 q exp ( α j + γ S ) subject   to   P α + γ s r γ 0
where the optimization is carried out over α and γ, and where P, s, and S are constants.
The dual problem to optimizing the control capacity of an intelligent system is that of minimizing its control rate subject to a maximum control error. This problem is the information-theoretic analog of minimizing the transmission rate through a communications channel subject to a reconstruction distortion constraint D. Formally, to minimize the control rate, the objective is to:
R ( D ) = min { p ( x i | y j ) } I ( X ; Y )
subject to the constraint:
j = 1 q i = 1 m p ( y j ) p ( x i | y j ) d i j D
where dij is the cost of making decision yj given it observes xi. Regarding the cortical neural model (described further later), its input X consists of 2n possible codes, where n ≈ 10,000. During learning, the neuron partitions this space of codes into two sets, X0 and X1. As the temperature T goes to zero, the neuron deterministically decides to not fire if it observes a code in X0 and, conversely, to fire if an observed code belongs to X1. Note that X0X1 = φ, the empty set, and X0X1 = X. However, with the neural computational temperature taking the form of an equivalent noise [33], the neuron will make errors and thus wrongly decide to fire when it should not and to not fire when it should. Define (1) the control distortion measure dij = 1 where dij = 1 if i = j for observances from Xi, i = {0, 1}; and (2) the binary firing decisions yj, where j = {0, 1}. The D in Equation (19) is the control error rate. For the cortical neuron, 0 ≤ D ≤ 1 bits per decision are its possible transmission rates. Moreover, it is the intelligence rate of the process and is regulated by the temperature of the system during the time it makes a decision. Lastly, D corresponds to the Carnot or operational energy efficiency of the device.
Returning to the control rate minimization of Equations (18) and (19), this problem [20] is stated as:
minimize   i = 1 m j = 1 q p ( y j ) p ( x i | y j ) log p ( x i | y j ) k = 1 q p ( y k ) p ( x i | y k ) subject   to   i = 1 m j q p ( y j ) p ( x i | y j ) d i j D      i = 1 m p ( x i | y j ) = 1 for   j = 1 , 2 , , q      p ( x i | y j ) 0
with the optimization carried out with respect to the conditional probabilities p(xi|yj) that define the matrix P. The problem in Equation (20) can be expressed as:
minimize   p y T r i = 1 m p ( x i ) log p ( x i ) subject   to   P T p y = p x      i = 1 m j = 1 q p ( y j ) p ( x i | y j ) d i j D      i m p ( x i | y j ) = 1   for   j = 1 , 2 , q      p ( x i | y j ) 0
with the optimization performed with respect to P with fixed py. Keeping in mind the definition for r, this can be expressed as:
minimize   p y T H ( X | y ) i = 1 m p ( x i ) log p ( x i ) subject   to   P T p y = p x      i = 1 m j = 1 q p ( y j ) p ( x i | y j ) d i j D      i m p ( x i | y j ) = 1   for   j = 1 , 2 , q      p ( x i | y j ) 0
where H(X|y) is a column vector of length q of the conditional entropies H(X|yi), i = 1, 2, …, q. The convex GP formulation [20] for minimizing R(D) is given by:
maximize   p y α γ D subject   to   log i = 1 q exp ( log p i + α i γ d i j )      γ 0
with the optimization carried out over α and γ. The fact that this problem is an energy minimization problem was first recognized by Berger [34].
Control authority maximization and control rate minimization both contain the log-sum-exp function either as the function to be optimized (Equation (17)) or as a constraint (Equation (23)) for GP in convex form. The log-sum-exp function is the log of the partition functions for Boltzmann distributions in statistical physics. The implication is that the rate and channel coding in information theory and the control authority and control rate problems associated with intelligent processes can both be interpreted from this perspective.
Consider a physical system containing n states. Each state i has a corresponding energy ei and probability pi. Define the n-vectors for energy e defined by ei and p defined by pi. The average energy of the system is given by U(p,e) = pTe. Its entropy is determined by H(p) = −Σpi log pi. The Helmholtz free energy [20] of the system is given by:
F ( p , e ) = U ( p , e ) T H ( p )
Solving for the minimum Helmholtz free energy of the system over p requires solving:
minimize   p T e + T i = 1 n p i log 1 p i subject   to   1 T p = 1 , p 0
whereas its corresponding GP problem in convex form requires maximizing the partition function of the system. Note that if one identifies energy ei with −ri in Equation (21), one has exact correspondence of maximizing Helmholtz free energy and the control authority C of an intelligent process. Similarly, Helmholtz free energies constrain the control rate minimization problem of Equation (23).
Because F(p, e) is convex in p and concave in e, it is true that mutual information is the same for {p(yj} and {p(xi|yj}}, respectively. This means that:
max p min e F ( p , e ) = min e max p F ( p , e )
This joint optimization problem thermodynamically corresponds to maximizing Helmholtz free energy for the worst energy assignment. From an information-theoretic perspective, it is minimizing channel throughput and matching it to the source entropy rate. This corresponds to double-matching [19]. For an intelligent process, Equation (26) also indicates this minimizing acquisition rate, thus maximizing control authority, and the dynamic matching of the two objectives is dynamically achieved through learning and adaption. Thus, learning serves to make the intelligent process simultaneously pursue optimal information processing and optimal energy efficiency. This is most greatly demonstrated in the operation of ostensibly all biological systems and especially so in brains.
As will be discussed, the proposed optimal adaptation process does not pursue the optimization in Equation (26) simultaneously. Rather, one can alternate repeatedly between minimizing F over e and maximizing it over p. Convergence to global optima are guaranteed by the convexity and concavity of F with respect to p and e, respectively.

4. Reverse-Engineering the Cortical Neuron

This section presents eight topics:
  • Overview
  • Logical structure of a cortical neuron
  • Boltzmann distribution and neural architecture
  • Double-matching and spatiotemporal adaptation
  • Neural operation
  • A training example
  • Thermodynamic computing perspective

4.1. Overview

Cortical neurons as found within the human brain have on the order of 10,000 dendritic inputs and a single output. Inputs and outputs of each neuron consist of either no signal or an action potential generated at some time. Action potentials are the embodiment of the notion of an elementary assertion. They either exist or do not at any point in time.
Each input dendrite to a neuron allows it to ask a question Xi ≡ “Is there an observed pre-synaptic potential or not, and therefore Xi ≡ {xi, ¬xi}?” Like the photon detector, the dendrite need only physically ask either of the elementary questions, X i = x i or ¬ X i = ¬ x i and like the photon detector, choose the first detection strategy in asserting the positive when a photon arrives.
The solitary output of each cortical neuron corresponds to an answer to a question Y ≡ “Should I fire or not?” This question is formally defined by Y ≡ {y, ¬y}. Physically, it need only answer either assertion to convey control to its environment, and of course, biological neurons choose to answer the assertion by generating an action potential Y = y i , or not.
After establishing the basic relationship between biological signals and the logical formalism of questions and assertions, one can proceed to understanding the logical architecture of the cortical neuron and then how they process information and operate as highly efficient TCs.

4.2. Logical Structure of a Cortical Neuron

Logical questions provide a simple and elegant way of capturing the computational nature of cortical neuron from the “first person” perspective in terms of what it can “see” and what it can “do”.
Consider first the information that the neuron can acquire from its neural environment, and in doing so, first look at a single dendrite. Each dendrite asks a binary question Xi ≡ {xi, ¬xi}, with xi corresponding to the assertion of a presynaptic action potential or the lack of the detection of such. A typical human cortical neuron has on the order of n = 10,000 dendrites, a number likely limited by physiological constraints. Collectively, the dendritic field allows the neuron to ask X = X1X2 ∧ … ∧ Xn; a question defined by N = 2n assertions corresponding to all of the codes the neuron can possibly observe. N is a huge number corresponding to all possible input neural microstates; thus, the thermodynamic nature of a single cortical neuron.
The neuron is capable of answering a single binary question, Y ≡ {y, ¬y}. This decision is made at the axon hillock [35] of the cortical neuron. Two possible output decision states, combined with 2n informational states, means that there are a total of 2n+1 assertions that define possible microstates of the neuron. These assertions define the joint question XY. It is important to realize that these states are only defined internally and subjectively by the neuron.
The neuron uses the actionable information defined by XY to make its decisions Y. It follows that:
X Y = ( X 1 X 2 X n ) Y = ( X 1 Y ) ( X 2 Y ) ( X n Y )
Upon finding the reflections of the expressions in Equation (27), one obtains:
X Y = ( x 1 y ) ( x 2 y ) ( x n y )
The significance of this expression is as follows: Conjunction “∧” has the physical significance of “coincidence detector” as the meaning of XY. Conversely, disjunction has the physical significance of superposition. Thus, the expression in Equation (28) means that the neuron is observing the coincidences between each dendritic assertion and its output assertion y. Furthermore, it finds the superposition of all observed coincidence events.
Lastly, suppose that the neuron can observe its own output decision y by asking Y. The composite question posed by the cortical neuron then becomes (XY) ∧ Y. One can note that:
( X Y ) Y = [ ( x 1 y ) ( x 2 y ) ( x n y ) ] y
which captures the overall logical architecture of the cortical neuron.

4.3. Boltzmann Distribution and Neural Architecture

The logical expression in Equation (29) suggests that the cortical neuron is able to observe the n coincidences xiy and its output y. Moreover, suppose that it can observe the moments E{xiy} = <xiy>, i = 1, 2, …, n, and E{y} = <y>. Then, as in thermodynamics, one can apply the maximum entropy principle [36] to obtain the single cortical neuron Boltzmann distribution given by:
p ( x , y ) = exp [ ( i = 1 n λ i x i y μ y ) / T ] Z = exp [ e i / T ] Z
where Z is the partition function given by:
y Y x X exp [ ( i = 1 n λ i x i y μ y ) / T ]
As will be shown, the n-vector Lagrange λ corresponds to the dendritic efficacies or gains, whereas the Lagrange factor μ corresponds to the neural decision threshold.
Almost every marginal and conditional probability distribution of Equation (30) conveys insight in the probabilistic nature of the cortical neuron. Two distributions of particular interest are its a posteriori log-odds given by log p(y = 1|x)/p(y = 0|x) and the conditional probability p(y = 1|x).
The log odds [36] is given by:
log p ( y = 1 | x ) p ( y = 0 | x ) = λ T x μ = log p ( x | y = 1 ) p ( x | y = 0 ) + log p ( y = 1 ) p ( y = 0 ) = ν
The conditional probability p(y = 1|x) is given by:
p ( y = 1 | x ) = 1 1 + exp [ ν / T ]
The implications of Equations (32) and (33) are significant. The soma is ostensibly a spatiotemporal integrator with a time constant τ that is on the order of milliseconds of precision. Equation (32) means that the soma can integrate weighted inputs and then compare this integral with a decision threshold μ and implement Bayes’ Theorem and guide its decisions in this way. Equation (33) implies that it can use this numeric quantity directly to make the probabilistic decision to fire y = 1 or not y = 0 as performed at the axon hillock.
The conditional probabilities p(xi = 0|y = 1) and p(xi = 1|y = 1) also are interesting. From Equation (30) it can be seen that:
p ( x i | y = 0 ) = 1 2
and furthermore that p(x|y = 0) = 2n. This means that H(X|y = 0) = n bits, which means it has maximal uncertainty of what its x input is when it does not fire. The probability p(xi = 1|y = 1) is given by:
p ( x i = 1 | y = 1 ) = 1 1 + exp ( λ i )
In Section 4.4, the requirement that the L2-norm |λ|2 ≤ 1 is enforced. Because n ≈ 10,000, all typical connection strengths |λi| are very small and close to zero with regard to Equation (35). Therefore, all p(xi|y = 1) can be expected to deviate only modestly from 1/2. Consequentially, H(Xi|y = 1) will be slightly less than 1 with the difference 1 − H(Xi|y = 1) being the small quantity of information contained in any single dendritic input regarding the output decision. The sum of these n differences across the dendritic field X is the total actionable information I(X; Y) in making its decisions Y.
As will be described, the requirement of the simultaneity of the presentation of stimuli to the soma as required by Equation (28) is not arbitrary, but rather is an achieved condition. The optimal Hebbian [37] adaptation rules for the n-vector λ and decision threshold μ will be seen to extend to the requirement for the adaptation of a dendritic delay equalization n-vector τ defined by learned delay times τi between the input of an assertion into the ith dendrite and its integrated effect λixi within the soma. Thus, given dendritic gains λ, decision threshold μ, and delay equalization vector τ, there are a total of 2n+1 adaptable parameters in this cortical model.

4.4. Double-Matching and Spatiotemporal Adaptation

The proposed neural optimization strategy of the neuron is to maximize its output decision rate while minimizing and equaling it to the rate at which it acquires information from its neural environment. This is the double-matching problem of information theory [19], which for the cortical neuron becomes:
min λ I ( X ; Y ) subject   to   | λ | 2 = α      E { d ( x , y ) } D
and:
max μ I ( X ; Y )    E { c ( y ) } C
Equation (36) includes a constraint on the size of the L2 norm of the dendritic n-vector. In general, α ≤ 1; therefore, in general |λi| << 1 given n ≈ 10,000. This minimization problem can be solved whereupon the distortion constraint can be introduced and achieved by regulating the temperature range [Tl, Tu] over which the neuron operates in its Carnot cycle. This in effect regulates the Carnot efficiency of the neuron to D. Temperature regulation will be seen to correspond to noise regulation within the soma as controlled by the probability of error within each synaptic channel. This is accomplished via each dendritic channel corresponding to a binary erasure channel [38], whereby presynaptic potentials are not detected post-synaptically. This phenomena is known as Quantum Synaptic Failure or QSF [38].
Regarding the output optimization in Equation (37), no explicit decision cost is imposed. The presence of this constraint would serve to limit the maximum decision rate H(Y) in bits per second to which the cortical neuron can operate. Within a biological neuron, this constraint must be present. One can imagine that if constructed, artificial cortical neurons could operate optically, electrically, or via other means that would allow them to operate at higher speeds than biological neurons.
Because the optimization problems in Equations (36) and (37) are convex, one merely needs to find the partial derivatives in each case, with respect to λi in the first case and to μ in the second, and then set them to zero in each case. The math is simplified through the use of what Fry refers to as the Gibbs Mutual Information Theorem [36] and more properly referred to in this paper as the Boltzmann Mutual Information Theorem. This theorem was independently derived by Hinton and Sejnowsku [39].
As shown by Fry [36], if:
p ( x , y ) = exp [ θ T f ( x , y ) ] Z
then:
I θ i = E { [ f i ( x , y ) f i ( x , y ) ] log p ( y | x ) p ( y ) }
Regarding the dendritic strengths, the optimal λ in [36] is given by the largest eigenvector of the covariance matrix:
R = x x T y x y x y T y
where:
R λ = α λ
Solving for the optimal decision threshold μ gives:
μ = λ T E { x | y = 1 }
The optimal λ can be found using a modified Hebbian form of E. Oja’s equation [40]; thus, conditioned on y = 1, the following equation in used to determine λ:
λ ( t + Δ t ) = λ ( t ) + π ν ( t ) [ x ( t ) α 2 ν ( t ) λ ( t ) ]
where, per Equation (32), ν(t) = ν[x(t)] and π is a time constant of adaptation. Thus, a biologically plausible mechanism exists for performing the optimization in Equation (36). Optimizing Equation (37) is simple in that all that is needed is a time-averaging of the induced somatic potential in Equation (42) conditioned on y = 1.
Presynaptic action potentials impinging on a dendrite can be expected to have variability inter-arrival times relative to the time that the neuron makes a decision. These potentials, although perhaps conveying actionable information to decisions being made, may arrive too early or too late to impact decisions. One can postulate that there exists a third Hebbian adaptation mechanism whereby the time delay τi comes from the detection of an action potential, its propagation along a dendrite, and the superposition of its effect on the somatic potential. Indeed, this is the case.
As shown by Fry [41], the following intuitive adaptation rule serves to optimally equalize the dendritic delays contained in the n-vector τ:
d τ i d t = β λ i y ( t ) d x i ( t τ i ) / d t
The first point to note in this rule is that it is proportional to the connection strength λi, meaning that stronger connections expedite delay equalization. Secondly, this rule is like the other two rules, also Hebbian with its execution dependent on y(t) = 1 at reference decision time t. Lastly, depending on the sign of λi, the derivative can either be negative or positive, and the signs of the derivative and the synaptic efficacy cancel each other out, which guarantees the convergence of the delay τi to yield maximal synchrony of the arrival of dendritic effects at the soma for decisioning when y = 1.
Figure 5 graphically depicts the delay equalization process. The implication is that neural pathways communicate spatial-temporal population codes.

4.5. Neural Operation

Neural decisions are probabilistic and made according to Equation (33), which is a function of temperature T. This function is the logistic function of statistical regression. Let β = 1/T be inverse temperature. Then, as shown in [42], p(y = 1|ν + η) = erf[ν/(21/2σn)] closely approximates p(y = 1|ν) if η is zero-mean Gaussian noise with variance σ n 2 = 4 e 2 β / 2 π . An overlay of these two functions as a function of ν is provided in Figure 6.
The implication is that by combining additive white noise to the induced somatic potential, one can implement a probabilistic decision rule consistent with Equation (33). Probabilistic decisioning with errors is a requirement for adaptation to occur. As described in Section 4.4, QSF provides a simple means of generating Gaussian noise. The ideal potential induced in the soma is given by:
ν = i = 1 n λ i x i
Through QSF, some fraction of the input potentials will randomly be eliminated such that the noisy somatic potential becomes:
ζ = i = 1 n λ i x i j Q λ j x j = ν + η
where Q is the random set of failed synaptic junction events. Because connection strengths are randomly positive and negative and the dendrites are assumed mutually independent of one another, the sum of inputs of failed junctions has a Gaussian distribution. The central limit theorem holds because QSF occurs 50% to 75% of the time [38]. The QSF rate essentially determines the operating temperature of the neuron as well as its achievable Carnot efficiency, which seems to range between 25% and 50%, which is significant if such efficiencies are possible. Such efficiencies offered by cortical neurons may contribute significantly to the overall efficiencies of brains.

4.6. A Training Example

Suppose one has 20 codes each containing a 20-bit code. These codes are delineated in Figure 7. Some codes are replicated, whereas others not.
Figure 8 captures code similarities by tabulating the Hamming distances between the selected codes. For instance, one can see that the first code has four instantiations given it has a Hamming distance of zero from the codes indexed by itself, 1, and codes 5, 9, and 13.
Figure 9a illustrates which codes illicit action potentials y = 1 with probabilities greater than half. In particular, 9 of the 20 codes fall into this category. Ideally, should the training set allow, the set of input codes is split equally between those that induce an action potential and those that do not. This maximizes the output entropy H(Y) of the cortical neuron by letting it approach 1 bit per decision.
Figure 9b plots the dendritic gains λi vs. index i. These values are equally likely to be positive or negative. Furthermore, repeated learning trials yield the same values; however, the sign of the vector λ varies in a random fashion between positive and negative. In opposing cases, the neuron fired on complementary code sets.
Neural adaptation rules can be thought of as geometric. The quantity λTxμ is the log a posteriori odds as per Equation (32) with μ = log p(y = 0)/p(y = 1) and λTx = log p(x|y = 1)/p(x|y = 0). As depicted in Figure 10, the optimally adapted λ and μ define a hyper-plane. The vector λ defines the orientation of the hyper-plane, whereas μ defines its offset from the origin.
Optimal hyper-planes slice through the n-dimensional space in such a way as to ensure the maximal dispersion of observed codes about each of its sides. The offset μ is such so as to ensure this dispersion has equal probabilities of inducing an action potential or not doing so.

4.7. Thermodynamic Computing Perspective

If one has a closed-form expression for the partition function of a system, then one knows basically everything thermodynamically about the system. The partition function for this simple cortical model is no different.
The partition function for the cortical neuron was given in Equation (31) and is repeated here using inverse-temperature β:
Z = y Y x X exp [ β ( i = 1 n λ i x i y μ y ) ]
Summing over y gives:
Z = 2 n + 2 n e μ x X exp [ λ T x ]
Note that the optimal decision threshold is:
μ = E { λ T x | y = 1 }
and that using the known form for the probabilities gives:
μ = i = 1 n λ i p ( x i = 1 | y = 1 ) = i = 1 n λ i 1 1 + e λ i
After finding the Taylor expansion of each probability p ( x i = 1 | y = 1 ) = 1 1 + e λ i about λi = 0, keeping the first two terms, and finally noting that:
x X exp ( λ T x ) = i = 1 n ( 1 + e λ i )
yields the cortical partition function:
Z ( n , β ) = 2 n + 2 2 e β 2 / 4 i = 1 n cosh ( β λ i 2 ) = 2 n + Z 1
Z is plotted as a function of β and n in Figure 11. The function Z1 reflects dependency on β, while both terms depend on n. As the number of cortical inputs n increases, one can see that the neuron can operate at successive lower temperatures, meaning that its energy efficiency increases with the number of input connections n. Thus, physiological constraints on this number is important and therefore it can be seen that optimal computational efficiency asymptotes at observed biological dendritic field sizes of n ≈ 10,000. The deep red region shown in Figure 11 is where Z can take on the value 2n+1. This is the maximum value it can have and is consistent with the fact that the system desires to maximize its Helmholtz free energy in the sense of Equations (23) and (26).
As suggested by Figure 11, the intelligent Carnot process executed by the cortical neuron cycles between two temperatures, T = 1 and T = 0.1, or as dictated by the lower temperature limit of the operation.
All of this suggests that the cortical neuron operates according the Carnot process depicted in Figure 12. In Figure 12, the entropy of the neuron decreases, owing to the measurement of a new dendritic field potential x in phase 1. During phase 2, the actionable information or potential is deposited within the soma, less the effects of Gaussian noise introduced through QSF. Phase 2 is also associated with a decrease in neural temperature, meaning that the induced noisy potential has been recorded and stored within the soma where it is no longer subject to the effects of noise. During phase 3 of the intelligent Carnot process, the neuron decides whether to fire or not, and in either case, expends and dissipates collected information. As actionable information is expended in decisioning, system entropy increases. Lastly, during the fourth and final phase, the neuron reestablishes sodium and potassium ion concentration biases across the cell membrane. This allows the cortical neuron the renewed ability to acquire new information and make new decisions as to whether to fire or not.
During temperature decreases and increases in the process cycle shown in Figure 12, the neural system undergoes phase state changes whereby Z is reduced from 2n+1 to 2n and then restored to 2n+1. During the first cycle, actionable information in the form of the sensed somatic potential is stored, thus allowing the neuron to make one decision or another. During the latter cycle, the neuron’s ability to make this determination is restored. In this phase in the endothermic phase of cortical operation, energy resources in the form of adenosine triphosphate energize sodium-potassium ion pumps to restore ion concentrations across the cortical cell membrane.
Z operationally transitions between 2n and 2n+1. Until a neural decision is made, the neuron has 2n+1 states, including the two possible output decision states. After firing, and when the neuron enters its refractory state, the output state is no longer accessible and Z transitions to 2n. Physically, this corresponds to the neuron undergoing phase transitions, whereby the decision state y is accessible or not.

5. Discussion

This paper introduced a physics-based explanation of intelligent processes and the idea of thermodynamic computing. There are many additional considerations that go well beyond those considered. For example, because energy is an extensive property of physical systems (Extensive is meant to mean the narrow interpretation that the property scales with the size of the system and not for instance as applies to small-system thermodynamics as described in [43]), the energy function of a network equals the sum of the energies of its constituent nodes. This has significance to neural systems.
Another critical area is the consideration of the computational issue and how it can represent more complex problems and solution methodologies beyond those considered here. For example, an issue A can be logically manipulated just like any other question. To illustrate, if A = A1A2 ∧ … ∧ Am, then m parallel intelligent processes can be launched. One can consider how to partition a complex task into additional but simpler tasks.
This paper only discusses discrete memoryless intelligent processes. More complex processes can be constructed from this building block. For example, a complex task can be composed from simpler tasks, as is also illustrated by A = A1A2 ∧ … ∧ Am. As described in [1], biological systems as intelligent processes are differentiated from other types because of their hierarchical nature.
Lastly, one can consider varying intelligent process types. The cortical neuron is an example of a system that acquires information from its environment and then uses it to make decisions. One can simply have a question-asking process similar to that in a card-guessing game that is entirely information oriented. Many games are like this. Conversely, one can have a process that is purely self-constructing or control oriented, such as in constructing an organism from its DNA sequence.

6. Conclusions

This paper argues that basic physical and information-theoretic principles may underlie intelligent processes.
First, it was described that one can develop a basic theory of computation describing how information and control are exchanged by a system with its environment. This theory is based on computationally interpreting distinguishability and causality. Distinguishability leads to the consideration of a joint logic of questions and assertions as developed by Cox [5]. Causality leads to the conclusion that only two kinds of computational processes are possible; ones that reconstruct the past and ones that control the future. Intelligent systems fall into the latter class.
Next, the engineering mathematics of the min-max optimization process required to realize double-matching optimization by an intelligent process is summarized. Double-matching is thought to underlie the optimization strategy of intelligent systems. While optimizing its computational capacity, it also optimizes its energy efficiency.
Finally, this paper demonstrates the engineering process of reverse-engineering a cortical neuron using the notion of questions developed initially, and then applying the requirement of double-matching as its adaptation strategy. This led to a physiologically accurate prediction of known neural functionalities. These include signaling by action potentials, Hebbian adaptation, dendritic field, axonal decisioning, the refractory period, and many other known properties of cortical neurons.
Should the proposed model of intelligent processing be correct, then there are many implications. This paper has not even begun to consider many of the next steps to be taken and questions to be asked. While these questions are both scientific and technical, they are also philosophical.
Regarding philosophical implications, consider the question “If a tree falls in the forest and nobody is there to hear it, then does it make a sound?” The theory here says that if is not answered by any real assertion then it is known as a vain question. Therefore, this philosophical question has an answer and that answer is no.

Acknowledgments

The author wishes to thank Todd Hylton of Brain Corporation, who created and managed the Physical Intelligence program at DARPA and who proposed the Thermodynamic Computing program there. Thanks also to Alex Nugent, who likewise has contributed to the development of TCs through his creative and insightful nature.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

The following abbreviations are use in this paper:
GPGeometric Programming
LPLinear Programming
QSFQuantum Synaptic Failure
TCThermodynamic Computer

References

  1. Fry, R.L. Computation by Biological Systems. Biomed. Sci. Today 2015. Available online: https://pdfs.semanticscholar.org/c645/9f5948618a3773278514e67c05e02a35e9cf.pdf (accessed on 6 March 2017). [Google Scholar]
  2. Shannon, C.E. The Mathematical Theory of Communication; University of Illinois Press: Urbana, IL, USA, 1949. [Google Scholar]
  3. Online Encyclopedia of Science. Available online: http://www.daviddarling.info/encyclopedia/P/protozoan.html (accessed on 30 December 2016).
  4. Lewis, C.I. Mind and The World Order: An Outline of a Theory of Knowledge; Charles Scribner’s Sons: New York, NY, USA, 1929; Reprinted in Paperback by Dover Publications, Inc.: New York, NY, USA, 1956.
  5. Cox, R.T. Of inference and inquiry, an essay in inductive logic. In Proceedings of the Maximum Entropy Formalism—First Maximized Entropy Workshop, Boston, MA, USA, 1979; pp. 119–168.
  6. Cohen, F.S. What is a question? Monist 1929, 39, 350–364. [Google Scholar] [CrossRef]
  7. Lewis, C. A Survey of Symbolic Logic; University of California Press: Berkeley, CA, USA, 1918. [Google Scholar]
  8. Cox, R.T. Algebra of Probable Inference; Johns Hopkins University Press: Baltimore, MD, USA, 1961. [Google Scholar]
  9. Venn, J. On the Diagrammatic and Mechanical Representation of Propositions and Reasonings. Philos. Mag. J. Sci. 1880, 10, 1–18. [Google Scholar] [CrossRef]
  10. Yeung, R.W. A new outlook on Shannon’s information measures. IEEE Trans. Inf. Theory 1991, 37, 466–474. [Google Scholar] [CrossRef]
  11. Spencer-Brown, G. Laws of Form; Crown Publishing Group: Danvers, MA, USA, 1972. [Google Scholar]
  12. The Shell Game. Available online: https://en.wikipedia.org/wiki/Shell_game (accessed on 6 March 2017).
  13. Shannon, C.E. Source Coding with a Fidelity Criterion. Proc. IRE 1959. [Google Scholar]
  14. Cover, T.M. Untitled Contribution. IEEE Inf. Theory Soc. Newslett. 1998, 18–19. Available online: http://www-isl.stanford.edu/~cover/papers/paper112.pdf (accessed on 6 March 2017). [Google Scholar]
  15. Gibbs, W.J. A Method of Geometrical Representation of the Thermodynamic Properties of Substances by Means of Surfaces. In The Scientific Papers of J. Willard Gibbs; Longmans, Green, and Co.: New York, NY, USA, 1906; Volume 1, pp. 1–32. [Google Scholar]
  16. Tribus, M. Thermostatics and Thermodynamics: An Introduction Energy, Information and States of Matter with Engineering Application; D. Van Nostrand, Inc.: Princeton, NJ, USA, 1961. [Google Scholar]
  17. Landauer, R. Irreversibility and heat generation in the computing process. IBM J. Res. Dev. 1961, 5, 183–191. [Google Scholar] [CrossRef]
  18. Bennett, C.H. The thermodynamics of computation—A review. Int. J. Theor. Phys. 1982, 21, 905–940. [Google Scholar] [CrossRef]
  19. Gastpar, M.; Rimoldi, B.; Vetterli, M. To Code, or Not to Code: Lossy Source-Channel Coding Communication Revisited. IEEE Trans. Inf. Theory 2003, 49, 1147–1158. [Google Scholar] [CrossRef]
  20. Chiang, M. Geometric Programming for Communication Systems. Found. Trends Commu. Inf. Theory 2005, 2, 1–154. [Google Scholar] [CrossRef]
  21. Ben-Tal, A.; Nemirovski, A. Lectures on Modern Convex Optimization: Analysis, Algorithms, and Engineering Applications; SIAM: Philadelphia, PA, USA, 2001. [Google Scholar]
  22. Bertsekas, D.P. Nonlinear Programming, 2nd ed.; Athena Scientific: Belmont, MA, USA, 1999. [Google Scholar]
  23. Bertsekas, D.P.; Nedic, E.; Ozdaglar, A. Convex Analysis and Optimization; Athena Scientific: Belmont, MA, USA, 2003. [Google Scholar]
  24. Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  25. Nesterov, Y.; Nemirovsky, A. Interior Point Polynomial Algorithms in Convex Programming; SIAM: Philadelphia, PA, USA, 1994. [Google Scholar]
  26. Nocedal, J.; Wright, S.J. Numerical Optimization; Springer: New York, NY, USA, 1999. [Google Scholar]
  27. Rockafellar, R.T. Lagrange Multipliers and Optimality. SIAM Rev. 1993, 35, 183–283. [Google Scholar] [CrossRef]
  28. Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 1970. [Google Scholar]
  29. Csiszar, I.; Korner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems; Academic Press: Salt Lake, UT, USA, 1981. [Google Scholar]
  30. Phelps, R.R. Convex Functions, Monotone Operators and Differentiability, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 1991; p. 42. [Google Scholar]
  31. Avriel, M. (Ed.) Advances in Geometric Programming; Plenum Press: New York, NY, USA, 1980.
  32. Wissner-Gross, A.D.; Freer, C.E. Causal Entropic Forces. Phys. Rev. Lett. 2013, 110. Available online: http://math.mit.edu/~freer/papers/PhysRevLett_110-168702.pdf (accessed on 6 March 2017). [Google Scholar]
  33. Fry, R.L. Neural Statics and Dynamics. Neurocomputing 2005, 65, 455–462. [Google Scholar] [CrossRef]
  34. Berger, T. Rate Distortion Theory: A Mathematical Basis for Data Compression; Prentice Hall: Upper Saddle River, NJ, USA, 1971. [Google Scholar]
  35. The Nerve Cell. Available online: https://www.britannica.com/science/nervous-system/The-nerve-cell#ref606326 (accessed on 6 March 2017).
  36. Fry, R.L. Observer-Participant Models of Neural Processing. IEEE Trans. Neural Netw. 1995, 6, 918–928. [Google Scholar] [CrossRef] [PubMed]
  37. Hebb, D.O. The Organization of Behavior; Wiley and Sons: New York, NY, USA, 1949. [Google Scholar]
  38. Levy, W.B.; Baxter, R.A. Energy-Efficient Neuronal Computation via Quantal Synaptic Failures. J. Neurosci. 2002, 22, 4746–4755. [Google Scholar] [PubMed]
  39. Hinton, G.E.; Sejnowsku, T.J. Optimal Perceptual Inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, June 1983; pp. 448–453. Available online: https://papers.cnl.salk.edu/PDFs/Optimal%20Perceptual%20Inference%201983-646.pdf (accessed on 6 March 2017).
  40. Oja, E. A simplified neuron model as a principal component analyzer. J. Math. Biol. 1985, 15, 267–273. [Google Scholar] [CrossRef]
  41. Fry, R.L. Computation by neural and cortical systems. In Proceedings of the CNS Workshop on Methods of Information Theory and Neural Processing, BMC Neuroscience, Portland, OR, USA, 23–24 July 2008.
  42. Fry, R.L. Double Matching: The Problem that Neurons Solve. Neurocomputing 2005, 69, 1086–1090. [Google Scholar] [CrossRef]
  43. Hill, T.L. Thermodynamics of Small Systems, Parts I & II; Dover: New York, NY, USA, 2013. [Google Scholar]
Figure 1. Comparison of dual Venn diagrams (left) and I-diagrams (right).
Figure 1. Comparison of dual Venn diagrams (left) and I-diagrams (right).
Entropy 19 00107 g001
Figure 2. Dyadic diagram used for characterizing a binary question.
Figure 2. Dyadic diagram used for characterizing a binary question.
Entropy 19 00107 g002
Figure 3. Information transmission process depicted as a Carnot cycle on the temperature-entropy plane.
Figure 3. Information transmission process depicted as a Carnot cycle on the temperature-entropy plane.
Entropy 19 00107 g003
Figure 4. Carnot process associated with an intelligence process as depicted on the temperature-entropy plane.
Figure 4. Carnot process associated with an intelligence process as depicted on the temperature-entropy plane.
Entropy 19 00107 g004
Figure 5. Depiction of delay equalization process relative to neural decisions y(t) made at reference time t.
Figure 5. Depiction of delay equalization process relative to neural decisions y(t) made at reference time t.
Entropy 19 00107 g005
Figure 6. Overlay of the logistic function and error functions for T = β = 1 showing they are indistinguishable.
Figure 6. Overlay of the logistic function and error functions for T = β = 1 showing they are indistinguishable.
Entropy 19 00107 g006
Figure 7. Sample training codes used against the specified neural optimization objectives.
Figure 7. Sample training codes used against the specified neural optimization objectives.
Entropy 19 00107 g007
Figure 8. Hamming distances between selected training codes.
Figure 8. Hamming distances between selected training codes.
Entropy 19 00107 g008
Figure 9. (a) Codes that illicit neural firing and (b) the depiction of dendritic field connection strengths.
Figure 9. (a) Codes that illicit neural firing and (b) the depiction of dendritic field connection strengths.
Entropy 19 00107 g009
Figure 10. Hyper-plane representation for double-matching optimization.
Figure 10. Hyper-plane representation for double-matching optimization.
Entropy 19 00107 g010
Figure 11. Colorized map of Z(n, β) showing where phase transition occur. The dark red region is required for cortical operation.
Figure 11. Colorized map of Z(n, β) showing where phase transition occur. The dark red region is required for cortical operation.
Entropy 19 00107 g011
Figure 12. Depiction of Carnot process underlying the operation of the cortical neuron.
Figure 12. Depiction of Carnot process underlying the operation of the cortical neuron.
Entropy 19 00107 g012
Table 1. Relationships between bearing and information-theoretic measures.
Table 1. Relationships between bearing and information-theoretic measures.
BearingInformation TheoryName
b(X|A)H(X)Entropy
b(XY|A)I(X; Y)Mutual Information
b(XY|A)H(X, Y)Joint Entropy
b(X ∨ ~Y|A)H(X|Y)Conditional entropy
Back to TopTop