Information-Theoretic Neuro-Correlates Boost Evolution of Cognitive Systems

Schossau, Jory; Adami, Christoph; Hintze, Arend

doi:10.3390/e18010006

Open AccessArticle

Information-Theoretic Neuro-Correlates Boost Evolution of Cognitive Systems

by

Jory Schossau

^1,2,*,

Christoph Adami

^2,3 and

Arend Hintze

^1,3,4

¹

Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, USA

²

BEACON Center for the Study of Evolution in Action, Michigan State University, East Lansing, MI 48824, USA

³

Department of Microbiology & Molecular Genetics, Michigan State University, East Lansing, MI 48824, USA

⁴

Department of Integrative Biology, Michigan State University, East Lansing, MI 48824, USA

^*

Author to whom correspondence should be addressed.

Entropy 2016, 18(1), 6; https://doi.org/10.3390/e18010006

Submission received: 1 July 2015 / Revised: 9 December 2015 / Accepted: 22 December 2015 / Published: 25 December 2015

(This article belongs to the Special Issue Information Theoretic Incentives for Cognitive Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Genetic Algorithms (GA) are a powerful set of tools for search and optimization that mimic the process of natural selection, and have been used successfully in a wide variety of problems, including evolving neural networks to solve cognitive tasks. Despite their success, GAs sometimes fail to locate the highest peaks of the fitness landscape, in particular if the landscape is rugged and contains multiple peaks. Reaching distant and higher peaks is difficult because valleys need to be crossed, in a process that (at least temporarily) runs against the fitness maximization objective. Here we propose and test a number of information-theoretic (as well as network-based) measures that can be used in conjunction with a fitness maximization objective (so-called “neuro-correlates”) to evolve neural controllers for two widely different tasks: a behavioral task that requires information integration, and a cognitive task that requires memory and logic. We find that judiciously chosen neuro-correlates can significantly aid GAs to find the highest peaks.

Keywords:

genetic algorithm; neuro-correlate; markov brain; evolution; information theory

1. Introduction

The last 50 years of research in Artificial Intelligence have taught us many things, but perhaps the most obvious lesson is that designing complex cognitive systems is extremely hard. Notwithstanding the success of chess-playing algorithms and self-driving cars, designing a brain that rivals the performance of even the smallest vertebrate has proven elusive. While the computational algorithms that are being deployed today on the aforementioned problems, (as well as on image classification via convolutional nets) are impressive, many researchers are convinced that none of these algorithms are cognitive in the sense that they display situational understanding. For example, the celebrated convolutional nets can easily be fooled [1] with trivial imagery, suggesting that they implement a sophisticated look-up table after all, with very little understanding.

The failure of the design approach has been acknowledged by several groups of researchers that have chosen an entirely different approach, namely to use the power of evolution to create machine intelligence. This field of “neuro-evolution” [2,3] is much less developed than the standard design approach, but it has made great strides in the last decade. It also has the advantage (compared to the design approach) that the approach is known to have resulted in human-level intelligence at least once. In the field of neuro-evolution, a Genetic Algorithm (GA) [4] is used to evolve a program that, when executed, builds a “neuro-controller”. This neuro-controller constitutes the brain of a simulated entity, which is called an agent or animat.

Each program is evaluated via the performance of the agent, and programs that gave rise to successful brains are then replicated, and given proportionally more off-spring programs than unsuccessful programs. Because mutations are introduced in the replication phase, new types of programs are introduced every generation, trying out variations of the programs—and therefore variations of the brains. This algorithm, closely modeled on the Darwinian process that has given rise to all the biocomplexity on the planet today, has proven to be a powerful tool that can create neuro-controllers for a diverse set of tasks.

Using evolution to create brains is no panacea, though. The use of Genetic Algorithms to optimize performance (fitness) of behavior controllers is often hindered by the structure of complex fitness landscapes, which are typically rugged and contain multiple peaks. The GA, via its fitness maximization objective, will discover local peaks but may get stuck at sub-optimal peaks because crossing valleys is specifically not an objective. For a population to overcome the valleys in such a rugged landscape, programs must (at least temporarily) acquire deleterious mutations that are at odds with a simple reward system for optimization. This difficulty is typically overcome by increasing diversity in the population [5], by splitting the population into islands [6,7], by using alternative objectives such as in novelty search [8], or by changing (and thus optimizing) the fitness function itself. Of these solutions, “Diversity” and “Novelty” can be computationally intensive, while fitness function optimization is very specific to every problem, and thus not a general solution.

Here we propose an alternative approach to fitness optimization in the evolution of cognitive controllers, which takes advantage of the insight that functioning brains have a certain number of characteristics that are a reflection of their network structure, as well as their information-processing capacity. If we were able to reward these features at the same time as rewarding the performance of the given task, it may be possible to evade the valleys of the landscape, and move on neutral ridges towards higher peaks. The idea of using multiple objectives in Genetic Algorithms is not at all new [9], and it has been used previously in neuro-evolution [10].

We present evidence from simulations of evolving virtual agents that establishes that it is possible to improve the performance of a GA, increase the rate of adaptation, and improve the performance of the final evolved solution, all by incorporating neuro-correlates of the evolving agent’s brains into the fitness calculation. These neuro-correlates are metrics inspired by the interface between Network Science and Cognitive Science that attempt to measure “how well a brain is working”, independently of the achieved fitness. These measures typically do not assess an agent’s performance of a task because it is often difficult to relate task performance to cognitive ability. Ideally, these neuro-correlates either quantify the mode and manner that information is being processed, or in which manner the nodes of the network are connected. It is important that these neuro-correlates are agnostic of performance, as otherwise their reward would not open a new dimension in optimization.

The evaluation of an agent in any GA typically involves measuring the agent’s performance for a given task or environment. We show that multiplying the performance of an agent with the value of its neuro-correlate will improve the speed of evolution and increase the ratio of evolved agents that perform perfectly (this is a simple way of performing multi-objective optimization, see for example [11]). Any observed improvement can be traced back to an increase in the number of potentially beneficial mutations that may persist or sweep the population. If a mutation increases cognitive ability but does not yet have an effect on the agent’s performance, then it is evolutionarily neutral and can be lost by drift. However, if the neuro-correlate shows an increase that is neutral with respect to performance, but improves cognition in some other form (and therefore increases a neuro-correlate), then such a mutation is no longer neutral and is instead selected. In future generations, such an improvement might become beneficial for task performance. Therefore, using neuro-correlates in conjunction with performance allows these otherwise neutral mutations to stay in the population for longer or even promote them to fixation. Subsequent mutations then have a chance to take advantage of these changes that otherwise would have been lost to drift.

2. Background

We evolve agents to solve two very different tasks: a temporal-spatial integration task (active categorical perception, see [12,13,14,15]) using embodied agents, and a purely mathematical (disembodied) task that requires complex cognitive processing: the generation of random numbers using deterministic rules. The temporal-spatial integration task requires agents to observe and categorize blocks of different sizes falling toward them, by catching small blocks while avoiding large blocks. This task cannot be solved by a purely reactive machine (see [15]) because agents must use memory in order to recognize the block’s trajectory and predict where it will land. The task creates a fitness landscape known to be deceptive, as prior results have shown that only a small fraction (about 10%) of populations result in an optimal solution. The sub-optimal agents usually get stuck on local peaks that deliver about 80% of maximal fitness [15].

In the second set of experiments we investigate a task where agents are rewarded for generating long sequences of random numbers (without access to a random number generator or any other stochastic source). Agents are given an oscillating bit as an indicator of time, and assigned fitness based on the length of the dictionary generated from a Lempel–Ziv compression [16] of the agent’s output. Like the previous task, this task cannot be solved by a purely reactive machine, although would be trivially solved if the agents could access stochastic processes. However, because the agents use only deterministic processes we expect this task to require a great amount of temporal integration and memory in order for them to achieve a good amount of randomness. Indeed, generating random numbers is a known task to test cognitive ability and disability, in particular in the realm of Autism Spectrum Disorders and dementia [17,18,19].

The standard substrate for neuro-evolution are Artificial Neural Networks (ANNs, see e.g., [20]), but we use here a different substrate (“Markov networks” or “Markov Brains”) that has proven to be adept at behavioral decision-making tasks [21,22,23,24,25,26,27,28]. In contrast to ANNs in which neurons are continuous-valued and non-firing, neurons in Markov brains (MBs) are digital with only two states: quiescent or firing. Markov neurons can interact with any other neuron via arbitrary logical functions (as opposed to the ubiquitous transfer- and activation-function found in ANNs). We use MBs because we have experienced that they are computationally more powerful and more evolvable, while having a much smaller computational footprint than ANNs. ANNs on the other hand have a wide range of applications, and our results might generalize to those applications as well.

The logic functions that connect neurons, along with the topology of the network, are encoded directly by a string of bytes. The logic gates act on Markov variables (our equivalent of a neuron), and the output is written into other Markov variables. In a sense, the MB is defined by the edges between nodes, as it is the edges which carry all the computational power. Each gate is specified by a gene, and the beginning of each gene on the chromosome is specified by a particular combination of bytes–in our case, the “start codon” (42,213). The bytes that follow determine the identity of the neurons it reads from, and the identifier of the neuron(s) it writes to. The subsequent bytes encode the logic of the gate, which can be done by simply encoding the truth table. While other MB implementations allow for stochastic logic gates, here we confine ourselves to deterministic gates, which have a much more concise encoding (see Refs. [15,22,24] for a more detailed description of MB encoding and function).

There are alternative neural substrates that we could have studied here, including NEAT or hyperNEAT [29], genetic programming, or subsumption architecture machines [30], etc. These are all viable substrates for exploring the benefits of neuro-correlate-aided evolution. In this contribution we focus on testing the general validity of the neuro-correlate augmented evolution approach. We do expect the results to depend on the underlying neural substrates, their evolvability, and how well each neuro-correlate can be assessed. In addition, our proposed method is easy to implement for other systems: a neuro-correlate must be measured and the resulting value multiplied by the associated performance. This should allow for a rapid testing of this method in other systems.

Despite evidence that an indirect encoding might be more advantageous [31,32,33], the direct encoding has been very successful in evolving controllers for virtual agents to solve a wide range of tasks [15,21,22,24,34,35]. In addition, these controllers have been instrumental in establishing some of the neuro-correlates used in the present study, which increases our confidence that these measures perform as described. Next we describe the eight different neuro-correlates used to assess a controller’s topology and performance.

2.1. Network-Theoretic Neuro-Correlates

2.1.1. Minimum-Description-Length

The simplest neuro-correlate is, colloquially speaking, the largest possible brain size. It is difficult to define such a concept mathematically, but we can imagine that if we had a description of the brain in terms of the program that builds it, then the shortest such program would be the most concise description of the brain in a Minimum Description Length (MDL) formalism, and larger MDLs could encode larger brains. The size of the genome that codes for our Markov brains could serve as a proxy for the brain MDL, but it is almost never the smallest description of the brain simply because the genome can add more “empty tape” instead of running out of space to encode more logic gates, for example using a gene duplication. Using the genome size (as proxy for MDL) as a neuro-correlate makes sense because it explicitly rewards genome expansion, rather than waiting for a fortuitous genome duplication to add the extra space. The genome size is directly proportional to the potential number of logic gates and thus the number of connections the agent phenotype might have, since the genome encodes the logic gates directly. Of course, under such a selective pressure genome length is almost guaranteed to be very different from the compression limit (the smallest program encoding the brain), but we can think of evolution as creating a selective pressure to compress the brain description as much as possible.

In our implementation of Markov Brains the genome size varies between 2000 and

20, 000

loci (each locus on the genome is a byte, so it can take values between 0 and 255) and can be affected by insertion- and deletion-mutations. We do not use the number of encoded logic gates to directly assess brain size for two reasons: First, each gate can have a different number of inputs and outputs, which influences the complexity of the gate. Second, gene duplications can create exact copies of a gene that codes for a gate, which changes the number of gates without (presumably) affecting the brain’s function.

Because the connectivity of standard ANNs is fixed [20], an MDL-like neuro-correlate does not exist there, but the Vapnik–Chervonenkis (VC)-dimension [36] that bounds the learning capacity of a network could be a suitable alternative. Within more plastic systems ANNs that allow for encoding of connections such as NEAT [29], the number of edges between neurons could be a proxy for potential brain size.

2.1.2. Topological Complexity

Brains are networks of neurons, and our MBs are networks of logic gates, both of which can be represented by graphs. Assessing the complexity of graphs is not a straight-forward task (see, e.g., [37]), but for the purpose of brain function some graph properties are obvious neuro-correlate candidates, and easy to measure. We first measure the graph diameter (GD) as the highest value in the distance matrix of the network—also known as the longest of all shortest paths between all node pairs. The intuition behind using GD as a neuro-correlate is that information traveling along neurological pathways in brains with a large diameter has more time to interact with other neurons, in particular in other parts of the brain. If information takes longer to pass from sensors to actuators, it remains within the brain longer and therefore extends the agent’s short-term memory.

2.1.3. Connectivity and Sparseness

We measure the standard graph theoretic “Gamma Index” (GI) or “connectivity” of the network as well as its converse, the network “sparseness”. The Gamma Index is the ratio of extant connections to all possible connections. For this measure multiple connections between nodes are treated like a single connection, otherwise there would be an infinite number of possible connections between all nodes of the network.

Current understanding of brain optimization and organization suggests that connections between neurons are costly [38,39] and that this provides a strong selection pressure during evolution. Specifically, it has been shown that minimizing connections between neurons in a brain produces more modular and adaptable brains [10,40]. As we will see, this is not necessarily the case but depends on the task to be solved. Also, intuitively one might think that more connections are better, and thus optimizing for density might be as beneficial as optimizing for sparseness under the right circumstances. To incorporate this phenomenon, we use Sparseness and Gamma Index separately, as they reflect different aspects of brain connectivity.

2.1.4. Representations

R, a measure for the amount of information that a brain represents within internal states, is a new information-theoretical measure of cognitive function that correlates with fitness [15,34], but is in principle separate from an agent’s performance on a task. R measures how much an agent knows about its environment above and beyond its current sensory information. Representations can be thought of as internal models of the world that the agent can use to make decisions in conjunction with sensory signals, or even in the absence of sensory signals. Because of this, R is often identified with “what a brain knows about its world with eyes closed”. We can define R as the information the mental states (described by a random variable M) of an agent have about the environment (described by random variable E), given its sensors (variable S) [15]

\begin{matrix} R = I (E : M | S) = I (E : M) - I (E : M : S) = I - I (S : E) - I (S : M) . \end{matrix}

(1)

Here,

I

is “multi-information”: the amount of information variables share [41], which for three variables becomes

\begin{matrix} I = H (E) + H (M) + H (S) - H (E, M, S), \end{matrix}

(2)

where

H (X)

is the Shannon entropy of the random variable X. In Equation (1),

I (E : M | S)

is the shared information between variables E and M given the sensor state S,

I (E : M)

is the (unconditional) shared information between environment and brain internal states not including sensors, and

I (E : M : S)

is the shared information between all three variables (a quantity that may become negative). The measure R thus quantifies the correlation of mental states with world states, while subtracting the correlations coming from the sensors. This notion is not well-defined when using the correlation function, but it is when using information theory. R increases in evolving agents when their tasks require knowledge or internal models about the environment [15]. Agents without representations, or which are purely reactive agents, remain below optimal performance. Measuring these representations is simple as long as sensor, brain, and environmental states are directly observable (as they are here) causing only a small computational overhead.

It is important to note that R is not necessarily correlated with fitness. For instance, an agent might have representations about an incoming threat, but may not respond. Else, an agent may make decisions based solely on sensorial input, obtaining high fitness without representations. Therefore, representations do not necessarily make a prediction about an agent’s performance, even though they are usually correlated. In addition, R is not strictly speaking a neuro-correlate since it cannot be measured intrinsically (that is, using only variables internal to the brain). It is crucial that the correlate used does not allow predictions about performance, because otherwise the correlate in itself would be a proxy for fitness, and therefore optimizing a combination would not introduce a new element. R satisfies this condition so we include this measure.

2.1.5. Information Integration

Agents solving a cognitively complex task must integrate information from different sensory inputs to come to a single decision. Sensory inputs must be compared to one another in the light of past experiences. The information-theoretic measure Φ is one way of quantifying a brain’s information-integration capability [21,22,42,43,44]. Unfortunately, computing Φ is computationally extremely intensive, so much so that it is cumbersome on modern high-performance computers to calculate Φ exactly for brains of 16 neurons (for every agent at every update of an evolving population), and essentially infeasible for brains with more than 24 neurons. For example, because the computational complexity of Φ scales super-exponentially, calculating Φ for the brain of the lowly nematode C. elegans with 302 neurons requires evaluating

\sim 4.8 \times 10^{457}

partitions of the network, an absurd task.

Here we use the much more practical

Φ_{atomic}

which is a very good approximation of Φ at a much reduced computational cost [22] (

Φ_{atomic}

has also been defined as “synergistic information” [45,46]). Specifically,

Φ_{atomic}

is given by the integrated information, but calculated for one specific (the “atomic”) partition of the network, while the standard Φ optimizes over partitions. The atomic partition is the one where each node is its own part, that is, the atomic partition segments the network into all nodes individually.

To define

Φ_{atomic}

, we first define the information that is processed (in time) by the entire system. Let us define the network’s state using the joint random variable

X = X^{(1)} X^{(2)} \dots X^{(n)}

, where

X^{(i)}

represents the elements (nodes) of the system, and X changes as time (t) progresses. Each variable

X_{t}

is defined by a probability distribution

p (x_{t})

to find variable

X_{t}

in state

x_{t}

. Each node i progresses in time

X_{0}^{(i)} \to X_{1}^{(i)}

and each

X_{t}^{(i)}

is described by probability distribution

p (x_{j}^{i})

.

The information processed by the system from time step t to

t + 1

is then given by

\begin{matrix} I (X_{t} : X_{t + 1}) = \sum_{x_{t}, x_{t + 1}} p (x_{t}, x_{t + 1}) log \frac{p (x_{t}, x_{t + 1})}{p (x_{t}) p (x_{t + 1})} . \end{matrix}

(3)

The measure

Φ_{atomic}

then quantifies how much of the information processed by the system cannot be explained by the sum of the information processed by each individual computational unit. Thus, in a sense

Φ_{atomic}

quantifies how much processing is “more than the sum of its parts”, where the parts are defined by the individual neurons:

\begin{matrix} Φ_{atomic} = I (X_{t} : X_{t + 1}) - \sum_{i = 0}^{n} I (X_{t}^{(i)} : X_{t + 1}^{(i)}) + I . \end{matrix}

(4)

Here,

I (X_{t}^{(i)} : X_{t + 1}^{(i)})

is the information processed by the ith neuron, and

I

(called “integration” or “multi-information” in other work) measures the nonindependence between the network variables [41,47,48,49,50] and is defined as

\begin{matrix} I = \sum_{i = 0}^{n} H (X_{t}^{(i)}) - H (X_{t}), \end{matrix}

(5)

and

H (X_{t}) = - \sum_{x_{t}} p (x_{t}) log p (x_{t})

is the Shannon entropy of variable

X_{t}

. With both system information unexplained by part sums over time, and spatial integration measures of nonindependence,

Φ_{atomic}

represents both temporal and spatial network synergies. Calculation of

Φ_{atomic}

then simplifies to

\begin{matrix} Φ_{atomic} = \sum_{i = 0}^{n} H (X_{t}^{(i)} | X_{t + 1}^{(i)}) - H (X_{t} | X_{t + 1}), \end{matrix}

(6)

where

H (X_{t} | X_{t + 1})

is the conditional entropy of variable

X_{t}

given

X_{t + 1}

(and similarly for the conditional entropy of the each neuron

X^{i}

). These conditional entropies quantify our uncertainty about the “cause repertoire”, that is, the ensemble of states that could have given rise to the state

X_{t + 1}

[21].

As with previous neuro-correlates, the act of integrating information does not imply that there will be an action based upon such integrations. However, selecting agents with a higher

Φ_{atomic}

over others with the same performance guarantees the preservation of potentially beneficial mutations. In addition, we know that

Φ_{atomic}

is a limiting factor in the evolution of cognitive abilities: to perform a given task the agent requires a minimal amount of

Φ_{atomic}

[51], and a better performance necessitates a higher minimal amount of

Φ_{atomic}

.

2.1.6. Predictive Information

Predictive information (

P I

) can be measured in several ways [22,52,53]. It is the one-step mutual information that a system has between time t and

t + 1

. MB animats have sensors and actuators, and

P I

can be measured as a one-step mutual information of the sensors and future sensors, or the actuators and future actuators, or the sensors and future actuators, or the actuators and future sensors. Here we measure

P I

of the sensors and future sensors (

P I_{ss}

), and sensors and future actuators (

P I_{sa}

)

\begin{matrix} P I_{ss} & = I (S_{t} : S_{t + 1}), \\ P I_{sa} & = I (S_{t} : A_{t + 1}), \end{matrix}

(7)

where

S_{t}

represents the sensor values at time t and

A_{t}

represents the actuator values at time t.

An organism solving a physical task will move through the environment such that this information is increased—we typically do not look around randomly, but in a predictable manner. It has been shown that increasing

P I_{ss}

can be advantageous for creating meaningful behavior on agents [53]. Alternatively, predictive information can be understood as the information the sensors have about the actuators after the brain processed the sensor information. For a purely reactive agent, increasing this

P I_{sa}

(for predictive information from sensor to motor) would be advantageous, because the actions of the agent become more appropriate given the experienced sensor patterns. At the same time, if agents need to become less reactive but more dependent on their internal states

P I_{sa}

should decrease after adaptation (as shown in [22]).

2.2. Complex Environments

We investigate the effect that rewarding neuro-correlates have on adaptation in two different environments. The first is a temporal-spatial integration task where an agent must catch or avoid blocks that fall towards it (see Figure 1). The task is an adaptation of the “active categorical perception task” studied earlier [12,13,14], and requires a comparison between past and present sensor inputs to make inferences about future optimal behavior. While the task is seemingly simple, the sensor modalities (embodiment) of the agent are limited in such a way that this becomes a complex problem to be solved by an agent [15].

The second environment we use to optimize the agents is the generation of (pseudo) random numbers, and does not require embodiment of the brain. This task does not require any input, but the agent must produce a sequence of zeros and ones with high entropy. This task is also used to assess cognitive abilities in humans: it is known that autism [54], schizophrenia [55], as well as different forms or Alzheimer’s disease can be diagnosed by analyzing a sequence of symbols generated by a human subject who was asked to produce a sequence that is as unpredictable as possible [17,54,56,57]. This complex task involves memory [18], processing [19], and the ability to sequentially process information [17]—components that are also involved in the algorithmic generation of pseudo random numbers. It is unclear if an evolved Markov Brain random number generator resembles either a computer algorithm or the cognitive abilities found in humans. Nevertheless, this task clearly qualifies as a complex problem requiring many components to work together, while at the same time it is not another example of an embodied agent. The randomness of the produced sequence is measured by its compressability using the Lempel–Ziv–Welch (LZW) algorithm [16].

Figure 1. (A) In the simulation, large or small blocks fall diagonally toward the bottom row of a

20 \times 20

discrete world, with the agent on the bottom row. For the purpose of illustrating the task, a large brick is falling to the left, while a small brick is falling to the right. In simulations, only one block is falling at the time, and any one brick can fall either only to the left or only to the right. The agent is rewarded for catching small blocks and punished for catching large blocks; (B) A depiction of the agent’s neurons (bottom left: triangles depict sensors, circles illustrate brain (internal) neurons, trapezoids denote actuators) and the sequence of activity patterns on the agent’s 4-bit retina (right), as a large brick falls to the right. Reproduced from [15], with permission.

Figure 1. (A) In the simulation, large or small blocks fall diagonally toward the bottom row of a

20 \times 20

discrete world, with the agent on the bottom row. For the purpose of illustrating the task, a large brick is falling to the left, while a small brick is falling to the right. In simulations, only one block is falling at the time, and any one brick can fall either only to the left or only to the right. The agent is rewarded for catching small blocks and punished for catching large blocks; (B) A depiction of the agent’s neurons (bottom left: triangles depict sensors, circles illustrate brain (internal) neurons, trapezoids denote actuators) and the sequence of activity patterns on the agent’s 4-bit retina (right), as a large brick falls to the right. Reproduced from [15], with permission.

One can think of many other complex environments for which this method of GA augmentation might be suitable, such as: navigation tasks [22], classification and perception tasks [24], or tasks that require an understanding of group behavior [35,58]. As long as neuro-correlates are measurable and doing so does not impose too high a computational overhead, this augmentation should be applicable. However, different environments could benefit differently from the neuro-correlates used—for example a one-layered perceptron in a classification task might not require internal representations. In such cases the representation measure R might become useless. Alternatively

Φ_{atomic}

might become meaningless in a task that does not require the integration of information.

3. Methods

We performed evolutionary experiments to test how neuro-correlates affect the performance of a GA (Our source code is available at https://gitlab.msu.edu/jory/entropy-2015-neuro-correlates). In each block-catching experiment the performance of an agent was assessed by measuring the number of blocks correctly caught and correctly avoided in 80 trials. In the random-number-generation (RNG) experiment, agents were provided with one bit that at each time step changed from 0 to 1 and back for 500 updates (that is, an oscillator). The output of the network over those 500 updates was collected and compressed. The number of symbols written into this compressed string was used as a proxy for the maximum entropy of that string. Highly regular or constant strings result in very short sequences after compression, while strings with higher entropy cannot be compressed that easily and result in longer strings. This environment has no particular state the world can be in, thus measuring R is meaningless in this context.

Once performance is assessed, fitness can be calculated, upon which the GA’s selection mechanism can operate. It is standard in this domain to use an exponential of performance (

1 . 1^{performance}

) as a fitness measure to encourage selection of more complex behavior (similar to [59]) implying that each unit increase in performance increases fitness by 10%. Using fitness alone is our control: the non-augmented case. We explored the use of eight neuro-correlate measures for augmenting performance of the GA: minimum-description-length (MDL), diameter of the brain network (GD), amount of information integration (

Φ_{atomic}

), amount of representation about the environment (R), Gamma Index (connectedness), sparseness, and two variations of predictive information: sensors t to sensors

t + 1

, and sensors t to actuators

t + 1

(

P I_{ss}

and

P I_{sa}

). Augmenting performance by a neuro-correlate is performed by multiplying the normalized neuro-correlate with the exponential performance of the agent. Each evolutionary experiment is repeated 128 times and agents are allowed to adapt over

10, 000

generations (all evolutionary parameters are identical to [15], except duplication and deletion are identical at

0.05

). The population begins and ends at size 100 and the mutation rate is

0.005

. At the end of each experiment the line of descent (LOD) is reconstructed [59] and the results on the LOD are averaged over all replicates. The general form for fitness calculation used in this work is

ω = 1 . 1^{α} \prod_{i = 1}^{N} [\frac{κ_{i}}{κ_{i max}} + 1]

(8)

where N is the set of neuro-correlates to use. It is interesting that using

P I

as a linear additive function together with performance [60] as a way to augment selection has met with moderate success in the context of lifetime-learning in embodied artificial intelligence. Here we use a non-linear multiplicative function instead, which emphasizes the effect of both beneficial and deleterious mutations. The experiments described here use

N = 1

, but the general form allows any number of neuro-correlates to augment performance together in a form of multi-objective optimization. In (8), α is the measure of performance, κ the measure of the neuro-correlate, and

κ_{i max}

the maximum theoretical value for the neuro-correlate used.

Alternatively, the distribution of fitness or neuro-correlates at the end of evolutionary adaptation is measured. The violin plots we use aggregate the replicate experiments and visualize the distribution. The final population contains genetic variation not yet culled by selection. To reduce such variation, we take from each experiment the organism on the line of descent three generations prior to the final generation.

4. Results and Discussion

We find that three of the eight proposed neuro-correlates improve adaptation when used to augment the GA in the block-catching task. The agent populations not only adapt faster, but also evolve to a higher average performance after

10, 000

generations (see Figure 2a). These neuro-correlates are

Φ_{atomic}

and Graph Diameter (

p < 0.05

) and Gamma Index (

p < 0.0001

) whereby p values were calculated using the Mann-Whitney U test.

P I_{ss}

(

p < 0.01

) and Sparseness (

p < 0.05

) produced significantly worse results when used as augmenting neuro-correlates.

Because

Φ_{atomic}

, Graph Diameter, and Gamma Index all have a positive effect on evolution, it seems additional selection pressure for neural architecture and material drastically helps these populations of agents evolve to solve their task. It is possible that Sparseness has a negative effect because Markov Brains start with relatively small neural architectures and must evolutionarily grow to match the problem.

Figure 2. (a) Average fitness over

10, 000

generations (over 128 replicates) for the block-catching task; (b) Average fitness over

10, 000

generations (over 128 replicates) for generating random numbers. •: the control using only performance in the Genetic Algorithm (GA). The effect of augmenting the fitness function (Equation (8)) with

Φ_{atomic}

(▽), with R (▲), with graph diameter (☐) on task performance, with minimum-description-length ( Entropy 18 00006 i001

), with Gamma Index (★), with sparseness (+), with

P I_{ss}

(◇), and with

P I_{sa}

(

) on task performance.

Figure 2. (a) Average fitness over

10, 000

generations (over 128 replicates) for the block-catching task; (b) Average fitness over

10, 000

generations (over 128 replicates) for generating random numbers. •: the control using only performance in the Genetic Algorithm (GA). The effect of augmenting the fitness function (Equation (8)) with

Φ_{atomic}

(▽), with R (▲), with graph diameter (☐) on task performance, with minimum-description-length ( Entropy 18 00006 i001

), with Gamma Index (★), with sparseness (+), with

P I_{ss}

(◇), and with

P I_{sa}

(

) on task performance.

The results are very similar for the RNG task (see Figure 2b). Significance of the effect is roughly the same:

Φ_{atomic}

, Graph Diameter, and Gamma Index significantly affect evolution positively when used as fitness-agumenting neuro-correlates (

p < 0.01

). R cannot be used in the context of the RNG, and is therefore not shown.

Predictive information is maximized in cases where reactive behavior is rewarded. In tasks that require memory, maximizing predictive information can be detrimental (and is not the best predictor of fitness, see [22]). It is possible that a predictive information with a larger time delay, or a conditional predictive information such as

I (S_{t} : A_{t} | A_{t - 1})

[61] could produce better results. We plan on exploring those in the future.

Multi-parameter optimization (MPO) can often solve many problems related to multidimensional complex fitness landscapes but can suffer from a number of problems, all well-described in the multi-parameter optimization literature (for an overview see [11,62]). In most of these problem cases, the parameters to be optimized work against each other in the form of trade-offs (one parameter can only be optimized at a cost of another). We observe this effect with Gamma Index and sparseness depending on the task to evolve, while all other neuro-correlates work synergistically with performance. See Figure A1, Figure A2 and Figure A3 for evolutionary history interactions between neuro-correlates. We find that some neuro-correlates affect performance or each other antagonistically. While in our experiments this trade-off reduces the final performance of the evolved agents, it could be overcome using MPO. An objective which is antagonistic using our fitness function could be beneficial in MPO, which should be explored in the future.

4.1. Augmented Selection and Its Effect on Other Neuro-Correlates

Using a neuro-correlate to augment a Genetic Algorithm can shorten the runtime requirements and may improve the overall performance of the evolved population depending on the neuro-correlate and objective. One might ask how augmenting selection using one neuro-correlate affects the other correlates not under direct selection. Intuitively one would expect that selection for a particular neuro-correlate, in conjunction with performance, should increase not only performance (as discussed above) but also the measure itself. Similarly, since neither of the Predictive Information measures augment selection, we do not expect an increase in Predictive Information when using them in conjunction with performance. However, we find

P I_{sa}

to increase when selecting for it together with performance in the RNG environment.

We find no other effect of one neuro-correlate driving the evolution of another. The most prominent example of this effect is with Gamma Index driving the evolution of nearly all other neuro-correlates and vice versa (see Figure 3). This further supports the idea that using neuro-correlates does not necessarily create negative synergy as discussed in Results and Discussions concerning multiple parameter optimization.

Figure 3. Comparison of different optimization methods at the end of evolution of the block-catching task. The top row shows the performance distribution (gray) and average performance given the different selection regimes (MDL abbreviates minimum-description-length, which is the potential brain size based on the length of the encoding genome, GI abbreviates Gamma Index, GD graph diameter, and

P I_{ss}

and

P I_{sa}

are predictive information sensor t to sensor

t + 1

, and sensor t to actuator

t + 1

, respectively). Subsequent rows show the effect of the different selection regimes (x axis) on the respective neuro-correlates.

Figure 3. Comparison of different optimization methods at the end of evolution of the block-catching task. The top row shows the performance distribution (gray) and average performance given the different selection regimes (MDL abbreviates minimum-description-length, which is the potential brain size based on the length of the encoding genome, GI abbreviates Gamma Index, GD graph diameter, and

P I_{ss}

and

P I_{sa}

are predictive information sensor t to sensor

t + 1

, and sensor t to actuator

t + 1

, respectively). Subsequent rows show the effect of the different selection regimes (x axis) on the respective neuro-correlates.

In the RNG environment R cannot be used because there is no environment about which the agent could build representations. For other neuro-correlates we observe the same trend in the RNG task as in the block-catching task (see Figure 4). Selecting for a particular neuro-correlate increases its measure over evolutionary time, more so than the increase found without explicit selection. The exception to this is Gamma Index and sparseness and

Φ_{atomic}

. All other neuro-correlates seem to have no additional effect on each other.

Figure 4. Comparison of different optimization methods on each other at the end of evolution of the random number generation task. The top row shows the performance distribution (gray) and average performance given the different selection regimes (MDL abbreviates minimum-description-length, which is the potential brain size based on the length of the encoding genome, GI abbreviates Gamma Index, GD graph diameter, and

P I_{ss}

and

P I_{sa}

are predictive information sensor t to sensor

t + 1

, and sensor t to actuator

t + 1

, respectively). Subsequent rows show the effect of the different selection regimes (x axis) on the respective neuro-correlates.

Figure 4. Comparison of different optimization methods on each other at the end of evolution of the random number generation task. The top row shows the performance distribution (gray) and average performance given the different selection regimes (MDL abbreviates minimum-description-length, which is the potential brain size based on the length of the encoding genome, GI abbreviates Gamma Index, GD graph diameter, and

P I_{ss}

and

P I_{sa}

are predictive information sensor t to sensor

t + 1

, and sensor t to actuator

t + 1

, respectively). Subsequent rows show the effect of the different selection regimes (x axis) on the respective neuro-correlates.

4.2. Neuro-Correlate Interactions

We showed that selection can be augmented by using specific neuro-correlates, while others do not help. Is that because selecting for the neuro-correlates themselves already provides an advantage for performance? One might think that for example maximizing knowledge about the world (R) requires the agent to move in such a fashion that performance enhances automatically. To test this we repeated the above described experiment, but instead of augmenting performance with a neuro-correlate, this time selection was performed on the neuro-correlates alone.

We find that none of the neuro-correlates affect performance substantially in the context of the RNG task (see the top rows in Figure 5) except for

Φ_{atomic}

. The effect of

Φ_{atomic}

on the generation of random numbers is measurable, but very small. We assume that in order to generate non-zero

Φ_{atomic}

information must flow through the system. Because there is no control function, the “information” is literally random (that is, entropy), which is what the RNG environment seems to be selecting.

Figure 5. Effect on different neuro-correlates and performance when selecting only on a single neuro-correlate. The left panel shows evolution in the block-catching environment, and the panel on the right shows the random-number-generation (RNG) task.

As expected, selecting for a single neuro-correlate increases its value in both environments (see the diagonal for both environments in Figure 5). However, we also find that many neuro-correlates affect each other positively and negatively, and the effect is similar in both environments. Some of these interactions are very intuitive. All measures that benefit from more connections, for example, cause the minimum-description-length to increase, whereas sparseness causes the minimum-description-length to shrink. Similarly,

Φ_{atomic}

and R have some positive effect on each other, and we conjecture that having the ability to form representation—even though they might not be used to improve performance—still requires the ability to integrate information.

We find that

P I_{ss}

positively affects

P I_{sa}

in the block-catching environment and has no effect in the RNG environment, while

P I_{sa}

has a positive effect on

P I_{ss}

in both environments (compare the bottom right of each Figure 5). To our knowledge the relation between the two Predictive Information measures has not been studied, and we are unable to provide any additional insight into this phenomenon.

5. Conclusions

We have tested whether eight different network- and information-theoretic neuro-correlates can be used to improve the performance of a simple Genetic Algorithm. We found that

Φ_{atomic}

, Graph Diameter, and density (Gamma Index) each statistically significantly improve the performance of a GA in two environments tested. Sparseness does not improve performance as much as density does, suggesting sparseness is not generally beneficial for GAs in this domain. Thus, sparseness should only be used if its application has been shown to be beneficial for the problem domain in question. The two forms of predictive information measures (

P I_{ss}

and

P I_{sa}

) had a negative effect in both environments on finding optimal performers (see Figure 6), and thus it appears that

P I_{sa}

should not be used to augment selection in this domain. Gamma Index was significantly the most reliable fitness-augmenting neuro-correlate in the block-catching task for producing perfectly performing agents (see Figure 6).

Figure 6. Absolute number of perfectly performing block-catching agents after

10, 000

generations evolved under the five different experimental conditions shown. * indicates significance using the Chi-Squared test relative to Task or no neuro-correlate augmentation (

p < 0.05

).

Figure 6. Absolute number of perfectly performing block-catching agents after

10, 000

generations evolved under the five different experimental conditions shown. * indicates significance using the Chi-Squared test relative to Task or no neuro-correlate augmentation (

p < 0.05

).

Because the value of each neuro-correlate is simply multiplied by performance, the computational overhead is bound by the complexity of each measure. Typically, R and

Φ_{atomic}

measures are computationally intensive and must be repeated for every agent in the population. This is a significant overhead, especially for increases in agent lifetime or brain size. This study shows Graph Diameter and Gamma Index measures to be computationally inexpensive and thus preferable, with preference between the two for Graph Diameter. While the GA used here benefited from augmenting selection with neural correlates, we studied only two environments, and it is likely other environments might respond differently. Another possible extension of this work is to investigate other neuro-correlates, or even more topologically based neuro-correlates, or perhaps other algorithms such as novelty search [8].

Acknowledgments

This work was supported in part by the National Science Foundation’s BEACON Center for the Study of Evolution in Action under Cooperative Agreement DBI-0939454. We wish to acknowledge the support of the Michigan State University High Performance Computing Center and the Institute for Cyber Enabled Research (iCER), and the valuable contributions of three referees. We also thank Wesley Elsberry for suggesting LZW compression as a fitness criterion for the RNG task.

Author Contributions

Arend Hintze wrote the core computational evolution code. Jory Schossau wrote the analysis code, and performed all computational analysis. All authors conceived the experiments, evaluated the results, and wrote the manuscript. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Figure A1. Selection for both task and a cognitive trait and the effect on each cognitive trait. Task (Block Catching Task) was selected for, in adition to each y axis trait. the x axis is the trait shown over evolutionary time for the y axis treatment. The populations were allowed to evolve for

10, 000