Quantifying Emergent Behavior of Autonomous Robots

Quantifying behaviors of robots which were generated autonomously from task-independent objective functions is an important prerequisite for objective comparisons of algorithms and movements of animals. The temporal sequence of such a behavior can be considered as a time series and hence complexity measures developed for time series are natural candidates for its quantification. The predictive information and the excess entropy are such complexity measures. They measure the amount of information the past contains about the future and thus quantify the nonrandom structure in the temporal sequence. However, when using these measures for systems with continuous states one has to deal with the fact that their values will depend on the resolution with which the systems states are observed. For deterministic systems both measures will diverge with increasing resolution. We therefore propose a new decomposition of the excess entropy in resolution dependent and resolution independent parts and discuss how they depend on the dimensionality of the dynamics, correlations and the noise level. For the practical estimation we propose to use estimates based on the correlation integral instead of the direct estimation of the mutual information using the algorithm by Kraskov et al. (2004) which is based on next neighbor statistics because the latter allows less control of the scale dependencies. Using our algorithm we are able to show how autonomous learning generates behavior of increasing complexity with increasing learning duration.


Introduction
Whenever we create behavior in autonomous robots we strive for a suitable measure in order to quantify success, learning progress or compare algorithms with each other. When a specific task is given, then typically the task provides a natural measure of success, for instance walking may be measured by velocity and perturbation stability. In cases where behavior is learned via optimization of a global objective function then the same function can also be used as a quantification, so creating and quantifying behavior often go hand in hand. This also applies in principle to behavior from task-independent 1 objectives that have been recently more and more successful in generating emergent autonomous behavior in robots [1,14,24,32,35,39,48]. However, there are several cases of emergent behavior where this strategy fails: if the behavior arises from a) optimizing a local function [7,28], b) optimizing a crude approximation of a computationally expensive objective function [28], c) local interaction rules without an explicit optimization function [8,9,37], and d) a biological system (e. g. freely moving animals) where we don't know the underlying optimization function. Thus, independent of the origin of behavior it would be useful to have a quantitative description of its structure. This would allow to objectively compare algorithms with each other and to compare technical with biological autonomous systems. This paper presents a measure of behavioral complexity that is suitable for the analysis of this kind of emergent behavior.
We base our measure on the predictive information (PI) [2] of a set of observables, such as joint angles. The PI is the mutual information between past and future of a time-series and captures how much information (in bits) can be used from the past to predict the future. It is also closely linked to the excess entropy [5,41] or effective measure complexity [18], and it is a natural measure for the complexity of a time series because it provides a lower bound for the information necessary for optimal prediction. Thus it is a promising choice as a measure. Given these favorable properties, the PI was also proposed as an intrinsic drive for behavior generation in autonomous robots [1,28,48]. On an intuitive level maximizing the PI of the sensor process, leads to a high variety of sensory input -large marginal entropy -while keeping a temporal structure -high predictability corresponding to a small entropy rate.
Unfortunately, there are conceptual and practical challenges in using the PI for generating and measuring autonomous behavior. Conceptually, for systems with continuous values the PI provides a family of functions depending on the resolution of the measurements and practically it is difficult to estimate the PI. For a fixed partition (single resolution) in a low-dimensional case it is possible to estimate the one-step PI and adapt a controller to maximize it [48]. In high-dimensional systems it cannot be used directly in an online algorithm. An alternative approach [28] uses a dynamical systems model, linearization and a time-local version of the PI [28] to obtain an explicit gradient for locally maximizing PI for continuous and high-dimensional state spaces (no global optimization). A simplified version of the resulting neural controller [8] was used for generating the examples presented in this paper.
The aim of the present paper is to provide a measure of behavioral complexity that can be applied to emergent autonomous behavior, animal behavior and potentially other time-series. In order to do this it turns out that the predictive information as a complexity measure has to be refined by taking into account its scale dependency. Because the systems of interest are high-dimensional we investigate different methods to estimate information theoretic quantities in a resolution dependent way and present results for different robotic experiments. The emergent behavior in these experiments are generated by the above mentioned learning rules. On the one hand we want to increase our understanding of the learning process guided by the time-local version of the PI and on the other hand demonstrate how the resolution dependence allows us to quantify behavior on different length-scales.
We argue that many natural behaviors are low-dimensional, at least on a coarse scale. For instance in walking all joint angles can be typically related to a single phase, so in the extreme case it is one-dimensional [45]. However, as we show below in 2.1, this contradicts to the global maximization of the PI which would maximize the dimension. We discuss this contradiction and how it was circumvented in the practical applications. Along with the paper there is a supplementary material located at playfulmachines.com/QuantBeh2015 and the source code can be found in the repository github.com/georgmartius/behavior-quant.
The plan of the paper is as follows. In Sec. 2.1 we introduce the information theoretic quantities and in Sec. 2.3 the two estimation methods used in the paper. In Sec. 2.4 we describe the algorithm used to calculate the quantification measures. In Sec. 2.5 we demonstrate their behavior at the example of the Lorenz system including the effects of noise. Sec. 3 starts with explaining the robots and the control framework used for the experiments which are described in 3.2. The results of the experiments quantification are presented in Sec. 3.3 and are discussed in Sec. 4.

Entropy, dimension and excess entropy
Starting point is a (in general) vector valued stationary time series {y 0 , y 1 , y 2 , . . . , y N }. It is used to reconstruct the phase space of the underlying dynamical system by using delay embedding: x t = {y t , y t−τ , y t−2τ , . . . , y t−(m−1)τ }. If the original time series was d−dimensional the reconstructed phase space will be of dimension d × m 2 . In the following, we will only consider the case d = 1, i.e. scalar time series. The generalization to vector-values time series is straight forward. Let us assume that we are able to reconstruct approximately the joint probability distribution p( x t , x t−1 , ...) in a reconstructed phase space. Then we can characterize its structure using measures from information theory. Information theoretic measures represent a general and powerful tool for the characterization of the structure of joint probability distributions [19,33]. The uncertainty about the outcome of a single measurement of the state, i.e. about x t is given by its entropy. For discrete-valued random variable X with values x ∈ X and a probability distribution p(x) it is defined as An alternative interpretation for the entropy is the average number of bits required to encode a new measurement. In our case, however, the y t are considered as continuous-valued observables (that are measured with a finite resolution). For continuous random variables with a probability density ρ(x) one can also define an entropy, the differential entropy However, it behaves differently than its discrete counterpart: It can become negative and it will get even minus infinity if the probability measure for X is not absolutely continuous w. r. t. to the Lebesgue measure -for instance, in the case of the invariant measure of a deterministic system with an attractor dimension smaller than the phase space dimension. Therefore, when using information theoretic quantities for characterizing dynamical systems researchers often prefer using the entropy for discretevalued random variables. In order to use them for dynamical systems with continuous variables usually either partitions of the phase space or entropy-like quantities based on coverings are employed. These methods do not explicitly reconstruct the underlying invariant measure, but exploit the neighbor statistics directly. Alternatively one could use direct reconstructions using kernel density estimators [34] or methods based on maximizing relative entropy [17,42] to gain parametric estimates. These reconstructions, however, will always lead to probability densities, and are not suitable for reconstructing fractal measures which appear as invariant measures of deterministic systems.
In this paper we use estimators based on coverings, i. e. correlation entropies [44] and nearest neighbors based methods [26] considered in 2.3 below. For the moment let us consider a partition of the phase space into hypercubes with side-length . For a more general definition of an -partition [see 16]. In principle one might consider scaling the different dimensions of y t differently, but for the moment we assume that the time series was measured using an appropriate rescaling. The entropy of the state vector x t observed with a ε-partition will be denoted in the following as H( X; ε) with parameterizing the resolution.
How does the entropy change if we change ε? The uncertainty about the potential outcome of a measurement will increase if the resolution of the measurement is increased, because of the larger number of potential outcomes. If x is a m-dimensional random variable and distributed according to a corresponding probability density function ρ(x) we have asymptotically for → 0 ([see Cover and Thomas 4, ch.8, p.248, theorem 8.3.1] or [16]).
This is what we would expect for a stochastic system. However, if we observe a deterministic system the behavior of an observable depends how its dimension relates to the attractor dimension. If the embedding dimension is smaller than the attractor dimension the deterministic character will not be resolved and Eq. 2 still applies. However, if the embedding dimension is sufficiently high (m > D [11]) then instead of a density function ρ(x) we have to deal with a D-dimensional measure µ(x) and the entropy will behave as H(X; ε) ≈ const − D log ε .
If an behavior such as in Eqs. (2) or (3) is observed for a range of values we will call this range a stochastic or deterministic scaling range, respectively.
Conditional entropy and mutual information Let us consider two discrete-valued random variables X and Y with values x ∈ X and y ∈ Y, respectively. Then the uncertainty of a measurement of X is quantified by H(X). Now we might ask, what is the average remaining uncertainty about X if we have seen already Y ? This is quantified by the conditional entropy The reduction of uncertainty about X knowing Y is the information that Y provides about X and is called the mutual information between X and Y Having defined the ε-dependent state entropy H( X; ε) we can now ask, how much information the present state contains about the state of the system at the next time step. The answer is given by the mutual information between X t and X t+1 : Using Eq. 2 one see, that for stochastic systems the mutual information will remain finite In the limit ε → 0 and can be expressed by the differential entropies: Note, that this mutual information is invariant with respect to coordinate transformation of the system state, i.e. if z j = f (x j ) is a continuous and invertible function, then However, in the case of a deterministic system, the mutual information will diverge This is reasonable behavior because in principle the actual state contains an arbitrary large amount of information about the future. In practice, however, the state is known only with a finite resolution determined by the measurement device or the noise level.
Predictive information, excess entropy and entropy rate The unpredictability of a time series can be characterized by the conditional entropy of the next state given the previous states. In the following we will use an abbreviated notation for these conditional entropies and the involved entropies: The entropy rate ([see Cover and Thomas 4, ch 4.2]) is this conditional information if we condition on the infinite past In the following we assume stationarity, i.e. we have no explicit time dependence of the joint probabilities and therefore also of the entropies. Moreover, if it is clear from the context, which stochastic process is considered, we will write H m (ε) and h m (ε) instead of H m ( X t ; ε) and h m ( X t ; ε), respectively and it holds For deterministic systems the entropy rate will converge in the limit ε → 0 to the Kolmogorov-Sinai (KS-)entropy [3,43] which is a dynamical invariant of the system in the sense that it is independent on the specific state space reconstruction. Moreover, already for finite m ≥ D, h m (ε) will not depend on ε for sufficiently small ε because of (3).
To quantify the amount of predictability in a state sequence one might consider subtracting the unpredictable part from the total entropy of a state sequence. By doing this one ends up with a well known complexity measure for time series, the excess entropy [5,41] or effective measure complexity [18] with The excess entropy provides a lower bound for the amount of information necessary for an optimal prediction. For deterministic systems, however, it will diverge because H m ( ) will behave according to (3) and h m−1 (ε) will become -independent for sufficiently large m and small , respectively with D being the attractor dimension. The predictive information [2] is the mutual information between the semi-infinite past and the future time series with PI m (ε) := I( X t+m , . . . , X t+1 ; X t , . . . , X t−m+1 ; ε) If the limits (14) and (17) exist the predictive information PI(ε) is equal to the excess entropy E(ε). For the finite time variants in general PI m (ε) = E m (ε): However, if the system is Markov of order p the conditional probabilities will only depend on the previous p time steps, p( X t |X t−1 , . . . , X t−∞ ) = p( X t |X t−1 , . . . , X t−p ), hence h m = h m−1 for m > p and therefore E = E p+1 = PI p+1 .

Decomposing the excess entropy for continuous states
In the literature [2,5,18] both the excess entropy and the predictive information were studied only for a given partition -usually a generating partition. Thus, using the excess entropy as a complexity measure for continuous valued time series has to deal with the fact that its value will be different for different partitions -even for different generating ones.
In (16) we have seen that the excess entropy for deterministic systems becomes infinite in the limit ε → 0. The same applies to the predictive information (18). Moreover, we have seen that the increase of these complexity measures with decreasing ε is controlled by the attractor dimension of the system. Does this means that in the case of deterministic systems the excess entropy as a complexity measure reduces to the attractor dimension? Not completely. The constant in (16) reflects not only the scale of the signal, but also statistical dependencies or memory effects in the signal, in the sense that it will be larger if the conditional entropies converge slower towards the KS-entropy.
How can we separate the different contributions? We will start by rewriting Eq (15) as a sum. Using the conditional entropies (13) we get Using the differences between the conditional entropies the excess entropy can be rewritten as Note that the difference δh m (ε) is the conditional mutual information δh m (ε) = I( X t : X t−m | X t−1 , . . . , X t−m+1 ) .
It measures dependencies over m time steps that are not captured by dependencies over m − 1 time steps. In other words, how much uncertainty about X t can by reduced if in addition to the m − 1 step past also the m th is taken into account. For a Markov process of order m the δh k vanish for k > m. In this case the sum Eq (20) contains only a finite number of terms. On the other side truncating the sum at finite m could be interpreted as approximating the excess entropy by the excess entropy of an approximating Markov process. What can be said about the scale dependency of the δh m (ε)? From Eqs. (2) and (3) follows that h m (ε) ∝ −log(ε) for m ≤ D and h m (ε) ∝ const for m ≥ D. Using this and Eq. (19) we have to distinguish four cases for deterministic systems. Note that δD = D − D denotes the fractal part of the attractor dimension.
Thus the sum Eq. (20) can be decomposed into three parts: with the ε dependence showing up only in the middle term (MT): with ε D denoting the length scale where the deterministic scaling range starts. Therefore we have decomposed the excess entropy in three terms: Two ideally ε independent terms and one ε-dependent term. The first term will be called "state complexity" in the following because it is related to the information encoded in the state of the system. The constant c was added here in order to ensure that the -dependent term vanishes at D -the beginning of the deterministic scaling range. The second ε-independent term will be called "memory complexity" because it is related to the dependencies between the states on different time steps. What we call "state" in this context is related to the minimal embedding dimension to see the deterministic character of the dynamics which is D + 1 [11]. In order to be able to get a one to one reconstruction of the attractor a higher embedding dimension might be necessary [36]. Both ε-independent terms together we will call "core complexity" So far we only addressed the case of a deterministic scaling range. In the case of a noisy chaotic system we have to distinguish two ε regions: the deterministic scaling range described above and the noisy scaling range with h m (ε) ≈ h c m − d log ε with h c m determined by the noise level. In the stochastic scaling range all δh m become ε-independent and the decomposition (22) seems to become unnecessary. This is not completely the case. Let us assume that the crossover between the two regions happens at a length scale ε * . Moreover, let us assume that for sufficiently large m we have in the deterministic scaling which allows to express the cross-over scale ε * in terms of the KS-entropy and the noise level related Moreover, the excess entropy in the deterministic scaling range will behave as Evaluating this expressing at the crossover length scale ε * allows to express the value of the excess entropy in the stochastic scaling range as In particular, this expression shows that in decreasing the noise level, i. e. h c m will increase the asymptotic value of the excess entropy for noisy systems. Thus, an increased excess entropy or predictive information for a fixed length scale or partition can be achieved in many ways: 1. by increasing dimension D of the dynamics 2. by decreasing the noise level ε * 3. by increasing the amplitude ε D 4. by increasing the state complexity 5. by increasing the correlations measured by the "memory" complexity, i.e. by increasing the predictability 6. by decreasing the entropy rate h KS , i.e. by decreasing the unpredictability Naturally, the effect of the noise level will be observed in the stochastic scaling range only. In practice there might be more than one deterministic and stochastic scaling range or even no clear scaling range at all. How we will deal with these cases will be described below when we introduce our algorithm.

Methods for estimating the information theoretic measures
Reliably estimating entropies and mutual information is very difficult in high-dimensional spaces due to the increasing bias of entropy estimates. Therefore we will employ two different approaches. On the one hand we will use an algorithm for the estimation of the mutual information proposed by (author?) [26] based on nearest neighbor statistics which allows to reduce the bias by employing partitions of different sizes in spaces of different dimensions. On the other hand we calculate a proxy for the excess entropy using correlation entropies [44] of order q = 2. These are related to the Rényi entropies of second order and the correlation sum provides an unbiased estimator. Both methods do not require binning but differ substantially in what they compute.
Estimation via local densities from nearest neighbor statistics (KSG) The most common approach to estimate information quantities of continuous processes, such as the mutual information, is to calculate the differential entropies (1) directly from the nearest neighbor statistics. The key idea is to use nearest neighbor distances [12,25,46] as proxies for the local probability density. This method corresponds in a way to an adaptive bin-size for each data point. For the mutual information I(X, Y ) (required e. g. to calculate the PI (18)), however, it is not recommended to naively calculate it directly from the individual entropies of X, Y and their joint (X, Y ) because they may have very dissimilar scale such that the adaptive binning leads to spurious results. For that reason a new methods was proposed by (author?) [26], that we call KSG, which only uses the nearest neighbor statistics in the joint space. We denote I The length scale on which the mutual information is estimated by this algorithm depends on the available data. In the limit of infinite amount of data I 1. However, in order to evaluate the quantity at a certain length scale (similar to ε above) and assuming the same underlying space for X and Y , noise of strength η is added to the data resulting in where U (η) is the uniform distribution in the interval [0, η]. The idea of adding noise is to make the processes X and Y independent within neighborhood sizes below the length scale of the noise. In this way only the structures above the added noise-level contribute to the mutual information. Note that for small η the actual scale (k-neighborhood size) may be larger due to sparsity of the available data.
Estimation via correlation sum The correlation sum is one of the standard tools in nonlinear time series analysis [23, Chapter 6], [20]. Normally it is used to estimate the attractor dimension. However, it can also be used to provide approximate estimates of entropies and derived quantities such as the excess entropy. The correlation entropies for a random variable X with measure µ( x) are defined as [44] The integral in this formula is also known as "correlation integral". For N data points x i ∈ R n i = 1, . . . , N it can be estimated using the correlation sum, which is the averaged relative number of pairs in an ε-neighborhood [23, Chapter 6], [20] Θ denotes the Heaviside function Then the correlation entropy is For sufficiently small ε it behaves as 2) being the correlation dimension of the system [21]. A scale dependent correlation dimension can be defined as difference quotient For a temporal sequence of states (or state vectors) x i i = 1, . . . , m we can now define block entropies H (2) m (ε) = − log C (2) (ε) by using m×d-dimensional delay vectors x i . Now, we can define also the quantities corresponding to conditional entropies and to the excess entropy using the correlation entropy We expect the same qualitative behavior regarding the ε-dependence of these quantities as for those based on Shannon entropies, see Eqs (11,15). Quantitatively there might be differences, in particular for large ε and strongly non-uniform measures.
A comparison of the two methods with analytical results are given in the Appendix A, where we find a good agreement. Although, the KSG method seems to underestimate the mutual information for larger embeddings (higher dimensional state space). The correlation integral method uses a unit ball of diameter 2ε whereas the KSG method measures the size of the hypercube enclosing k neighbors where the data was subject to additive noise in the interval [0, η]. Thus comparable results are obtained with η ≈ 2ε.

Algorithm for excess entropy decomposition
We are now describing the algorithm used to compute the proposed decomposition of the excess entropy in Sec. 2.2. The algorithm is composed of several steps: preprocessing, determining the middle term (MT) (23), determining the constant in MT, and the calculation of the decomposition and of quality measures.
Preprocessing: Ideally the δh curves are composed of straight lines in a log-linear representation, i. e. of the form o − s log(ε). We will refer to s as the slope (it is actually the inverted slope). Thus we perform fitting, that attempts to find segments following this form, details are provided in the Appendix B. Then the data is substituted by the fits in the intervals where the fits are valid. As for very small scales the δh become very noisy we extrapolate below the fit with the smallest scale. In addition we calculate the derivativeŝ(m, ε) = d δhm d ε in each point, either from the fits (s, where available) or from finite differences of the data (using 5 points averaging).
Determining MT: In theory only two δh m should have a non-zero slope at each scale ε, see Eq (21). However, in practice we have often more terms, such that we need to find for each ε the maximal range (m l , m u ), where ∀i ∈ [m l , m u ] :ŝ(i, ε) > s min , i. e. the slope is larger than the threshold s min . However, this is only valid for deterministic scaling ranges. In stochastic ranges all δh should have zero slope. We introduce a measure of stochasticity, defined as κ(m, ε) = 1 − mu k=m lŝ (k, ε) which is 0 for purely deterministic ranges and 1 for stochastic ones. The separation between state and memory complexity is then inherited from the next larger deterministic range. Thus if κ(m, ε) ≥ κ max we use (m l , m u ) at ε * , where ε * = arg min e∈(ε,∞) κ(m, e) < κ max . Note that the here algorithmically defined * is not necessarily equal to the * defined above (28) for an ideal-typical noisy deterministic system.

Determining the constant in MT:
In order to obtain the scale-invariant constant c of the MT, see Eq (23), we would have to define a certain length scale ε D . Since this cannot be done robustly in practice (in particular because it may not be the same ε D for each m) we resort to a different approach. The constant parts of the δh m l ...mu terms in the MC can be determined from plateaus on larger scales. Thus, we define c MT m (ε) = min e∈(ε,ε * ] δh m (e), where ε * is smallest scale > ε where we have a near-zero slope, i. e.. ε * = arg min e∈(ε,∞) s(m, e) < s min . In case there is no such ε * then c MT m = 0.
Decomposition and quality measures: The decomposition of the excess entropy follows Eqs (24,25) with m l and m u used for splitting the terms:   In addition we can compute different quality measures to indicate the reliability of the results, see Table 1.

Illustrative example
To get an understanding of the quantities, let us first apply the methods to the Lorenz attractor, as a deterministic chaotic system, and to its noisy version as a stochastic system.
Deterministic system: Lorenz attractor The Lorenz attractor is obtained as the solution to the following differential equations:ẋ = s(−x + y) integrated with standard parameters s = 10, r = 28, b = 8/3. Fig 1 displays the trajectory in phasespace (x, y, z) and using delay embedding of x : (x(t), x(t − τ ∆), . . . , x(t − mτ ∆)) using m = 3 and time delay is τ = 10 (for sampling time ∆ = 0.01). The equations where integrated with Runge-Kutta fourth order with step size 5 · 10 −4 . An alternative time delay may be obtained from first minimum of mutual information as a standard choice [15,30], which would be τ = 18, but τ = 10 yields clearer results for presentation. The result of the correlation sum method using TISEAN [22] is depicted in Fig 2 for 10 6 data points. From the literature we know that the chaotic attractor has a fractal dimension of 2.06 [21] which we can verify with the correlation dimension D   1.98 ± 0.11 memory core (state+mem ) excess Ent. decomp excess Ent. raw m l m u Figure 3: Excess Entropy decomposition for the Lorenz attractor into state complexity (blue shading), memory complexity (red shading), and ε-dependent part E ε (beige shading) (all in nats). Columns as in Fig 2. A set of quality measures and additional information is displayed on the top using the right axis. m l and m u refer to the terms in E ε (37). Scaling ranges that are identified as stochastic are shaded in gray (stochastic indicator > κ max = 0.5). In manually chosen ranges, marked with vertical black lines, we evaluate the mean and standard deviation of the memory-and state complexity. Parameters: s min = 0.  Stochastic system: noisy Lorenz attractor In order to illustrate the fundamental differences between deterministic and stochastic systems when analyzed with information theoretic quantities we consider now the Lorenz dynamical system with dynamic noise (additive noise to the state (x, y, z) in Eqs (38-40) before each integration step) as provided by the TISEAN package [22]. The dimension of a stochastic systems is infinite, i. e. for embedding m the correlation integral yields the full embedding dimension as shown in Fig 2(b),(c) for small ε.
On large scales the dimensionality of the underlying deterministic system can be revealed to some precision either from D (2) or from the slope of E (2) or PI. Thus, we can identify a determinstic and a stochastic scaling range with different qualitative scaling behavior of the quantities. In contrast to the deterministic case, the excess entropy converges in the stochastic scaling range and no more contribution from m > 5 are observed as shown in Fig 2(h),(i). By measuring the ε-dependence (slope) we can actually differentiate between deterministic and stochastic systems and scaling ranges, which is performed by the algorithm (Sec. 2.4) and presented in Fig 3. The predictive information (k)-(l) again yields a lower slope, meaning a lower dimension estimate, but is otherwise consistent. However, it does not allow to safely distinguish determinstic and stochastic systems, because it always saturates due to the effect of finite amount of data.
Let us now consider the decomposition of the excess entropy as discussed in Sec. 2.2 and Sec. 2.4 to see whether the method is consistent. Fig 3 shows the scale-dependent decomposition together with the determined stochastic and deterministic scaling ranges. The resulting values for comparing the determinstic scaling range are given in Table 2 and reveal, that the E core decreases for increasing noise level similarly as the constant reduces. This is a consistency check because the intrinsic scale (amplitudes) of the deterministic and noisy Lorenz system are almost identical. We also checked the decomposition for a scaled version of the data where we found the same values state-, memory-and core complexity. We see also that the state complexity E state vanishes for larger noise, because the dimension drops significantly below 2, where we have no summands in the state complexity and the constant c MT is also 0.
To summarize, in deterministic systems with a stochastic component, we can identify different ε-ranges (scaling ranges) within which stereotypical behaviors of the entropy quantities are found. This allows us to determine the dimensionality of the deterministic attractor and decompose the excess entropy in order to obtain two ε-independent complexity measures.

Results
In this section we will apply our methods to the behavior of robotic systems. We use a previously published control scheme that can generate a large variety of smooth and coherent behaviors from a local learning rule [6,8]. As it is a simplified version of the time-local predictive information maximization controller we aim at answering the hypothesis that the controller actually increases the complexity of behavior during learning.

Application to robotics: controller and learning
Let us now briefly introduce the control framework and the robots used to generate the behavior we are analyzing in the following. The robots are simulated in a physically realistic rigid-body simulation [29]. Each robot has a number of sensors producing real-valued stream s ∈ R n and a number of actuators, also controlled by real values a ∈ R o . The sensors measure the actual joint angles and the body accelerations.
In all examples the actuators are position controlled servo motors with the particular property to have no power around the set-point. This enables small perturbations to be perceivable in the joint-position sensor values. The time is discretized with an update frequency of 50 Hz. The control is done in a reactive closed-loop fashion by a one layer feed-forward neural network: where C is a weight matrix, h is a weight vector, and tanh is applied component-wise. For fixed weights this controller is very simple, however it can generate non-trivial behavioral modes due to the complexity of the body-environment interaction. For instance with a scaled rotation matrix C an oscillatory behavior may be controlled which frequency is adaptive to how fast the physical system follows. To get a variety of behavioral modes the parameters (C, h) have to be changed which can be done by optimizing certain objective functions. Several methods have been proposed that use generic (robot independent) internal drives. Homeokinesis [7,10] for instance balances dynamical instability with predictability. More recently the maximization of the time-local predictive information of the sensor process was used as driving force [28]. On an intuitive level the maximization of the PI (18) of the sensor process leads to a high variety of sensory input due to the entropy term H(S) while keeping a temporal structure by minimizing the future uncertainty reflected in the conditional entropy H(S |S). Indeed we have found that it produces smooth and coordinated behavior in different robotic systems, including hexapod robots and humanoids. On the other hand, maximizing PI also leads to behavior of high complexity with maximal dimension as discussed in Sec. 2.2. A simplified version of the PI-maximizing algorithm, called ULR published in [6,8], leads to more coherent and defined behavioral modes of seemingly low dimensionality which we use in this paper. In order to apply the method to behavior of robots or possibly to that of animals the state space reconstruction has to be done from certain measurements. Here we use a single sensor value, but the formalism can be extended to multiple sensor values as well. It must be guaranteed that sufficient physical coupling exist between the measured quantity and the rest of the body.

Experiments
Let us now present the emergent behaviors obtained from the control algorithm ULR. We use two robots, a snake-like robot and an insect-like hexapod robot as illustrated in Fig 4. The behaviors of the snake are published here for the first time, whereas the behaviors of the hexapod have been published before in (author?) [8]. However, this paper focuses primarily on the quantification of behavior and not on its generation. The Snake has identical cylindrical segments which are connected by actuated 2 degrees of freedom (DoF) universal joints. In this way two neighboring segments can be inclined to each other, but without relative torsion along the segment-axes. The sensor values are the actual positions of the joints. If the robot is controlled with the ULR controller very quickly highly coordinated behavioral  modes emerge, such as crawling or side-rolling, see Fig 5. An external perturbation can bring the robot into another behavior. Without perturbation the robot typically transition into one behavior and remain in it. We recorded 9 · 10 5 samples from a typically side-rolling and crawling behavior. The Hexapod is a insect-like robot with 18 DoF, see Fig 4(a). The sensor values are the joint angles and the vertical velocity of the trunk. Note, that the actual phase space dimension is 18 + 18 + 4 = 40 as we have joint angles, joint velocities, and height and orientation of the trunk. Controlled with the ULR controller the robot develops a jumping behavior, see Fig 6, within 8 min of simulated time [8]. The behavior of the robot changes due to the update dynamics of the controller. For the analysis a stationary process is required, such that we stop the update at some time and obtain a static controller (fixed C, h). Using this the robots behavior stays essentially the same. We analyze three behaviors, where the learning was stopped at 2 min, 4 min and 8 min after the start and recorded each 10 6 samples. After "2 min" the robot moves periodically up and down. The "4 min" behavior shows already signs of jumping and the final behavior after "8 min" is a whole body periodic jumping as displayed in Fig 6.

Quantifying the Behavior
We start with analyzing the data from the Snake experiments followed by the Hexapod experiments.

Snake
In Fig 7 the result of the quantification of the two behaviors (Fig 5) are presented. We used the delay embedding of the sensor x 7 which is one of the joint angular sensors of the middle joint. We find that the side-rolling behavior is roughly 1-dimensional on the behaviorally relevant scales (0.05 < ε < 0.5), see Fig 7(c). On smaller scales the dimension raises with each embedding although the embedding dimension is not reached on the observable length scales and therefore the excess entropy does not saturate. On the larger scales we find the expected scaling behavior (16) with D ≈ 1.
The crawling behavior is also 1-dimensional on the coarse length scales (0.1 < ε < 0.5), see Fig 7(d). The smaller scales, however, bear a surprise because the dimension raises to a value of ≈ 2.5. The excess   Figure 7: Quantification of two Snake behaviors. First column: side rolling; second column: crawling (see Fig 5).   entropy shows also a slope of D ≈ 1.0 on the coarse length scales and a slope of D ≈ 2.5 on the smaller scales, see Fig 7(f). Calculating the predictive information using the KSG method yields a similar result on the large scales, as shown in Fig 7(g),(h). For smaller scales, however, the predictive information always saturates, as seen before in Fig 2(j). In the crawling case we get again an underestimated dimension of D ≈ 2.
The decomposition of the excess entropy follows for the above experiments are given in Table 4. Interestingly, the E core attributes a significantly higher complexity to the crawling behavior, whereas the plain excess entropy constant would suggest that side-rolling is more complex. Since we take out the effect of different scales our new quantities are more trustworthy.
Note, that for the crawling behavior at the small scaling range (ε < 0.01) we can also perform the decomposition and obtain E state = 0.84 ± 0.17, E mem = 3.23 ± 0.46, however these values are mostly useful for comparison, which we only have on the large scale. Note that E state and E mem are not really constant in the plots because the deterministic scaling range is not very good in this case (cf. the quality measure "no fits" in Fig. 7(f)). For the third behavior we cannot identify a clear plateau in the dimension plot. The excess entropy allows us to attribute a dimensionality and core complexity to the behavior on large scales, as summarized in Table 5. Whereas the scaling behavior of the excess entropy suggests a similar dimensionality, the core complexity suggest that the 8 min behavior is more complex.

Hexapod
The predictive information calculated with the KSG method underestimates the dimension by far and is only able to report structure on the coarsest scale, see Fig 8(j)-(l) and compare for instance slope 1.8 of PI in (k) with slope 3.36 of E (2) in (h). Note that usually only the value of the mutual information without added noise (or only very small one) is reported (left-most value in our graphs). In our case this would attribute the highest complexity to the 2 min behavior and the lowest to the 8 min behavior, see also Table 5. This is because the intrinsic noise level increases from 2 min to 8 min, such that the plateau level is much lower.

Discussion
In recent years research in autonomous robots has been more and more successful in developing algorithms for generating behavior from generic task-independent objective functions [1,14,24,28,32,35,39,48]. This is valuable, on the one hand, as a step towards understanding and creating autonomy in artificial beings, and on the other hand, to obtain eventually more effective learning algorithms for mastering real-life problems. This is because the task-agnostic drives aim at efficient exploration of the agent's possibilities by generically useful behavior [7,28,31,32,38] and at generating an operational state of the system [13] from which specific tasks can quickly be accomplished.
Very often emerging behaviors are not the result of optimizing a global function but for instance selforganize from local interaction rules. We propose to use information theoretic quantities for quantifying      these behaviors. There are already studies using information theoretic measures to characterize emergent behavior of autonomous robots [27,40,47] -in particular with the aim to use them as task-independent objective functions [1,24,28,35,39,48]. Usually these information theoretic quantities are estimated for a fixed partition, i. e. a specific discretization, or by using differential entropies assuming that the data were sampled from a probability density. In this paper we show that it is both necessary and fruitful to explicitly take into account the scale dependence of the information measures. We quantify and compare behaviors emerging from self-organizing control of high-dimensional robots by estimating the excess entropy and the predictive information of their sensor time series. The estimation is done on the one hand using entropy estimates based on the correlation integral and on the other hand the predictive information is directly estimated as a mutual information using an estimator (KSG) based on differential entropies. The latter implicitly assumes stochastic data, i. e. the existence of a smooth probability density for the temporal joint distribution, such that the predictive information converges towards a finite value. However, this is not the case for deterministic systems. Due to the adaptivity of the algorithm the predictive information is then calculated at a certain length scale which depends on the available amount of data. Thus, a naive application of the KSG estimator can lead to totally meaningless results, because it does not allow to control for the length scale. To avoid this problem and to control the length scale, we applied additive noise. Nevertheless, our results show that the complexity estimates based on the KSG estimator are much less reliable than those based on the correlation integral. While for simple stochastic systems such as autoregressive models both estimators deliver consistent results for small embedding dimensions, the predictive information is underestimated due to finite sample effects for larger embedding dimensions (cf. Fig 9). For more complex systems such as the Lorenz attractor or the behavioral data from robots the KSG algorithm underestimates the dimensionality, see e. g. Fig 8(j)-(l), does not allow to extract length scales properly, and blurs the distinction between stochastic and deterministic scaling ranges.
An important result of the presented study is that the complexity of the behavior can be different at different length scales: an example is the crawling behavior of the Snake which is 1-dimensional on the coarse scale and 2.6-dimensional on the small scales, see Fig 7 (the same for the Hexapod). In order to quantify these effects we analyzed for the first time the resolution dependence of the well-known complexity measure for time series known as excess entropy [5,41], effective measure complexity [18], or predictive information [2]. We show that the scale dependent excess entropy reflects different properties of the system: scale, dimension and correlations. For deterministic systems the attractor dimension controls how the complexity increases with increasing resolution, while the offset depends on the overall scale of the behavior, e. g. the amplitude of an oscillation, and on the structure of the temporal correlations. In order to disentangle the latter two effects we introduced a new decomposition of the excess entropy with the core complexity quantifying the structural complexity beyond the dimension. In order to estimate it one has to remove the effect of the overall scale of the behavior from the excess entropy. We devised an algorithm based on an automatic identification of scaling ranges in the conditional mutual information δh m ( ).
For noisy systems the excess entropy remains finite. The actual value, however, depends on the noise level, which is consistent with the intuition: The larger the noise the lower the complexity if all other factors remain constant. While we have already a good understanding of these quantities for systems with clearly identifiable scaling regions as in the noisy Lorenz system, very often real systems do not show such ideal types of behavior. Therefore we had to use heuristic procedures which provide plausible results for the data under study. Our quantification of the behaviors of the Snake found that crawling has a higher complexity than side-winding, even though the original excess entropy would have suggested the opposite. For the Hexapod as a learning system (aiming at maximizing time local PI) we were able to quantify the learning progress: The decomposition of the excess entropy revealed the strategy used to increase the complexity: (1) The dimensionality of the behavior increased on the large scales. This does not necessarily mean that the attractor dimension in the mathematical sense increased, because the latter is related to behavior on infinitely small scales which might be irrelevant for the character of the behavior. (2) The core complexity is increased on the large scales indicating also more complex temporal correlations in the learned behavior. We also discuss further strategies for increasing the complexity of a time-series, that could be employed in designing future learning algorithms, most notably, the decrease in the entropy rate and the increase of the state complexity.
Our study contributes a well grounded and practically applicable quantification measure for the behavior of autonomous systems as well as time-series in general. In addition, a set of quality indicators are provided for self-assessment in order to avoid spurious results.
Future work is to develop a better theoretical understanding of the scale dependence of the excess entropy for realistic behaviors that are not low dimensional deterministic or linearly stochastic such as the behavior observed in the Hexapod after 8 minutes of learning, Fig 8. We expect also further insights by applying the same kind of analysis to movement data from animals and humans.
For larger delays values the expressions become very large such that it becomes intractable to compute the MI for τ > 10 (requires r 30 ) because numerical errors accumulate even after algebraic simplification. The supplementary (playfulmachines.com/QuantBeh2015) contains a Mathematica file for calculating the MI's up to τ = 10. The predictive information (18) can be also written as For the AR2 process we have h k = h 2 for k ≥ 2 due to Markov order 2. Thus, PI 1 = h 0 − h 1 = 2H 1 − H 2 and PI 2 = PI k = h 0 + h 1 − 2h 2 = 3H 2 − 2H 3 for k ≥ 2. For the excess entropy Eq (15) we find E 1 = 0, E 2 = h 0 −h 1 = PI 1 , E 3 = E k = h 0 +h 1 −2h 2 = PI 2 , for k ≥ 3. Fig 9 shows the results for the AR2 model obtained from fitting the parameters to the lorenz system (a 1 = 1.991843, a 2 = −0.994793). Indeed the conditional entropies h m = h 2 become identical for m ≥ 2, see (c),(d) and the excess entropies converge for m ≥ 3 to their theoretical values, see (e),(f) In the case of the KSG algorithm the values are correct for m = 1, 2 but for higher embedding dimensions we get a significant underestimation, see (g),(h). This effect is even stronger for smaller neighborhood sizes k.

B.1 Fitting
In deterministic scaling ranges the δh m (ε) follow the form o − s log(ε) (possibly with s=0). In a perfectly stochastic scaling range we have δh m (ε) ∝ const for all m. However, the behavior of a system may show several deterministic and stochastic scaling ranges. So we have to determine segments where a certain typical behavior is found. The following algorithm assumes that for each m the δh m (ε) contain segments of the form f (ε) ≈ 0 − s log(ε) and some parts that do not fit. Because for different m and different systems the curves may have different amount of noise we first determine a quality threshold for the fits.
Internally the data is represented by n points which we denote (ε i , δh i ), with i = 1, . . . , n. We denote the range where a fit was computed by r = (i l , i u ) where i l < i u . The quality measure is the sum of the squared fit-residuals: First, all fits with a certain number of data points (here 10) are computed and their quality measure is computed: Q = {q(r j ) | |r j | = 10}, where |r| = i u − i l is the length of the segment in number of points. There will be segments with low residual errors and segments with very high residuals, namely those that cover the regions between ideal-typical behavior. The heuristics for the quality threshold is the 25% percentile of Q plus a small portion of the standard deviation: q max = P (Q, 0.25) + 0.1StdDev(Q). The reasoning for the percentile is to pick a threshold among the good fits, and we only assume that at least 25% of the curve has fitting segments. If there is however an extended range of good fits then a high percentage (say 90%) of short segments have very similar low q values. In this case a too low threshold would be selected which cuts away good fitting regions. The tens of the standard deviation is added to lift the threshold above the potential plateau, see Fig 10( For all "good" segments Q good = {q(r j ) < q max | q(r j ) ∈ Q} the longest extension of the fitting range is determined that keeps the quality of the fit below the threshold, i. e. ∀q(r j ) ∈ Q good , (i l , i u ) = r j : i u = arg max k>iu q((i l , k)) < q max . Thus we get many overlapping regions of good fits. For each pair of two regions that overlap more than 30% (w.r.t. the smaller) we discard the shorter one. The remaining regions are our final fits, as displayed in Fig 10(a).
The components c MT for calculating the constant of the middle term (37) are also displayed in Fig 10(a). For m = 3, the value of c MT is 0.23 for 0.1 < ε < 0.8 because of the shallow region at [0. 8,9]. This leads to the value of 0.68 (m * c MT m ) for the state complexity in Fig 3(b). For ε < 0.1 the constant c MT