A Note on Representational Understanding

In this paper, we explore a new approach in which understanding is interpreted as a set representation. We prove that understanding/representation, finding the appropriate coordination of data, is equivalent to finding the minimum of the representational entropy. For the control of the search for the correct representation, we propose a loss function as a combination of the representational entropy, type one and type two errors. Computational complexity estimates are presented for the process of understanding and using the representation found.


Introduction
Intelligence in general is the ability to respond quickly and adequately to challenges of the external world. Animals need to be able to recognise enemies and predators quickly. This can be considered as a one-sided classification, one class versus all the others. In [1] the authors argue that responding extremely quickly requires a different solution than classical classification or learning. There, they propose a new paradigm, fundamentally different from classification, to cover the cognitive process of understanding. The present paper shows the theoretical feasibility of the comprehension process and proposes practical development of the process of understanding.
The authors in [1] argue, following the strategy of representation learning [2,3], that understanding a topic is equivalent to finding the right representation of the data. According to this approach, understanding does not involve data compression, but merely the rearrangement of known facts/characteristics that fit best to the observed phenomena.
There is a subtle philosophical difference between understanding and learning, although they may seem technically similar and are part of our cognitive work. Classic AI/ML tasks such as classification, clustering, encoding, etc. go hand in hand with understanding, "model building". Before we are able build a proper model, we try to classify objects based on their characteristics, features, or try to cluster them. If we find a strikingly good arrangement, it can lead to a model idea. After some trials we find a model that fits the data well (after adjusting a few parameters) and meets our expectations of the model. We stress here that model building always involves some preliminary assumptions (which can be good or bad). Once you have a suitable model in hand, the task of classification or clustering is easy. As an example, let us consider a classification problem. If we have to classify a new item, we have to compare its features with those of the possible classes. If we have a model, a few relations between some features should be checked to identify the right class. This is the bird tweet phenomenon. Without a model, it is difficult (expensive in terms of algorithmic steps) to find the solution, while the model plays the role of the bird and tells us the correct answer. Imagine that we have sample points from several polynomials of order one, two and three. If we are given a new set of sample points from an unknown polynomial, even if we know the three classes, we need to compare several class elements with the new one until we find a class element that is convincingly close, similar, to the new one. Contrary, if we know the laws of the three classes, a quick model fitting yields the answer. We will see that this quick solution comes at a price. While supervised or unsupervised learning algorithms have relatively low complexity, finding a model that well characterizes the classes, the relationship between the features, is a very difficult task with much higher complexity. The difference is typically exponential, or even hyper-exponential, depending on the task.
Let us take a closer look at the learning and model building processes. We talk about training when we have a feature set of data and we want to fine-tune a combination of these features (pick out a few key ones) to optimise some behaviour (usually classification or clustering, in some cases forecasting or filling in missing data). The task of understanding, on the other hand, is to find the best data model that reveals the relationships (functions, laws) between features and best describes a problem that is set prior the whole investigation. We call this the context of model building. Training always requires an explicit loss function that decides whether the trained system behaves correctly or not. This may be rather implicit, as in the case of auto-encoders or reinforcement learning. In the process of understanding, we consider only the input data set and try to figure out the feature functions that separate our data set from the rest of the world.
More formally, in the task of understanding the representative elements of the system (sub-universe) we want to understand, are presented one by one to the model-building apparatus. It has separating and descriptive coordinates (paper [1] called them relevant and irrelevant coordinates, respectively, based on observed features). These are either kept as they are or modified to better fit the new observed element. The separation coordinates are expected to be constant 1 for the elements of the investigated sub-universe. Meanwhile, the descriptive coordinates distinguish the elements of our subset piece-by-piece. After observing a large enough number of samples, we expect to have a good understanding, and later our understanding can also be the basis for a quick classification.
Again, we emphasize that we think that there is no understanding/model building without context, prior assumptions, questions and prior elimination of almost all the aspects of the universe, except some that are the particular subject of the investigation, the problem to be solved. In problem setting, we must specify the set of objects Ω and the function space that maps the "measurements" of the elements of the space to the new "coordinates". In our abstract model, the sub-space of binary functions which has log 2 (Ω) variables should be specified. Overall, the context is given by Ω and the chosen function space.
One simplified example could be the predefined scale of the study. We may search for a model of sand in a desert. If we have bird's eye view photographs of the desert (10 6 mm scale), we can try to shift and overlap the dunes in the images and find that there is an almost periodic pattern that describes well the sand surface. The model can be formalised using trigonometric series. Time lapsed video recordings of the same scale can reveal wave-like behaviour and explain the periodicity (or falsify the static view). At a 10 1 -10 2 mm scale, surface tension, friction, avalanche effects may contribute to the description of the observation. The formalism should be based on polynomials of a few order, chaotic systems, or stochastic dynamical systems. On a scale of 10 −3 -10 −0 mm, fragmentation, particle collision processes can be captured and the formalism can be based on branching processes or dynamical systems with external potentials (forces) from statistical physics.
In general, it is not easy to decide whether a coordination is appropriate in the above sense or not. Are we satisfied with the coordination (representation) or do we need further investigation, sample elements? To facilitate this, it was suggested in [4] to associate an entropy with each representation, which is minimal if the coordinates are chosen properly. In [4], some properties of the proposed entropy function were investigated and it was demonstrated in simple examples that it indeed performs the desired task.
In the present paper, we take a closer look at the practical understanding process. First, we recall the formal model of understanding. We then show that representation entropy is minimal if and only if the representation is canonical, i.e., a representation that separates the subset and describes the elements in terms of "independent" coordinates. We propose a loss function for representations and provide theoretical and practical calculation of the type one and type two errors of representations. Finally, an estimate of the cost/complexity of the understanding procedure is given.

The Representational Entropy
We assume that Ω is a finite set, {Ω, F , P} is a probability space, and P is uniform over Ω :P(ω) = 1 |Ω| for ω ∈ Ω. For simplicity, we use sets with cardinality 2 k with k > 0 integers. |Ω| = 2 N . Let The coordinates are the binary functions: For continuous variables see the discussion in [1] .

Definition 2.
A representation of Ω 1 ⊂ Ω is a coordination in which the set Ω 1 and its complement Ω 2 = Ω\Ω 1 are distinguished by N − I 1 coordinates (binary functions), called separation coordinates (SCs), if all of them are 1 on Ω 1 , and at least one of them is zero on Ω 2 . The other I 1 coordinates called Descriptive Coordinates (DCs). Given that permutation of coordinates is irrelevant we may assume that the separation coordinates have the lower index:

Definition 3.
A representation of Ω 1 is optimal (canonical) if all the DCs as random variables are independent over Ω 1 (with respect of the uniform distribution over Ω).
Let us denote the descriptive coordinates by y = {y i } I 1 i=1 , and the function they provide by y : Ω 1 → {0, 1} I 1 as well.
It is shown in [1] that the optimal (canonical) coordination always exists and coincides with the binary labeling of the elements of Ω 1 by descriptive coordinates, that is, by labeling them with the N 1 -length binary vectors. One should observe immediately that y : Ω 1 → {0, 1} N 1 is a bijection for the canonical representation.

Definition 4.
The representation entropy is defined as follows: let p i (σ) = P Ω 1 (x i (ω i ) = σ), σ ∈ {0, 1}: For convenience, let H(x) = S Shannon (x) denote the classical Shannon entropy: where p k = P(ξ = x k ). Using that notation we have the following observation. Proof. At first, let us note that the entropy H(y|Ω 1 ) of the DCs is constant, equal to N 1 , which justifies the search for a minimum. The statement of the Lemma is just reformulation of the well-known fact that and equality holds if and only if all the z i -s are independent. We give a short proof of this statement since it is in several textbooks (e.g., [5] ) but not all the proofs are clear or complete.
If the family of random variables z = {z i } n i=1 is independent then we have the equality in (1) by the chain rule. Let us assume that the family of random variables z is not independent. In that case, there are I 1 and I 2 ⊂ {1, 2, . . . , n} such that I 1 ∩ I 2 = ∅, I 1 ∪ I 2 ⊂ {1, 2, . . . , n}, I 1 , I 2 = ∅ and x = {z i } i∈I 1 , y = z j j∈I 2 such that the variables x and y are not independent and For convenience let us reorder z i − s so that I 1 = {1, ..k}, I 2 = {k + 1, . . . , n}. Using that notation we have that which shows the reverse implication.

Search
Let us assume that we have a given coordination x. Without loss of generality, we can assume that the first N − I 1 coordinates are going to represent Ω 1 with value 1-s. The type one error is: and the type two error is: The loss function is then defined as where λ, µ ≥ 0 regularizing meta parameters. One should take into consideration when choosing the meta-parameters λ, µ that The theoretical values of the error probabilities are cumbersome, but interesting on their own. Their detailed calculation is given in Appendix A. The empirical estimates are easy. Similarly to the ML procedure, we split the sample into two. We use the first to create a coordination of the learning set, assuming that we have large enough sample then with the second part we calculate the estimate of the type one error. If we have access to non Ω 1 elements, then we can calculate the estimate of the probability of type two error as well.

Costs
The representation cost can also be a factor in the design of the coordination. There are different costs.

1.
Creation of the coordinates that include the presentation of ω-s, building of the x i coordinates and, at least once, the calculation of the ready coordinates.

2.
Storage of the coordinate functions.

3.
Calculation of the representation entropy, type one and type two errors.

4.
Cost of decision if ω ∈ Ω 1 and full coordination of ω.
Let Ω 2 ⊂ Ω 1 be the set of presented items. The creation of a complete coordination of Ω 2 needs the presentation of |Ω 2 | = 2 N 2 elements. The canonical, complete representation contains r 2 = N − N 2 functions x i | Ω 2 ≡ 1 and further N 2 functions which maps to the lexicographical ordered binary vectors of length N 2 . All the functions should be different. A very conservative estimate of the construction cost is the typical length l(N) of a binary function of N inputs: (c.f. [6]) Of course if the function class is restricted, much lower cost is possible, but without the explicit knowledge of that we can not incorporate into our estimate. We need N such functions (where the choice of the value of N is based on expert guess, suggestion, external parameter of the process, can not be specified by the procedure). The cost of learning C L is estimated by where c ≥ 1 given that each constructed function should be calculated at least once for each ω ∈ Ω 2 . The storage and recall of the representation needs at least The representation entropy for a complete representation is fixed ( 1/M 2 ), no calculation needed. Type one error can be estimated with the observed relative frequency which is based on the investigation of M 3 number of Ω 1 elements. That needs C test = number of items × number of functions × cost of a calculation of a function ∼ c M 3 (N − N 2 )2 N / log 2 N steps. If we have access to, or can create, Ω\Ω 1 elements for test purposes, a similar estimate can be given for the cost of the estimate of type two error. Finally, from the above consideration, it is clear that for an ω the decision if ω ∈ Ω 1 needs C decision ∼ (N − N 2 )2 N / log 2 N steps and the full representation needs C f ull ∼ N2 N / log 2 N steps.

Summary and Discussion
The goal of understanding is to build a data model by which we can make a distinction between the "essential" an "unimportant" items.
Following [1], we identify the set of essential data through their characteristic features. This means that we calculate some quantities that have a given value for our distinguished set. We should build our data representation based on the characteristic and irrelevant features, which are called separation and descriptive coordinates in this paper. Once we know all the separation coordinates, we can identify our singled out set with 100% certainty.
In practice, however, it is very tedious to find this coordination and verify if it is optimal. It is much easier if we have a real valued function (loss function) that is minimal for the correct data representation. In this work, we have proven with mathematical rigour that the representation entropy defined in Definition 4 has exactly this property.
It is worth to emphasize that this representation quality measure is of statistical nature. While the generally used loss functions can be computed for each input, the use of representation entropy requires the knowledge of the bitwise probability distribution of the outputs, and thus it is available only after having seen enough examples from the given set. Once we know the bitwise probabilities, we can compute the representation entropy with the psuedo-code given by Algorithm 1.

Algorithm 1 Representation Entropy from bitwise probability distributions
p are the bitwise probabilities of 1.
end for 6: return S 7: end procedure In this paper, we worked in an abstract world with complete information, meaning that we know all possible inputs, and we can approach all elements of our specific set. In this case, we can find the best data representation using the psuedo-code of Algorithm 2, based on evolution algorithm.

8:
if S(p) < S best then 9: P best ← P 10: end if 11: end for 12: return P best • R 0 This algorithm works because the minimal representation entropy corresponds to the best coordination of the data.
In fact, in the abstract world we can also construct the best representation (c.f. [1]) as follows. The original R 0 representation associates an integer number to all elements, interpreting its binary coordinates as digits of a binary number. Here the Ω 1 set usually shows up randomly. However, we can find a permutation P that sends the element of Ω 1 to the top of the list. Then P • R 0 associates the numbers 0, 1, . . . 2 N 1 − 1 with the elements of Ω 1 , i.e., only its first N 1 bits can be different from one. In this representation, therefore, the first N 1 bits are uniformly distributed, the higher bits are all one, thus the representation entropy is N 1 = S Shannon . The type one and type two errors are all zero. Therefore P best = P.
This solution, however, is not feasible in practice, since we do not have complete information. Then we shall use the given pseudo-code, with two modifications. The first is that we can not sample the whole {0, 1} N → {0, 1} N bijections, since the cardinality of this space is 2 N !, which is usually too big to be parametrized. In practice, therefore, we shall be content with a relatively small subset of all the representations, which we should choose very carefully. For an appropriate choice of the function class, we can take into account a number of arguments, we shall examine the symmetry of the system, we may consider simple function classes (such as linear or low order polynomials), and we shall rely on the field knowledge, accumulated common wisdom in the given area.
The other reason is that usually we do not know the complete Ω and Ω 1 sets, we only see some samples of it. In fact we do not even know the cardinality of Ω 1 in general.
To mitigate the problem of the limited number of observations in Ω 1 , we may apply methods that potentially improve the convergence. In this paper we have suggested to incorporate into the cost function the type one and type two errors. That may be useful from this point of view.
Even with the best strategy, if we do not have complete information, the fixed values of the separation coordinates determine an Ω = Ω 1 subset. Then a number of questions arise, such as what is the relation of Ω and Ω 1 , how to choose the set of available coordinations, how many separation and descriptive coordinates are worth to keep, and so on. These questions are the subject of our future studies.  Let R i (i = 1, 2) denote the index set of SCs for Ω i -s and assume that l = |R 1 ∩ R 2 |. First, we calculate P l the probability that is based on the Ω 2 data that we have a representation which has l common SCs with R 1 .
where m = N − r 1 − r 2 + l = N 2 − r 1 + l. The type one error is for an ω ∈ Ω 1 is that we decide oppositely based on R 2 .
we have The type two error takes place if x| N 1 ≡ 1 but ω / ∈ Ω 1 . We now count the possible cases when the selective bits of Ω 2 and of Ω 1 overlap at most in N − N 1 − 1 places and produce false positive result. Let again m = N − r 1 − r 2 + l, then Given that the true size of Ω and Ω 1 is typically unknown, the theoretical values of the type one and two errors can not be calculated. On the other hand, the empirical values can be obtained based on a proper sample.