A Fundamental Scale of Descriptions for Analyzing Information Content of Communication Systems

The complexity of a system description is a function of the entropy of its symbolic description. Prior to computing the entropy of the system description, an observation scale has to be assumed. In natural language texts, typical scales are binary, characters, and words. However, considering languages as structures built around certain preconceived set of symbols, like words or characters, is only a presumption. This study depicts the notion of the Description Fundamental Scale as a set of symbols which serves to analyze the essence a language structure. The concept of Fundamental Scale is tested using English and MIDI music texts by means of an algorithm developed to search for a set of symbols, which minimizes the system observed entropy, and therefore best expresses the fundamental scale of the language employed. Test results show that it is possible to find the Fundamental Scale of some languages. The concept of Fundamental Scale, and the method for its determination, emerges as an interesting tool to facilitate the study of languages and complex systems.


Introduction
The understanding of systems and their complexity requires accounting for their entropy. The emergence of information upon the scale of observation has become a topic of discussion since it reveals much of the systems' nature and structure. Bar Yam [1] and Bar-Yam et al. [2] have proposed the concept of complexity profile as a useful tool to study systems at different scales. Among others, Lopez-Ruiz et al. [3], and criterion, as is the minimization of the entropy. This study develops a series of algorithms to recognize the set of symbols that, according to their frequency, leads to a minimum entropy description. The method developed in this study mimics a simplified communication system's evolution process. The proposed algorithm is tested with short example of English text, and two descriptions, the first is an English text and the second, a sound musical instrument digital interface (MIDI) file. This representation of the components may convey a description of a system and its structural essence.

A Quantitative Description of a Communication System
A version of Shannon's entropy formula, generalized for communication systems comprised of symbols, is used to compute quantity of information in a descriptive text. To determine the symbols that make up the sequential text, a group of algorithms were developed. These algorithms are capable of recognizing the set of symbols which form the language used in the textual description. The number of symbols represents not only the diversity of the language but also the fundamental scale used for the system description.

Quantity of Information for a 'nary Communication System
We refer to language as the set of symbols used to construct a written message. The number of different symbols in a language will be referred as the diversity .
To compute the entropy ℎ of a language, that is, the entropy of the set of different symbols, used with a probability to form a written message, we use the Shannon's entropy expression, normalized to produce values between zero and one: Note that the base of the logarithm is equal to the language's diversity , whereas classical Shannon's expression uses 2 as the base of the logarithm; also equal to the diversity of the binary language that he studied. Researchers such as Zipf [12], Kirby [13], Kontoyiannis [6], Gelbukh and Sidorov [14], Montemurro and Zanette [7], Savoy [15], Febres, Jaffe and Gershenson [9] and Febres and Jaffe [16], among others, have studied the relationship between the structure of some human and artificial languages, and the symbol probability distribution corresponding to written expressions of each type of language.
All these studies assume symbols as characters or words, in our present study we leave freedom to group adjacent characters, to form symbols in order to comply with the minimization of the entropy ℎ as expressed in Equation (1). In the following sections we explain this optimization problem, and our approach to find a solution reasonably close to the set of symbols that produce an absolute minimum entropy.

Scale and Resolution
We propose a quantitative concept of scale: the scale of a system equals the diversity of the language used for its description. Thus, for example, if a picture is made with all available colors in an 8-bit-color map of pixels, then the diversity of the color language of the picture would equal 2 , and the scale of the picture description, considering each color as a symbol, would be also 2 . Another example would be a binary language, a scale 2 communication system made up of only two symbols. Notice we have used the term "communication system" to refer to the media used to code information. Interestingly, the system's description scale is determined, in first place, by the observer, and in a much smaller degree by the system itself. The presumably high complexity of a system, functioning with the actions and reactions of a large number of tiny pieces, simply dissipates if (a) the observer or the describer fails to see the details, (b) the observer or describer is not interested the details, and prefers to focus on the macroscopic interactions that regulate the whole system's behavior, or (c) the system does not have sufficient different components, which play the role of symbols here, to refer to each type of piece. It is clear that any observed system scale implies the use of a certain number of symbols. It is also clear that the number of different symbols used in a description is linked with our intuitive idea of scale. There being no other known quantitative meaning of the word scale, we suggest its use as a descriptor of languages by specifying the number of symbols forming them.
Resolution specifies the maximum accuracy of observation and defines the smallest observable piece of information. In the computer coded files we used to interpret descriptions, we consider the character as the smallest observable and non-divisible piece of information.
Let denote the physical space that a symbol or a character occupies, and let the sub-index signal the object being referred to. Thus, considering a written message , constructed using different symbols as = { , , … , }, we would say the message occupies the space and each symbol occupies the space . We define the length of all characters equal one. Therefore ≡ 1 for any . Finally, if the number of characters in a message is , each symbol appears ! times within the message, and the symbol diversity is , we can write the following constraints over the number of characters, symbols and the space they occupy: (2)

Looking for a Proper Language Scale
We see the scale of a language as the set of finite symbols that "best" serves to represent a written message. The qualification "best" refers to the capacity of the set of symbols to convey the message with precision in the most effective way.
Take for example the western natural languages. Among their alphabets, there are only minor differences; too few differences to explain how far from each other those languages are. As Newman [17] observes, some letters may be the basic units of a language, but there are other units formed by groups of letters.
Chomsky's syntactic structures [18], later called context-free grammar (CFG) [19] offers another representation of natural language structure. The CFG describes rules for the proper connections among words according to their specific function within the text. Thus, CFG is a grammar generator useful to study the structure of sentences. Chomsky himself treats a language as an infinite or finite set of sentences. CFG works at a much larger scale than the one we are looking for in this study.
Regarding natural languages it is common to think that a word is the group of characters within a leading and a trailing blank-space. At some time a meaning was assigned to that word, and thereafter the word's meaning, as well as its writing, evolves and adopts a shape that works fine for us, the users of that language. Zipf's principle of least effort [14] and Flesch's reading ease score [20] certainly give indications about the mechanisms guiding words, as written symbols, to reduce the number of characters needed to be represented.
From a quantitative linguistics perspective, this widely accepted method for recognizing words offers limited applicability. Punctuation signs, for example, have a very precise meaning and use. The frequency of their appearance in any western natural language compete with the most common words in English and Spanish [21]. However, punctuation signs are very seldom preceded by a blank-space and are normally written with just a single character, which promotes the false idea that they function like letters from the alphabet; they do not. They have meaning as well as common words have. Another situation revealing the inconvenience of this natural but too rigid conception of words, is the English contraction when using the apostrophe. It is difficult to count the number of words in the expression "they're". How many words are there, one or two? See Febres et al. [21] for a detailed explanation on English and Spanish word recognition and treatment for quantification purposes.
Intuitively the symbols forming a description written using some language, should be those driving the whole message to low entropy when computed as the function of the symbols frequency. In this situation the message is fixed as fixed is also the text and the quantity of information it conveys. Then, there appears to be a conflict: while the information is constant because the message is invariant, any change to the set of symbols considered as basic units, alters the computed message entropy, as if the information had changed; it has not. To solve this paradox, we return to the question asked at the beginning of this section about the meaning of "best" in the context of this discussion. From the point of view of the message emitter, the term "best" considers the efficiency to transmit an idea. This is what Shannon's work was intended for: to determine the amount of information, estimated as entropy, needed to transmit an idea. From the reader's point of view the economy of the problem works different. The reader's problem is to interpret the message received to maximize the information extracted. In other words, the reader focuses on the symbols which turn the script as an organized, and therefore easier to interpret message. If the reader is a human and there are words in the message, the focused symbols are most likely words because those are the symbols that add meaning for this kind of reader. But if there existed the possibility to select another set of symbols which makes the message look even more organized, the reader would rather use this set of symbols because it would require less effort to read.
In conclusion, what the reader considers "best" is the set of symbols that maximizes the organization of the message while for the sender the "best" means the set of symbols needed to minimize the disorder of the message and thus the quantity of information processed. These statements are expressed as objective functions in Equations (3) where the best set of symbols is named %, the message is , the message entopy is ℎ and the message organization is (1 − ℎ ): Senders / objective: min % ℎ Receiver / s objective: max Following this reasoning, "best" means the same for both sides of the communication process. This may have important implications when considering languages as living organisms or colonies of organisms. Both parts of the communication process push the language to evolve in the same direction: augmenting self-organization and the reducing of entropy of the messages. Both come together. Self-organization can be seen as one of the evolving directions of languages. Thus, self-organization is an indirect way to measure how deeply evolved a language is and what its capacity is to convey complex ideas or sensations. Finally, an objective function to search the most effective set of symbols-the set with minimal entropy-to describe a language has been found. It will be used to recognize the set of symbols that best describes a language used to write a description.

Language Recognition
Consider a description consisting of a message built up with a sequence of characters or elementary symbols. The message can be treated as an ordered set of characters = as: The symbol probability distribution >( ) can be obtained dividing the frequency distribution H by the total number of symbols in the message: Language %, used to convey the message , can now be specified as the set of % different symbols and the probability density function >( ) which establishes the relative frequencies of appearance of the symbols . Each symbol is constructed with a sequence of contiguous characters as indicated in Equation (6). The set of symbols that describes the message with the least entropy comes after the solution of the following optimization problem: The resulting language will be the best in the sense that it is the set of symbols that offers a maximum organization of the message. The symbol lengths will range from a minimum to a maximum defining a distribution of symbol lengths characteristic of this scale of observation which is referred to as the Fundamental Scale.

The Algorithm
The optimization problem (8) is highly nonlinear and restrictions are coupled. A strategy for finding a solution has been devised. It is a computerized process compound of text-strings processing, entropy calculations, text-symbol ordering and genetic algorithms. Given a description consisting of a text of characters, the purpose of the algorithm is to build a set of symbols % whose entropy is close to a minimum. The process forms symbols by joining as many as V adjacent characters in the text. A loop where V is kept constant, controls the size of the symbols being incorporated to language %. The process ends when the maximum symbol length of V WX characters is considered to form symbols. We add a subindex to language % Y to indicate the symbol size V considered at each stage of its construction. We have defined several sections of the algorithm and we named them according to their similarity with a system where each symbol appears and ends up being part of a language, only if it survives the competence it must stand against other symbols. A pseudo-code of the fundamental scale algorithm is included in Appendix A.

Base Language Construction
In the first stage, the message is separated into single characters. The resulting set of characters along with their frequency distribution constitute the first attempt to obtain a good language and it will be denoted as % . The sub-index indicates the maximum length that any symbol can achieve.

Prospective Symbol Detection
The prospective symbol detection consists of scanning the text looking for strings of exactly V characters. All V-long strings are considered as prospective symbols to join the previously constructed language % YD made of strings of up to V − 1 characters. The idea is to find all possible different V-long strings present in the message , which after complying with some entropy reduction criteria, would complement language % YD to form language % Y .
To cover all possibilities of character sequences forming symbols of length equal to V, several passes are done over the text. The difference from one pass to another is the character where the initial symbol starts, which will be called the phase of the pass. Figure 2 illustrates how the strategy covers all possibilities of symbol instances for any symbol size specification V.

Figure 2.
Examples of reading a text to recognize prospective symbols with a sliding window of SymbolSize = 4 and reading Phase = 0, 1, and 3. Phase = 2 not shown. The message: "xMTrkbhÿXbÿYÿQñÖZbQñêrÿQÞgzÿQËQbØQËlÿQÿñQñpMTrkÿQ€ÿÿQ".

Symbol Birth Process
Prospective Symbols detected in the previous stage whose likelihood to be an entropy reducer symbol is presumed too low, are discarded and never inserted as part of the language. Interpreting entropy Equation (1) as the summation of contributions of the uncertainty due to each symbol, we can intuit that minimum total uncertainty-minimum entropy-occurs when each symbol uncertainty contribution is about the same. Thus, any Prospective Symbol must be close to the average uncertainty per symbol in order to have some opportunity to actually reduce the entropy after its insertion. The average contribution of the uncertainty Z for symbol can be estimated as: This leads us to look for symbols complying with condition shown in Equation (10), and save processing time whenever a prospective symbol is not within a 2λ-width band of around the average uncertainty value: ℎ Parameter ] can be adjusted to avoid improperly rejecting entropy reducer symbols or to operate in the safe side at the expense on processing time.

Conservation of Symbolic Quantity
The inclusion of prospective symbols into the arrays of symbols representing the language %, is performed to avoid the overlap of the newly inserted symbols and the previous language existing symbols. Therefore, every time a prospective symbol is inserted into the stack of symbols, the instances of former symbols occupying the space the new symbols, must be released. Sometimes this freed string is only a fraction of a previously existing symbol. Thus, the insertion of a symbol may produce a break up of other symbols, generating empty spaces for which recovered symbols must be reinserted in order to keep the original text intact.

Symbol Survival Process
A final calculation is performed to confirm the entropy reduction achieved after the insertion of a symbol into the language being formed. Those symbols not producing an entropy reduction, are rejected and the Language % is reverted to its condition prior to the last insertion of a symbol.

Controlling Computational Complexity
The computational complexity of this algorithm is far beyond polynomial. A rough estimation sets the number of steps required above the factorial of the diversity of the language treated. Thus, segmenting the message into shorter pieces, allows the algorithm to find a feasible solution and to keep affordable processing times for large texts. This strategy is in fact a sort of parallel processing which significantly reduces the algorithm's computational complexity down to becoming an applicable tool. A complex system software platform has been developed along with this study to deal with the complexities of this algorithm, and the structure needed to maintain record of every symbol of each description within a core of very many texts. This experimental software, is named Monet and a brief description of it can be found in [21].
The noise introduced when cutting the original description in pieces, is limited. At most two symbols may be fractured for each segment. Very low compared to the number of symbols making each segment. The algorithm calculates the entropy of each description chunk. But, as Grabchak et al. [22] explain, the estimation of the description's entropy must consider the bias introduced when short text samples are evaluated. Taking advantage of the extensive list of symbols and frequencies available and organized by means of the software Monet, we used the alternative of calculating the description entropy using the joint sets of symbols for each description partition, an then forming the whole description. As a result, no bias has to be corrected.

Tests and Results
In order to compare the differences obtained when observing a written message at the scales of characters, words and the fundamental scale, we designed an Example Text. Table 1 shows the symbols obtained after the analysis of the Example Text at the three observation scales used in this study. The entropies calculated at the scales of characters and words were 0.81 and 0.90 respectively, the entropy at the fundamental scale was 0.76; an important reduction of the information required to describe the same message.
These results also get along with our intuition. Clearly, the selection of a certain character-string as a fundamental symbol, is favored by the frequency of appearance of the string of characters. As a result, the "space character" (represented as ø in the table) is recognized as the most frequent fundamental symbol. It indeed is an important structural piece in any English text, since it defines the beginning and the end of natural words. The length of the string of characters also favors the survival of the symbol in its competence with other prospective symbols. The string "describ", for example, appears twice in the Example Text and the algorithm recognized it as a symbol. On the other hand, the 11-char long string "An adverb" also appears two times, but the algorithm found it more effective in reducing the overall entropy, to break that phrase apart and increase the appearances of other symbols. A similar case is that of the word "adverb", which appears in nine instances (not including those written with the first capital letter) on the Example Text. But the entropy minimization problem found a more important entropy reduction by splitting the word "adverb" in shorter and more frequent symbols as "dv" (10 times), or the characters as "e" (70 times), "a" (40 times), ), "r" (33 times), and "b"(12 times).
In another experiment, we contrasted two different types of communication systems by performing tests over full real messages. The first test is based on a text description written in English and the second in test based on the text file associated to music coded using the MIDI format. The English text is a speech by Bertrand Russell given in 1950 during the Nobel Prize ceremony. The MIDI music is a version of the 4th movement of Beethoven's ninth symphony. The sizes of these descriptions are near the limit of applicability of the algorithm. English descriptions of 1300 words or less can be processed in short times of less than a minute. Larger English texts have to be segmented using the control computational complexity criteria mentioned in Section 3.6 to reach reasonable working times. Bertrand Russell's speech was fractioned in seven pieces. For MIDI music files, the processing times show an attitude of sharp increase starting for music pieces lasting about 3 min. The version of 4th movement of Beethoven's ninth symphony used, is a 25 min long piece. It was necessary to process it by fractioning in 20 segments.
To reveal the differences of descriptions when observed at different scales, symbol frequency distributions were produced. For the English text, characters, words and the fundamental scale were applied. For the MIDI music text distributions at character and fundamental scale were constructed. Words do not exist as scale for music. The corresponding detailed set of fundamental symbols can be seen in Appendix B. The frequency distributions were ordered upon the frequency rank of the symbols, thus the obtained were Zipf's profiles.  Table 2 shows the length , the diversity and the entropy ℎ obtained for these two descriptions analyzed at several scales and Figure 3 shows the corresponding Zipf's profiles for Bertrand Russell's speech English speech and Beethoven's 9th Symphony's 4th movement. Both descriptions' profiles are presented at the scales they were analyzed: character-scale and the fundamental scale for both, English and music, and the word-scale only for English.
In Figures 3a and 3b, the character scale exhibit the smallest diversity range. Taking only the characters as allowable symbols, leaves out any possibility of combination to form more elaborated symbols and excluding any possibility of representing how the describing information of a system arranges to create what could be loosely called the "language genotype". Allowing the composition of symbols as the conjunction of several successive characters, dramatically increases the diversity of symbols.
The selection of the symbols to build an observation scale holding the criteria of minimizing the resulting frequency distribution entropy, bounds the final symbolic diversity in a scale while capturing a variety of symbols that represents the way characters are organized to represent the language structure. The fundamental scale appears as the most effective scale, since with it, the original message can be represented with the most compressed information, expressed as the lowest entropy measured for all scales in both communication systems evaluated.     Any scale of observation has a correspondence with the size of the symbols focused at that scale. When that size is the same for all symbols, the scale can be regarded as a regular scale and specified indicating its size. If on the contrary, the scale does not correspond to a constant symbol size, then a symbol frequency distribution based on the sizes is a valid depiction of the scale. That is the case of the scales of words for English texts and the fundamental scale for our two examples. Figures 4 and 5 show those distributions and are useful to interpret the fundamental scales of both examples.

Discussion
The results clearly showed the calculus of the entropy content of a communication system varies in important ways, depending on the scale of analysis. Looking at a language at the scale of characters provides a different picture than examining it at the level of words, or at the here described fundamental scale. Thus, in order to compare different communication systems, we need to use a similar scale applicable to each communication system. We showed that the fundamental scale presented here is applicable to very different communication systems, such as music, computer programs, and natural languages. This allows us to perform comparative studies regarding the systems entropy and thus to infer about the relative complexity of different communication systems.
In both examples analyzed, the profiles at the scale of characters and the fundamental scale run close to each other, within the range of the most frequent symbols to the symbols with a rank placed near the mid logarithmic scale. For points with lower ranking, the fundamental-scale profile extends its tail toward the region of low symbol frequencies. The closeness of fundamental and character scaled profiles in the high frequency region, indicates that the character-scaled language B1 is a subset of the fundamental scale language. The language at fundamental scale, having a greater symbolic diversity and therefore more degrees of freedom, finds a way to generate a symbol frequency distribution with a lower entropy as compared to the minimal entropy distribution when the description is viewed at the scale of words. Focusing in the fundamental scale profiles, the symbols located in the lower rank region-the tail of the profile-tend to be longer symbols formed by more than one character. These multi-character symbols, which cannot exist at the character scale, are formed at the expense of instances of single character symbols typically located in the profile's head. This explains the nearly constant gap between the two profiles in the profiles' heads.
The English description, observed at the scale of words, produces a symbol profile incapable of showing short symbols-fragments of a word-which would represent important aspects of a spoken language as syllabus and other typical fundamental language sounds. On the opposite extreme, by observing at the character scale, the profile forbids considering strings of characters as symbols, thus meaningful words or structures cannot appear at this scale, missing important information about the structure of the described system.
The fundamental scale, on the other hand, appears as an intermediate scale capable of capturing the essence of the most elementary structure of a language, as its alphabet, as well as larger structures which represent the result of language evolution in its way to form more specialized and complex symbols. The same applies for music MIDI representation. There is no word scale for music, but clearly the character scale does not capture the richness that undoubtedly is present in this type of language.
Another difference between the fundamental scale, and other scales is the sensitivity to the order of the symbols as they appear in the text. At the scale of words or the scale of characters, the symbol frequency profile does not vary as the symbol order. The profiles depend only on the number of appearances of each symbol, word or character, depending on the subject scale. The profile built at the fundamental scale does change as the symbol order is altered, not because of the symbol order itself, but because the symbol set recognized as fundamental, changes when the order or words or characters are modified. As a consequence, the character and word scales do not have any sense of grammar. The fundamental scale and its corresponding profile, on the other hand, is affected by the order in which words are organized-or disorganized-and is therefore sensitive to the rules of grammar. Other communication systems may not have words, but they must have some rules or the equivalence of a grammar. Assuming rigid rules as symbol size or symbol delimiters seems to be a barrier when studying the structure of system descriptions.
In the search for symbols, the fundamental scale method accounts for frequent sequences of strings which result from grammar rules. The string "ing", for example appears at the end of words representing verbs or actions. Moreover, it normally comes followed by a space character (" "). As the sequence appears with noticeable frequency, the fundamental scale method recognizes the char sequence "ing" (ending with a space) as an entropy reducer token and therefore an important descriptive piece of English as a language. The observation of a description at its fundamental scale is therefore, sensitive to the order in which char-strings appear within the description. The fundamental scale method detects the internal grammar which has been ignored when analyzing Zipf's profiles at the scale of words in many previous studies.
Despite the concept of fundamental scale being applicable to descriptions built over multidimensional spaces, the fundamental scale method and the algorithm developed is devised for 1-dimensional descriptions. The symbol search process implemented scans the description along the writing dimension of the text file being analyzed. This means that the fundamental symbols constituting 2D descriptions like pictures, photographs or plain data tables cannot be discovered with the algorithm as developed. To extend the fundamental scale algorithm to descriptions of more than one dimension, the restriction (8c) must be modified or complemented, to incorporate the sense of indivisible information unit-as has been the character in the development of this study-and the allowed symbol boundary shape in the description-space considered. This adjustment is a difficult task to accomplish because establishing criteria for the shapes of the boundaries becomes a hard to solve topology problem, especially in higher dimensional spaces.
There are other limitations for the analysis of descriptions of one dimension. Some punctuation signs which belong more to the writing system than to the language itself, work in pairs. Parenthesis, quotes, admiration and question marks are some of the written punctuation signs which work in couples. Intuition indicates that each one of them is a half-symbol belonging to one symbol. In these cases, not considering each half as part of the same symbol most likely increases the entropy associated to the set of symbols discovered, thus becoming a deviation of the ideal application of the method. Nevertheless, for English, Spanish and human natural languages, in general, the characters which work in couples, appear infrequently as compared to the rest of characters. Thus the minimal entropy distortion introduced by this effect is small.
Practical use of the algorithm is feasible up to some description lengths. The actual limit depends on the nature of the language used in the description. For syllabic human natural languages the algorithm can be directly applied to texts of 40,000 characters or less. Longer texts, however, can be analyzed by partitioning. Thus the application limit for texts expressed in human natural languages, covers most needs. For the analysis of music, the use of the algorithm is limited to the MIDI format, result in large processing times even for powerful computers available today. The problem of scanning all possible sets of symbols in a sequence of characters grows as a combinatorial number. The Problem rapidly gets too complex in the computational sense, and its practical application is only feasible for representations of music in reduced sets of digitized symbols like the MIDI coding. Using more comprehensive formats like .MP3, a compressing technology capable of reducing the size of a music pack while keeping reasonably good sound quality, would be enough to locate the solution of the problem beyond our possibilities of performing experiments with large sets of musical pieces. Yet, the fundamental scale method provides new possibilities for discovering the most representative dimension of small sized textual descriptions, allowing us to advance in our understanding of languages.
The Fundamental Scale, as a concept and as a method to find a quantitative approximation to the description of communication systems promises to be fruitful in further research. Tackling the barriers of the algorithm by finding ways to reduce the number of loops and augmenting the assertiveness of the criteria used, may extend the space of practical use of the notion of a description's fundamental scale.