2.1. Representations: Small Objects and Large Objects
Some creative objects are inherently discrete in their forms, while others are fundamentally continuous. The combinatorial category includes text documents (e.g., stories, poetry, play scripts, and text descriptions of other media) and symbolic music (e.g., scores), while the continuous set includes live performances, photography, movies, and paintings.
Producers of these various art forms probably do not spend much time thinking about whether they are creating a fundamentally digital object or not, but in a digital era, the difference is significant for a number of reasons. For example, there is no such thing as a higher-resolution version of a novel; a proper encoding of the complete text is a full representation of the creative artifact. Given that novels in English tend to be fewer than 100,000 words, this means that a novel can be represented in ASCII encoding in a few megabits, and transmitted in less than a second. Adding more storage to the representation does not make for a better reading experience. Obviously, one could include historical versions of the text, annotations, criticism, contextual information and other augmentations of that sort. However, the core text can be stored at a scale of over a thousand novels per gigabyte; even the entire text of the King James Bible requires only a few megabytes.
By contrast, advances in digital photography technology, for production, storage, and compression, give viewers increasingly accurate representations of the image that the artist sought to generate. The 320 × 200-pixel images in sixteen colors that a Commodore 64 computer had in the 1980s could be stored in 32 kilobytes, but offered poor detail, no ability to properly indicate shadow or dimensionality, and (obviously) low resolution and color reconstruction. A 2019 Mac has the resolution of 5120 × 2880 pixels in over a billion colors, requiring roughly 2000 times as much video memory and an astonishing amount of computation to show a user images. However, it is also able to represent artwork in vastly greater detail; further, super-high-resolution images of famous artworks can require gigabytes of space, and curators and art historians can use them as primary research tools [26
], given that they might well zoom into the image with more detail than is visible to the naked eye.
This striking contrast in file size may seem like a curiosity, particularly now that we are far enough removed from the Commodore 64 era that image (and audio and video) quality issues are rarely major concerns, and most of us do not even store audio files locally anymore in the streaming music era.
Yet the problem of “smallness” creates real challenges for computational creativity: with many fast algorithms, we can easily generate hundreds of thousands of poems or short stories or song lyrics or musical segments in a second, and this creates the need to also evaluate their quality in real time. This is not easy: authors often identify a collection of desiderata that characterize good examples of the genre they are creating and then score the generated pieces against computationally efficiently measurable analogues of each desideratum, highlighting the most successful piece. Ensuring novelty can be a real challenge in this context.
2.2. Algorithmic Information Theory Basics
We explore a variety of desiderata for creative objects in a recent paper [24
], using the surprisingly apropos vantage point of algorithmic information theory. Algorithmic information theory seeks to identify the inherent computation displayed by a combinatorial object. It gives measure to the universal probability of an object (using an encoding given by Turing machines) and the complexity of the object (the size of the smallest Turing machine whose output is that object). In that work, we also adapt other more advanced concepts of algorithmic information theory, such as logical depth [27
] and sophistication [28
] to the more general concepts of computational creativity. This approach gives rise to a collection of largely new definitions for core computational creativity goals, such as novelty, value, typicality, surprisingness and others.
To be more specific, we define some of the basic concepts of algorithmic information theory here that apply to our recent work. The Kolmogorov complexity of a string x is the length of the shortest input p to a universal Turing machine U, where on input p, U computes x and halts. Often, we ignore the details of this universal Turing machine U, but in our case, we will spend a fair amount of time limiting the possible inputs to U and restricting the possible behavior of the machine U, or of the machines it simulates. The conditional Kolmogorov complexity of y given the string x, is the length of the shortest input to U for which U outputs y, given that U is also provided with a read-only tape on which the string x is pre-loaded. If , then the string x provides a great deal of information in planning to later compute y, while if , then x and y are largely unrelated.
One way of understanding is to see it as coming from a two-part representation of the object x: a program p that describes the regularity of x and how to compute it, combined with a number of bits of random information y needed to describe the remaining details in x. In this formulation, if we use U to simulate the running of the program p on input y, then U again halts with output x. The usefulness of this idea is that it allows us to explore other inputs to the program p; naturally, this approach is most effective if p does not output x on all inputs. In this formulation, a model for x is a program p which, on some input data y outputs x, and the pair forms a two-part code encoding of x.
A clean way of describing the relationships among U, p and y is to see U as running with two (read-only) input tapes that are initialized with p and y, and a work tape; then, we say that U simulates that by halting with exactly the string x on the work tape. To make this domain much more tractable, we require that U computes only on a prefix-free set, that is, if for some non-empty and if U halts on for any string y, then U goes to an error state on the input for any y; similarly if for some non-empty , and U halts on for any encoded Turing machine p, then U goes to an error state for the input for any p. The prefix-free property can seem a bit unnatural, but in the domain of computer programs, this is equivalent to having an END statement at the end of the program p, and for inputs, it can represent a special end-of-file character. An alternative easy way to impose the prefix-free property is to represent a binary object z of length n by the string ; in this formulation, all strings x have a unique -bit mapping available to them. In this frame, we can consider p to be a partial function where = if U halts on input , and is undefined otherwise.
In this framework, strings x
that have short data-to-model codes
are inherently more likely as outputs of U
than those with only longer encodings. This is best described by the algorithmic probability of x
: given a universal machine U
with the prefix-free property on the inputs it accepts, the algorithmic probability of x
(Note that because of the prefix-free property,
, by the Kraft inequality [29
]; we do not typically scale the measure to sum to 1 since we are more often concerned with the difference in algorithmic probability of two different strings x
). We can further define a program-specific algorithmic probability
because of the prefix-free property of U
’s second input tape. This probability is the likelihood of x
being the outcome of p
, a program of interest. Since the smallest
pair that computes x
, this contributes
to the overall algorithmic probability of x
, which is the largest contribution. Let the language of the model p
be the set
of all strings x
We can further restrict the class of valid models p we permit, and will do this extensively in later sections of this paper. For example, we can require that p always accepts (except when rejecting inputs that are not from its prefix-free set of valid inputs), that p computes an injective function, that is a prefix-free set, that for some constant c, and even make restrictions on the size or resource use (space or time) of p. We can require that for some set of valid outputs Q to guarantee that all possible objects of a type are output. Some of these restrictions correspond to substantial demands on the space of valid models and dramatically expand the computational requirements for a valid model; it is possible that under these restrictions, the shortest model for x might be quite a bit longer than , or indeed x might no longer be a computable string at all with the restrictions. It is also possible that testing whether a program p satisfies the restrictions is itself uncomputable: for example, the requirement that p always accept is untestable, due to the uncomputability of the halting problem. For any property that defines a valid class of models p, is the smallest length of representation such that ; if no such model p exists, then . One trick that is worth noting is that while the inputs to U (the representations of p and its input y) may be required to be prefix-free, we can consider p and y to be extracted from that prefix-free representation, so by the time the universal Turing machine simulates P on y, the input to p can be assumed to be of length .
Some Pathological Programs
There are two programs that form special cases that we most likely do not want to have as optimal models for the string x. The first program is the “print” program, which requires that its inputs all are of the form for , and on input prints the string x on the work tape and halts. This program is of constant size (the portion that verifies that the input is valid can be handled by the UTM U), and for any valid input, assigns algorithmic probability to the string x in runtime . (In this case, the algorithmic probability exactly integrates to 1, and so no scaling is needed.) However, we can certainly agree that “print” performs no interesting computation, and ideally is not the best model for any object we wish to view as creative.
The second pathological program is even simpler: it is the “print x” program, which stores the string x in full detail in its structure, and on every input y, simply prints x on the output tape and halts. This program is of length for some constant c, and requires runtime (regardless of the length of its input, which it ignores); its language contains only x, which it outputs with probability 1.
Note that the Turing machines we consider in this article are deterministic and non-probabilistic. As such, actions such as “print out a random string of length n” cannot be encoded for this model.
2.3. Desiderata for Creativity
We now define several of the key concepts in the evaluation of creative agents and explain how these concepts integrate with advanced ideas from algorithmic information theory, as defined in our recent paper [24
]. A somewhat unexpected concern comes from the question of model classes: by restricting to models that represent only finite sets, we can potentially over-fit on training data and not have a good representation of the novelty of a new object.
The definition of typicality that we give in our recent paper is that, given an “inspiring set” S of objects in our class, and a good model M whose language is a superset of S, the typicality of a new object x is the negative of randomness deficiency of x with respect to M, or . In this formulation, an object is “typical” for the model M if M generates it with parameterizations that cannot be better represented by changing the model itself: the most typical examples of M are those with a typicality of zero.
Of course, as alluded to in the previous section, we must do a good job of defining the class of models we are considering. In our previous paper, we restrict the class of valid models to those for which the model complexity along with the data-to-model code of each member
of the inspiring set is not much larger than
; to be specific,
for some constant c
, and the typicality of x
with respect to the model M
is as defined in the previous paragraph.
Similarly, we define the value of an object x in two related ways. The first approach uses logical depth to model the inherent computation of the object itself: if all of the short programs for x require substantial computation to generate x, then the best explanation for the creation of x is one that involves substantial effort, and hence, x is of value. To be specific, let for some constant b, where is the empty string; then . Obviously, is uncomputable since if we know what are, we could just run all programs for that runtime until we find one that halts with output x. Again, in this frame, if x is logically deep, it is valuable since it attests to the likely effort that the creator engaged in while building the object x. In the algorithmic probability sense, a logically deep string is a string for which the most likely explanations all require substantial computational effort; while U generates x with some long, quick programs, the bulk of the probability mass is on the shorter, slower programs.
The second approach for the value of an object is its sophistication. The sophistication of an object x is the length of the shortest program for which x is a typical output. Formally, let be all machines that compute total functions. Then, given a constant c, . In this formulation, a high-sophistication string is one that can be compressed but for which the data-to-model code requires a fairly complicated model p; the model represents the complexity of the creator itself, and the object is satisfying because it would only come from a simple creator in a surprising outcome. To consider our two pathological examples, if , then neither the “print” program nor the “print x” program is a good model for x since the former requires a too-long argument and the latter a too-long program. However, if , which is to say, if x is a random string, then the “print” program is a good model for x. By contrast, if , or is some other highly repetitive string, and or is some other small value, then the string x is still a typical output of a simple program that reads in a binary representation of n and prints out that many 1 s. So, despite being highly-compressible, x is still unsophisticated. Said in a different way, with a small value of c, the tolerance constant in the definition of sophistication, the minimum total program for x is a trivial program that reads in a c-length value (the length of x) and then outputs a string of 1 s of that length.
It is an important theorem that sophisticated objects are all of high logical depth: that is, if it requires a complex program to compress a non-random string x
with a short input, then the execution of that program will be lengthy. See Antunes et al. [30
] for more details.
Unfortunately for us, this overall approach does not work immediately with small objects. Short objects all have trivial short descriptions; including enough background information to describe the class that an object belongs to makes the overall program much, much larger than any small object. As such, the logical depth of any small object is itself small, and distinguishing among these objects is impossible in this manner. Similarly, for any finite inspiring set S of small objects, it will be hard to beat the trivial model M that just lists off those elements, and then simply prints off the new object, with no real computation involved in any of the explanations. The sophistication of a small object x is going to be small, because the “print” program has x as a typical output, and yet, it is a small program. Compressibility, and the effort found in compression, is not sufficient to explain small objects.
2.4. The Problem of Small Objects, Formally
Now, we at last can present the problem of small objects. We know that small objects x, where is small (and our understanding of x is complete; that is, there exists no string , for that better represents the object) have the property that . If for some constant k, how can we identify if x is of high value, or of high novelty or typicality with respect to an inspiring set S, relative to some other small object y?
The reason why this problem is serious comes from the low value of . Imagine that x is a haiku. To build a program p to generate a good-quality English language haiku, we would need to represent English grammar and vocabulary, syllable counts and so on, all as part of the program p. Thus, p will presumably require megabytes of code and data; after doing so, we might be able to identify x by an input y that is a bit shorter than x, and might even be able to express some aesthetic judgment in prioritizing “good” haiku.
However, by our definition of typicality, the program p is an invalid model for x since , and , the program p is not allowed. Similarly, since the “print x” program is short enough to be a valid answer to the question “how was x generated?”, and its runtime is approximately n; the “print” program run with the input x also generates x in approximately n time, and the pair has a length that is not much longer than n. It is possible in some rare cases that a small object might have a very short program that compresses it even more, but it would be very surprising if the best very short program allowed the expansion of x in only very long runtimes indicative of much logical depth. Finally, x cannot be sophisticated because parameterizing the “print” program to output x will be a valid choice.
How can we use algorithmic information theory to differentiate the value and typicality of small objects?