Escaping Printable Encoded Streams to Embed Out-of-Band Data

: In this paper, we propose to exploit the unused conﬁgurations of a printable encoding such as Base41, Base45 or Base85 to create a side channel that can store extra data such as error detection or correction codes, integrity veriﬁcation and authentication information or application deﬁned data. After introducing the encoding of binary octet strings in printable form, we present some case studies that show possible applications of the unused conﬁgurations.


Introduction
Printable string encoding of binary data is a well-known and widely spread technique used by systems designed to manage only printable characters: in other words, printable string encoding is an encapsulation method to process, in a transparent way, any possible bit string by systems able to treat only printable strings (for example, some mail servers).
Typical examples of this are old mail servers that are not able to directly accept binary data and the QR-code representation of the European Union Digital COVID Certificate.
To allow the storage and transmission of binary data with printable strings, many encodings have been proposed: in Section 3, some of these works will be recalled, emphasizing the use of the legal strings and computing the space of unused configurations that allow the embedding of additional information.
In general, the approach is to define an alphabet made of symbols that are used to compose strings, each one associated with a binary configuration to be encoded.For example, Base41 [1,2] uses three symbols from an alphabet of 41 printable characters to encode a pair of octets: in this case the possible printable strings are 41 3 = 68,921, while the configurations of pair of octets are 2 16 = 65,536, leaving 41 3 − 2 16 = 3385 free printable strings that, in general, are considered "illegal" but may be employed to store additional data such as a Cyclic Redundancy Check (CRC) value, a cryptographic hash, a Message Authentication Code (MAC), or a digital signature.
The main contributions of this paper are as follows: • Showing the redundancy present in some printable encodings; • Developing a methodology to embed extra data in a printable encoded stream; • Showing the effectiveness of embedding these data for security purposes; • Showing some other applications that leverage unused printable strings.
The following Section 2 will establish a uniform nomenclature and notation to be used throughout the paper, after which Section 3 will present some printable encoding methods and applications.Section 4 will give details on some of the encodings that allow the embedding of extra (i.e., payload) data and will present some methods for performing this operation.Section 5 will present a numerical analysis for some embodiments of the case studies introduced in Sections 4 and 6 will discuss some conclusions on the proposed methodology and its applications.

Nomenclature and Notation
In this section, we briefly recall some nomenclature and notation to have a uniform and clear definition and representation of the entities involved in this paper.
A symbol or character is a graphical representation of an abstract or real entity or concept.
An alphabet is an ordered, finite size collection of distinct symbols.
A set is a collection of distinct items, or elements, that in the present context will be symbols or characters.
A sequence or string will refer to an ordered collection of symbols from an alphabet.In particular, a sequence of symbols from an alphabet having cardinality 2 is called binary string.A string of characters is written within double quotes, e.g., "ABC" represents the string of the first three symbols of the Latin alphabet (capital letters).
To refer to instances of the previous entities, variables, or their properties, we denote them with the following rules: Throughout this document, BaseYY will denote an encoding method based on an alphabet of YY symbols: for example, Base41 refers to an encoding method based on 41 symbols.

Related Works
In the first part of this section, we will recall some printable encodings, giving emphasis to those that have unused configurations that can be exploited to represent extra data.Then, given that the present proposal is concerned to the employment of printable sequences to stuff extra data, the second part will discuss some works that embed payload information into textual data.
One of the most widely used printable encodings is Base64 [3], which represents three binary octets with four base 64 symbols, also dealing with an input length not multiple of three using the special symbol "=".The same paper [3] presents a base 32 encoding that represents five octets with a sequence of eight symbols taken from an alphabet of 33 (32 plus "=" that is used for padding when the input sequence has a length that is not a multiple of five).Furthermore, Ref. [3] discussed the base 16 that is essentially the well-known hexadecimal representation.
The use of base 41 is the core of [1,2,4].The proposed encodings emplxoy two different alphabets of 41 symbols; moreover, Refs.[1,2] also discussed bit strings of arbitrary length, i.e., not necessarily having a length of a multiple of eight.In particular, Refs.[1,2] encoded a pair of octets, a single octet, and a string of length from 0 to 7 bits with three Base41 symbols and, as previously cited, leaves 3385 free (unused) printable Base41 strings.
Base 45 iswas used in [5] for encoding data to be represented by QR codes.The proposed method expresses a pair of binary octets with three symbols from an alphabet of 45.A special coding is reserved for a single octet in case the original stream has an odd number of octets.This code uses only 2 16 = 65,536 triplets of the 45 3 = 91,125 available: this redundancy was used in [6,7] to reversibly embed data in a Base45 encoded stream.
Base 85 was used in two works: Ref. [8] used an alphabet of 85 printable symbols to obtain an efficient and compact representation of IPv6 addresses, while [9] used the alphabet made of 85 ASCII characters, from code 33 to code 117, to define an encoding called Ascii85 that represents a quadruple of octets with five Base85 symbols.Given the mapping performed by [9], there are 85 5 − 2 32 = 142,085,829 unused configurations for this encoding.
Ninety-one printable characters are employed in two Base91 encodings [10,11].The software available at [10] encodes blocks of 13 bits with pairs of symbols from a Base91 alphabet: given that 91 2 = 8281 Base91 pairs exceed by 89 the configurations of 13 bits (2 13 = 8192), if the block has a value not greater than 88, then one more bit is encoded; in this way, all the 91 2 Base91 pairs are used for encoding a bit stream in printable form.On the other hand, Ref. [11] encoded groups of 13 bits with two Base91 characters but used 12 pairs of the exceeding 89 to indicate the length of the last group of bits (in case it has a length different from 13): in this encoding, 89 − 12 = 77 pairs are left unused.
With the proposal of this paper, any of the printable encodings that leave unused configurations may be utilized to embed out-of-band data.
In the field of data hiding in textual data many works have been developed to store a payload into a text written with a word processor: in general, non-printable and empty characters or various kinds of white spaces are used to encode binary information.
For example, Ref. [12] developed UniSpaCh that works on Microsoft ® Word documents, inserting different Unicode spacing characters between words, sentences, and paragraphs: this method adds (non-visible) characters to the file increasing its size as our method does.The feature of change tracking in Microsoft ® Word documents was exploited in [13] to hide a secret message for steganographic communication.
In [14], the two non-printing characters zero-width joiner (ZWJ) and zero-width nonjoiner (ZWNJ) were employed to store information in a text: a binary information can be embedded if ZWJ and ZWNJ are used to represent the two states, but the paper also proposes an encoding that exploits longer sequences of ZWJ and ZWNJ to save characters from the Latin alphabet.
Ref. [15] merged the approach in [12] with the use of a zero-width character (ZWC): different combinations of Unicode spaces are used to embed bit pairs between words, sentences, lines, and paragraphs, and the payload is increased considering also the possibility to store ZWCs between words and sentences.
Modifications to the colors of printed characters were employed in [16] to embed a message in a document that will be printed and successively scanned to extract the hidden information: the paper discusses text color modulation (TCM), defining a model for the process of printing and successive scanning (PS model) and defines embedding and detection methods that save the information in the channels red and blue with respect to the value of the green channel.

Printable Encodings and Case Studies of Payload Data Embedding
Consider the set B of all binary strings of length n bits; thus, card(B) = 2 n .Furthermore, having an alphabet A of t printable symbols compute the value v such that: and define the set S of all sequences of v symbols from the alphabet A: obviously, card(S ) = t v .Using 2 n different sequences from S, it is possible to encode all the bit strings in B using only symbols from A. It follows that there will be a subset E of S (E ⊆ S), whose elements are in one-to-one correspondence with the binary strings of B, that is, there is a bijection between E and B, E ⇔ B.
Table 1 reports the characterizing values for some printable encodings.
The set U = S − E , which contains the unused sequences of S, will have card(U ) = t v − 2 n .From Table 1, it may be observed that this set U is non-empty for Base41 [1], Base45 [5], Base85 [9], and Base91 [11].Here, we propose a general framework for exploiting the unused sequences in several contexts, allowing applications to choose the most appropriate setting for their own purposes.Therefore, every application must define the meaning assigned to every unused sequence and how to process it.Suppose to encode binary sequences of n bits with v symbols belonging to an alphabet A (v is determined as in Equation ( 1)).If U = ∅ (see, for example, the encodings with a non-zero value in the last column of Table 1), an application selects a set of sequences Z ⊆ U and assigns a meaning to every sequence S ∈ Z.The semantics of each sequence must be known to both the encoder and decoder and agreed upon to have a correct transmission and extraction of the encoded data.
As will be shown later on, a sequence S ∈ Z may represent: a.
A string of bits encoding the whole or part of a Cyclic Redundancy Check (CRC) code; b.
A prefix indicating that a fixed number of following sequences encode a CRC, a Message Authentication Code, or a digital signature; c.
One or more bits to be transmitted separately from the data encoded by the sequences belonging to E ; d.
A separator to split portions of the data stream encoded by the sequences in E ; e.
An identifier specifying the characteristics of a portion of following sequences; f.
A context defining the meaning of the following sequences S ∈ Z.For instance, an application that uses Base41 printable encodings can decide that the sequence "zxx" is a prefix indicating that the next two sequences represent a 32 bit CRC.Note that different applications can assign different meanings to the same sequence from Z.
The next subsections will present some possible embodiments using the previously introduced representations.

Error Detection and Correction Information Embedding
The stream of printable encoded data may be stuffed with sequences belonging to U that encode a Cyclic Redundancy Check (CRC) [17] of a portion of data that has to be controlled for errors.
It is possible to encode a CRC of length l ≤ log 2 (t v − 2 n ) bits using a subset C of 2 l sequences in U associating every CRC binary string of length l to one sequence in C (Figure 1a).In this case, the proposed framework is instantiated with Z = C.
The maximum values of l for the encodings in Table 1 are 11 for Base41, 14 for Base45, and 27 for Base85.Longer CRC codes may be stuffed by simply concatenating more unused sequences (Figure 1b) and also in this case Z = C or, considering a single unused sequence S c , Z = {S c }, as a preamble for a fixed number of legal sequences belonging to E each carrying n bits of the CRC (Figure 1c) (see [18] for a comprehensive list of CRC polynomials).Example 2. Using the same Base41 encoding [1], an implementation of Figure 1b is to employ 2048 of the 3385 unused sequences available and concatenate three of them to stuff CRCs of length 3 l = 3 log 2 2,048 = 33 bits computed on the previous bit string for error detection.1c with Base45 [5] is to employ one of the 25,589 unused sequences available (see Table 1) to specify that the following two sequences belonging to E (each one encoding 16 bits) will encode a 2 × 16 = 32 bits CRC.

Integrity Information, Message Authentication Code, and Digital Signature Embedding
The printable encoded data may be stuffed and/or terminated with security information such as a cryptographic hash, a Message Authentication Code (MAC), or a signature covering the whole or a portion of the encoded data.Due to the bit length of these binary strings, it is more efficient to employ three unused sequences S h , S m , S ds from U to specify the type of security information, respectively, hash, MAC, and signature, encoded in the following sequences and then use a fixed number of sequences in E to store the hash, the MAC, or the signature (Figure 2).In this case, Z = {S h , S m , S ds }.Example 4. As shown in Figure 2, a single unused sequence S h1 of the Base41 encoding [1] may be employed to specify that the following eight sequences belonging to E (each one encoding 16 bits) will store a 8 × 16 = 128 bits hash, such as MD5 [19].Furthermore, another unused sequence S h2 of the Base41 encoding can utilized to indicate that the following sixteen sequences belonging to E (each one representing 16 bits) will encode a 16 × 16 = 256 bits hash such as SHA3-256 [20].In this case, Z = {S h1 , S h2 }.

Secondary Data Channel
It is possible to create a second data channel that carries information, such as a watermark, using the sequences in the previously defined set C (Z = C): every sequence represents l bits of information and may be interleaved anywhere in the encoded data stream being recognizable and distinguishable from data transformed in printable form (Figure 3).Example 5. Suppose a desire to store extra data in a Base85 [9] encoded stream.Exploiting the 142,085,829 unused sequences (see Table 1), it is possible to encode l = log 2 142,085,829 = 27 bits with an unused sequence of five characters.These can be inserted anywhere in the normal flow of Base85 sequences creating a secondary channel that, for example, can carry RGB colors (expressed with 8 bits per channel for a total of 24 bits).

Parameter Separation
A printable encoding may be also employed to encode parameters passed to a function in a context where binary data cannot be directly transmitted, for example, in the query string of a Web address.To separate the various encoded parameters, it is possible to use a single sequence S d belonging to the previously defined set U and another sequence S t from the same set to indicate the end of the parameters (Figure 4).The framework is instantiated with Z = {S d , S t }.
Another possibility is to identify the data types of the various parameters employing sequences from the set U (Figure 5): for example, it is possible to use a use sequence S i ∈ U to identify an integer, another sequence S f ∈ U to specify a float, then S o ∈ U to specify an octet string, S p ∈ U to express a binary pointer, and two sequences S rs , S re ∈ U to indicate the beginning and the end of a record made of fields in turn identified with these delimiters (with a possible recursive structure).The parameter's list can be terminated with the sequence S t from the same set U .In this case, the framework is instantiated with Z = S i , S f , S o , S p , S rs , S re , S t .
Nonetheless, the encodings proposed in Sections 4.1 and 4.2 may be used as an additional data protection feature for the parameters, taking care to choose S i , S f , S o , S p , S , S re , S d , and S t among the sequences in U not encoding a CRC (Figure 1) nor a type of hash, MAC, or digital signature (Figure 2).The proposed framework has Z = S i , S f , S o , S p , S rs , S re , S t , S c , S h , S m , S ds .
Example 6. Assume having a program running on a Web server that needs a (variable) set of parameters in binary form.In this case, the various data can be encoded with Base41 [1] and sent as a query string to the program, separating the various parameters with a single sequence from U and terminating the parameter list with another sequence in U .At the receiving side, the program can split the data using the and recover the original binary values decoding the Base41 strings.

Discussion and Results
In this section, we perform some numerical computations on some possible practical applications of the proposed method to printable encoded streams.

CRC Embedding
In the first run of tests, we considered adding an 11 bits CRC to blocks of data encoded in printable form with Base41 [1].The method adds three octets to the Base41 encoding of the block; thus, if the block has size n octets (8 n bits), then the Base41 encoding inflates it to 1.5 n octets, adding the CRC leads to 1.5 n + 3 octets with an overload of 3 1.5 n+3 × 100%.On the other hand, an 11 bits CRC on a block of 8 n bits represents an overload of 11  8 n+11 × 100%.Analogous formulas can be derived for 14 bits CRC and employing Base45 unused sequences.
We performed the computation of the overload for blocks of sizes 128, 256, 512, and 1024 bits (or 16, 32, 64, and 128 octets, respectively).Table 2 shows the resulting overloads for CRCs embedded into Base41 and Base45 encodings as proposed, comparing them with the classical overload had when embedding a CRC of (11 and 14 bits, respectively).From these data, it may be seen that the increase in overload is quite limited and feasible for an application level error detection and data protection from unintentional modifications.

Hash Embedding
Let us now examine a Base41 or a Base45 encoding: three printable characters encode two octets (apart from a single octet encoded when the stream length is not even).An MD5 hash [19] has a length of 16 octets, and thus, 3 + 3 × 16/2 = 27 octets may encode a file MD5 hash.Furthermore, a SHA-1 hash [21] has a length of 20 octets, and thus, 3 + 3 × 20/2 = 33 octets may encode a file SHA-1 hash.
Considering Base41 [1], we may assign the unused sequence S MD5 = "zzM" to indicate that the following 24 characters encode an MD5 hash and the unused sequence S SH A-1 = "zzS" to indicate that the following 33 characters encode a SHA-1 hash (in this embodiment, the framework is instantiated with Z = {S MD5 , S SH A-1 }).The impact on the size of the resulting encoding is, in both cases, only of three octets due to the escaping sequence (in this case, "zzM" or "zzS").

Extra Data Attachment
As a practical instance of Example 5, let us consider the use of Base85 to represent the pixels of an RGB color image to be appended to an Ascii85 encoded stream.Building Z with 16,777,216 sequences, and thus, card(Z ) = 16,777,216 of the 142,085,829 unused ones, it is possible to printable encode the pixels of the image: if the image dimensions are 320 × 240 = 76,800 pixels, then the size (inflated with a ratio 5:3) of the uncompressed image will be 76,800 × 5 = 384,000 or octets.

Client-Server Parameter Passing
Let us consider passing a variable number of parameters from a Web client to a server.As a concrete example, suppose conveying a 16 bit integer valued 41, a 16 bit integer valued 65,535 and a character string valued "BASE".Having built Z with the unused sequences "xBA", "xBB", "xBC", "xBD", "xBF", "xBG", "xBH", and "xBJ" to represent S i , S f , S o , S p , S rs , S re , S d , and S t , respectively, to perform an encoding that follows the proposal shown in Figure 5, the resulting printable stream will be: It is obvious that the resulting Base41 string can be immediately and unambiguously decoded by a procedure aware of the Base41 encoding symbols assignment and expecting the corresponding parameters.
One disadvantage is that the insertion of extra data in the encoding increases the size of the processed stream and this might be limiting the application on low-capacity links or small-capacity devices.

FigureExample 1 .
Simple schemes to show possible encodings of CRC codes (gray arrows show the sequences covered by the CRC).(a) CRC encoded in a single unused sequence.(b) CRC encoded in multiple unused sequences.(c) CRC encoded in an unused sequence and multiple legal sequences.Considering the Base41 encoding [1], an implementation of Figure 1a is to employ 2048 of the 3385 unused sequences available to stuff CRCs of length l = log 2 2048 = 11 bits computed on the previous bit string for error detection.

Figure 2 .
Figure 2. scheme to show possible encodings of security information for data protection (gray arrows show the sequences covered by the hash, MAC, or digital signature).

Figure 3 .
Figure 3. Secondary channel information interleaved in printable encoded data.

Figure 4 .
Figure 4. Possible encoding of parameters with separators S d and S t .

Figure 5 .
Figure 5. Identifying types of parameters with separators.

Table 1 .
Some printable encodings with their main parameters and number of unused sequences.