A Survey of Printable Encodings
Abstract
1. Introduction
2. Notation and Nomenclature
2.1. Notation
- scalar variables with italicized lowercase letters, like ; to ease reading; some variables may be named with longer strings of letters, e.g., to indicate a length;
- program variables with lowercase and uppercase letters, both italics or not, e.g., , v, Z;
- a range of values comprised between a and b with ;
- sets and alphabets with boldface uppercase letters, like A, B, and T;
- numerals S in base b (in the present work b will always be written in base 10) with , with where A is an alphabet (see Section 2.2) of b symbols, ;
- an element, that with others belongs to a set or an alphabet, with a lowercase letter and possibly a subscript, like ;
- Unicode code points identified by a number are written in hexadecimal notation (with alphabet ) with four or six characters prefixed by the string “U+”;
- unless differently stated, all numbers are written in the decimal base;
- the floor and ceiling functions are indicated with and , respectively.
2.2. Nomenclature
- 8-bit cleanness property of a system that is able to store, transmit, and process 8-bit data without requiring data formatted in 7-bit units or relying on a possible use of the 8 th bit of an octet for its own processing; alternatively, it is the quality of a system that is capable of processing octets without assigning specific meanings to some binary configurations; thus, a not-8-bit-clean system may interpret in a misleading way or modify the Most Significant Bit (MSB);
- alphabet an ordered set of n symbols, each one associated, in order, to an integer number from 0 to ;
- ASCII armor a technique used by OpenPGP (see the OpenPGP Message Format [3]) to encode any kind of data (i.e., binary) in a form that is not modified by intermediate not-8-bit clean systems: it uses Radix-64, i.e., a printable encoding based on Base64 [4], to represent with printable ASCII characters the data and the relative checksum encapsulating them into header and trailer lines;
- base b positional numeral system a numeral representation system using b symbols from an alphabet ( represents the natural number i) where the numeral expresses the number , that is, every symbol is weighted according to its position in the numeral; the number b is called the base of the positional numeral system;
- big endian referred to the transmission or storage order of octets, meaning that a data object composed of many octets is sent starting from its Most Significant Octet (MSO), and in its memory area, it is stored by putting the MSO towards lower memory addresses and the Least Significant Octet (LSO) at higher memory addresses;
- bit (binary digit) the fundamental unit of information that can assume one of two values with the same probability; mathematically, it is the information (or entropy) of a source that produces as an outcome one of two equiprobable events (i.e., each one having probability );
- Byte Order Mark the byte order mark (BOM) is a special block of octets (i.e., bytes) prefixed to an octet sequence useful in the decoding of data. In particular, the BOM (the term byte order mark comes from the name of this character, BYTE ORDER MARK, in Unicode 1.0; if this character is present in the following part of a data stream, it should be interpreted as a word joiner, i.e., the character string must not be separated at that point, but since Unicode 3.2, it is recommended to use the character WORD JOINER U+2060 for this purpose) is the Unicode character ZERO WIDTH NO-BREAK SPACE (ZWNBSP) with code point U+FEFF: when this character is the first of a sequence, the decoder can establish the endianness of the data encoding; moreover, the presence of this character at the beginning allows the decoder, with a high probability, to be confident that the data stream is Unicode encoded and to determine the type of encoding. In fact, UTF-8 does not allow the octets and to be present as first data, excluding this kind of encoding. In the case of UTF-16 and UTF-32, the BOM allows us to determine if the data has been saved in big-endian or little-endian mode: given that the octet sequence does not represent any character in Unicode, then its presence as the first character signals the decoder that the data has been recorded in little-endian mode, allowing it to correctly interpret the data stream and to save it with the local machine’s endianness.To sum up, the BOM for UTF-16 encoded in big endian mode is the octet sequence , and in little endian mode it is ; for UTF-32, the BOM for big endian systems is , and for little endian systems it is .Other UTF encodings have a header, called BOM, used to specify the type of encoding:
- UTF-1 has the octet sequence ;
- UTF-7 has the sequence followed by another octet whose value depends on the next symbol;
- UTF-8 has a sequence whose usage is not recommended by Unicode, corresponding to .
- C0 control codes compose a set of 32 characters found in ASCII and other encodings used to represent printing, character set switch, communication, alert, power, and formatting commands. The values of the C0 control codes range from to . In some contexts, also the space character (ASCII value ) and the delete (DEL) character (ASCII value ) are also considered control codes but are not part of the C0 set;
- C1 control codes compose the “twin” set of the C0 control codes when considering an extended ASCII character set (i.e., over 8 bits): The C1 set is made up of the characters in the range to , that is, the C0 characters with the eighth (most significant) bit set;
- character an elementary piece of information that can be associated with a symbol and that, with others, composes an alphabet;
- code page a table specifying the association between a graphic character, like ‘A’, or a control character, like newline, and a number, thus defining an encoding; this name was introduced by IBM, which numbered many possible code pages, and many other vendors and software producers aligned or produced their own numbering;
- code point the address (label) of an element in a multi-dimensional matrix containing heterogeneous entities; in the context of the present paper, an integer number that uniquely identifies a (part of a) character in an encoding system like ASCII or Unicode;
- endianness refers to the octet ordering and can be little-endian or big-endian;
- G0, G1, G2, G3 working sets of graphic characters that can be loaded and accessed using particular special sequences of control characters;
- GL, GR the primary code area for graphic characters in 7-bit environments is called GL (Graphic Left), while in 8-bit environments, the additional code area is called GR (Graphic Right);
- glyph the graphical representation of a symbol; a symbol may be represented with many glyphs, e.g., a, a, a, a;
- grapheme an elementary object of a writing system that in the computer science field has the same meaning as character;
- little endian referred to the transmission or storage order of octets, meaning that a data object composed of many octets is sent starting from its LSO and in its memory area is stored by putting the LSO towards lower memory addresses and the MSO at higher memory addresses;
- nibble a sequence of four bits, i.e., half an octet;
- number the measure of a quantity, or of an amount, defining a concept perceived by an entity;
- numeral the expression of a number, that is, a symbol, a signal, or a sequence of symbols or signals instantiating the concept of a number. For example, the sequence 12 from the decimal numeral system using Arabic digits expresses the concept of the number of months in the year of the Gregorian calendar, but the same number can be expressed with the Roman numerals as XII or in the English language as twelve;
- octet a sequence of eight bits;
- percent encoding (also called URL-encoding) is a method to encode octets of arbitrary value using only ASCII characters that are not reserved for representing URIs; more details on this encoding are in the dedicated section;
- sequence an ordered list of homogeneous (i.e., having the same type) objects;
- shellcode a portion of code that has the purpose of letting a hacker or a cracker gain control over a machine, in some cases launching a command shell (thus the name);
- string a sequence of characters;
- symbol the representation of a concept; in the context of this paper, it generally identifies a computer character;
- URL-encoding see percent encoding;
- URL-safe URLs can be written according to the specifications [5,6]; even if printable, some characters are reserved for coding or due to possible misinterpretation by other protocols or systems and, as such, must not be used in URLs. A character is considered URL-safe if it is printable and is not reserved; a character string is URL-safe if all its characters are URL-safe;
- wide character a data type having a size of one or more octets aimed at containing a character. The necessity of a char data type larger than 8 bits became clear when character sets having more than 256 symbols were defined. For example, UTF-8 and UTF-16 encodings of UCS require up to four octets to represent a character. In general, language implementations define a wide character as two octets, for example, if UTF-16 is used (but surrogate pairs require two of them), or four octets to contain all the UTF-32 code points. Note that, being compiler-specific the size of a wide character can also be a single octet.
3. Character Encodings
- ASCII: the American Standard Code for Information Interchange [7,8] is a method for encoding a set of characters. Standard ASCII uses 7 bits to portray code points, which are associated with printable and non-printable symbols. The 95 code points having values from to designate printable characters comprising the letters of the Latin (and English) alphabet, both uppercase and lowercase, the ten decimal digits, the space, and 32 punctuation, mathematical, and special characters, like ampersand and tilde. The remaining 33 code points associated with non-printable characters (code points from to , called C0 control codes, and ) are associated with control characters corresponding to commands for printers, disks, modems, or other peripherals.
- EBCDIC: the Extended Binary Coded Decimal Interchange Code is an 8-bit character code developed by IBM in the sixties and, as its name says, extends the 6-bit code BCDIC (Binary Coded Decimal Interchange Code) used for punched cards having two groups of rows (named zone and number) [9]. Due to the mechanical constraints of BCDIC, the EBCDIC encoding inherits some character representations, and for this reason, in the invariant part of the code alphabet, letters, both uppercase and lowercase, are not assigned to consecutive binary configurations; for example, the letter R is represented with , and S is encoded with . Control codes are represented with codes from to plus . Space, special characters, digits, and letters occupy a part from to , but many codes are left free and assigned by each code page for each world language. Moreover, code pages for languages not using the Latin alphabet may also reassign the codes for the Latin letters.
- ISO/IEC 646: a 7-bit encoding strictly related to ASCII is ISO/IEC 646 [10]. ISO/IEC 646 consists of multiple 7-bit standard character sets sharing a common Basic Character Set composed of the ten decimal digits, the space, some basic punctuation and mathematical characters, and the uppercase and lowercase letters of the ISO Basic Latin alphabet (which coincides with the English alphabet); the first 32 codes are the same control codes of ASCII, as well as code , which is DEL (the code associated with DEL is a legacy of punched cards: given that the presence of a hole represented a 1-valued bit, the solution adopted to delete, i.e., mark as unusable, a previously punched 7-bit character was to overwrite it with seven 1 bits, transforming any character into DEL; zeroing a bit was not practical because it required shutting a hole in the card). Twelve (12) codes are free to be used by national variants to represent their language characters (for example, è, à, etc. of the Italian language). Code is allowed to be £ or #, and code must be $ or the character for international unspecified currency ¤; nonetheless, some national variants change these two characters.
- ISO/IEC 8859 [11]: this family of encoding standards was developed with the objective of enriching the ASCII standard with symbols present in alphabets based on the Latin alphabet but that also contain new characters or modifications of the ones already present (e.g., accented characters or diacritics). These standards use an 8-bit encoding; given that the added configurations are not able to cover all the alphabets, many parts have been developed inside ISO/IEC 8859: for example, ISO/IEC 8859-4 is called Latin-4 North European and covers Estonian, Latvian, Lithuanian, Greenlandic, and Sami alphabets.
- ISO/IEC 10646 is a family of standards defining the Universal Coded Character Set (UCS) that evolved in time and converged with Unicode in 1991 for the representation of characters (see next list point about Unicode); nonetheless, Unicode adds other attributes and procedures tied to the use of the defined characters (e.g., writing direction). The characters (i.e., code points) represented by ISO/IEC 10646 are those encoded by UTF-16, namely those in the range : these code points are represented with 2 or 4 octets (see UTF-16 in the following). Also, ISO/IEC 10646 allows an encoding over 4 octets called UTF-32, simply expressing the code point over 32 bits.
- Unicode [12] is a standard that defines a unique number, called a code point, for every character in the world, regardless of the language, script, or system. There are currently almost code points in Unicode, covering alphabets, symbols, emojis, and more. Note that Unicode does not define an encoding method but a unique correspondence between code point and character. Different Unicode encodings (Unicode Transformation Formats) have been defined, as described in the following.
- ▶
- UTF-7 [13] (Unicode Transformation Format 7-bit) is an obsolete character encoding scheme that was created to represent and transmit Unicode characters through systems that only handle 7-bit ASCII data. The UTF-7 encoding scheme uses base64 to encode non-ASCII characters into ASCII characters. The Unicode Consortium never approved UTF-7 as an official standard. It has security problems that made software developers stop using it. HTML 5 does not allow it. In UTF-7, escaping is used to encode non-ASCII characters, using the character “+” to indicate the start of an escaped sequence, followed by a base64 encoding of the non-ASCII character, and terminated by a “-” character or the end of the string. For example, the character é (U+00E9) is encoded as +AOk- in UTF-7: the “+” at the beginning signals the start of the escape, the AOk is the base64 encoding of é, and the “-” at the end indicates the end of the escape.
- ▶
- UTF-8 [14] is a way of representing any Unicode character using one to four octets. It was invented by Ken Thompson and Rob Pike in 1992 as a coding method that is simple, efficient, and backward compatible with ASCII. UTF-8 was soon adopted by the Internet Engineering Task Force (IETF) and the Unicode Consortium as a standard for Unicode encoding. Features of UTF-8 are as follows:
- ○
- It can encode any Unicode character using one to four octets, depending on its value. The first 128 characters, which correspond to ASCII, are encoded using one octet. The higher the value of the character, the more octets it requires.
- ○
- It is self-synchronizing, meaning that it is possible to find the start of a character by looking at the prefix bits of each octet. The first octet of a multi-octet sequence has a certain number of prefix bits (110, 1110, or 11110) that indicate the number of octets in the sequence. The continuation octets have a prefix bit of 10. This makes it easy to parse and manipulate UTF-8 strings.
- ○
- It is error-resistant, meaning that it can detect and recover from invalid or corrupted sequences. If an octet does not match the expected pattern, it can be skipped or replaced with a replacement character. This prevents the propagation of errors and the loss of data.
- ○
- It is compact, meaning that it uses less space than other Unicode encodings for most texts. This is especially true for texts that contain mostly ASCII characters, such as English or HTML.
- ▶
- UTF-16 [15] is a standard encoding format that is capable of representing all the Unicode code points, from U+000000 to U+10FFFF, using 2 or 4 octets. The code points in the Basic Multilingual Plane (BMP) have an associated value in the range , i.e., , and are encoded by UTF-16 with their representation in 2 octets, except the range of values , called surrogate range: This range of values is reserved (i.e., these values do not represent any symbol) to also encode the symbols not belonging to the BMP, corresponding to the code points from U+010000 to U+10FFFF. Given a code point having a value v greater than , the exceeding part from is computed: the resulting value can be expressed with 20 bits; this quantity is split into two parts of 10 bits: the most significant part is added (more efficiently, OR-ed) to (where the digits in italics are the less significant 10 bits), producing the high surrogate, and the less significant part is OR-ed to , resulting in the low surrogate. In this case, a code point is encoded with four octets.The result is that all the Unicode code points can be unequivocally encoded with two or four octets, and the first pair of octets allows one to immediately distinguish the presence of a surrogate pair; moreover, the most significant six bits of each surrogate pair distinguish a high surrogate from a low surrogate, allowing a correct reconstruction of the original code point.When not already specified by the encoding, the endianness may be detected by representing the BOM (Byte Order Mark) U+FEFF (representing the Unicode zero-width no-break space) that, if encoded with , reveals a little-endian encoding, whilst if left unaltered, specifies a big-endian encoding.
- ▶
- UTF-32: UTF-32 [16] is an encoding format for Unicode that uses 4 octets to represent all the Unicode code points (from U+0000 to U+10FFFF): this leaves, for every code point, the 11 Most Significant Bits zeroed. Due to the use of 4 octets for every symbol, it is less space efficient than UTF-8 and UTF-16 but has the advantages of a fixed-length encoding: In a stream of UTF-32 encoded data, the i-th code point starts at the -th octet. Note that given that a complex character may be represented using more than one code point, then the previous direct access formula may be applied to a sequence of code points but not to the characters they represent (requiring a linear reading of all the code points to arrive at the one(s) of character under examination). The endianness is detected by the encoding of the BOM U+FEFF.
- ▶
- UTF-EBCDIC (originally called EBCDIC-Friendly UCS Transformation Format, EF-UTF) is a transformation format specified in [17] that may represent all the Unicode points up to plane 16 (from U+0000 to U+10FFFF) with 1 to 5 octets; moreover, UTF-8-Mod, the encoding used by UTF-EBCDIC, may represent all the UCS-4 code points, namely up to U+7FFFFFFF, encoding them with a maximum of 7 octets.The UTF-EBCDIC encoding goes through two reversible steps: a Unicode code point is first converted into what is called an I8-sequence (of octets) obtained from an adapted UTF-8 encoding, and then the derived octets are singularly remapped with a reversible transformation. The objective of this transformation format is to map the code points from U+0000 to U+009F to a single octet of the same value and then to remap these octet values to match the value of the corresponding character in the EBCDIC encoding. Moreover, the values from to never appear in any octet that is part of a sequence encoding code points greater than U+009F.
- ▶
- UTF-1 (“ISO IR 178: UCS Transformation Format One (UTF-1)”, [18,19]) is an encoding method for Unicode and ISO/IEC 10646 characters. It is capable of representing the UCS characters from to (even if Unicode is upper limited to ). Every character is encoded with one, two, three, or five octets, depending on its value. The objective of this encoding is to generate octet sequences that do not contain the octets in the sets of the C0 or C1 control codes, along with space () and DEL (), obviously apart from the octet representing the control code itself. This leaves usable octet representations, leading to an alphabet for a Base190 encoding that is saved in a variable-length octet sequence as previously mentioned: this alphabet is selected by means of a function and its inverse ( and in [18,19]) that filter the allowed subset of octet configurations (namely, all but the C0 or C1 control codes, the space, and DEL).
- ▶
- UTF-5 was proposed in the year 2000 with an Internet Draft [20], now expired, as a transformation format for ISO/IEC 10646 and Unicode.The objective of UTF-5 was to develop a format alternative to UTF-7, UTF-8, and UTF-16 for systems, applications, and protocols unable to process 7-bit or 8-bit strings. Examples of proposed applications are the representation of names used in the Domain Name System (DNS) and addresses in the Simple Mail Transfer Protocol (SMTP).In [20], a sequence of 5 bits is called quintet, whose value is represented with the 32 characters of the alphabet0123456789ABCDEFGHIJKLMNOPQRSTUV(as noted in [20], each character can be encoded in binary with any code, typically ASCII).Considering a UCS-4 Unicode 32-bit representation of a symbol, UTF-5 encodes the nibbles (four-bit aggregations) from left to right, starting from the first non-zero-valued nibble (the UCS value 0 is considered as a single nibble). The sequence of nibbles is represented as a sequence of quintets, one nibble in the four rightmost bits of each quintet: the first (most significant) quintet will have its leftmost (most significant) bit (MSB) set (i.e., valued 1), and the other quintets will have the MSB reset (valued 0). For example, the UCS-4 value will be represented in UTF-5 as G, while the UCS-4 symbol will be encoded as QBCDE.
- ▶
- UTF-6 was proposed in the (now expired) Internet Draft [21]. This encoding augments UTF-5 (hence the name UTF-6), adding compression by leveraging the intrinsic redundancy of UTF-16 encoded host names for the DNS. The compressed data distinguishes a redundant octet using the Y character and a redundant nibble with the Z character; in addition to these two characters, the alphabet is composed of the same 32 (lowercase) symbols of UTF-5, namely0123456789abcdefghijklmnopqrstuvThe resulting string is prefixed by the sequence of characters wp--
- ▶
- UTF-9 and UTF-18 are introduced in [22] as an April Fools’ Day RFC from IETF whose application is possible even with reduced time/space efficiency for present architectures having 8-bit bytes (note that in the past the term byte referred to sequences of bits with different lengths; thus, we prefer the term octet when dealing with 8-bit bytes). RFC 4042 describes a Unicode transformation format aimed at architectures using 9-bit addressable units, called nonets. The code points from U+0000 to U+00FF are encoded in one nonet with the MSB unset. The code points from U+0100 to U+10FFFF are encoded with two (for code points in the BMP) or three nonets depending on the required space to save the meaningful bits in the eight bits of each nonet, setting the MSB to indicate continuation on the next nonet and leaving the last nonet MSB unset (note that in [22] the code points outside the BMP are identified in the range U+1000–U+10FFFF instead of U+10000–U+10FFFF, presumably due to a typo). With this technique, also the remaining UCS-4 code points (from U+110000 to U+7FFFFFFF) are UTF-9 encoded with three or four nonets.UTF-18 encodes the Unicode planes 0, 1, 2, and 14 only. The first three plane code points map directly to the 18 bits of UTF-18, whilst plane 14 code points are stored in UTF-18 after subtracting (it seems that this value is reported as in [22], apparently due to a typo) from the code point value.In [23], a comparison between Unicode encodings requiring a zeroed most significant bit in each octet and those that are 8-bit clean is reported.
- UCS-2: this definition is now superseded and should not be used anymore; originally, it referred to a character representation over 2 octets for symbols now located in the Basic Multilingual Plane [24].
- UCS-4: the ISO/IEC 10646 standard originally defined the Universal Character Set to represent characters in 4 octets in the range but subsequently limited the range to , and currently UCS-4 and UTF-32 indicate the same set [24].
- ISO/IEC 2022: the ISO/IEC 2022 standard [25] is a framework developed for encoding character sets in a way that enables switching between multiple character sets within a single data stream. Originally published in 1986 and known as “Information technology – Character code structure and extension techniques”, ISO/IEC 2022 supports the use of multiple national and international character encodings, particularly in contexts where 7-bit or 8-bit communication channels are used. At its core, ISO/IEC 2022 provides mechanisms for code extension by defining escape sequences (starting with the ESC octet, ) that designate or invoke various character sets. It defines four working sets of graphic characters, referred to as G0, G1, G2, and G3 [26], where different graphic character sets can be loaded and then accessed. In particular, special sequences of control characters (escape sequences) are used to designate which specific character set is loaded into each of the G0, G1, G2, or G3 slots, while other control characters (shift functions) are then used to invoke or load one of these designated G-sets to be the currently used set for interpreting subsequent octet values. The standard also defines code areas within the 7-bit or 8-bit code spaces where these graphic character sets are invoked (made active for interpreting subsequent octets): for 7-bit environments, the primary code area for graphic characters is called GL (Graphic Left), while in 8-bit environments, an additional code area called GR (Graphic Right) is used. This enables compatibility with legacy systems and multilingual text processing, especially in East Asian languages, which require large character sets. ISO/IEC 2022 underpins several regional and application-specific standards, such as ISO-2022-JP (for Japanese [27]), ISO-2022-KR (for Korean [28]), and ISO-2022-CN (for Chinese [29]), which are still used in specific contexts like e-mail transmission (per MIME specifications) despite being largely supplanted by Unicode in modern applications. While flexible, ISO/IEC 2022 has been criticized for its complexity, especially in parsing and rendering text streams. Its stateful encoding model, requiring the interpretation of escape sequences to know which character set is active in which code area, makes it error-prone and challenging to implement compared with stateless encodings like UTF-8. Nonetheless, its historical significance and influence on character encoding architectures remain substantial.
4. The Printable Encoding Model
- Representation level: This level provides the alphabet symbols used to encode the binary values and eventually other symbols (e.g., CR, LF, =) that are used to format the resulting sequence of characters;
- Coding level: This level uses the symbols provided by the Representation level to encode and decode blocks of binary data sequences of possibly various lengths; it may provide the upper level with means to signal the end of the encoded sequence;
- Stream level: This level composes and formats the encoded blocks and reversibly extracts them from and to the Coding level;
- Application level: The encoded stream is encapsulated into a larger object (like an e-mail or a file), possibly enveloped into a structure that determines its format.
5. Binary to Printable Encodings
5.1. Base Printable Encodings
- The Base64, Base32, and Base16 defined in [4] are amongst the most widely known representations of binary data in printable form.Base64 may use different alphabets, but the canonical one defined in [4] is the following:ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=This alphabet is composed of 65 symbols: the first 64 are used for representing the octet strings, as will be explained, and the equal sign “=” is used for padding. This alphabet does not care about using human-distinguishable glyphs because the encoding is principally reserved for data transmission between computer systems.The encoding (called base64) is easily performed considering the binary expression of octets: a sequence of 3 octets is divided into 4 sextets (i.e., 6 bits), and the value of each sextet is used as an index in the alphabet to get a character. In case the input octet length is not a multiple of 3, then two cases may happen:
- An ending single octet: The input is padded with 4 zero-valued bits, the two sextets are encoded, and two equal signs are appended (“==”);
- Two ending octets: The input is padded with 2 zero-valued bits, the three sextets are encoded, and one equal sign is appended (“=”).
At the application level, it may be required to write the output stream with Carriage Return/Line Feed to limit the length of the lines of Base64 encoded data (see, for example, MIME [31] and PEM [32]).A variant of the previous alphabet, called base64url ([4] sec. 5), was proposed for URL and filename safe encodings. In particular, the new alphabet substitutes the last two symbols with “-” (minus symbol) and “_” (underline, or underscore symbol). The new alphabet isABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_=and if padding is used in URIs, then the equal sign should be percent encoded.Base32 performs an encoding (called base32) analogous to Base64 but using an alphabet of 33 symbols, namely the 26 letters of the Latin alphabet, the decimal digits from 2 to 7, and the padding equal sign “=”. Extensionally the alphabet isABCDEFGHIJKLMNOPQRSTUVWXYZ234567=Input sequences of 5 octets are re-interpreted as sequences of 8 quintets (i.e., a group of five bits), and the value of each quintet is used as an index for a character in the Base32 alphabet vector.When the input octet length is not a multiple of 5, then four cases requiring padding may happen:- An ending single octet: the input is padded with 2 c, the two quintets are encoded, and six equal signs are appended (“======”);
- Two ending octets: the input is padded with 4 zero-valued bits, the four quintets are encoded, and four equal signs are appended (“====”);
- Three ending octets: the input is padded with 1 zero-valued bit, the five quintets are encoded, and three equal signs are appended (“===”);
- Four ending octets: the input is padded with 3 zero-valued bits, the seven quintets are encoded, and one equal sign is appended (“=”).
In ([4] sec. 7), a Base32 encoding with an Extended Hex Alphabet called base32hex is proposed. The alphabet used in this case is0123456789ABCDEFGHIJKLMNOPQRSTUV=The encoding method is the same as the base32 one, but the resulting printable strings keep the sort order of the input binary data from which they were computed.Base16 encoding, called base16 or hex, is based on the classical sixteen symbols of the alphabet0123456789ABCDEFand represents each octet with two characters using each nibble value as an index in the alphabet vector. - Base36: The Base36 is made available in many programming languages and is based on an alphabet composed of the ten decimal digits and the 26 letters (in general, case insensitive) of the Latin alphabet; that is,0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZIn particular, JavaScript has the methods Number.prototype.toString() [33] and Number.parseInt() [34], Python has the functions numpy.base_repr() [35] and int() [36], and PHP uses the function base_convert() [37] to support conversions to and from base 2 to Base36. Also, spreadsheets have the function BASE(), which converts a number into a base from 2 to 36, and the function DECIMAL(), which interprets a string in a base from 2 to 36, converting it into a number.
- Base41: the base 41 was chosen in the work [38] (a related web page with sample code is available at [39]) because 41 is the minimum number of symbols that must be used over three symbol positions to represent all binary configurations of 16 bits (i.e., a pair of octets).The main objectives of [38] are the use of an alphabet containing only uppercase and lowercase Latin alphabet letters, avoiding special character symbols to have the widest possible applicability (for example, the printable strings produced by the conversion are URL-safe); moreover, letters that can lead to human visual misinterpretation (e.g., uppercase I and lowercase l) are not present in the proposed 41-letter alphabet.ABCDFGHJKLMNQRSTUVXZabcdefhikmnopqrstuvxzThe method [38] allows the printable encoding of octet and bit strings of any length.In 2014, [40] proposed a method for binary-to-text encoding using an alphabet of 41 characters for encoding pairs of octets with three symbols. The intensional definition of the symbol’s alphabet is the set of ASCII characters from code to code , that is, extensionally,)*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQThe work in [40] allows the conversion of octet strings of even length only.
- Base45: an encoding based on 45 symbols is proposed in [41]. The number of alphabet symbols is chosen according to the QR code alphanumeric encoding that is based on the following 45 symbols (note the ‘space’ character between Z and $):0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ $%*+-./:Using this alphabet, octet strings to be stored in a QR code are represented in printable form, and then pairs of Base45 characters are saved in strings of 11 bits (6 bits only to represent a possible single final character).Pairs of octets are encoded in Base45 with three Base45 symbols; if a single octet is present at the end of an odd-length octet string, then it is converted to Base45 with two symbols. Decoding performs Base45 to binary conversion, and care must be taken to signal an error in case of decoded values greater than when starting from three characters or greater than if decoding two characters.Due to the alphabet employed, this encoding may produce strings that are not URL-safe, requiring, in some cases, an additional percent encoding.
- Base56: At this web resource [42], Python functions for encoding and decoding in a base with 56 symbols are available. The alphabet used is23456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnpqrstuvwxyzThis set of characters avoids visual ambiguity for characters 1, lowercase L, uppercase I, and uppercase and lowercase O. The encoding function performs the classical conversion between bases (see Appendix A.1).
- In [43], the objective of binary representation is oriented towards the human interpretation of binary data, avoiding mistakes due to confusing zero (0) with capital O or capital I and lowercase L (l); moreover, the used alphabet avoids characters having special meanings for operating systems like slash (/) in file pathnames. Note that this is an expired Internet Engineering Task Force Internet Draft; nonetheless, it is used for encoding Bitcoin addresses [44]. In the Bitcoin context, Base58Check [44] is used to represent Bitcoin addresses or any other octet sequence. The object to be converted is prefixed by an octet indicating its type, then the Secure Hash Algorithm is applied twice on the obtained sequence, producing a cryptographic hash: the first 4 octets of this hash are appended as a checksum to the sequence, and the result is converted into Base58. The aim of this checksum is to improve security by detecting typing errors or maliciously modified addresses/keys, making evident even small differences from the intended address/key.The alphabet’s extensional definition is123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyzThe same alphabet intensional specification is the decimal digits apart from 0, the Latin alphabet uppercase letters except I and O, and the Latin alphabet lowercase letters apart from l (lowercase L).
- The paper [47] uses a Base62 encoding as a final step of a compression algorithm. The alphabet used isABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789This alphabet’s intensional definition is the Latin alphabet’s uppercase and lowercase letters and the ten decimal digits. The bit stream is divided into blocks of 6 bits and the decimal value of each block is used as an index in the alphabet to define its encoding character. In case the 6 bits, have a value greater than 59 (thus, having values , , , or ), then only the first 5 bits, having binary value or , are encoded with the characters 8 or 9, respectively, assigning the leftover bit to the following block of 6 bits. The last block of bits is zero-padded to have a correct length for the encoding with one of the 62 characters; the decoder will discard the exess zero bits to have a length compatible with an octet string.In [48], an alphabet of 62 characters, namely0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzis used to define a transformation format, called UTF-62, for ISO 10646 UCS, a collection of standards that define an international set of characters. Code values representing characters can be expressed with 16 bits or 31 bits. In the case of 16 bit (UCS-2) code values, the number is converted to Base62 (employing the previously mentioned alphabet) using three symbols, where the most significant does not exceed V; on the other hand 31-bit code values (UCS-4) require six Base62 symbols. Moreover, to distinguish from UCS-2 encoded values, the first (most significant) symbol is shifted (increased) by 32 positions (due to the maximum UCS-4 value, no overflow may happen with this shift), leading to a most significant symbol not less than W.
- Base85: an encoding using 85 printable characters was first introduced by P. E. Rutter for the utility btoa: this program originally used an alphabet composed of the ASCII charater from code (space) to code (character “t”), but to avoid problems with programs that skipped white spaces (ASCII code ), the alphabet was shifted by one position from code (special character “!”) to code (character “u”). The btoa utility added a header (made by the string xbtoa Begin) and a trailer (composed of the string xbtoa End, the original data length in decimal and in Base16, and three checksums). The coding level considers the case of four binary octets 0-valued returning the character “z” instead of the five Base85 symbols “!!!!!”, and in a subsequent release of the btoa utility, also four input octets valued at (space) are encoded as the character “y”.Adobe Systems Incorporated developed functions documented in [49] that perform encoding and decoding in Base85, starting from chunks of four binary octets, interpreting them as a number to be converted into Base85 and returning five printable characters for each chunk (adding 33 to each value returned from the Base85 conversion). Thus, the alphabet used is!"#$%&’()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_‘abcdefghijklmnopqrstuThat is, the ASCII characters from code (special character “!”) to code (character “u”).At the coding level, in the case of four binary octets 0-valued, the character “z” substitutes the five Base85 symbols “!!!!!”; moreover, at the stream level, if the octet string has a length not a multiple of 4, then it is completed with 0-valued octets, converted into Base85, and the same number of (less significant) symbols added is discarded from the result: this allows the decoder to restore the correct length of the original octet string. At the application level, the Base85 encoded sequence is terminated with the two-character string “∼>”.The RFC 1924 [50] proposes a compact Base85 encoding of IPv6 addresses. These addresses are 128 bits long, and 85 is the minimum base to write them in 20 characters. The extensional expression of the alphabet used is0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!#$%&()*+-;<=>?@^_‘{|}~Also, the patch system of the Git version control system employs a Base85 encoding to store diff binary data.
- Base91: coding efficiency was the motivation in [51] for the development of a binary-to-text code based on the following 91 ASCII symbol alphabet:!"#$%&’()*+,/0123456789:;<>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_‘abcdefghijklmnopqrstuvwxyz{|}~That is, the 95 printable characters purged of the space character, the dot, the dash, and the equal sign. The binary data to be encoded is divided into chunks of 13 bits that can be represented with two Base91 characters, leaving unused Base91 pairs. Among these 89 configurations, twelve (from 8192 to 8203) are used to specify how many filling bits (if any) were added to the last chunk to make it 13 bits long: this allows the method to encode bit strings of arbitrary length (this is managed at the stream level). A Base91 code has a data expansion factor of , lower than those of Base 64 () and Base85 ().Different software using a 91-character alphabet was developed and made available at [52]ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789!#$%&()*+,./:;<=>?@[]^_‘{|}~"In this case, the encoding uses all the two-letter configurations to represent 13-bit chunks; in addition, the 13-bit chunks with decimal values from 0 to 88 are also associated with the 89 unused configurations, allowing to encode one more bit (for a total of 14 bits) with the net result of not wasting any configuration but requiring a higher level of encoding to establish the right length of the original bit string at the decoding stage (thus, the stream level is empty and does not control the data length, whose management is left to the application level).
- Base94: In [53,54], a space efficiency analysis of the bases from 2 to 94 is presented, computing an optimal ratio between the number of input octets and output characters. The maximum base 94 is chosen, restricting the possible output alphabet symbols to the printable ASCII characters (), i.e., without the C0 control codes, also excluding the space (ASCII code ) and DEL (ASCII code ) characters (leading to ).
- Base122: the use of 122 symbols to encode binary data for HTML pages is proposed in [55]. The idea is to store binary data into the free bits of one and two octets of UTF-8 encoded data. Recalling UTF-8, the code points up to are represented with the single octet binary string 0BBBBBBB (where the seven bits B can assume the value 0 or 1), and the remaining code points up to are expressed with the two-octet string 110BBBBB 10BBBBBB (eleven free bits).Given that web browsers cannot deal transparently with all single-octet UTF-8 representations, among the 128 configurations, six are unused, namely NUL, Line Feed, Carriage Return, backslash, ampersand, and double quotes. This leaves 122 possible single-octet UTF-8 configurations and requires the use of two UTF-8 octets for encoding the remaining six configurations. The eleven free bits of these octets are employed in the following way:
- ▶
- Three bits encode one of the six configurations that were not possible to store in a single octet;
- ▶
- One bit is forced to 1 to avoid the possible illegal UTF-8 encoding of a code point less than over two UTF-8 octets;
- ▶
- Seven bits store the following input binary data; note that in the second octet (10BBBBBB of the UTF-8 encoding), any binary sequence does not create any problem).
In the case of encoding of bit strings of any length, then the stream level can manage the termination in many ways: a proposal is to always pad with a bit string beginning with 1 and add as many zeros as required to complete the UTF-8 characters.
5.2. Special Printable Encodings for Data Representation
5.2.1. Bootstring and Punycode
Algorithm 1 Algorithm for converting a number to base b with generalized variable-length integers using symbols and thresholds . |
|
5.2.2. Quoted Printable
5.2.3. Percent Encoding
- :/?#[]@!$&’()*+,;=
- ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~
5.2.4. yEnc
Algorithm 2 Algorithm yEncode for an octet o. |
|
5.2.5. Bech32
- qpzry9x8gf2tvdw0s3jn54khce6mua7l
5.3. Computational Complexity
6. Printable Encoding Applications
6.1. btoa() and atob()
6.2. uuencode and uudecode
6.3. xxencode and xxdecode
6.4. BinHex
6.5. Multipurpose Internet Mail Extensions
- !"#\$@[]^‘{|}~
6.6. BOO
6.7. QR Codes
6.8. Library Functions
- int isprint(int ch);int iswprint(wint_t wch);
- int isprint(int ch);int iswprint(std::wint_t ch);
6.9. Printable and Alphanumeric Code
6.10. Data Hiding
6.11. Security Considerations
7. Perspectives and Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
AIR | Asymptotic Inflation Ratio |
API | Application Programming Interface |
ASCII | American Standard Code for Information Interchange |
BaseY | representation system that uses the base Y positional numeral system for encoding binary data, e.g., Base64 |
BCDIC | Binary Coded Decimal Interchange Code |
BCH | Bose–Chaudhuri–Hocquenghem code |
BMP | Basic Multilingual Plane |
BOM | byte order mark |
CR | Carriage Return |
DNS | Domain Name System |
EBCDIC | Extended Binary Coded Decimal Interchange Code |
HTML | HyperText Markup Language |
HTTP | HyperText Transfer Protocol |
IDN | Internationalized Domain Name |
IDNA | Internationalizing Domain Names in Applications |
IDS | Intrusion Detection System |
LF | Line Feed |
LLM | Large Language Model |
LSO | Least Significant Octet |
MAC | Media Access Control |
MIME | Multipurpose Internet Mail Extensions |
MSB | Most Significant Bit |
MSO | Most Significant Octet |
PEM | Privacy Enhanced Mail |
QR code | Quick Response code |
RFC | Request for Comments |
S/MIME | Secure/Multipurpose Internet Mail Extensions |
SHA-256 | Secure Hash Algorithm 256 |
SMTP | Simple Mail Transfer Protocol |
UCS | Universal Coded Character Set |
URI | Uniform Resource Identifier |
URL | Uniform Resource Locator |
UTF | Unicode Transformation Format, or UCS Transformation Format |
ZWNBSP | Zero Width No-break Space |
Appendix A. Support Material
Appendix A.1. Converting a Number to Numeral in a Defined Base and Vice-Versa
Algorithm A1 Algorithm for converting a number to base b with symbols . |
|
Algorithm A2 Algorithm for converting a base b numeral to the corresponding number n. |
|
Appendix A.2. Symbols in a Numeral Required to Represent a Number in a Base
Appendix A.3. Asymptotic Inflation Ratio
References
- NAAPO—The North American Astrophysical Observatory. Big Ear Memorial Website. Available online: http://www.bigear.org/ (accessed on 5 June 2025).
- Ehman, J. Explanation of the Code “6EQUJ5” On the Wow! Computer Printout. Available online: http://www.bigear.org/6equj5.htm (accessed on 5 June 2025).
- Finney, H.; Donnerhacke, L.; Callas, J.; Thayer, R.L.; Shaw, D. OpenPGP Message Format. RFC 4880. 2007. Available online: https://www.rfc-editor.org/rfc/rfc4880.html (accessed on 5 August 2025). [CrossRef]
- Josefsson, S. The Base16, Base32, and Base64 Data Encodings. RFC 4648. 2006. Available online: https://www.rfc-editor.org/rfc/rfc4648.html (accessed on 5 August 2025). [CrossRef]
- Berners-Lee, T.; Masinter, L.M.; McCahill, M.P. Uniform Resource Locators (URL). RFC 1738. 1994. Available online: https://www.rfc-editor.org/rfc/rfc1738.html (accessed on 5 August 2025). [CrossRef]
- Berners-Lee, T.; Fielding, R.T.; Masinter, L.M. Uniform Resource Identifier (URI): Generic Syntax. RFC 3986. 2005. Available online: https://www.rfc-editor.org/rfc/rfc3986.html (accessed on 5 August 2025). [CrossRef]
- Cerf, V. ASCII Format for Network Interchange. RFC 20. 1969. Available online: https://www.rfc-editor.org/rfc/rfc20.html (accessed on 5 August 2025). [CrossRef]
- INCITS 4-1986[R2022]; Information Systems—Coded Character Sets—7-Bit Standard Code for Information Interchange (7-Bit ASCII). ANSI: Washington, DC, USA, 2022.
- Mackenzie, C.E. Coded Character Sets, History and Development; The Systems Programming Series; Addison-Wesley Publishing Company, Inc.: Indianapolis, IN, USA, 1980. [Google Scholar]
- ISO/IEC 646:1991; Information Technology—ISO 7-Bit Coded Character Set for Information Interchange. International Organization for Standardization: Geneva, Switzerland, 1991; p. 15.
- ISO/IEC JTC 1/SC 2 Working Group. Standards by ISO/IEC JTC 1/SC 2 Coded Character Sets. Available online: https://www.iso.org/committee/45050/x/catalogue/p/1/u/0/w/0/d/0 (accessed on 5 June 2025).
- The Unicode Consortium. UNICODE. Available online: https://home.unicode.org/ (accessed on 5 June 2025).
- Goldsmith, D.; Davis, M. UTF-7 A Mail-Safe Transformation Format of Unicode. RFC 2152. 1997. Available online: https://www.rfc-editor.org/rfc/rfc2152.html (accessed on 5 August 2025). [CrossRef]
- Yergeau, F. UTF-8, a Transformation Format of ISO 10646. RFC 3629. 2003. Available online: https://www.rfc-editor.org/rfc/rfc3629.html (accessed on 5 August 2025). [CrossRef]
- Hoffman, P.E.; Yergeau, F. UTF-16, an Encoding of ISO 10646. RFC 2781. 2000. Available online: https://www.rfc-editor.org/rfc/rfc2781.html (accessed on 5 August 2025). [CrossRef]
- Davis, M. Unicode Standard Annex #19 UTF-32. Available online: https://www.unicode.org/reports/tr19/tr19-9.html (accessed on 5 June 2025).
- Umamaheswaran, V.S. UTF-EBCDIC, Unicode Technical Report #16; Technical Report; Unicode, Inc.: San Francisco, CA, USA, 2002. [Google Scholar]
- ISO/IEC JTC 1/SC2/WG2; UCS Transformation Format One (UTF-1). International Organization for Standardization: Geneva, Switzerland, 1993; ISO/IEC 10646, First edition 1993, Registration number 178.
- ISO/IEC JTC 1/SC2/WG2; UCS Transformation Format One (UTF-1). International Organization for Standardization: Geneva, Switzerland, 2008. Available online: https://web.archive.org/web/20150318032101/http://kikaku.itscj.ipsj.or.jp/ISO-IR/178.pdf (accessed on 5 June 2025).
- Seng, J.; Duerst, M.; Tan, T.W. UTF-5, a transformation format of Unicode and ISO 10646. Internet-Draft draft-jseng-utf5-01, Internet Engineering Task Force. 2000. Available online: https://datatracker.ietf.org/doc/html/draft-jseng-utf5-01.txt (accessed on 20 July 2025).
- Welter, M.; Spolarich, B. UTF-6—Yet Another ASCII-Compatible Encoding for IDN. Internet-Draft draft-ietf-idn-utf6-00, Internet Engineering Task Force. 2000. Available online: https://datatracker.ietf.org/doc/html/draft-ietf-idn-utf6-00.txt (accessed on 20 July 2025).
- Crispin, M. UTF-9 and UTF-18 Efficient Transformation Formats of Unicode. RFC 4042. 2005. Available online: https://www.rfc-editor.org/rfc/rfc4042.html (accessed on 5 August 2025). [CrossRef]
- Wikipedia. Comparison of Unicode Encodings. Available online: https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings (accessed on 5 June 2025).
- Allen, J.D.; Anderson, D.; Becker, J.; Cook, R.; Davis, M.; Edberg, P.; Everson, M.; Freytag, A.; Iancu, L.; Ishida, R.; et al. The Unicode Standard Version 7.0—Core Specification; Unicode, Inc.: San Francisco, CA, USA, 2014. [Google Scholar]
- ISO/IEC 2022:1994; Information Technology—Character Code Structure and Extension Techniques. International Organization for Standardization: Geneva, Switzerland, 1994. Available online: https://www.iso.org/standard/22747.html (accessed on 5 June 2025).
- CEN/TC 304 Project Team. Annex A, 8-Bit Character Sets. Available online: https://www.open-std.org/cen/tc304/guidecharactersets/guideannexa.html#_Toc443292242 (accessed on 5 June 2025).
- Murai, J.; Crispin, M.; van der Poel, E.M. Japanese Character Encoding for Internet Messages. RFC 1468. 1993. Available online: https://www.rfc-editor.org/rfc/rfc1468.html (accessed on 5 August 2025). [CrossRef]
- Choi, U.; Chon, K.; Park, H. Korean Character Encoding for Internet Messages. RFC 1557. 1993. Available online: https://www.rfc-editor.org/rfc/rfc1557.html (accessed on 5 August 2025). [CrossRef]
- Zhu, H.; Hu, D.; Wang, Z.; Kao, T.; Chang, W.; Crispin, M. Chinese Character Encoding for Internet Messages. RFC 1922. 1996. Available online: https://www.rfc-editor.org/rfc/rfc1922.html (accessed on 5 August 2025). [CrossRef]
- Wikipedia. Binary-to-Text Encoding. Available online: https://en.wikipedia.org/wiki/Binary-to-text_encoding (accessed on 5 June 2025).
- Freed, N.; Borenstein, D.N.S. Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies. RFC 2045. 1996. Available online: https://www.rfc-editor.org/rfc/rfc2045.html (accessed on 5 August 2025). [CrossRef]
- Linn, J. Privacy Enhancement for Internet Electronic Mail: Part I: Message Encryption and Authentication Procedures. RFC 1421. 1993. Available online: https://www.rfc-editor.org/rfc/rfc1421.html (accessed on 5 August 2025). [CrossRef]
- Mozilla Corporation. JavaScript Reference, Number Constructor, Number.prototype.toString() Method. Available online: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Number/toString (accessed on 5 June 2025).
- Mozilla Corporation. JavaScript Reference, Number Constructor, Number.parseInt() Method. Available online: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Number/parseInt (accessed on 5 June 2025).
- NumPy Developers. Python numpy.base_repr. Available online: https://numpy.org/doc/stable/reference/generated/numpy.base_repr.html (accessed on 5 June 2025).
- Python Software Foundation. Python Built-in Functions int(). Available online: https://docs.python.org/3/library/functions.html#int (accessed on 5 June 2025).
- The PHP Documentation Group. PHP Math Functions, base_convert() Function. Available online: https://www.php.net/manual/en/function.base-convert.php (accessed on 5 June 2025).
- Botta, M.; Cavagnino, D. Base41: A proposal for printable encoding of bit strings. Eng. Rep. 2023, 5, e12606. [Google Scholar] [CrossRef]
- Botta, M.; Cavagnino, D. Base41: A Method for Bit String Encoding in Printable Form. 2023. Available online: https://watermarking.di.unito.it/base41/index.html (accessed on 5 June 2025).
- Veljkovic, S. Base41. 2014. Available online: https://github.com/sveljko/base41 (accessed on 5 June 2025).
- Fältström, P.; Ljunggren, F.; van Gulik, D.W. The Base45 Data Encoding. RFC 9285. 2022. Available online: https://www.rfc-editor.org/rfc/rfc9285.html (accessed on 5 August 2025). [CrossRef]
- Kunzmann, N. base56. 2024. Available online: https://github.com/foss-fund/base56 (accessed on 5 June 2025).
- Nakamoto, S.; Sporny, M. The Base58 Encoding Scheme. Internet-Draft draft-msporny-base58-03, Internet Engineering Task Force. 2021. Available online: https://datatracker.ietf.org/doc/draft-msporny-base58/03/ (accessed on 20 July 2025).
- Antonopoulos, A.M. Mastering Bitcoin: Unlocking Digital Cryptocurrencies; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2014. [Google Scholar]
- Piasecki, P. Why Is Ripple’s base58 Alphabet So Weird? 2013. Available online: https://bitcoin.stackexchange.com/questions/14124/why-is-ripples-base58-alphabet-so-weird (accessed on 5 June 2025).
- Elliott-McCrea, K. Manufacturing flic.kr Style Photo URLs. 2009. Available online: https://www.flickr.com/groups/api/discuss/72157616713786392/ (accessed on 5 June 2025).
- He, K.; Xu, X.; Yue, Q. A Secure, Lossless, and Compressed Base62 Encoding. In Proceedings of the 2008 11th IEEE Singapore International Conference on Communication Systems, Guangzhou, China, 19–21 November 2008; pp. 761–765. [Google Scholar] [CrossRef]
- Wu, P.C. A base62 transformation format of ISO 10646 for multilingual identifiers. Softw. Pract. Exp. 2001, 31, 1125–1130. [Google Scholar] [CrossRef]
- Adobe Systems Incorporated. PostScript Language Reference, 3rd ed.; Addison-Wesley Longman Publishing Co., Inc.: Boston, MA, USA, 1999. [Google Scholar]
- Elz, R. A Compact Representation of IPv6 Addresses. RFC 1924. 1996. Available online: https://www.rfc-editor.org/rfc/rfc1924.html (accessed on 5 August 2025). [CrossRef]
- He, D.; Sun, Y.; Jia, Z.; Yu, X.; Guo, W.; He, W.; Qi, C.; Lu, X. A Proposal of Substitute for Base85/64–Base91. In Proceedings of the SUMMER 8th International Conference on Computing, Communications and Control Technologies, CCCT, Orlando, FL, USA, 29 June–2 July 2010. [Google Scholar]
- Henke, J. basE91 Encoding. 2006. Available online: https://base91.sourceforge.net/ (accessed on 5 June 2025).
- vorakl. Convert Binary Data to a Text with the Lowest Overhead. 2020. Available online: https://vorakl.com/articles/base94/ (accessed on 5 June 2025).
- vorakl. The Zoo of Binary-to-Text Encoding Schemes. 2020. Available online: https://vorakl.com/articles/stream-encoding/ (accessed on 5 June 2025).
- Albertson, K. Base-122 Encoding. 2016. Available online: https://blog.kevinalbs.com/base122 (accessed on 5 June 2025).
- Microsoft Corporation. BASE Function. Available online: https://support.microsoft.com/en-us/office/base-function-2ef61411-aee9-4f29-a811-1c42456c6342 (accessed on 5 June 2025).
- Microsoft Corporation. DECIMAL Function. Available online: https://support.microsoft.com/en-us/office/decimal-function-ee554665-6176-46ef-82de-0a283658da2e (accessed on 5 June 2025).
- Apache OpenOffice Wiki. BASE Function. Available online: https://wiki.openoffice.org/wiki/Documentation/How_Tos/Calc:_BASE_function (accessed on 5 June 2025).
- Apache OpenOffice Wiki. DECIMAL Function. Available online: https://wiki.openoffice.org/wiki/Documentation/How_Tos/Calc:_DECIMAL_function (accessed on 5 June 2025).
- The Document Foundation. BASE Function. Available online: https://help.libreoffice.org/latest/lo/text/scalc/01/func_base.html (accessed on 5 June 2025).
- The Document Foundation. DECIMAL Function. Available online: https://help.libreoffice.org/latest/lo/text/scalc/01/func_decimal.html (accessed on 5 June 2025).
- Cavagnino, D.; Werbrouck, A.E. Efficient Algorithms for Integer Division by Constants Using Multiplication. Comput. J. 2008, 51, 470–480. [Google Scholar] [CrossRef]
- Warren, H.S. Hacker’s Delight, 2nd ed.; Addison-Wesley Professional: Boston, MA, USA, 2012. [Google Scholar]
- Costello, A.M. Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA). RFC 3492. 2003. Available online: https://www.rfc-editor.org/rfc/rfc3492.html (accessed on 5 August 2025). [CrossRef]
- Fältström, P.; Hoffman, P.E. Internationalizing Domain Names in Applications (IDNA). RFC 3490. 2003. Available online: https://www.rfc-editor.org/rfc/rfc3490.html (accessed on 5 August 2025). [CrossRef]
- Dürst, M.J.; Suignard, M. Internationalized Resource Identifiers (IRIs). RFC 3987. 2005. Available online: https://www.rfc-editor.org/rfc/rfc3987.html (accessed on 5 August 2025). [CrossRef]
- Helbing, J. yEncode—A Quick and Dirty Encoding for Binaries. 2022. Available online: http://www.yenc.org/yenc-draft.1.3.txt (accessed on 5 June 2025).
- MDN Web Docs. Function btoa. Available online: https://developer.mozilla.org/en-US/docs/Web/API/btoa (accessed on 2 July 2025).
- MDN Web Docs. Function atob. Available online: https://developer.mozilla.org/en-US/docs/Web/API/atob (accessed on 2 July 2025).
- The FreeBSD Project. FreeBSD Manual Pages, btoa. Available online: https://man.freebsd.org/cgi/man.cgi?query=btoa&apropos=0&sektion=0&manpath=FreeBSD+14.0-RELEASE+and+Ports (accessed on 2 July 2025).
- Horton, M. UUENCODE(1C) UNIX Programmer’s Manual. Available online: https://www.tuhs.org/cgi-bin/utree.pl?file=4BSD/usr/man/cat1/uuencode.1c (accessed on 5 June 2025).
- IEEE and The Open Group. UUENCODE and UUDECODE—The Open Group Base Specifications Issue 7. Available online: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/uuencode.html and https://pubs.opengroup.org/onlinepubs/9699919799/utilities/uudecode.html (accessed on 5 June 2025).
- IEEE Std 1003.1-2017; IEEE Standard for Information Technology—Portable Operating System Interface (POSIX(TM)) Base Specifications, Issue 7; Revision of IEEE Std 1003.1-2008. IEEE Computer Society and The Open Group: Washington, DC, USA, 2018; pp. 1–3951. [CrossRef]
- Mann, T. Prehistory of BinHex. Available online: https://www.tim-mann.org/binhex.html (accessed on 5 June 2025).
- Lempereur, Y. Post on Prehistory of BinHex. Available online: https://www.tim-mann.org/trs80/yves.txt (accessed on 5 June 2025).
- Crocker, D.; Fair, E.E.; Fältström, P. MIME Content Type for BinHex Encoded Files. RFC 1741. 1994. Available online: https://www.rfc-editor.org/rfc/rfc1741.html (accessed on 5 August 2025). [CrossRef]
- Moore, K. MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text. RFC 2047. 1996. Available online: https://www.rfc-editor.org/rfc/rfc2047.html (accessed on 5 August 2025). [CrossRef]
- Schaad, J.; Ramsdell, B.C.; Turner, S. Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 4.0 Message Specification. RFC 8551. 2019. Available online: https://www.rfc-editor.org/rfc/rfc8551.html (accessed on 5 August 2025). [CrossRef]
- Kermit Project Software Archive. BOO Files. Available online: https://www.kermitproject.org/archive.html#boofile (accessed on 5 June 2025).
- Columbia University Computer Center. The Kermit Project. 1981. Available online: https://web.archive.org/web/20231215030314/https://www.columbia.edu/kermit/ (accessed on 5 June 2025).
- cppreference.com. Function isprint(). Available online: https://en.cppreference.com/w/c/string/byte/isprint (accessed on 5 June 2025).
- cppreference.com. Function iswprint(). Available online: https://en.cppreference.com/w/c/string/wide/iswprint (accessed on 5 June 2025).
- ISO/IEC. ISO International Standard ISO/IEC 9899:2024(en): Information Technology—Programming Languages—C (Standard C23). 2024. Available online: https://www.iso.org/standard/82075.html (accessed on 5 August 2025).
- ISO/IEC. ISO International Standard ISO/IEC 30112:2020(en): Information Technology—Specification Methods for Cultural Conventions. 2020. Available online: https://www.iso.org/standard/71987.html (accessed on 5 August 2025).
- cppreference.com. Function std::isprint(). Available online: https://en.cppreference.com/w/cpp/string/byte/isprint (accessed on 5 June 2025).
- cppreference.com. Function std::iswprint(). Available online: https://en.cppreference.com/w/cpp/string/wide/iswprint (accessed on 5 June 2025).
- rix. Writing ia32 alphanumeric shellcodes. Phrack 2001, 57. Available online: https://phrack.org/issues/57/15.html#article (accessed on 5 August 2025).
- Botta, M.; Cavagnino, D. A Framework for Reversible Data Embedding into Base45 and Other Non-Base64 Encoded Strings. Appl. Sci. 2022, 12, 241. [Google Scholar] [CrossRef]
- Botta, M.; Cavagnino, D. Improving data embedding capacity into Base45 encoded strings. Eng. Rep. 2023, 5, e12622. [Google Scholar] [CrossRef]
- Botta, M.; Cavagnino, D.; Druetto, A. Hide45: A Method for Optimal Payload Data Hiding in Base45 Encoded Strings. Appl. Sci. 2023, 13, 9993. [Google Scholar] [CrossRef]
- Botta, M.; Cavagnino, D. Escaping Printable Encoded Streams to Embed Out-of-Band Data. Appl. Sci. 2023, 13, 6926. [Google Scholar] [CrossRef]
- Hines, K.; Lopez, G.; Hall, M.; Zarfati, F.; Zunger, Y.; Kiciman, E. Defending Against Indirect Prompt Injection Attacks with Spotlighting. arXiv 2024, arXiv:2403.14720. [Google Scholar]
- Zhang, R.; Sullivan, D.; Jackson, K.; Xie, P.; Chen, M. Defense against Prompt Injection Attacks via Mixture of Encodings. arXiv 2025, arXiv:2504.07467. [Google Scholar]
Base | Alphabet (Ranges Are Considered in ASCII Code, See Figure 1) | Other Chars | Asymptotic Inflation Ratio | Encoded Block Size [Bits] | Highest Level Involved | Reference and Publication Year or Availability | Applications | Observations |
---|---|---|---|---|---|---|---|---|
64 | A..Za..z0..9+/ | = | 24 | Stream | [4], 2006 | uuencode, MIME, PEM, btoa, XML, JSON | ||
64 url | A..Za..z0..9-_ | = | 24 | Stream | ([4] sec. 5), 2006 | URL, filenames | ||
32 | A..Z2..7 | = | 40 | Stream | ([4] sec. 6), 2006 | |||
32 hex | 0..9A..V | = | 40 | Stream | ([4] sec. 7), 2006 | |||
16 | 0..9A..Z | 2 | 8 | Coding | ([4] sec. 8), 2006 | Low level device addresses, percent encoding, Unicode code points | ||
36 | 0..9A..Z | any | Coding | [33,34,35,36,37,56,57,58,59,60,61], available today on respective websites | Javascript, Python, PHP, spreadsheets | |||
41 | A..DFGHJ..N Q..VXZ a..fhikm..vxz | 16 | Stream | [38], 2023 | Can encode octet strings and bit strings | |||
41 | )..Q | 16 | Coding | [40], 2014 | Conversion of octet strings of even length only; real length must be handled at Application level | |||
45 | 0..9A..Z $ %*+-./: | 16 | Stream | [41], 2022 | QR codes | |||
56 | 2..9A..HJ..NP..Z a..kmnp..z | any (depends on machine and compiler limits) | Coding | [42], 2024 | Compatible with other packages using a different alphabet | |||
58 | 1..9A..HJ..NP..Z a..km..z | any | Coding | [43], 2021 [44], 2014 | Bitcoin | |||
62 | A..Za..z0..9 | to | 5 or 6 | Coding | [47], 2008 | Compression algorithm | ||
62 | 0..9A..Za..z | or | 16 or 31 | Coding | [48], 2001 | UTF-62 for ISO 10646 UCS | ||
85 | !..uyz | ~> | 32 | Application | [49], 1999 (btoa) | Ascii85 | ||
85 | 0..9A..Za..z !#$%&()*+-; <=>?@‘{|}~ | 128 | Application | [50], 1996 | IPv6 addresses representation | |||
91 | !"#$%&’()*+,/0..9 :;<>?@A..Z[\]‘ a..z{|} | 13 | Stream | [51], 2010 | Can encode bit strings | |||
91 | A..Za..z0..9 !#$%&()*+,./:; <=>?@[]‘{|}~" | 13 or 14 | Coding | [52], 2006 | ||||
94 | !..~ | 72 | Coding | [53,54], 2020 | Suggested encoding sizes according to rounded values | |||
122 | One or two octets UTF-8 characters | 7 or 14 | Stream | [55], 2016 | Encoding binary objects into HTML web pages |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Botta, M.; Cavagnino, D.; Druetto, A.; Lucenteforte, M.; Marra, A. A Survey of Printable Encodings. Algorithms 2025, 18, 504. https://doi.org/10.3390/a18080504
Botta M, Cavagnino D, Druetto A, Lucenteforte M, Marra A. A Survey of Printable Encodings. Algorithms. 2025; 18(8):504. https://doi.org/10.3390/a18080504
Chicago/Turabian StyleBotta, Marco, Davide Cavagnino, Alessandro Druetto, Maurizio Lucenteforte, and Annunziata Marra. 2025. "A Survey of Printable Encodings" Algorithms 18, no. 8: 504. https://doi.org/10.3390/a18080504
APA StyleBotta, M., Cavagnino, D., Druetto, A., Lucenteforte, M., & Marra, A. (2025). A Survey of Printable Encodings. Algorithms, 18(8), 504. https://doi.org/10.3390/a18080504