PSON: A Serialization Format for IoT Sensor Networks

In many Internet of Things (IoT) environments, the lifetime of a sensor is linked to its power supply. Sensor devices capture external information and transmit it. They also receive messages with control commands, which means that one of the largest computational overheads of sensor devices is spent on data serialization and deserialization tasks, as well as data transmission. The simpler the serialization/deserialization and the smaller the size of the information to be transmitted, the longer the lifetime of the sensor device and, consequently, the longer the service life. This paper presents a new serialization format (PSON) for these environments, which simplifies the serialization/deserialization tasks and minimizes the messages to be sent/received. The paper presents evaluation results with the most popular serialization formats, demonstrating the improvement obtained with the new PSON format.


Introduction
The next generation of telecommunications networks (fifth generation or 5G) aims to redefine the rules of the game in connectivity in many respects. The capabilities of the new generation are overwhelming: data rates of up to 10 Gbps (10-100 times better than the current 4 and 4.5G networks), latencies of 1 millisecond, network availability of 99.999%, 100% coverage, 90% reduction in network power consumption, and increased capacity of simultaneously connected users are expected [1]. Beyond improved speed, or latency, 5G is expected to unleash a massive IoT (Internet of Things) ecosystem [2] in which networks can meet the communication needs of billions of connected devices with the right capabilities [3].
The 5G specifications and use cases go far beyond mobile communication, the service experienced by the end user. Specifically, the new 5G standard defines different operational scopes or categories, which are eMBB (enhanced Mobile Broadband), URLLC (Ultra-Reliable Low-Latency Communications), and mMTC (massive Machine Type Communications) [4].
The aim of eMBB is to substantially improve the bandwidth of mobile communications with moderate latency and thus provide a solution for emerging applications related to virtual reality, augmented reality, UltraHD quality applications, 360°video streaming, etc., which will be further enhanced in the future; for example, it will be boosted in 2021.
On the other hand, URLLC [5] refers to ultra-reliable communications with very low communication latency. URLLC will support a range of advanced services for latencysensitive connected devices to enable applications across a wide spectrum, such as factory automation, autonomous driving, industrial internet, remote surgery, and smart grids, among others.
In addition, within the proposed 5G specification, mMTC services [4], which are massive machine-to-machine communications, are also defined. These are specifications focused on providing a cost-effective and robust connection to billions of devices without overloading the network. It will serve devices whose typical use case is to send small amounts of information on a regular basis, enabling optimal use of the power of IoT devices, as the vast majority of IoT devices are battery-powered.
Within the Internet of Things, 5G's mMTC technology is being postulated as an unprecedented connectivity solution for the development of connected products and services [6,7]. The specification sets requirements around networking devices with up to 10 years of battery life, coverage penetration of 164 dB with a capacity of 160 bits per second, coverage density to support up to one million connected devices per square kilometer, communication latencies of less than 10 s for 20 bytes of data, and a crucial element for massive scaling: very inexpensive hardware [4]. In addition, this specification states that 5G networks and mMTC should support more features and applications over time, such as positioning, mobility, or multicast communication capability, among others. The mMTC will rely on two network standards, NB-IoT (Narrowband-IoT) and LTE-M, which are 3GPP specifications particular to IoT [8].
NB-IoT and LTE-M are part of LPWANs (Low-Power Wide-Area Networks), such as Sigfox, ZigBee, and LoRa [9], which are technologies that offer ranges of several kilometers and low power consumption. They mitigate the shortcomings of WPANs (Wireless Personal Area Networks), such as WiFi and Bluetooth, which, although still widespread [10], have limitations with respect to their range (only a few tens of meters) and their power consumption [11], which is why they cannot be extensively used in IoT contexts. The main advantage of the LPWAN approach in 5G compared to other technologies is that they are within the licensed spectrum (so they are immune to interference) and use the infrastructure of telephony networks, so there is no need to deploy their own infrastructure. Moreover, because it is a standardized technology, LPWAN is supported by a global ecosystem that allows interoperability between different market players and production scales of these solutions, which will reduce the cost of the technology once it is consolidated [12].
As discussed above, the massive growth of IoT solutions will require the use of devices with limited capabilities [13,14]. Although technological advancement is ongoing, due to the limited resources available in a typical sensor device, and in order to achieve lower power consumption at the nodes, it is important to reduce the amount of data exchanged. Various initiatives have been developed to improve energy consumption in sensor networks. Some of them aim to optimize energy by designing routing protocols [15,16]. Other work has focused on the development of Wireless Power Transfer (WPT) technology [17][18][19][20][21] or Energy Harvesting (EH) [22]. However, the focus of this research work is on the efficient transfer of data from devices to the server-side. In this way, the processing power, available memory, and battery life of IoT devices, which are mostly limited in these aspects [23,24], will be optimized.
For the transmission of information in both directions, that is, between the device and the server, serialization formats are used. Serialization is the process of translating data structures or object states into a format that can be transmitted and reconstructed later. Therefore, serialization is the conversion of an object into a sequence of bytes, whereas deserialization is the reconstruction of an object from a sequence of bytes. Serialization/deserialization processes are critical for devices with limited on-board energy, such as those in an IoT network. The smaller the size of the serialized object and the shorter the execution time involved, the more efficient the format. Any reduction in processor time for transaction serialization/deserialization contributes to an increase in the deployed lifetime of an IoT device. There are a number of different serialization formats, as is discussed further below. In IoT environments, where many devices are expected to be connected to the server, the importance of selecting a serialization format is vital in order to reduce overheads (measured as memory and bandwidth usage) [25].
Although the selection of the message protocol [25,26] is also relevant in the communication system, the focus of this paper is the presentation of a new serialization format called PSON [27]. The main goal of PSON is to define a serialization format efficiently in terms of total serialization time and bandwidth required to transmit arbitrary data payloads. The main problem with other encoding technologies is that they were not specifically designed thinking in the IoT ecosystem, both for servers to allow them to scale better while decoding massive IoT sensor data, and for sensors to last long when powered by batteries. This way, some existing methods are quite efficient for reducing the payload size, but they increase the total serialization/deserialization complexity in terms of processing power, especially for a small microcontroller. PSON is then focused on providing a balance between serialization time and generated payload size. PSON is used in the Thinger.io Cloud Platform [28]. Thinger.io is an open-source platform with capabilities for the collection, management, and analysis of a huge amount of heterogeneous sensor data. The use of PSON provides optimization in terms of execution time, channel utilization, and power consumption compared to the most common methods used in IoT environments. The aim of this paper is to describe this new serialization format and assess its performance compared to the most widely used formats.
The remainder of this article is organized as follows: Section 2 introduces the main aspects and the information sources of the data serialization formats that are analyzed and compared in the presented research. Section 3 provides an in-depth description of the new developed data serialization format, PSON. Section 4 describes the design of the research carried out to compare the selected data serialization formats and specifically addresses the attributes used to perform the comparison, the hardware used and its relation with IoT, the libraries used, and the test and payloads used. Section 5 presents the research results obtained from each attribute analyzed and the hardware used. Finally, Section 6 summarizes the conclusions obtained in the research and describes possible future work that can be performed. Taking this research as a starting point, some of that research has already begun.

Data Serialization Formats
This paper presents a comparison between data serialization formats; in this section, we enumerate and describe the main characteristics of the data serialization formats included in the comparison. The data serialization formats included in this study are those with widespread use: • JSON, JavaScript Object Notation. The European Computers Manufacturers Association, ECMA, published the ECMA-404 standard, "The JSON data interchange syntax", whose latest update was the 2nd edition, published in 2017 [29]. This document presents the most recent version of the standardized JSON language. As defined in the document, "JSON is a lightweight, text-based, language-independent data interchange format. It was derived from the ECMAScript programming language, but is programming language independent. JSON defines a small set of structuring rules for the portable representation of structured data". JSON is based on a subset of the JavaScript Programming Language Standard ECMA-262 3rd Edition published in December 1999, and it is a very stable data-interchange language that has had few modifications since it was first presented in 2001 on the JSON organization website [30]. This stability is complemented by the fact that it uses conventions similar to the C-family of languages, such as C, C++, C#, Java, JavaScript, Perl, and Python, making JSON one of the most widely used data serialization formats. • BSON, Binary JSON. First developed by MongoBD [31] as a binary structure that encodes type and length information, BSON is currently maintained as an open binary-encoded serialization of JSON-like documents in [32], whose latest published specification version is 1.1. This document describes the three characteristics for which BSON was designed: "Lightweight, Keeping spatial overhead to a minimum; Traversable, was designed to be tranversed easily; Efficient, Encoding data to BSON and decoding from BSON can be performed very quickly in most languages due to the use of C data types." • Protocol Buffers, developed by Google as a mechanism for serialized structured data. Two versions have been published, Proto2 and Proto3, the specifications of which can be found in [33]. The most recent version of Protocol Buffers, proto3, supports generated code in Java, Python, Objective-C, Dart, Go, Ruby, and C#. The main objective of Protocol Buffers is to be a small, fast, and simple mechanism for data serialization and be language-neutral, platform-neutral, and extensible. • XML, Extensible Markup Language. Developed by six different XML Groups [34], each dedicated to a different aspect of the Information and Knowledge Domain, W3C [35], Extensible Markup Language (XML) is a text format derived from SGML in ISO 8879, which was designed to meet the challenges of large-scale electronic publishing, but the file was extensively used in the exchange of a wide variety of data on the Web and elsewhere. Its first publication was in 1997, and since then, many different specifications, which can be found in [36], have been published.  [39]. Apache Thrift allows reliable performance communication and data serialization across a variety of programming languages and use cases. The project team aimed for Thrift to embody several characteristics: Simplicity, with a simple and approachable code, free of unnecessary dependencies; Transparency, conforming to the most common idioms in all languages; Consistency, with niche, language-specific features in extensions, not the core library; and Performance, striving for performance first, elegance second. • Apache Avro. Avro joined the Apache Software Foundation as a Hadoop subproject in 2009. Since then, it has been very intensively maintained, and more than thirty releases have been published, with the latest one being 1.10.2 in 2021. All versions can be found in [40]. This is a data serialization system that relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, which also allows its use with dynamic scripting languages since the data, together with their schema, are fully self-describing. The developer team indicates that Avro is intended to provide rich data structures; a compact, fast, binary data format; a container file to store persistent data; a remote procedure call (RPC); and simple integration with dynamic languages based on the fact that code generation is not required to read or write data files, nor do RPC protocols need to be implemented. For this reason, code generation is an optional optimization step and is only worth implementing for statically typed languages.

PSON: Thinger.io Data Serialization Format
PSON is an object serialization specification similar to JSON but specifically created for microcontrollers. It improves JSON in encoding/decoding complexity and generates a more compact representation over the wire. It also extends JSON by allowing any arbitrary binary information to be encoded, which is not permitted by the standard JSON schema. Thus, PSON handles different data types, which are referenced in the following as the wire type: To represent this kind of heterogeneous information, PSON messages are encoded as series of header-value pairs. Headers indicate the type of data, and the value represents the actual value. Therefore, a decoder needs to read the header to retrieve the actual data type and determine how to decode the upcoming value. A header is a fixed single byte composed of two fields: the wire type and the header payload. The wire type is encoded in the first 3 MSB (most significant bits), while the payload header is kept on the remaining 5 LSB (less significant bits). Figure 1 represents this structure. The wire-type field of the header has a clear role in describing the value type, i.e., a number, a float, a string, an object, etc. On the other side, the header payload is 5-bit generalpurpose storage that is used to optimize the serialization size. In this case, 5 bits allows up to 32 different values to be specified, which, in PSON, is used for different purposes: Thus, a header contains the wire type and, under some circumstances, the actual value or size of the upcoming object/array, resulting in an efficient encoding representation. This is especially useful in the embedded ecosystem, where payloads tend to be small due to network and battery constraints. If an integer, length, size, or number of elements does not fit in the 5-bit storage (it is greater than 30), then it is flagged with a 0x1 f (31) value in the header payload, and in this case, the actual value is represented by a varint number following the header.
Varints, also known as Little Endian Base 128 (LEB128) [41], allow small numbers to be stored in a single byte while also allowing the encoding of arbitrarily long numbers. Each byte in LEB128, except for the last byte, has the MSB set, and this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two's complement representation of the number in groups of 7 bits, starting with the least significant group.
The conventions used for encoding all different types are summarized in Table 1, along with their binary representations. Figure 2 presents an example of the complete encoding of a JSON document to the PSON format. In this example, a map is encoded with two keys. A map wire-type is encoded by convention with a header starting with 0b110 . . . . . . The remaining bits are used for the header payload, which contains the number of elements in the map. Thus, the header value is encoded as 0b11000010, which is 0xC2, as shown in the figure. The map header is followed by key-value pairs composed of a key as the string and a value that can be any other value, such as an integer, boolean, another string, a null, etc. In the present example, it is a header encoded with 0x87, meaning that it is a string with a length of 7 bytes, as 0b100 . . . . . . represents a string wire-type. Thus, the following 7 bytes contain the actual string, which is "compact". In the following, the actual value is encoded as 0x61. In Table 1, a discrete value is encoded as 0b011 . . . . . ., and a header payload of 1 is used for true values, so the final value is 0b01100001, which is the above-mentioned 0x61. This process is repeated for the next key-value pair following the same encoding rules for wire-types and header payloads.

Evaluation Methodology
The evaluation methodology used in this research had two main parts: the design of the research and the realization of the tests to compare data serialization formats. In this section, the design of the research carried out to compare data serialization formats is described. Four main aspects were defined in order to obtain a useful comparison of the formats enumerated above:
Tests and Payloads.

Attributes
The attributes or characteristics measured to compare data serialization formats are the following:

1.
Serialization/Deserialization speed. The values were measured per 1000 iterations and are expressed in microseconds.

2.
Binary size increase with the use of the library. This attribute is very important with memory-limited devices such as Arduino UNO and is necessary for its application in IoT devices. 3.
Encoding sizes. This attribute is very important when on a limited bandwidth network and, in consequence, for the scope of this study.

Hardware
To perform these tests, Arduino UNO (BCMI LABS LLC, Scarmagno, Italy) and ESP32-WROVER-B (Espressif Systems, Shanghai, China) modules were used. Due to their different characteristics, these devices represent different use scenarios. Arduino UNO represents a device with low memory capacity and low CPU power, while ESP32-WROVER-B represents a device with higher memory and higher CPU. The tests were performed on both devices to show the results in both cases.

Hardware Characteristics
Three key characteristics were involved in the tests: This library was not used on the microcontrollers because there are currently no implementations for it. They were implemented in Java and executed on a computer in order to obtain the serialized object and measure its size for serialization and deserialization. For this reason, the tests with this library only contain data on sizes. The associated code can be found in the theoretical-tests folder. • Thrift: Apache Thrift. This library was not used on the microcontrollers because there are currently no implementations for it. They were implemented in Java and executed on a computer in order to obtain the serialized object and measure its size for serialization and deserialization. For this reason, the tests with this library only contain data on sizes. The associated code can be found in the theoretical-tests folder. • YAML. This library was not used on the microcontrollers because there are currently no implementations for it. They were implemented in Java and executed on a computer in order to obtain the serialized object and measure its size for serialization and deserialization. For this reason, the tests with this library only contain data on sizes.
The associated code can be found in the theoretical-tests folder. Table 2 describes the details of all the libraries used. This library was not used on the microcontrollers and was tested in Java Apache Thrift This library was not used on the microcontrollers and was tested in Java YAML This library was not used on the microcontrollers and was tested in Java

Tests and Payloads
Different tests and payloads were developed to measure each defined attribute. For each attribute, the payloads are the following:

1.
Serialization/Deserialization speed. The tests used to measure this attribute were performed by using 10 different payloads and checking the time needed to serialize and deserialize them.

2.
Binary size increase with the use of the library. The tests used to measure this attribute were performed using a reference code (code-without-library folder) to measure the binary size generated when not using any library. Then, the code from the binarysize-tests folder was loaded on each microcontroller, and the binary size increment was calculated. 3.
Encoding sizes. This attribute is very important when on a limited bandwidth network and, in consequence, for the scope of this study. The tests used to measure this attribute were performed by using 10 different payloads and checking the generated serialized object size.
The 10 payloads used to measure the results for the encoding size and speed tests are named Test#. All of the tests performed are labeled with their name and can be checked here to confirm which payload was used for any test (shown as their JSON representation):

Test02
{ "sensor":"This is a very long string. This is a very long string. This is a very long string. This is a very long string. This is a very long string. ", "time":1351824120, "data":[ 48.75, 2.3 ] }

Research Results
This section summarizes the research results with new findings for each of the attributes studied.

Serialization/Deserialization Speed
These are the values measured for each test; the test definitions above can be referenced to confirm which payload corresponds to each row. As previously mentioned, the values were measured per 1000 iterations and are expressed in microseconds. The results are presented for ESP32-WROVER-B and Arduino UNO in two separate tables.
These tests show the performance of each library and protocol. High values for the serialization and deserialization time indicate that more CPU cycles were used for the data processing, which leads to more power consumption. Power consumption is very important in scenarios in which devices are powered by batteries. In addition, decreased use of the CPU by the serialization and deserialization process allows the device to use it for its actual goal, i.e., reading the sensors and processing their data. In these tests, the lower the values, the better.
The results for ESP32 are reported in Figure 3 and Table 3.
The results for Arduino are depicted in Figure 4 and Table 4.

Binary Size Tests
This section presents the results of the binary size tests for each format. These values are important when working with microcontrollers. Due to their limited memory, it is important to keep the utility code as small as possible. If the serialization/deserialization library consumes a lot of the total memory of the device, there may be insufficient memory to load the code that is necessary to perform the actual microcontroller task. In these tests, the lower the values, the better. The binary sizes without a library are: • Arduino: 592 bytes; • ESP32: 260,710 bytes.
The calculated percentages for the sizes are presented in Figures 5 and 6. The "Increase" columns in the figures represent the actual percentage of program size increase resulting from the addition of the serialization and deserialization library (BSON is not represented in Figure 5 because its values are very different from the rest and would distort the whole graphic). This value is calculated by subtracting the binary size without a library from the total value obtained in the test.

Encoding Sizes
These are the results obtained after encoding each Test# payload with each library. These values are important because microcontrollers are commonly used in low-bandwidth or mesh networks. In these scenarios, sending a message through the net has a high cost in terms of network capacity. Furthermore, sending or receiving a larger message uses more of the network interface, so the device consumes more power. As described in the serialization/deserialization speed section, this is very important when devices are powered by batteries. In these tests, the lower the value, the better. Figure 7 includes the results for Avro, Thrift, and YAML, although they were not used in the microcontrollers. Avro, Thrift, and YAML messages were serialized using Java on a computer, and their serialized sizes were measured for each test. All results are shown in bytes. Figure 7 shows the size values for each serialization protocol. The above evaluation results for encoding time, encoding size, and binary size increase indicate that PSON is quite an efficient mechanism for embedded systems. According to the encoding time, PSON is one of the fastest formats for completing the encoding/decoding tests for ESP32 (Figure 3), followed by Protocol Buffers or MessagePack. Compared to MessagePack (another schema-less format, similar to PSON), PSON is much faster on average: 33% faster encoding and 55% faster decoding. On Arduino UNO, as shown in Figure 4, it encodes 40% faster on average, but deserialization is around 47% slower. Thus, PSON is quite efficient when encoding information on microcontrollers, which is the normal scenario in IoT, where devices send information periodically to the cloud. From the encoding size results in Figure 7, the most efficient means of encoding information in these tests was obtained with Protocol Buffers or Apache Thrift, but this is only suitable for use cases in which the message structure is known beforehand by both the microcontroller and the server decoding the messages. However, this is not the standard scenario in the IoT ecosystem, as it will require creating custom decoding functions in the cloud for every message sent by the device, which typically implies the compilation of the format definition, the use of the generated source files, etc. This is not practical or sustainable in the long term when dealing with multiple device types or changes in the protocol, which may result in versioning complexity. Moreover, it complicates the cloud interoperability required in IoT with third-party services or applications, which usually work over the well-known REST API schema, using JSON as the standard encoding format.
Among the other schema-less encoding formats that can be directly converted to JSON, PSON is one of the most efficient methods and is comparable to MessagePack, obtaining quite similar results for encoding size. On average, PSON and MessagePack generate payloads that are 15% smaller than a raw JSON, but depending on the payload, it can be improved by 30-40%, i.e., in Test01 and Test06. Finally, the increases in binary size due to serialization/deserialization in Figures 5 and 6 illustrate that the ArduinoJSON library, which provides serialization/deserialization for JSON and MessagePack, is quite an optimized library in this respect, leading with a smaller program footprint in both ESP32 and Arduino UNO. The PSON library is quite similar to ESP32, with a 20.26% increase above the baseline binary versus the 20.15% required for MessagePack or 20.21% required for JSON. For Arduino UNO, PSON is less optimized and results in a 10% greater footprint on average against Arduino JSON. In sum, PSON is a new encoding format that competes with MessagePack in terms of encoding size but improves the serialization time by using a simpler encoding approach. It outperforms other schema-less encoding systems such as JSON, XML, and BSON in both serialization size and time. The results also indicate that the PSON library competes with other specific state-of-the-art Arduino Libraries on modern microcontrollers such as ESP32, but it can be improved on more modest architectures such as the AVR used in Arduino.

Conclusions and Future Work
We are witnessing the emergence of a new generation of IoT devices capable of being part of massively scalable and cost-effective IoT applications using LPWAN and the latest NB-IoT and LTE-M communication technologies. The deployment of massive IoT applications requires an huge volume of low-cost, low-power sensor devices. Therefore, these sensor devices will have relatively low performance requirements.
In an IoT environment, clients and servers exchange data. Part of this data may be in the transport protocol, which is used in the exchange of messages. Within messages, structured information (integers of different sizes and formats, arrays, strings, etc.) can be exchanged. Therefore, the use of serialization formats is necessary to represent this structured data in a linear set of bytes, which can be equivalently deserialized. The process of serialization and deserialization is critical in massive IoT environments, as it consumes processing time and has an impact on message size, and consequently, it is directly related to the energy consumption of sensor devices and their lifetime. This paper presents a new serialization format used by the Thinger.io platform called PSON. PSON optimizes execution time, channel utilization, and power consumption compared to the most common methods used in IoT environments.
In order to evaluate the efficiency of PSON, tests were carried out to compare it with the most widely used serialization formats using different payloads. The evaluation results demonstrate the excellent performance of PSON in terms of serialization, deserialization, and average encoding sizes. Specifically, the serialization and deserialization times for 1000 iterations were 11,179 µs and 11,657 µs for ESP32 and 125,671 µs and 324,406 µs for Arduino UNO; the average encoding binary size was 66.3 bytes. PSON also presented good results for library binary size overhead.
Future works will involve extending this research to other IoT services, such as the performance impact in mesh and low-bandwidth networks and the energy savings for microcontrollers. Moreover, the intention is to register the encoding format within IANA (Internet Assigned Numbers Authority) so that PSON can become a new standard in the future for optimizing JSON payloads over the Internet. There is also an opportunity to improve PSON libraries by reducing the compile size in microcontrollers, thus increasing the efficiency in constrained devices. In addition, there is a plan to create libraries in different languages, such as Python, Node.JS, and Java, so the encoding format can be much more operable with different programming languages and custom back-ends. Finally, the improvement of libraries is planned with the use of zero-copy techniques [42] to avoid unnecessary memory copying to improve deserialization time.