Lossless Compression of Sensor Signals Using an Untrained Multi-Channel Recurrent Neural Predictor

: The use of sensor applications has been steadily increasing, leading to an urgent need for efﬁcient data compression techniques to facilitate the storage, transmission, and processing of digital signals generated by sensors. Unlike other sequential data such as text sequences, sensor signals have more complex statistical characteristics. Speciﬁcally, in every signal point, each bit, which corresponds to a speciﬁc precision scale, follows its own conditional distribution depending on its history and even other bits. Therefore, applying existing general-purpose data compressors usually leads to a relatively low compression ratio, since these compressors do not fully exploit such internal features. What is worse, partitioning a bit stream into groups with a preset size will sometimes break the integrity of each signal point. In this paper, we present a lossless data compressor dedicated to compressing sensor signals which is built upon a novel recurrent neural architecture named multi-channel recurrent unit (MCRU). Each channel in the proposed MCRU models a speciﬁc precision range of each signal point without breaking data integrity. During compressing and decompressing, the mirrored network will be trained on observed data; thus, no pre-training is needed. The superiority of our approach over other compressors is demonstrated experimentally on various types of sensor signals.


Introduction
As digitalization advances continuously, sensor technology is undergoing tremendous development and has been widely used in applications such as wearable medical devices [1], climate change tracking [2], and infrastructure monitoring [3]. At the same time, with the improvement of the resolution and sampling rate of analog-to-digital converters (ADCs), the volume of sensor signals increases rapidly, leading to great pressure on data storage. One way to alleviate such pressure is to reduce the redundancy existing in massive data through data compression. Data compression techniques can be categorized into two classes: lossless compression and lossy compression. Lossy compression of sensor signals discards part of the secondary information, and the resulting distortion is limited within an acceptable range. For example, An et al. proposed a data compression method based on two-dimensional discrete cosine transform (DCT), which can effectively reduce the amount of data since most natural signals are concentrated in the low frequency parts of DCT [4]. Zhang et al. proposed a compression method based on wavelet transform and obtained a high signal-to-noise ratio [5]. These methods can produce a high compression ratio at the cost of dropping part of the signal information. However, in many scenarios, such as sensor debugging [6], sensor signals must be stored losslessly. In such cases, lossless data compression is used. For example, Biagetti et al. explained the importance of lossless compression algorithms in electromyography (EMG) sensors and analyzed the energy consumption and performance of existing lossless compressors [7].
Classical general-purpose lossless compression algorithms mainly include entropy coding and dictionary coding. The former is based on Shannon information theory [8], including Huffman coding [9], arithmetic coding [10], and asymmetric numeral systems (ANS) [11]. The later is based on LZ algorithm and its variants, which compress data by replacing repeated data with the earlier position of that data in the uncompressed stream [12,13]. To better utilize the context information underlying sequential data, contextbased compression methods were proposed. The general idea of these methods is to combine a context-based predictor and a coding algorithm. Among context-based algorithms, PAQ uses a large number of models conditioned on different contexts (e.g., n-grams, sparse contexts, analog contexts, etc.) to estimate the probability distribution of the next symbol [14]. Deep learning models are naturally context extractors, which can be used as effective probabilistic predictors in context-based compressors. Byron proposed CMIX [15], which uses a large number of context models including a long short-term memory (LSTM) [16]. Goyal et al. proposed DZip [17], which uses a pre-trained neural network as a predictor, which is stored in the compressed file after compression. Different from DZip, tensorflow-compress [18] trains its deep learning predictor during compressing and decompressing; thus, it does not need to store the model parameters and can run with a large batch size to get a substantial speed improvement.
The compressors mentioned above are general-purpose methods which perform data compression by reducing the redundant information between data points. For sensor signals, this inter-point relationship may be the continuity and periodicity. However, sensor signals also have intra-point features. Specifically, each bit in every signal point has its own underlying conditional distribution, depending on the history values of its own and other bits. For example, the values of lower-order bits change more frequently than values of higher-order bits. Therefore, applying general-purpose compressors directly to sensor signals often leads to a relatively low compression ratio, since they do not fully exploit the intra-point characteristics mentioned above. What is more, most of the general-purpose compressors read a fixed number of bits at a time. In this way, a bit stream of a sensor signal will be partitioned into groups before being fed into the compressor, which may break the data integrity of sensor signals. There are also compressors that are specially designed for digital signals. Dai et al. proposed a lossless compression method for periodic signals based on an adaptive dictionary model which can predict the current data value according to the history [19]. Huang et al. proposed a novel ECG signal prediction model that uses an autoregressive integrated moving average (ARIMA) model and discrete wavelet transform (DWT) [20]. Nonetheless, these compressors are all designed for specific signals.
In this paper, we present a novel recurrent neural network (RNN) architecture that is specially designed for modeling sensor signals as the probability predictor of a contextbased lossless compressor. The proposed multi-channel recurrent unit (MCRU) reads one signal point at a time, and the internal bits are re-grouped into multiple channels, each of which is assigned with a sub-recurrent unit. In this way, the intra-point features can be extracted without breaking the integrity of each signal point. Furthermore, we adopt a similar strategy as tensorflow-compress to train the network during compressing and decompressing; thus, no pre-training is needed. The effectiveness of the proposed approach is demonstrated experimentally on different types of sensor signals.

Context-Based Lossless Compression for Sensor Signals
In this section, we first introduce the basic logic of sequence prediction for digital signals. Then, the general framework of context-based lossless compression algorithms, on which the proposed method of this paper is based, is presented. For mathematical notations, we denote the vectors and matrices by bold lower-and upper-case letters, respectively, e.g., x and W. Functions are denoted by upper-case letters in calligraphic font, e.g., F .

Digital Signals and Sequence Predictor
A digital signal {s i } of length L sampled by an R a -bit ADC is stored as a sequence of bits b j in hardware: where R a is the ADC resolution, and b denotes the j-th bit in the i-th sampled value. In this paper, we assume that the order of bits in each sampled value is ascending, i.e., b s i 1 and b s i R a represent the lowest-and highest-order bits in s i , respectively.
On the other hand, a sequence predictor Q predicts future values based on previously observed values by modeling relationships between data points in a data series {x t }. At time step t, Q regards the value to be predicted as a random variable ξ t , and gives a probability distribution over all its possible values {v 1 , · · · , v N }: where x 1 , · · · , x t−1 are previous observations. Given a digital signal {s i }, a sequence predictor usually predicts R p bits at a time by re-grouping the bit sequence in Equation (1) to {x t }. In this way, each x t has a total of 2 R p possible values, i.e., in Equation (2), N = 2 R p . In the following discussion, we refer to R p as the predictor resolution.
In this paper, we use bits per character (BPC, [21]) to measure the performance of a sequence predictor. Note that for digital-signal predictors, the BPC on a re-grouped sequence {x t } of length T is calculated as

Context-Based Encoding and Decoding
The proposed approach in this paper is built upon a context-based lossless compression framework which consists of a predictor and a coding module. Specifically, we use entropy coding as the coding module, and a sequence predictor is used to give the probability prediction, as in Equation (2). Here we summarize the general flow of such a context-based encoding and decoding scheme, as shown in Figure 1. At each encoding/decoding step t, the predictor first reads K latest data points {x t−K , · · · , x t−1 } from raw data and gives a prediction p t . This prediction is then sent to the entropy coding module to encode raw data x t , or decode compressed datax t . For some context-based compressors, the predictor will be updated based on the true value, i.e., x t , of the prediction made. In this paper, all mentioned context-based compressors, including the proposed approach, use arithmetic coding [10] as the entropy coding module, unless otherwise specified.
Among the context-based approaches, the ones based on the LZ77 algorithm [12], which uses a dictionary to maintain the context of data, are perhaps the most widely used. For example, the famous compression tool Gzip, to which our proposed method is compared in Section 4, is based on LZ77 and Huffman coding [9]. Another popular LZ77based compressor, namely LZMA, uses arithmetic coding as the entropy coding module. These compressors work without the prior knowledge of the data to be compressed. In addition, they are not able to model the underlying joint distributions of data points, leading to a relatively low compression rate (see Table 1). To address this issue, CMIX [15] combines a large number of context-based predictors, achieving a state-of-the-art compression ratio on several compression benchmarks [22,23] at the cost of slow compression speed.
Predictor update Figure 1. Working flow of a context-based lossless compressor with a probability predictor and an entropy-coding module. Deep learning models, especially recurrent neural networks (RNNs), have been proven to be good at modeling context information hidden in sequential data. Naturally, they can be used as probability predictors in context-based compressors. DZip [17] uses a single deep learning model as the probability predictor. At each time step t, it calculates the conditional probability distribution based on previously observed K symbols. However, DZip requires pre-training the neural network, and the trained model is stored as a part of the compressed data in order to achieve a better compression rate. The aforementioned CMIX compressor also contains a deep learning predictor based on long short-term memory (LSTM, [16]). This sub-module is named lstm-compress, and can only work with a batch size of 1. Another deep learning-based lossless compressor, called tensorflow-compress [18], is a general version of lstm-compress, which can work with different recurrent units and an arbitrary batch size. Compared to the static DZip method, lstm-compress and tensorflowcompress do not require pre-training of their deep learning predictor; thus, they do not need to include the neural networks into the compressed data provided that both the compressor and the decompressor are initialized with the same random predictors. During encoding and decoding, the context information is extracted and stored in network state, and the RNN predictor is updated (indicated by red dashed line in Figure 1) via a standard back-propagation through time (BPTT, [24]) algorithm. In this paper, we refer to this kind of context-based technique as dynamic deep recurrent compressor (dynamic DRC).
Although the aforementioned compression techniques are promising, they can not fully exploit the special characteristics of sensor signals. For example, the predictor resolution R p , defined in Section 2.1, is usually fixed to 8 in these general compressors. In the case that ADC resolution R a = R p , re-grouping the bit sequence will lose the original data property. Even though the predictor resolution can be arbitrary for some RNN-based compressors, setting R p to R a will result in formidable model sizes when the signal sequence is generated from a high-resolution ADC. To address these problems, we propose a new dynamic DRC with a novel recurrent predictor that is specially designed for compressing sensor signals. The network detail is presented in the following section.

Multi-Channel Recurrent Predictor
In this section, we present the proposed multi-channel recurrent predictor by first introducing the background of RNNs.

Recurrent Neural Networks
An RNN can encode a sequence of arbitrary length {x t } into a fixed-sized state vector s t by reading one data point at a time. In most cases, at each time step t, the state vector s t will first be updated and then be used to calculate the model output: where T is the state transition function, G is the output function, and x t and y t represent the input and output vector of the RNN at time step t, respectively. One of the most fundamental building blocks of RNNs is the perceptron operator [25], which is defined as: where W i is called the weight matrix relating to input u i , b is the bias vector, and φ represents the non-linear activation function. The most simple RNN architecture simply uses a perceptron operator as the state transition function.
To alleviate the well-known vanishing and exploding gradient issue of RNNs, a gate unit is introduced [16], and the state transition function is constructed as a highway operator [25] H, which can be generally defined as: where represents element-wise multiplication,ŝ t is a candidate state vector, and g α t and g β t are gate vectors, in which each value is between 0 and 1. For example, candidate state and gate vectors in the popular gated recurrent unit (GRU, [26]) are calculated as: where σ and tanh represent the sigmoid and hyperbolic activation function, respectively. In this paper, we denote such a highway operator as H GRU .

Multi-Channel Recurrent Unit
In this sub-section, we present a novel multi-channel recurrent unit (MCRU), which is specially designed as a deep-learning predictor for sensor signals. The predictor resolution R p is set to the ADC resolution R a in order to keep the integrity of each value sampled by ADC. Specifically, every R a bit is re-grouped into C channels, each of which corresponds to R c = R a C bits and represents a specific precision scale of the sample value. Each channel is assigned an independent H GRU ; thus, the context information of the corresponding precision scale is retained individually.
At each time step t, C predictions, namely, p 1 t , · · · , p C t , are made in order from lesssignificant-bit (LSB) channels to more-significant-bit (MSB) channels, as is illustrated in Figure 2. Specifically, for the j-th channel, the corresponding state vector is updated first: Here, x j t represents the j-th re-grouped input with one-hot encoding. Note that an LSB channel has a lower value of j. After the state vector has been updated, a prediction of x j t+1 is made based on the new state vector s j t : Note that the prediction of each channel p j t is calculated independently by the corresponding perceptron operator P j . After the prediction has been made, it will be either sent to the encoder with raw data to produce a compressed data point, or sent to the decoder with a compressed data point to retrieve the original data. In whichever case, the true value x j t+1 corresponding to the prediction p j t will be sent to the state transition function in the next channel (if has), as in Equation (11).
In this paper, we refer to the context-based compressor that uses the proposed MCRU as a dynamic predictor as a multi-channel deep recurrent compressor (MCDRC), which uses arithmetic coding as the coding module by default. ...

Experiments
In this section we demonstrate the effectiveness of our proposed MCDRC by comparing it with other compressors on different types of sensor data.

Datasets
We select four public sensor datasets with different signal characteristics: BLOND [27] provides continuous energy measurements of a typical office environment at high sampling rates. This dataset contains the voltage and current measurements of 16 common electrical appliances in 50 days. In this paper, we use the voltage measurement of the first 5 electrical appliances measured within 2 min starting at 10 a.m. on 30 June 2017.
ESC [28] is a labeled collection of 2000 environmental audio recordings (with 50 semantical classes). The audio recordings are sampled at 44,100 Hz with 16 bits per sample without compression. Since the file size of each audio recording is too small, we merge the audios of each class into one single sensor signal without the file header. We use the first five classes of audio recordings for this experiment.
MPACT [29] contains radio frequency (RF) signals from different brands and models of drone remote controls (RC). The RF signals transmitted by the drone RCs to communicate with the drone are recorded by a passive RF surveillance system. There are 17 drone RCs from eight different manufacturers, and each RF signal contains 5000 k samples (spanning a period of 0.25 ms). We use the raw data of the first RF signal file of drones from five different manufacturers for the experiment.
MPACT-14bits contains the same data as MPACT, except that the lowest 2 bits of each sampled value are removed. Thus, the ADC resolution of this dataset decreases to 14 bits. This dataset is constructed to verify all compressors' performance on different R p .
The detailed information of these sensor datasets is listed in Table 2.

Experiment Setup
We benchmarked the performance of the proposed MCDRC on four different datasets of sensor signals and compared it with existing general-purpose compressors, namely, Gzip, BSC [30], PAQ [14], CMIX, and tensorflow-compress with different recurrent units. All compressors were set to give priority to the compression ratio. DZip is not considered for comparison since it requires pre-training and saving model weights. Table 3 lists the versions and methods used by these compressors.
In realization of all dynamic DRCs, we carefully avoided the use of non-deterministic operations [31]. This ensures the predictors are deterministic, which means that with a fixed random seed and same input data, the compressors will yield exactly the same results. In this way, the compressed data can be decompressed successfully. Each dynamic DRC was tested repeatedly for 10 times with different random seeds.
All dynamic DRCs including tensorflow-compress and the proposed approach were evaluated on an NVIDIA 2080TI GPU with 12GB GRAM. Other compressors, such as Gzip and CMIX, were evaluated on an Intel I9-10900K CPU (3.70GHz) with 32 GB memory and 20 cores.

Results of Different Recurrent Units
In this sub-section, the performance of the proposed MCRU as a probability predictor for sensor signals is evaluated and is compared to classic recurrent units, including LSTM and GRU. The two classic recurrent units were implemented based on the open-source code of tensorflow-compress. As a general-purpose compressor, the predictor resolution of tensorflow-compress is fixed to 8, and we stacked 2 recurrent layers for its overall RNN architecture in this experiment. For fair comparison, we set the channel number to 2 in the proposed MCRU. Other hyper-parameter settings can be found in Table 4. We use BPC, which is defined in Equation (3), as the performance metric. The results are reported in Table 5. We observed that all three recurrent units have stable performance over all signals. Specifically, the standard deviation over 10 runs with different random seeds for each model-signal pair within 0.005; thus, only median BPC is reported.  As we can see from Table 5, the proposed MCRU outperforms the other two recurrent units consistently on different signals, except BLOND-5. This advantage is more obvious on signals that are more difficult to predict correctly. For example, compared to LSTM, MCRU has an average BPC advantage of 0.319 and 0.158 on ESC and MPACT signals, respectively. Furthermore, it can be clearly observed from Table 5 that tensorflow-compress suffers from severe performance decline on signals sampled from ADCs with a resolution that is not an integer multiple of 8. In such cases, re-grouping the bit sequence destroys the integrity of each signal point. On the other hand, MCRU is more robust, since its predictor resolution is exactly the same as the ADC resolution.

Results of Different Compressors
In this sub-section, we compare the proposed MCRU-based dynamic DRC with other compressors with regards to the compression ratio (CR), which is defined as Compression Ratio (CR) = uncompressed data size compressed data size .
The hyper-parameters of MCRU are the same as in Table 4. The results are listed in Table 1. Again, for our proposed approach, the median CR over 10 runs is reported.
As we have mentioned in Section 2.2, dictionary-based entropy coding methods, such as Gzip, are not good at removing redundancy in sensor signals, leading to low CRs. The block sorting-based BSC algorithm, on the other hand, has demonstrated great performance improvement (more than 2 times) against Gzip on signals with a relatively simple pattern, such as BLOND signals. However, this gap shrinks dramatically on signals with a more complex pattern, such as ESC and MPACT signals. As for compressors based on mixed-context strategy, namely PAQ and its successor CMIX, they outperform BSC consistently on all types of signals. Comparing these two methods, the major difference is that CMIX additionally utilizes an LSTM to extract context information, yielding a minimum gain on CR of approximately 0.2 over all signals, which makes CMIX a state-ofthe-art general compressor.
The proposed MCDRC does not show advantages over CMIX on BLOND signals which contain a simple pattern. Nonetheless, on other signals which contain more complex features, MCDRC outperforms the state-of-the-art CMIX consistently. Note that the CRadvantage of MCDRC over CMIX nearly doubles on 14-bit signals compared to their 16-bit counterparts, which once again reveals the superiority of the flexible predictor resolution setting of MCDRC.

Compression Speed
In this sub-section, we compare the compression speed of MCDRC to CMIX. In our experiments, CMIX takes 13.6 min/MB for compression on average, even with its highly optimized implementation. As for MCDRC, the compression speed is related to the batch size. The compression performance of MCDRC on the MPACT signals with different batch sizes (64, 128, 256, 512, and 1024) is reported in Figure 3. For each batch-size configuration, the experiment is repeated 10 times, and the median results are reported. Other hyperparameters are the same as in Table 4. As we can see from Figure 3, when the batch size increases, the compression ratio gradually decreases, but less time is needed for compression. Therefore, there is a trade-off between the compression ratio and the compression speed. When we use a batch size of 256, the proposed method consistently outperforms CMIX regarding the compression ratio and is five times faster. Note that this result also holds for decompression, since these two processes are symmetrical.

Conclusions
In this work, we present a context-based lossless compression technique using a novel recurrent neural network architecture, namely the MCRU, which is specially designed for compressing sensor signals. Experiments have proven that the BPC performance of the proposed MCRU on sensor datasets with more complex patterns outperforms classic recurrent units. Furthermore, MCRU is more robust to a sensor dataset with different ADC resolutions due to its flexible predictor resolution setting. Based on MCRU, we propose MCDRC, whose compression ratios on several datasets exceed the current state-of-theart compressor CMIX. Regarding the running time of compression and decompression, although our work gives priority to the compression ratio, MCDRC achieves an obvious speed advantage over CMIX through a large batch size (256). The proposed MCDRC is five times faster than CMIX and can be further improved.
Many further ideas may be explored based on our work. For example, the highway operator H can be optimized to achieve stronger memory capability for the sensor signals. On the other hand, the channel number of MCRU can be set adaptively according to signal characteristics rather than pre-set to obtain better compression performance.
Although the proposed method has achieved state-of-the-art performance regarding the compression ratio on signal data, it is more suitable for offline compression due to its relatively slow compression speed. Therefore, another direction of future work can be implementing the proposed approach on hardware for real-time compression with low energy consumption.