# Finding Patterns in Signals Using Lossy Text Compression

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

#### 1.1. Motivation: Autonomous Toy Robots

**Compression or learning?**As explained above, the motivation for this study was to learn a communication protocol based on given recording of sampled signals. That is, to reverse-engineer the protocol. However, this problem in principle is very related to the problem of compressing signals. This is because an efficient compression algorithm for a specific protocol is expected to use the repeated format of this protocol. For example, machine learning is used to extract a small (compressed) predictive model from a large sample data, which is used in, e.g., video compression protocols to compress real-time video, where the decoder is expected to predict the next frame via the model, and only the differences (the fitting errors of the model) are being sent by the encoder. Similarly, the results in this paper can be used to learn a protocol, to compress a message efficiently based on a given protocol, or for noise removal. The theoretical optimization problem is very similar, as explained in the next sections.

#### 1.2. Run Length Encoding (RLE)

#### 1.3. Our Contribution

- Defining recursive run-length-encoding (RRLE) which extends the classic RLE and is natural for strings with repeated patterns such as in communication protocols.
- An algorithm that computes the optimal (smallest) RRLE compression of any given string S in time polynomial in the length $n=\left|S\right|$ of S. See Theorem 4.
- An algorithm that recovers an unknown string S from its noisy version $\tilde{S}$ in polynomial time. The only assumption is that S has a corresponding RRLE tree of $O\left(1\right)$ levels. See Definition 1 for details. The running time is polynomial in $\left|S\right|$, and the result is optimal for a given trade-off cost function (compression rate versus replaced characters). See Theorem 6.
- $(1+\u03f5)$-approximation for the above algorithms, that takes $O\left(n\right)$ time for every constant $\u03f5>0$, and can be run on streaming signals and in parallel, using existing core-sets for signals. See Section 4.
- Preliminary experimental results on synthetic and real data that support the guarantees of the theoretical results. See Section 6.
- An open-source and home-made system for automatic reverse-engineering of remote controllers. The system was used to hack dozens of radio (PWM) and IR remote controllers in our lab. We demonstrated it on three micro air-vehicles that were bought from Amazon for less than $30 each. See [4] for more results, code, and discussions.

#### 1.4. Related Work

**Reverse Engineering.**There are many systems and applications that suggest imitating remote controllers. For example, a “universal remote controller” can record IR signals and send them again by pressing on the corresponding button of this remote controller. However, such a simple controller with a small number of states, cannot replace the remote controller of, e.g., a common quadcopter with seven channels whose protocol can be used to generate unbounded types of signals.

**String Algorithms.**The notion of runs and periodicity of strings is at the core of many stringology questions [5,6]. It constitutes a fundamental area of string combinatorics due to important applications of text algorithms, data compression, biological sequences analysis, music analysis, etc. The notion of runs was introduced by Iliopoulos, Moore, and Smyth [5], who showed that Fibonacci words contain only a linear number of runs according to their length. Kolpakov and Kucherov [6] (see also [7], Chapter 8) proved that the property holds for any string and designed an algorithm to compute all runs in a string, which extends previous algorithms [8,9]. Other methods are presented in [7,10,11,12,13,14,15].

**Roadmap:**In Section 2, we provide the basic stringology notation needed for the algorithms, and a full definitions of the problem we are solving. In Section 3, we present our reverse engineering algorithms, which are the algorithms for exact and lossy text compression. In Section 4 we explain how we can apply these algorithms on our system. Then, in Section 5, we give an example of a protocol, and present in detail our reverse engineering system. Finally, in Section 7 we conclude this paper, and discuss some interesting directions for future work.

## 2. Problem Statement

#### 2.1. Basic Notations

#### 2.2. Recursive Run-Length Encoding (RRLE)

**Definition**

**1**

**Recursive run-length encoding (RRLE)**)

**.**An RRLE $s=({t}_{1},{s}_{1},\cdots ,{t}_{k},{s}_{k})$ is defined recursively as a $2k$-tuple, where $k\ge 1$ is an integer, ${t}_{i}\ge 1$ is an integer, and ${s}_{i}$ is either a string or an RRLE, for every $i\in [1..k]$. The size $\left|s\right|$ of s is recursively defined as $k+{\sum}_{i=1}^{k}\left|{s}_{i}\right|$. Similarly, $S\left(s\right)={{u}_{1}}^{{t}_{1}}\cdots {{u}_{k}}^{{t}_{k}}$ where ${u}_{i}={s}_{i}$ if ${s}_{i}$ is a string, and ${u}_{i}=S\left({s}_{i}\right)$; otherwise, for every $i\in [1..k]$, if $S\left(s\right)=Q$, then s is RRLE of the string Q. We define $\mathrm{rcost}\left(Q\right)$ to be the size of the smallest RRLE of Q, $\mathrm{rcost}\left(Q\right)={min}_{\left\{s\mid S\left(s\right)=Q\right\}}\left|s\right|$. Such an RRLE is called an optimal RRLE of Q and is denoted by ${s}^{*}\left(Q\right)$.

**Problem**

**2.**

#### 2.3. Lossy Compression

**Similarity cost**. Such a function $\mathrm{scost}(\xb7,\xb7)$ maps every pair of strings P and Q of the same length into a score (real number) $\mathrm{scost}(P,Q)$ that measures how much the strings are different; i.e., how much Q is a good approximation to p. In this paper, $\mathrm{scost}(P,Q)$ will be the number of indices $i\in \left\{1,\cdots ,n\right\}$ which has a different corresponding letter in P and Q, also known as the Hamming distance [26] between P and Q. For example, if $P=ababcb$ and $Q=ababab$, then $\mathrm{scost}(P,Q)=\mathrm{scost}(ababcb,ababab)=1$ since only the 5th letter is different: “c” for P and “a” for Q.

**An overall cost function.**This function $\mathrm{cost}(\xb7,\xb7)$ assigns an overall score for a pair of string that measures the trade-off between the similarity rate $\mathrm{scost}(P,Q)$ and the compression rate $\mathrm{ccost}\left(Q\right)$. For simplicity, we use the natural goal of minimizing the sum of similarity cost and compression cost,

**Problem**

**3.**

## 3. Algorithms for Exact and Lossy RRLE Compression

**exact**version of the problem, the input is a string Q that represents the signal, and the output is $\mathrm{rcost}\left(Q\right)$, which is the size of the smallest RRLE s of Q; see Definition 1. Hence, the similarity cost is $\mathrm{scost}(Q,S(s\left)\right)=0$, and the overall cost is $\mathrm{ccost}(Q,S(s\left)\right)=\mathrm{rcost}\left(Q\right)$.

**lossy**version we aim to “clean” the noise from the input signal Q and extract the hidden repeated patterns by finding a similar string P which minimizes $\mathrm{cost}(Q,P)$; see Problem 3. The motivation for both of the problems is that the input signal is assumed to have periodic patterns (exactly or approximately). By finding these periods we can either compress the signal efficiently, or reverse engineer the hidden protocol that is generated as it is in our experimental results. From the partition of the input string into periodic substrings, we can conclude the format of the protocol, including constant bits and the substring that is responsible for each button. see Section 5.

#### 3.1. Warm Up: Exact RRLE Compression

Algorithm 1: Exact(Q); see Theorem 4 |

**Overview of Algorithm 1:**Given a string Q of length n, the algorithm computes a matrix $D[1..n][1..n]$, such that $D\left[i\right]\left[j\right]=\mathrm{rcost}\left(Q\right[i..j\left]\right)$. In particular, $D\left[1\right]\left[n\right]=\mathrm{rcost}\left(Q\right)$. The matrix D is computed for substrings of increasing length; i.e., we first compute all substrings of length one, then all substrings of length two, and so on until the full string of length n is evaluated. We initialize the matrix on the main diagonal $D\left[i\right]\left[i\right]=2$ for $1\le i\le n$. Then we compute $\mathrm{rcost}\left(Q\right[i..j\left]\right)$ for all $1\le i<j\le n$ using a recursive definition of $D\left[i\right]\left[j\right]=\mathrm{rcost}\left(Q\right[i..j\left]\right)$. This is the minimum between the following three values: (i) $\mathrm{rcost}\left(Q\right[i..i+r-1\left]\right)$ if $Q[i..j]$ is r-periodic, where r is as small as possible, (ii) leaving $Q[i..j]$ as a whole which takes 1 counter and $j-i+1$ letters, and (iii) the smallest $\mathrm{rcost}$ that can be obtained by partitioning $Q[i..j]$ into two consecutive substrings $Q[i..k-1]$ and $Q[k..j]$, over every integer $k\in [i+1,j-i+1]$.

**Theorem**

**4.**

**Proof.**

**Time Complexity:**the algorithm runs $O\left({n}^{2}\right)$ iterations over the pair of “for” loops. For each such iteration, it computes the smallest period, if any, of a string of length $O\left(n\right)$, which takes linear time using the preprocessing of the Knuth–Morris–Pratt (KMP) algorithm [27]. Then, the corresponding entry in D is computed using the $O\left(n\right)$ precomputed values. Hence, the total time complexity of the algorithm is $O\left({n}^{3}\right)$. □

#### 3.2. Lossy RRLE Compression

- The minimum cost of modifying Q to be r-periodic, over every possible period length r. Formally, this is the minimal $\mathrm{cost}(Q,{q}^{\frac{n}{r}})$+1, over every string $q\in {\Sigma}^{r}$ and factor r of n.
- The minimum $\mathrm{rcost}$ over all possible partitioning options of Q.
- The cost of representing Q as is, with no compression.

**Definition**

**5**

**.**Let $Q\in {\Sigma}^{n}$ be a string over an alphabet $\Sigma =\left\{{\Sigma}_{1},\cdots ,{\Sigma}_{|\Sigma |}\right\}$. Let $r\ge 1$ be a factor of n. For every $i\in [\Sigma ]$ and $j\in \left[r\right]$, let ${Q}_{i,j}\in {\Sigma}^{n}$ denote the string whose letters in the entries $k\in \left\{j,j+r,j+2r,\dots \right\}$ are replaced by ${\Sigma}_{i}$; i.e., for every $k\in \left[n\right]$ we have

Algorithm 2:$\mathrm{L}\mathrm{OSSY}(M,\ell )$; see Theorem 6. |

**Overview of Algorithm 2:**The input to the algorithm is a n-Parikh matrix M of a string Q of size n over $\Sigma $, and an integer ℓ that denotes the maximum RRLE tree level of the compression, as explained below. Note that both Q and n can extracted from this Parikh matrix. We use dynamic programming to compute the matrix $D[1..n][1..n]$, in which $D\left[i\right]\left[j\right]=\mathrm{cost}(Q[i..j],{P}^{\prime})$ is the (optimal) compression cost of the sub string $Q[i..j]$. The loops are over the length m of the substring (from 1 to n), and then from the starting index i. The last index is denoted by $j=i+m-1$.

- Computing ${M}^{r}$ for every possible r takes $O\left(\right|\Sigma \left|{n}^{2}\right)$ + the time for computing D for ${M}^{r}$.
- Computing the second value in the equation takes $O\left(n\right)$.
- Computing the third value in the equation takes $O\left(\right|\Sigma \left|n\right)$.

**Theorem**

**6.**

**Proof.**

## 4. Linear-Time, Streaming, and Parallel Computation

## 5. The Reverse Engineering System

#### 5.1. Example of a Protocol: The SYMA G107 Helicopter

**Example Protocol:**The SYMA G107 helicopter supports a communication of three channels that represent the current state (level) of each button in the remote controller: throttle, pitch, and yaw of the helicopter. As in most of our RCs, the communication protocol is defined by a multi-layer language. For the special case of SYMA G107, the protocol is as follows.

**Level I: A/B (switches).**The IR signal is essentially a stream of binary numbers that corresponds to the IR light (on or off) that can be be changed every 13 microseconds. Light on for 13 microseconds represents in our notation the letter “A”; otherwise, the letter is “B”. The letters “A” and “B” are called switches.

**Level II: 0/1/H/F (letters).**The letter “0” is represented by the sequence of switches “0=ABABABABABBBBBBBBBBB”. That is, five pairs of “AB” followed by ten “B” letters. We encode the last sentence as a sequence of pairs “0” = $(5,AB,10,B)$, known as run-length-encoding (RLE); see Definition 3. Similarly, we define the letters “1” = $(5,AB,23,B)$, “H” = $(75,AB,72,B)$ (called header, and “F” = $S(10,AB,47,B)$ (called footer).

**Level III: word.**A word in the SYMA G107 protocol is defined by the followings sequence of letters:

#### 5.2. The System

#### 5.3. After Learning the Protocol

- Recording analog signals. In the case of IR signals, we use an IR decoder (sensor) that receives the signals from the remote controller. The IR decoder gets its power from the a micro-computer (Arduino), and is connected to a logic-analyzer.
- Converting Analog signals to binary signals. The logic analyzer converts the analog voltage signal into a digital binary signal that has value “A” or “B” in each time unit.
- Transmitting the binary stream to a laptop via a USB port.
- Running reverse engineering algorithm to learn the protocol.
- Producing commands to the robot using the mini-computer that is connected to a transmitter or a few IR (Infra-Red) LEDs.

## 6. Reverse Engineering Experiments

**The input data**was a set $\mathcal{M}$ of $48=16\xb73$ strings $\left\{{M}_{i,k}\mid i\in [1..16],k\in [1..3]\right\}$ in $\Sigma $. Each string M in this set was constructed as the sum “$M={M}^{*}+N$” of a fixed matrix ${M}^{*}$ with additional random noise N that were defined as follows. Let $V=S(4{,}^{\prime}{12}^{\prime},4{,}^{\prime}{3}^{\prime},3{,}^{\prime}{4}^{\prime})=121212123333444$, and let ${M}^{*}={V}^{3}$ be a string from the alphabet $\Sigma $. For $\sigma >0$, let ${N}_{\sigma}$ denote a string in ${\Sigma}^{\prime}$ of length $\left|N\right|=|{M}^{*}|$, where $N\left[j\right]\sim \mathcal{N}(0,{\sigma}^{2})$ is a random variable from a Gaussian distribution with zero mean and variance ${\sigma}^{2}$ for every $j\in [1..|N\left|\right]$. To obtain a finite alphabet and Parikh Matrix, we scale and round each letter $N\left[j\right]$ to its nearest integer. We then define ${\sigma}_{i}=0.05i$ and ${M}_{i}={M}^{*}+{N}_{{\sigma}_{i}}$; i.e., ${M}_{i}\left[j\right]={M}^{*}\left[j\right]+{N}_{0.05i}\left[j\right]$, for every $i\in [0..16]$ and $j\in [1..|N\left|\right]$. We repeat this construction of ${M}_{i}$ three times to obtain the string ${M}_{i,1},{M}_{i,2},{M}_{i,3}$, with different random noise ${N}_{{\sigma}_{i}}$ from the same distribution. The result is then the input set $\mathcal{M}$ of matrices above in the real alphabet ${\Sigma}^{\prime}$.

**The experiment**is a list of $255=17\times 3\times 5$ calls ${O}_{i,j,k}=\mathrm{L}\mathrm{OSSY}({M}_{i,k},j)$ to Algorithm 2 with the string ${M}_{i,k}$ and j, over every variance level $i\in [1..17]$, repetition (try) $k\in [1..3]$, and $j\in [1..5]$. The cost error for this call is the weighted number of mismatched letters (i.e., dissimilarity or distance) between the output recovered string ${O}_{i,j,k}$ and the original de-noised string ${M}^{*}$,

**The results**are shown in Figure 3. The x-axis represents the integer level $i={\sigma}_{i}/0.05$ of the noise variance as defined above. The color of each of the five curves corresponds to different RRLE levels $j\in [1..5]$ that were used in Algorithm 2. The height y of the jth curve in the ith variance level is the mean error over the 3 errors $\mathrm{error}\left(\right(i,j,1)+(i,j,2)+(i,j,3\left)\right)/3$, together with its variance.

**Conclusion.**Figure 3 shows that the algorithm is more robust to noise as the number of the levels in the tree increases.

#### 6.1. Experiments on Toy Robots

**The ground truth.**After using our system, we verified the following protocol of IR-UFO which has a structure that is similar to the SYMA S107, as defined in Section 1, with the following changes.

**Level I:**The IR light changes every 13 microseconds.

**Level II:**$\u201c0\u201d=(175,A,200,B),\u201c1\u201d=(350,A,375,B),\mathrm{and}H=(2,(360,A,240,B\left)\right)$.

**Level III:**A stream of binary words, each consists of 30 bits followed by a list of 4000 “A” bits. The format of a word is

#### 6.2. De-Noising Level I

**The experiment.**An IR signal was generated from the remote controller (RC). The “throttle” stick on the RC was pushed to its maximum level (100%) which produces repetitions of $W=S\left(\mathrm{package}\right(127,50,50,0\left)\right)$; see (1). Using the recording part of our system (see Figure 2), we recorded a signal of 6 seconds to obtain the Level I string

**Recovery Error.**We partitioned the recorded string V into separate packages ${P}_{1},{P}_{2},\cdots $. This was easy since each package was separated from the following package by a continuous sequence of thousands “A” letters. Hence, we simply removed from V any consecutive sequence of at least 500 “A” letters. The string between each resulting gap was defined to be a package, so we obtained the list of m packages ${W}_{1},{W}_{2},\cdots ,{W}_{m}$. We denote by ${V}_{j}={W}_{1}{W}_{2}\cdots {W}_{j}$ the string that consists of the first j packages, for every $j\in [1..m]$.

**The error plots**are shown in Figure 3. The green curve shows the average Hamming distance $\mathrm{error}\left(j\right)$ in the y-axis, over different number j of words (x-axis), together with its variance.

**Recovery using Algorithm 2.**We run Algorithm 2 m times using a call to $Lossy({M}_{j},\mathrm{levels})$, where ${M}_{j}$ is the Parikh matrix of ${V}_{j}$ and $\mathrm{levels}=2$, and ${R}_{j}$ is the output RRLE for each $j=[1..m]$. Without noise, ${V}_{j}={W}^{j}$, and the output is ${R}_{j}=(j,{s}^{*}\left({W}^{*}\right))$ that corresponds to the string ${W}^{j}={V}_{j}=S(j,{s}^{*}\left({W}^{*}\right))$. In practice, ${V}_{j}$, unlike ${W}_{j}$ is not periodic, but ${R}_{j}$ (the recovered output signal) is expected to be j-periodic. The average error of the recovered string $S\left({R}_{j}\right)$ is then

**Results of Algorithm 2**are shown in Figure 3. The blue curve shows the average Hamming distance $\mathrm{ourerror}\left(j\right)$ in the y-axis, over different number j of words (x-axis). Since ${R}_{j}$ is always periodic, the variance error between the recovered packages is zero.

**Conclusions:**In Figure 3 we see, as expected by the analysis, that the recovery error decreases as Algorithm 2 is given more packages to learn from. On the contrary, the error of the “memory-less” thresholding approach does not reduce over time.

#### 6.3. De-Noising Level II

#### 6.4. De-Noising Level III

**In the first experiment**we repeated the experiment from the previous section, where in the package P we used different values of $throttle$. That is, the “throttle” stick on the RC was pushed to random levels, from 1% to 100%, which ideally produces a repetitions of the semantic package $L=S\left(\mathrm{package}\right(throttle,50,50,0\left)\right)$ where the value of $throttle$ is then a random integer in $[0..127]$. The checksum field in L was also changed in each package. Let $M={L}_{1}{L}_{2},\cdots $ denote the concatenation of these semantic packages.

**The results**are shown in Table 1. The first 4 lines of Table 1 correspond to 4 semantic packages $M={L}_{1}{L}_{2}{L}_{3}{L}_{4}$ while the throttle is changed and the pitch/roll sticks remain unchanged. The fourth column of the ith row shows ${L}_{i}$ for $i\in [1..4]$. The “throttle” bits were marked manually (by us) in bold. The first row in the fifth column contains the output of Algorithm 2 on M, where the throttle field, as well as the checksum, were indeed identified by wildcards.

**In the second experiment**the goal was to see how Algorithm 2 is robust to noise that may occur in previous levels. To this end, we added synthetic noise to the input (column 4 from the left in Table 1) for the string M with the variable “throttle” field above. For each $x\in [1..250]$ we changed x bits in M to obtain ${M}_{x}$ and run Algorithm 2 with ${M}_{x}$ as described in the previous experiment. We then define $y\left(x\right)$ to be the wrong letters in the output, compared to the desired string (rightmost column of first row in Table 1) over 100 experiments. Figure 3 describes the average error y in the output (y-axis) and its variance, for the values of x (in the x-axis).

**Conclusions:**Figure 3 shows that roughly $90\%$ of the noisy bits in the input were recovered. When the input string is completely noisy, the output string consists of only wild cards, which is correct for few bits but wrong for the other (approximately 20) bits.

## 7. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Nasser, S.; Barry, A.; Doniec, M.; Peled, G.; Rosman, G.; Rus, D.; Volkov, M.; Feldman, D. Fleye on the car: big data meets the internet of things. In Proceedings of the 14th International Conference on Information Processing in Sensor Networks, Washington, DC, USA, 14–16 April 2015; pp. 382–383. [Google Scholar]
- D’Ausilio, A. Arduino: A low-cost multipurpose lab equipment. Behav. Res. Methods
**2012**, 44, 305–313. [Google Scholar] [CrossRef] [PubMed] - Pi, R. Raspberry pi. Raspberry Pi
**2012**, 1, 1. [Google Scholar] - Robotics and Big Data Laboratory, University of Haifa. Available online: https://sites.hevra.haifa.ac.il/rbd/about/ (accessed on 9 December 2019).
- Iliopoulos, C.S.; Moore, D.; Smyth, W.F. A Characterization of the Squares in a Fibonacci String. Theor. Comput. Sci.
**1997**, 172, 281–291. [Google Scholar] [CrossRef][Green Version] - Kolpakov, R.M.; Kucherov, G. Finding Maximal Repetitions in a Word in Linear Time. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science, New York, NY, USA, 17–19 October 1999; pp. 596–604. [Google Scholar]
- Crochemore, M.; Hancart, C.; Lecroq, T. Algorithms on Strings; Cambridge University Press: Cambridge, UK, 2007; 392p. [Google Scholar]
- Main, M.G. Detecting Leftmost Maximal Periodicities. Discret. Appl. Math.
**1989**, 25, 145–153. [Google Scholar] [CrossRef][Green Version] - Crochemore, M. Recherche linéaire d’un carré dans un mot. CR Acad. Sci. Paris Sér. I Math.
**1983**, 296, 781–784. [Google Scholar] - Crochemore, M. An Optimal Algorithm for Computing the Repetitions in a Word. Inf. Process. Lett.
**1981**, 12, 244–250. [Google Scholar] [CrossRef] - Apostolico, A.; Preparata, F.P. Optimal Off-Line Detection of Repetitions in a String. Theor. Comput. Sci.
**1983**, 22, 297–315. [Google Scholar] [CrossRef][Green Version] - Main, M.G.; Lorentz, R.J. An O(n log n) Algorithm for Finding All Repetitions in a String. J. Algorithms
**1984**, 5, 422–432. [Google Scholar] [CrossRef] - Kosaraju, S.R. Computation of Squares in a String. In Annual Symposium on Combinatorial Pattern Matching (CPM); Springer: Berlin, Germany, 1994; pp. 146–150. [Google Scholar]
- Gusfield, D.; Stoye, J. Linear Time Algorithms for Finding and Representing All the Tandem Repeats in a String. J. Comput. Syst. Sci.
**2004**, 69, 525–546. [Google Scholar] [CrossRef][Green Version] - Crochemore, M.; Ilie, L. Computing Longest Previous Factors in Linear Time and Applications. Inf. Process. Lett.
**2008**, 106, 75–80. [Google Scholar] [CrossRef][Green Version] - Amit, M.; Crochemore, M.; Landau, G.M. Locating All Maximal Approximate Runs in a String. In Annual Symposium on Combinatorial Pattern Matching (CPM); Springer: Berlin, Germany, 2013; pp. 13–27. [Google Scholar]
- Landau, G.M.; Schmidt, J.P.; Sokol, D. An Algorithm for Approximate Tandem Repeats. J. Comput. Biol.
**2001**, 8, 1–18. [Google Scholar] [CrossRef] [PubMed] - Sim, J.S.; Iliopoulos, C.S.; Park, K.; Smyth, W.F. Approximate Periods of Strings. Lect. Notes Comput. Sci.
**1999**, 1645, 123–133. [Google Scholar] - Kolpakov, R.M.; Kucherov, G. Finding Approximate Repetitions under Hamming Distance. Theor. Comput. Sci.
**2003**, 1, 135–156. [Google Scholar] [CrossRef][Green Version] - Amir, A.; Eisenberg, E.; Levy, A. Approximate Periodicity. In Proceedings of the 21st International Symposium on Algorithms and Computation (ISAAC), Jeju Island, Korea, 15–17 December 2010; Volume 6506, pp. 25–36. [Google Scholar]
- Nishimoto, T.; Inenaga, S.; Bannai, H.; Takeda, M. Fully dynamic data structure for LCE queries in compressed space. arXiv
**2016**, arXiv:1605.01488. [Google Scholar] - Bille, P.; Gagie, T.; Gørtz, I.L.; Prezza, N. A separation between run-length SLPs and LZ77. arXiv
**2017**, arXiv:1711.07270. [Google Scholar] - Babai, L.; Szemerédi, E. On the complexity of matrix group problems I. In Proceedings of the 25th Annual Symposium onFoundations of Computer Science, West Palm Beach, FL, USA, 24–26 October 1984; pp. 229–240. [Google Scholar]
- Storer, J.A.; Szymanski, T.G. Data compression via textual substitution. J. ACM
**1982**, 29, 928–951. [Google Scholar] [CrossRef] - Lempel, A.; Ziv, J. On the complexity of finite sequences. IEEE Trans. Inf. Theory
**1976**, 22, 75–81. [Google Scholar] [CrossRef] - Hamming, R.W. Error detecting and error correcting codes. Bell Syst. Tech. J.
**1950**, 29, 147–160. [Google Scholar] [CrossRef] - Knuth, D.E.; Morris, J.H., Jr.; Pratt, V.R. Fast pattern matching in strings. SIAM J. Comput.
**1977**, 6, 323–350. [Google Scholar] [CrossRef][Green Version] - Parikh, R.J. On Context-Free Languages. J. ACM
**1966**, 13, 570–581. [Google Scholar] [CrossRef] - Feldman, D.; Sung, C.; Sugaya, A.; Rus, D. idiary: From gps signals to a text-searchable diary. ACM Trans. Sens. Netw.
**2015**, 11, 60. [Google Scholar] [CrossRef] - Rosman, G.; Volkov, M.; Feldman, D.; Fisher, J.W., III; Rus, D. Coresets for k-segmentation of streaming data. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc.: Montreal, QC, Canada, 2014; pp. 559–567. [Google Scholar]
- Stackovervflow. How to Solve an Xord System of Linear Equations. 2016. Available online: https://stackoverflow.com/questions/11558694/how-to-solve-an-xord-system-of-linear-equations (accessed on 15 September 2016).

**Figure 1.**The recursive run-length encoding (RRLE) tree of the string $\left(abcccabcccgghhekmlfffcccdcccd\right)$, which is compressed from 29 letters to 22 (12 letters and 10 counters).

**Figure 2.**The recording part of our system: Moving a stick of the RC generates an IR (Infra-Red) signal. This signal is received by an IR sensor, which in turn transmits it to the logic analyzer (by Selae LTD). The logic analyzer translates this analog voltage signal into a binary signal. Its frequency is 5 Mhz. The binary signal is then transmitted to the USB port of the laptop and recorded to the hard drive using our software. In the above setting, the micro-processor Arduino is used only as a power supply for the sensor. In a different setup, the expensive logic analyzer was replaced by the Arduino. In this case the thresholds were computed by a software on the laptop.

**Figure 3.**Results of Algorithm 2. (

**a**) distance in the y-axis, over different number j of words (x-axis). (

**b**) average error in the output (y-axis) and its variance, for the values in the x-axis. (

**c**) The average Hamming distance compared to the original string.

**Table 1.**Example experimental results on a toy helicopter. The three leftmost columns tells which RC button was pressed. The input is the recorded communication bits from the RC for a long repeated signal. The output on the rightmost column is the output of our algorithm. It is the common “intersection” of input signals. Wildcards represent unstable character due to the different throttle, role or pitch values (in bold). From the output we can conclude the format of the message, including constant bits and the sub-string that is responsible for each button. See Section 6.4 for details.

Throttle | Role | Pitch | Input to Algorithm 2 (Semantic Package) | Output (RRLE) |
---|---|---|---|---|

25% | 0% | 0% | 01100100100010000001101010011111 | |

50% | 0% | 0% | 01001111100010000001101010011001 | |

75% | 0% | 0% | 00100011100010000001101010011010 | 0???????10001000000110101001???? |

100% | 0% | 0% | 00100011100010000001101010011010 | |

100% | −100% | 0% | 01100100111110000001101010010110 | |

100% | −50% | 0% | 01100100000110000001101010011000 | |

100% | 50% | 0% | 01100100110110000001101010010100 | 01100100????1000000110101001???? |

100% | 100% | 0% | 01100100001010000001101010011001 | |

100% | 0% | −100% | 01100100100000010001101010010111 | |

100% | 0% | 100% | 01100100100011110001101010010110 | 011001001000????000110101001???? |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Rozenberg, L.; Lotan, S.; Feldman, D. Finding Patterns in Signals Using Lossy Text Compression. *Algorithms* **2019**, *12*, 267.
https://doi.org/10.3390/a12120267

**AMA Style**

Rozenberg L, Lotan S, Feldman D. Finding Patterns in Signals Using Lossy Text Compression. *Algorithms*. 2019; 12(12):267.
https://doi.org/10.3390/a12120267

**Chicago/Turabian Style**

Rozenberg, Liat, Sagi Lotan, and Dan Feldman. 2019. "Finding Patterns in Signals Using Lossy Text Compression" *Algorithms* 12, no. 12: 267.
https://doi.org/10.3390/a12120267