Finding Patterns in Signals Using Lossy Text Compression

: Whether the source is autonomous car, robotic vacuum cleaner, or a quadcopter, signals from sensors tend to have some hidden patterns that repeat themselves. For example, typical GPS traces from a smartphone contain periodic trajectories such as “home, work, home, work, · · · ”. Our goal in this study was to automatically reverse engineer such signals, identify their periodicity, and then use it to compress and de-noise these signals. To do so, we present a novel method of using algorithms from the ﬁeld of pattern matching and text compression to represent the “language” in such signals. Common text compression algorithms are less tailored to handle such strings. Moreover, they are lossless, and cannot be used to recover noisy signals. To this end, we deﬁne the recursive run-length encoding (RRLE) method, which is a generalization of the well known run-length encoding (RLE) method. Then, we suggest lossy and lossless algorithms to compress and de-noise such signals. Unlike previous results, running time and optimality guarantees are proved for each algorithm. Experimental results on synthetic and real data sets are provided. We demonstrate our system by showing how it can be used to turn commercial micro air-vehicles into autonomous robots. This is by reverse engineering their unpublished communication protocols and using a laptop or on-board micro-computer to control them. Our open source code may be useful for both the community of millions of toy robots users, as well as for researchers that may extend it for further protocols.


Motivation: Autonomous Toy Robots
While this paper deals with a natural open problem in string compression and representation ("stringology"), its origin was in our robotics lab.Traditional labs have relatively expensive, potentially dangerous robots, such as heavy quadcopters, crawlers, and humanoids that cost thousands of dollars.However, in recent years it has become easy to order from Amazon or eBay, low-cost "toy" robots that cost a few dozen dollars.More recently, we have seen dozens types of robots in toy stores and malls, including helicopters, quadcopters, cars, small humanoids, and even combinations such as quadcopters with wheels.Due to their price, size, and plastic material, such robots can be used safely indoors (e.g., home, school, or university), are more resistant to crashes, and it is easy to fix or replace their parts.
However, these toy robots are usually not autonomous due to two main problems: (i) they have no "eyes": sensors such as GPS allow them to know their location and position; and (ii) they are controlled via a remote controller (RC) that is supposed to be operated by a human.These commercial remote controllers usually have no published communication protocol.While a few of them might be found in the internet, they change frequently from model to model.
Unfortunately, most commercial, low-cost (<$50) toy robots (cars, quadcopters and humanoids), do not have published protocols.Moreover, their protocols frequently change over time without notice.In fact, many times we ordered a few copies of exactly the same toy robot from Amazon.com and each one of them had a different protocol.This was also the case with toy helicopters in our experimental results section.
Our goal in this study was to take these toy robots and make them autonomous.To this end, we had to solve the to above-mentioned problems.Problem (i) was already handled by developing low-cost tracking systems based on web-cameras, or on-board analog cameras [1].In this paper we handle Problem (ii): how to automatically reverse engineer the communication protocol of the robot.
Once this protocol is known, we can imitate the remote control by producing the commands using a mini-computer, such as Arduino [2] or Raspberry Pi [3] that is connected to a transmitter or a few IR (Infra-Red) LEDs.Instead of a human with a remote control, an algorithm can then send hundreds of commands per second to the robot to result in a much more stable and autonomous robot.Such robotic, low-cost systems, that are based on this paper, can be found, e.g., in [1].
Compression or learning?As explained above, the motivation for this study was to learn a communication protocol based on given recording of sampled signals.That is, to reverse-engineer the protocol.However, this problem in principle is very related to the problem of compressing signals.This is because an efficient compression algorithm for a specific protocol is expected to use the repeated format of this protocol.For example, machine learning is used to extract a small (compressed) predictive model from a large sample data, which is used in, e.g., video compression protocols to compress real-time video, where the decoder is expected to predict the next frame via the model, and only the differences (the fitting errors of the model) are being sent by the encoder.Similarly, the results in this paper can be used to learn a protocol, to compress a message efficiently based on a given protocol, or for noise removal.The theoretical optimization problem is very similar, as explained in the next sections.

Run Length Encoding (RLE)
Given a string S, which represents a signal, our goal is to compress S such that the optimal compression will allow us to resolve the protocol behind that signal.The compression scheme we present in this paper is called recursive run-length encoding (RRLE), and is a natural generalization of run-length-encoding (RLE), but is more suited for semi-periodic strings that are produced by sensors on robots.
RLE is a very simple form of lossless string compression in which runs of letters (that is, sequences in which the same letter occurs in many consecutive elements of the string) are stored as pairs of one count and one single letter, rather than the original run.For example, the string S = (aaaaaaabbbbaaaa) has three runs and can be represented as the vector S = (7, a, 4, b, 4, a), which means that string S is consists of seven a's, followed by four b's, followed by four a's.This way the string S can be represented using six letters/integers instead of 15.RLE is most useful on a string that contains many such runs.
In this paper we also use the term run (or period) to denote a periodic string.That is, a string that can be divided into a number of identical adjacent non overlapping substrings.For example, if S = ababab, the RLE will be (3, ab) which means that the string (ab) repeats itself three times in the string S. Similarly, the RLE of S = ababcb is (2, ab, 1, cb).In RRLE we recursively define each run (period) so that it may be further compressed using RLE in order to get even better compression as in Figure 1.For a formal definition of RRLE, see Section 2.3.

Our Contribution
The contributions that are presented in this paper are as follows.
1. Defining recursive run-length-encoding (RRLE) which extends the classic RLE and is natural for strings with repeated patterns such as in communication protocols.2.An algorithm that computes the optimal (smallest) RRLE compression of any given string S in time polynomial in the length n = |S| of S. See Theorem 4. 3.An algorithm that recovers an unknown string S from its noisy version S in polynomial time.
The only assumption is that S has a corresponding RRLE tree of O(1) levels.See Definition 1 for details.The running time is polynomial in |S|, and the result is optimal for a given trade-off cost function (compression rate versus replaced characters).See Theorem 6. 4. (1 + )-approximation for the above algorithms, that takes O(n) time for every constant > 0, and can be run on streaming signals and in parallel, using existing core-sets for signals.See Section 4. 5. Preliminary experimental results on synthetic and real data that support the guarantees of the theoretical results.See Section 6.
6.An open-source and home-made system for automatic reverse-engineering of remote controllers.
The system was used to hack dozens of radio (PWM) and IR remote controllers in our lab.We demonstrated it on three micro air-vehicles that were bought from Amazon for less than $30 each.See [4] for more results, code, and discussions.

Related Work
Reverse Engineering.There are many systems and applications that suggest imitating remote controllers.For example, a "universal remote controller" can record IR signals and send them again by pressing on the corresponding button of this remote controller.However, such a simple controller with a small number of states, cannot replace the remote controller of, e.g., a common quadcopter with seven channels whose protocol can be used to generate unbounded types of signals.
String Algorithms.The notion of runs and periodicity of strings is at the core of many stringology questions [5,6].It constitutes a fundamental area of string combinatorics due to important applications of text algorithms, data compression, biological sequences analysis, music analysis, etc.The notion of runs was introduced by Iliopoulos, Moore, and Smyth [5], who showed that Fibonacci words contain only a linear number of runs according to their length.Kolpakov and Kucherov [6] (see also [7], Chapter 8) proved that the property holds for any string and designed an algorithm to compute all runs in a string, which extends previous algorithms [8,9].Other methods are presented in [7,[10][11][12][13][14][15].
All of the above mentioned works have focused on exact runs; i.e., runs that include exactly the same repeated period.Other works focused on approximate runs.For example, when a string S is a concatenation of a non-empty substrings, by the modification of at most k letters, they form an exact run.This problem was widely researched [16][17][18][19][20].However, none of these previous works have focused on recursive runs, as defined in this paper.
The RRLE compression presented in this paper is a novel definition of recursive approximate runs.Informally, our problem is an optimization problem that looks for approximate runs that include a period that may be a run by autonomously.In addition to run length encoding, RRLE is also closely related to run-length straight line program (RLSLP) compression scheme, which is an extension of straight line programs (SLPs) [21,22].
There are many other compression schemes of strings, such as SLP [23], macro schemes [24], and LZ77 [25], that might be even more useful than the RRLE suggested in this paper.However, the main goal of the study was not to simply compress strings, but actually extract patterns from noisy strings.This was the motivation for the approximation algorithm that is our main result in Section 3.
In the case of communication protocols, the idea of using recursive trees of patterns is important and more relevant than pointers that are more relevant to text documents.In addition, RRLE is not based on the longest repeated factor (substring), but rather based on finding repeated substrings that are well compressed by themselves.
This version of finding recursive approximate runs is challenging, since most known techniques for finding exact or approximate runs cannot be used here without introducing exponential running time.To the best of our knowledge, there are no known efficient (polynomial time) algorithms for the RRLE problem.
Roadmap: In Section 2, we provide the basic stringology notation needed for the algorithms, and a full definitions of the problem we are solving.In Section 3, we present our reverse engineering algorithms, which are the algorithms for exact and lossy text compression.In Section 4 we explain how we can apply these algorithms on our system.Then, in Section 5, we give an example of a protocol, and present in detail our reverse engineering system.Finally, in Section 7 we conclude this paper, and discuss some interesting directions for future work.

Problem Statement
In this section we define the RRLE problem and the required notation for the rest of the paper.

Basic Notations
Let Σ denote a set called alphabet, where each item in Σ is called a letter.A string is a vector P ∈ Σ n , where n ≥ 1 denotes its length.For simplicity we remove the commas and replace .j] is called a substring of P. The empty string is also considered a substring of P. If i = 1, then P is a prefix of P, and if j = n, then P is a suffix of P. The concatenation of two strings P of length n and Q of length m, is denoted by An integer r ∈ [1.
.n] is a factor of n (n mod r) = 0; i.e., there is an integer x ≥ 1 such that rx = n.The string P is r-periodic (or periodic in r

Recursive Run-Length Encoding (RRLE)
In this subsection we suggest a novel generalization of the classic run-length encoding compression, called recursive run-length-encoding (RRLE) which is more suitable to our applications.

Definition 1 (Recursive run-length encoding (RRLE)). An
where k ≥ 1 is an integer, t i ≥ 1 is an integer, and s i is either a string or an RRLE, for every where u i = s i if s i is a string, and u i = S(s i ); otherwise, for every i ∈ [1..k], if S(s) = Q, then s is RRLE of the string Q.We define rcost(Q) to be the size of the smallest RRLE of Q, rcost(Q) = min {s|S(s)=Q} |s|.Such an RRLE is called an optimal RRLE of Q and is denoted by s * (Q).
A less trivial example is the string While the last expression seems longer than the first one, it can actually be represented efficiently using a RRLE tree, which is a tree, where each edge corresponds to a counter (number of repetitions), and each of its leaves corresponds to a string; see Figure 1.
A natural problem statement that follows Definition 1 is: how to compute the optimal compression of a given string.
Problem 2. Given a string Q, compute the optimal RRLE s * (Q) of Q.That is, s * (Q) is the tuple that minimizes |s| over every tuple s which is a compression of Q; i.e., S(s) = Q and |s| = rcost(Q).Here, S(s) is the string that corresponds to s as in Definition 1.

Lossy Compression
The previous subsection discussed exact (non-lossy) compression.However, given a string P, our goal is intuitively to compute a string Q which is a "lossy compression" of P in the sense that: (a) Q is similar (not necessarily identical) to P, and (b) Q takes less space in memory than P.
Of course, we can trivially define Q = P so that the similarity of P and Q will be maximized, but then there will be no compression or memory saving at all.On the other hand, we can define Q = 1 n , which will minimize the compression cost of Q (since it is just n occurrences of the digit 1), then the similarity cost to P will be very high.In other words, there is a trade-off between these two costs or goals.For a proper lossy compression problem we thus need to define, in addition to the compression cost of the previous section, a similarity and overall cost functions as follows.
Similarity cost.Such a function scost(•, •) maps every pair of strings P and Q of the same length into a score (real number) scost(P, Q) that measures how much the strings are different; i.e., how much Q is a good approximation to p.In this paper, scost(P, Q) will be the number of indices i ∈ {1, • • • , n} which has a different corresponding letter in P and Q, also known as the Hamming distance [26] between P and Q.For example, if P = ababcb and Q = ababab, then scost(P, Q) = scost(ababcb, ababab) = 1 since only the 5th letter is different: "c" for P and "a" for Q.
An overall cost function.This function cost(•, •) assigns an overall score for a pair of string that measures the trade-off between the similarity rate scost(P, Q) and the compression rate ccost(Q).For simplicity, we use the natural goal of minimizing the sum of similarity cost and compression cost, For example, if P = ababcb and Q = ababab, then the overall cost is The second problem statement is then: given S, how can one compute a lossy compression that is both small and decompressed to a similar string as the given one.Problem 3. Given a string P, compute a string Q that minimizes over every string Q.Here rcost(Q) is the (optimal) RRLE compression cost of Q as in Definition 1, and scost(P, Q) is the similarity (Hamming) cost.

Algorithms for Exact and Lossy RRLE Compression
In this section, we define and provide algorithms for the exact and lossy RRLE compression Problems 2 and 3.In the exact version of the problem, the input is a string Q that represents the signal, and the output is rcost(Q), which is the size of the smallest RRLE s of Q; see Definition 1.Hence, the similarity cost is scost(Q, S(s)) = 0, and the overall cost is ccost(Q, S(s)) = rcost(Q).
In the lossy version we aim to "clean" the noise from the input signal Q and extract the hidden repeated patterns by finding a similar string P which minimizes cost(Q, P); see Problem 3. The motivation for both of the problems is that the input signal is assumed to have periodic patterns (exactly or approximately).By finding these periods we can either compress the signal efficiently, or reverse engineer the hidden protocol that is generated as it is in our experimental results.From the partition of the input string into periodic substrings, we can conclude the format of the protocol, including constant bits and the substring that is responsible for each button.see Section 5.

Warm Up: Exact RRLE Compression
We now describe Algorithm 1 for computing the smallest RLE of an input string Q and proves its correctness.For simplicity, the algorithm only computes the length of the smallest RLE but the RLE itself can be easily extracted by following the chosen indices during the recursive calls.This solves Problem 2.
In particular, D [1][n] = rcost(Q).The matrix D is computed for substrings of increasing length; i.e., we first compute all substrings of length one, then all substrings of length two, and so on until the full string of length n is evaluated.We initialize the matrix on the main diagonal ).This is the minimum between the following three values: .j] is r-periodic, where r is as small as possible, (ii) leaving Q[i..j] as a whole which takes 1 counter and j − i + 1 letters, and (iii) the smallest rcost that can be obtained by partitioning We now prove that the output of Algorithm 1 is indeed the optimal compression rcost(Q) of its input Q, which solves Problem 2.
Theorem 4. Let Q be a string of length n.Let D [1][n] be the output of a call to Exact(Q); see Algorithm 1. Then D [1][n] = rcost(Q) is the size of the smallest RRLE of Q and can be computed in O(n3 ) time.
Proof.We prove a more general claim, that the theorem holds for any substring for storing the letter Q[i] and its length counter 1.For ≥ 2, inductively assume that the theorem holds for any substring of The rest of the proof corresponds to the three possible evaluations of .j] and Q[i..j] are r = j − i + 1 periodic (otherwise we could have better compression rate for t 1 > 1 and shorter string s 1 ).In this case, By the inductive assumption and the definition of the size |s| of s, Time Complexity: the algorithm runs O(n2 ) iterations over the pair of "for" loops.For each such iteration, it computes the smallest period, if any, of a string of length O(n), which takes linear time using the preprocessing of the Knuth-Morris-Pratt (KMP) algorithm [27].Then, the corresponding entry in D is computed using the O(n) precomputed values.Hence, the total time complexity of the algorithm is O(n 3 ).

Lossy RRLE Compression
In this section we solve Problem 3; i.e., compute a lossy good compression of an input string Q. Formally, given such a string Q of length n, the goal of our algorithm is to compute the minimum cost(Q, P ) over every string P ∈ Σ n ; see Section 2.3 for the definition of cost.Of course, one can simply compute cost(Q, P ) using rcost(P ) = Exact(P ) for all possible strings P and output the one whose cost(Q, P ) is minimized.However, the time complexity of such a solution is O(|Σ| n n 3 ).
In order to reduce the time complexity, we propose a dynamic programming algorithm, which generalizes Algorithm 1 as follows.In Algorithm 1, if a substring Q[i..j] is not periodic, we check two possible evaluations of D[i][j]: partitioning Q[i..j] or leaving it as is.Here, even if Q[i..j] is not periodic, we may change it to be periodic by finding a periodic string Q of length i − j + 1, and "paying" the similarity cost scost between Q[i..j] and Q for this change.
Hence, the final cost of Q[i..j] is defined recursively as the minimum between the following three values: 1.The minimum cost of modifying Q to be r-periodic, over every possible period length r.Formally, this is the minimal cost(Q, q n r )+1, over every string q ∈ Σ r and factor r of n.
To efficiently implement the above algorithm, we define the r-Parikh Matrix of a given string and its factor r, which we use throughout the algorithm.Intuitively, we define the string Q i,1 to be the same as the input string Q except that we change every rth letter of Q to Σ i .Hence, we changed at most n/r letters.More generally, in Q i,j we do the same where j denotes the offset or first letter we change (beginning of count).The r-Parikh Matrix of Q contains the corresponding mismatching cost (Hamming distance) scost(Q i,j ) in its (i, j) entry.Examples will follow the definition.Definition 5 (Parikh Matrix [28]).Let Q ∈ Σ n be a string over an alphabet Σ = Σ 1 , • • • , Σ |Σ| .Let r ≥ 1 be a factor of n.For every i ∈ [Σ] and j ∈ [r], let Q i,j ∈ Σ n denote the string whose letters in the entries k ∈ {j, j + r, j + 2r, . ..} are replaced by Σ i ; i.e., for every k ∈ [n] we have For example, let Q = (ababac) be a string over Σ = {a, b, c}.If r = 1, then j ∈ [r] = {1}, and therefore j = 1.That is, the period of changing a letter is 1 and thus all the letters will be modified.Indeed, for every k ∈ If r = 2, then j ∈ {1, 2}.That is, we start the count with the first letter j = 1, which means in the above example that we change the letters in indices k ∈ {1, 3, 5}.We obtain Q 1,1 = (ababac), Q 2,1 = (bbbbbc), and Q 3,1 = (cbcbcc).Counting the corresponding mismatches compared to Q = (ababac), we get that M [1][1] = 0, M [2][1] = 3, and M [3][1] = 3.In a similar way, for j = 2, we obtain Q 1,2 = (aaaaaa), Q 2,2 = (ababab), and For the r-Parikh matrix M r of a string Q, we it denote by M r min (j) = min i∈[|Σ|] M r [i][j], the smallest entry in the jth column of M r .Suppose that its row is i * ; i.e., M r min (j Therefore, if we wish to fix Q to be r-periodic with an offset j, by paying the smallest Hamming distance scost, then we should change the corresponding letters Q • • • to the letter σ i * .This is also the motivation for using this matrix in Algorithm 2. factor, where ε ∈ (0, 1) is constant.More generally, the core-set time has roughly quadratic dependency on k and 1/ ; see [29,30] for details.Unlike many solutions in machine or PAC-learning, in this and most core-sets there are no special assumptions on the size of input or its distribution (i.e., worse case input is assumed).
To apply Algorithms 1 and 2 on this core-set, we consider the input string to be a signal over integers that represent the letters.We also assume that the optimal RRLE has at most k leaves (more generally, has length of at most k), so that every relevant RRLE candidate will be approximated by the core-set C.This assumption is natural, e.g., in our system, since the number k of patterns in the protocol is significantly smaller than the length n of the highly sampled signal.

Example of a Protocol: The SYMA G107 Helicopter
Example Protocol: The SYMA G107 helicopter supports a communication of three channels that represent the current state (level) of each button in the remote controller: throttle, pitch, and yaw of the helicopter.As in most of our RCs, the communication protocol is defined by a multi-layer language.For the special case of SYMA G107, the protocol is as follows.
Level I: A/B (switches).The IR signal is essentially a stream of binary numbers that corresponds to the IR light (on or off) that can be be changed every 13 microseconds.Light on for 13 microseconds represents in our notation the letter "A"; otherwise, the letter is "B".The letters "A" and "B" are called switches.
Level III: word.A word in the SYMA G107 protocol is defined by the followings sequence of letters: word = (1, H, 1, "0", 1, yaw, 1, "0", 1, pitch, 1, "0", 1, serial, 1, throttle, 1, "0", 1, trim, 1, F), where: H and F were defined in Level II above, "serial" is the letter "0" or "1" (Helicopter 0 or 1) that allows the support of two helicopters by a single RC.Each term yaw/pitch/throttle/trim is an integer between 0 and 255.Each integer is represented by a binary word that consists of 8 bits, where a bit is represented by the letter "0" or "1" above.In real-time, a continuous stream of such words is sent from the transmitter (RC) to the receiver of the helicopter that decodes these words.

The System
Given a remote controller of a toy robot, such as the one of SYMA G107 described above, our goal is to learn its protocol.That is, to reveal from the long recorded stream of analog signals, what is the exact sequence of switches that define each letter in the protocol.Once this is known, we can imitate the remote controller using a mini-computer.This is where the stringology steps in.The algorithm we present below allows us to identify, in this long sequence of "AB" switches, the exact letters of a protocol.

After Learning the Protocol
After learning the desired protocol using our reverse engineering algorithm, we send the signals that are generated by the controller algorithm using a low-cost set of IR LEDs.The amplifier receives the binary commands from the Arduino code and turns them into on/off commands to the LEDs array.The algorithm that controls the robot runs on a laptop or a mini-computer and generates words according to the learned protocol.These commands are sent to the Arduino through the USB port.
The whole system works as follows: 1.
Recording analog signals.In the case of IR signals, we use an IR decoder (sensor) that receives the signals from the remote controller.The IR decoder gets its power from the a micro-computer (Arduino), and is connected to a logic-analyzer.

2.
Converting Analog signals to binary signals.The logic analyzer converts the analog voltage signal into a digital binary signal that has value "A" or "B" in each time unit.

3.
Transmitting the binary stream to a laptop via a USB port.4.
Running reverse engineering algorithm to learn the protocol.5.
Producing commands to the robot using the mini-computer that is connected to a transmitter or a few IR (Infra-Red) LEDs.
See Figure 2 for steps 1-3.Note that since the logic analyzer is relatively expensive compared to the other parts of our system, we can use the Arduino board not only as a power provider, but also as a converter from the IR signal to the USB port.An Arduino code that implements this conversion is provided as part of the open source of our system.This signal is received by an IR sensor, which in turn transmits it to the logic analyzer (by Selae LTD).The logic analyzer translates this analog voltage signal into a binary signal.Its frequency is 5 Mhz.The binary signal is then transmitted to the USB port of the laptop and recorded to the hard drive using our software.In the above setting, the micro-processor Arduino is used only as a power supply for the sensor.In a different setup, the expensive logic analyzer was replaced by the Arduino.In this case the thresholds were computed by a software on the laptop.

Reverse Engineering Experiments
We ran experimental results on both synthetic and real data to test Algorithm 2. We first ran the following experiment on synthetic data to measure the robustness of the recovery of Algorithm 2; i.e., to get some sense of the signal-to-noise (SNR) ratio.Intuitively, we assume that user Alice sends a periodic string over a known alphabet Σ[1..4] to another user Bob, through a noisy channel.Bob then tries to recover the original string from the received noisy string over the alphabet of real numbers Σ = R.
The input data was a set M of 48 Each string M in this set was constructed as the sum "M = M * + N" of a fixed matrix M * with additional random noise N that were defined as follows.Let V = S(4, 12 , 4, 3 , 3, 4 ) = 121212123333444, and let M * = V 3 be a string from the alphabet Σ.For σ > 0, let N σ denote a string in Σ of length |N| = |M * |, where N[j] ∼ N (0, σ 2 ) is a random variable from a Gaussian distribution with zero mean and variance σ 2 for every j ∈ [1..|N|].To obtain a finite alphabet and Parikh Matrix, we scale and round each letter N[j] to its nearest integer.We then define σ i = 0.05i and for every i ∈ [0..16] and j ∈ [1..|N|].We repeat this construction of M i three times to obtain the string M i,1 , M i,2 , M i,3 , with different random noise N σ i from the same distribution.The result is then the input set M of matrices above in the real alphabet Σ .
The experiment is a list of 255 = 17 × 3 × 5 calls O i,j,k = LOSSY(M i,k , j) to Algorithm 2 with the string M i,k and j, over every variance level i ∈ [1..17], repetition (try) k ∈ [1..3], and j ∈ [1..5].The cost error for this call is the weighted number of mismatched letters (i.e., dissimilarity or distance) between the output recovered string O i,j,k and the original de-noised string M * , error(i, j, k) = The results are shown in Figure 3.The x-axis represents the integer level i = σ i /0.05 of the noise variance as defined above.The color of each of the five curves corresponds to different RRLE levels j ∈ [1..5] that were used in Algorithm 2. The height y of the jth curve in the ith variance level is the mean error over the 3 errors (error(i, j, 1) + error(i, j, 2) + error(i, j, 3))/3, together with its variance.
Conclusion. Figure 3 shows that the algorithm is more robust to noise as the number of the levels in the tree increases.W i by their Hamming distance hamm(W i , W); i.e., number of corresponding bits that are not the same.The average Hamming distance of the first j ≤ m packages is then error(j) := hamm(W j , V j ) = 1 j j ∑ i=1 hamm(W, W j ).
In practice, this error is proportional to the distance of the IR receiver sensor and the RC in Figure 2.
The error plots are shown in Figure 3.The green curve shows the average Hamming distance error(j) in the y-axis, over different number j of words (x-axis), together with its variance.
Recovery using Algorithm 2. We run Algorithm 2 m times using a call to Lossy(M j , levels), where M j is the Parikh matrix of V j and levels = 2, and R j is the output RRLE for each j = [1..m].Without noise, V j = W j , and the output is R j = (j, s * (W * )) that corresponds to the string W j = V j = S(j, s * (W * )).In practice, V j , unlike W j is not periodic, but R j (the recovered output signal) is expected to be j-periodic.The average error of the recovered string S(R j ) is then ourerror(j) := hamm(W j , S(R j )).
Results of Algorithm 2 are shown in Figure 3.The blue curve shows the average Hamming distance ourerror(j) in the y-axis, over different number j of words (x-axis).Since R j is always periodic, the variance error between the recovered packages is zero.
Conclusions: In Figure 3 we see, as expected by the analysis, that the recovery error decreases as Algorithm 2 is given more packages to learn from.On the contrary, the error of the "memory-less" thresholding approach does not reduce over time.
The results were very similar to the results in Figure 3.By this and the fact that Algorithm 2 is not needed in this level, they were omitted and can be find in [4].

De-Noising Level III
In Level III, we are given the semantic package of "0" and "1" and need to split it into semantic words as in (1).
In the first experiment we repeated the experiment from the previous section, where in the package P we used different values of throttle.That is, the "throttle" stick on the RC was pushed to random levels, from 1% to 100%, which ideally produces a repetitions of the semantic package L = S(package(throttle, 50, 50, 0)) where the value of throttle is then a random integer in [0..127].The checksum field in L was also changed in each package.Let M = L 1 L 2 , • • • denote the concatenation of these semantic packages.
Our goal was to identify the bits in M that correspond to the throttle field.To that end, we ran Algorithm 2 with the Parikh matrix for the string M with an additional row matrix that corresponded to a wildcard letter: "?".Each entry in this row will have a (cost) value of 1/2.The reason is that this wildcard will be used only on the variable "throttle" bits.The other bits are expected to be almost periodic and the cost of using wildcard on them will be too expensive.
Ideally, the algorithm will output the input string M where the "throttle" bits are replaced by wildcards.We repeat this experiment where "throttle" was replaced by "role" and "pitch".The results are shown in Table 1.The first 4 lines of Table 1 correspond to 4 semantic packages M = L 1 L 2 L 3 L 4 while the throttle is changed and the pitch/roll sticks remain unchanged.The fourth column of the ith row shows L i for i ∈ [1..4].The "throttle" bits were marked manually (by us) in bold.The first row in the fifth column contains the output of Algorithm 2 on M, where the throttle field, as well as the checksum, were indeed identified by wildcards.

Table 1.
Example experimental results on a toy helicopter.The three leftmost columns tells which RC button was pressed.The input is the recorded communication bits from the RC for a long repeated signal.The output on the rightmost column is the output of our algorithm.It is the common "intersection" of input signals.Wildcards represent unstable character due to the different throttle, role or pitch values (in bold).From the output we can conclude the format of the message, including constant bits and the sub-string that is responsible for each button.See Section 6.4 for details.To recover the complete protocol of IR-UFO, we scanned the two sticks over 20 positions, where each position was recorded for roughly a second.Our system plotted the desired Level III words as defined in (1) after a couple of hours.We expect that using core-sets the running time will reduce to minutes; see Section 4.

Throttle
In the second experiment the goal was to see how Algorithm 2 is robust to noise that may occur in previous levels.To this end, we added synthetic noise to the input (column 4 from the left in Table 1) for the string M with the variable "throttle" field above.For each x ∈ [1..250] we changed x bits in M to obtain M x and run Algorithm 2 with M x as described in the previous experiment.We then define y(x) to be the wrong letters in the output, compared to the desired string (rightmost column of first row in Table 1) over 100 experiments.Figure 3 describes the average error y in the output (y-axis) and its variance, for the values of x (in the x-axis).
Conclusions: Figure 3 shows that roughly 90% of the noisy bits in the input were recovered.When the input string is completely noisy, the output string consists of only wild cards, which is correct for few bits but wrong for the other (approximately 20) bits.

Conclusions
Novel algorithms for lossy text compression with provable guarantees on running time and optimality were provided.We demonstrated them by providing an open-source, home-made system for automatic reverse-engineering of toy robots, with experimental results on synthetic data and real communication signals.Clearly, there are many other applications of our algorithms such as compressing and recovering XML/Http or other protocols that are made of repeated similar blocks, or finding similar scenes in video/GPS streams.

Figure 1 .
Figure 1.The recursive run-length encoding (RRLE) tree of the string (abcccabcccgghhekml f f f cccdcccd), which is compressed from 29 letters to 22 (12 letters and 10 counters).

Figure 2 .
Figure 2. The recording part of our system: Moving a stick of the RC generates an IR (Infra-Red) signal.This signal is received by an IR sensor, which in turn transmits it to the logic analyzer (by Selae LTD).The logic analyzer translates this analog voltage signal into a binary signal.Its frequency is 5 Mhz.The binary signal is then transmitted to the USB port of the laptop and recorded to the hard drive using our software.In the above setting, the micro-processor Arduino is used only as a power supply for the sensor.In a different setup, the expensive logic analyzer was replaced by the Arduino.In this case the thresholds were computed by a software on the laptop.

Figure 3 .
Figure 3. Results of Algorithm 2. (a) distance in the y-axis, over different number j of words (x-axis).(b) average error in the output (y-axis) and its variance, for the values in the x-axis.(c) The average Hamming distance compared to the original string.