In this section, we define and provide algorithms for the exact and lossy RRLE compression Problems 2 and 3. In the exact version of the problem, the input is a string Q that represents the signal, and the output is , which is the size of the smallest RRLE s of Q; see Definition 1. Hence, the similarity cost is , and the overall cost is .
3.2. Lossy RRLE Compression
In this section we solve Problem 3; i.e., compute a lossy good compression of an input string Q
. Formally, given such a string Q
of length n
, the goal of our algorithm is to compute the minimum
over every string
; see Section 2.3
for the definition of
. Of course, one can simply compute
for all possible strings
and output the one whose
is minimized. However, the time complexity of such a solution is
In order to reduce the time complexity, we propose a dynamic programming algorithm, which generalizes Algorithm 1 as follows. In Algorithm 1, if a substring is not periodic, we check two possible evaluations of : partitioning or leaving it as is. Here, even if is not periodic, we may change it to be periodic by finding a periodic string of length , and “paying” the similarity cost between and for this change.
Hence, the final of is defined recursively as the minimum between the following three values:
The minimum cost of modifying Q to be r-periodic, over every possible period length r. Formally, this is the minimal +1, over every string and factor r of n.
The minimum over all possible partitioning options of Q.
The cost of representing Q as is, with no compression.
To efficiently implement the above algorithm, we define the r-Parikh Matrix of a given string and its factor r, which we use throughout the algorithm. Intuitively, we define the string to be the same as the input string Q except that we change every rth letter of Q to . Hence, we changed at most letters. More generally, in we do the same where j denotes the offset or first letter we change (beginning of count). The r-Parikh Matrix of Q contains the corresponding mismatching cost (Hamming distance) in its entry. Examples will follow the definition.
(Parikh Matrix [28
]). Let be a string over an alphabet . Let be a factor of n. For every and , let denote the string whose letters in the entries are replaced by ; i.e., for every we have
-Parikh Matrix of Q is the matrix such that
For example, let be a string over . If , then , and therefore . That is, the period of changing a letter is 1 and thus all the letters will be modified. Indeed, for every we have . Hence, consists of n copies of the letter . We obtain , , and . There are 3 corresponding mismatches of compared to , in indices . Hence, . Similarly, and . The 1-Parikh matrix of Q is, thus, .
If , then . That is, we start the count with the first letter , which means in the above example that we change the letters in indices . We obtain , , and . Counting the corresponding mismatches compared to , we get that , , and . In a similar way, for , we obtain , , and . Hence, , , and . The 2-Parikh Matrix of Q is then .
Finally, if then if the jth letter of Q equals , and otherwise. Hence, the 6-Parikh Matrix of is .
For the r-Parikh matrix of a string Q, we it denote by , the smallest entry in the jth column of . Suppose that its row is ; i.e., . Therefore, if we wish to fix Q to be r-periodic with an offset j, by paying the smallest Hamming distance , then we should change the corresponding letters to the letter . This is also the motivation for using this matrix in Algorithm 2.
|Algorithm 2:; see Theorem 6.|
Overview of Algorithm 2: The input to the algorithm is a n-Parikh matrix M of a string Q of size n over , and an integer ℓ that denotes the maximum RRLE tree level of the compression, as explained below. Note that both Q and n can extracted from this Parikh matrix. We use dynamic programming to compute the matrix , in which is the (optimal) compression cost of the sub string . The loops are over the length m of the substring (from 1 to n), and then from the starting index i. The last index is denoted by .
We first compute the optimal cost of modifying a substring of length 1 in the ith index. This is by definition, so for . Next, we compute for every using recursive exhaustive search over the following three options.
The first option is to modify the substring in order to get r-periodic substring for some factor r. This costs 1 for the number of reputations and the total compression of the first r letters of Q.
The second option is to partition the substring into a pair of substrings: the left and right side of the original string. The overall cost is then the sum of these two costs.
The last option is simply to keep the substring as is. This takes m letters which is the size of the substring.
Looking at the resulting RLE tree, the first case means we can compress the string by adding the node new single child which represents a period of length r. The edge to this new child is marked with . The second case means add k new children which represent the partition of the string. In this case, all k new edges are marked with 1. And the third case means the string itself is a leaf.
For computing the first value, we need to compute the r-Parikh matrix , for every possible period length r, and then recursively fill it. However, we bound this recursion by a constant number ℓ, which is the maximal levels in the RLE tree, by keeping for each call to Algorithm 2 its level in the recursion, and if we reach ℓ we only compute the third value. The algorithm also stops when the input Parikh matrix is of size , and returns .
The time complexity of computing , is times (for each ) the following:
Computing for every possible r takes + the time for computing D for .
Computing the second value in the equation takes .
Computing the third value in the equation takes .
Each call takes -time, and since we bound the recursive calls to ℓ, the total time complexity is .
The pseudocode of the algorithm is presented in Algorithm 2. For simplicity, the algorithm output is ; however, it can be easily modified to include the string P as well.
Let Q be a string of length n over Σ, and let x be the output of a call to , where M is the n-Parikh matrix of Q. Then x is the minimum over every string whose recursive depth is ℓ.
To prove the correctness of the algorithm recall that there are two options for a string Q: it is either periodic or not. If Q is not periodic we can partition it to smaller consecutive substrings and compress them, or we can leave it as is. These options are covered by the algorithm in the second and third values, respective of the equation of computing .
If Q is periodic, or modified to be periodic, the algorithm checks all possible period lengths r. For each such period length r it computes the r-Parikh matrix . The only thing we need to prove is that represents all possible substrings , and the corresponding value of . If this is true, it means that the algorithm considers all possible solution strings.
Let us look at the string Q of size n, and its n-Parikh matrix . By definition, the cell equals 0, if , and 1, otherwise. Hence, computing gives us the minimum over all strings .
The last thing left to prove is that computing the minimum is sufficient in order to get in the case of periodicity. If Q is periodic in r, then, by definition, . Hence, minimizing for a specific r is sufficient to compute the minimum for this r. Since the algorithm computes this value for every possible , it will find the correct solution string P. □