This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/.)

We present a survey of results concerning Lempel–Ziv data compression on parallel and distributed systems, starting from the theoretical approach to parallel time complexity to conclude with the practical goal of designing distributed algorithms with low communication cost. Storer's extension for image compression is also discussed.

Lempel–Ziv compression [

Lempel–Ziv compression is a dictionary-based technique. In fact, the factors of the string are substituted by

Given an alphabet _{1}_{2} ⋯ _{i}_{k}_{i}_{1}_{2} ⋯ _{i}_{i}_{1}_{2} ⋯ _{i}_{i}_{i}_{1}_{2} ⋯ _{i−1} [_{i}_{i}_{i}, ℓ_{i}_{i}_{i}_{i}_{i}

The LZ2 factorization of a string _{1}_{2} ⋯ _{i}_{k}_{i}_{i}_{i}_{i}

The pointer encoding the factor _{i}

The factorization processes described in the previous section are such that the number of different factors (that is, the dictionary size) grows with the string length. In practical implementations instead the dictionary size is bounded by a constant and the pointers have equal size. While for LZSS (or LZ1) compression this can be simply obtained by sliding a fixed length window and by bounding the match length, for LZW (or LZ2) compression dictionary elements are removed by using a deletion heuristic. The deletion heuristics we describe in this section are FREEZE, RESTART and LRU [

Let _{1}_{2}_{i}_{k}_{i}_{j}_{1}_{2} ⋯ _{j}_{i}_{k}_{j}_{i}_{h}

As mentioned at the beginning of this section, LZSS (or LZ1) bounded size dictionary compression is obtained by sliding a fixed length window and by bounding the match length. A real time implementation of compression with finite window is possible using a suffix tree data structure [

Greedy factorization is optimal for compression with finite windows since the dictionary is suffix. With LZW compression, after we fill up the dictionary using the FREEZE or RESTART heuristic, the greedy factorization we compute with such dictionary is not optimal since the dictionary is not suffix. However, there is an optimal semi-greedy factorization which is computed by the procedure of

LZSS (or LZ1) compression can be efficiently parallelized on a PRAM EREW [^{k}

We present compression algorithms for sliding dictionaries on an exclusive read, exclusive write shared memory machine requiring O(

NC is the class of problems solvable with a polynomial number of processors in polylogarithmic time on a parallel random access machine and it is conjectured to be a proper subset of P, the class of problems solvable in sequential polynomial time. LZ2 and LZW compression with an unbounded dictionary have been proved to be P-complete [^{k} n^{k} n^{k} n^{k} n^{k}^{k} n^{k}^{+1}. This hardness result is not so relevant for the space complexity analysis since Ω(log^{k} n^{O(c log c)} ^{O(c log c)} log ^{k}^{k} n^{k}^{k} n^{k}

As mentioned at the beginning of this section, the only practical LZW compression algorithm for a shared memory parallel system is the one using the FREEZE deletion heuristic. After the dictionary is built and frozen, a parallel procedure similar to the one for LZ1 compression is run. To compute a greedy factorization in parallel we find the greedy match with the frozen dictionary in each position

The design of parallel decoders is based on the fact that the Euler tour technique can also be used to find the trees of a forest in O(_{i}_{i}, ℓ_{i}_{1}, …, _{m}_{1}, …, _{m}_{i}_{i−1} +1 ⋯ _{i}_{i−1} +1 ⋯ _{i}_{i−1} +1 − _{i}_{i−1} + 1 − _{i}_{i}_{i}_{i}

With LZW compression using the FREEZE deletion heuristic the parallel decoder is trivial. We wish to point out that the decoding problem is interesting independently from the computational efficiency of the encoder. In fact, in the case of compressed files stored in a ROM only the computational efficiency of decompression is relevant. With the RESTART deletion heuristic, a special mark occurs in the sequence of pointers each time the dictionary is cleared out so that the decoder does not have to monitor the compression ratio. The positions of the special mark are detected by parallel prefix. Each subsequence _{1} ⋯ _{m}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{1} ⋯ _{m}_{1}, …, _{m}_{1}, …, _{m}_{i}_{i−1} + 1 ⋯ _{i}_{i}_{qi−α−1} + 1 and _{qi−α} + 1. Since the target of the pointer _{i}_{i}_{i}_{i−1} + 1 ⋯ _{i}_{first(i)} ⋯ _{last(i)}, respectively. As in the sliding dictionary case, if the target of _{i}

Distributed systems have two types of complexity, the interprocessor communication and the input-output mechanism. While the input/output issue is inherent to any parallel algorithm and has standard solutions, the communication cost of the computational phase after the distribution of the data among the processors and before the output of the final result is obviously algorithm-dependent. So, we need to limit the interprocessor communication and involve more local computation to design a practical algorithm. The simplest model for this phase is, of course, a simple array of processors with no interconnections and, therefore, no communication cost. We describe on such model for every integer

We simply apply in parallel sliding window compression to blocks of length

We obtain an approximation scheme which is suitable for a small scale system but due to its adaptiveness it works on a large scale parallel system when the file size is large. From a practical point of view, we can apply something like the gzip procedure to a small number of input data blocks achieving a satisfying degree of compression effectiveness and obtaining the expected speed-up on a real parallel machine. Making the order of magnitude of the block length greater than the one of the window length largely beats the worst case bound on realistic data. The window length is usually several thousands kilobytes. The compression tools of the Zip family, as the Unix command “gzip” for example, use a window size of at least 32 K. It follows that the block length in our parallel implementation should be about 300 K and the file size should be about one third of the number of processors in megabytes. An approach using a tree architecture slightly improves compression effectiveness [

As mentioned at the beginning of this section, if we use a RESTART deletion heuristic clearing out the dictionary every ^{12} and is employed by the Unix command “compress” with a dictionary of size 2^{16}. Therefore, in order to have a satisfying compression effectiveness the distributed algorithm might work with blocks of length

for each block, every processor but the one associated with the last sub-block computes the boundary match between its sub-block and the next one which ends furthest to the right;

each processor computes the optimal factorization from the beginning of the boundary match on the left boundary of its sub-block to the beginning of the boundary match on the right boundary.

Stopping the factorization of each sub-block at the beginning of the right boundary match might cause the making of a surplus factor, which determines the multiplicative approximation factor (

To decode the compressed files on a distributed system, it is enough to use a special mark occurring in the sequence of pointers each time the coding of a block ends. The input phase distributes the subsequences of pointers coding each block among the processors. If the file is encoded by an LZ2 compressor implemented on a large scale tree architecture, a second special mark indicates for each block the end of the coding of a sub-block and the coding of each block is stored at the same level of the tree. The first sub-block for each block is decoded by one processor to learn the corresponding dictionary. Then, the subsequences of pointers coding the sub-blocks are broadcasted to the leaf processors with the corresponding dictionary.

The best lossless image compressors are enabled by arithmetic encoders based on the model driven method [

The square matching compression procedure scans an _{i, j}

For example, in

Compression and decompression parallel implementations are described, requiring O(^{1/2}). In parallel for each area, one processor applies the sequential rectangular block matching algorithm so that in O(_{i, j}_{2i−1, j} and _{2i, j} provided they are monochromatic and have the same color. The same is done horizontally for _{i,2j−1} and _{i,2j}. At the _{(i−1)2k−1+1,j}, _{(i−1)2k−1+2,j}, ⋯ _{i2k−1,j}, with _{i2k−1+1,j}, _{i2k−1+2,j}, ⋯ _{(i+1)2k−1,j}, if they are monochromatic with the same color. The same is done horizontally for _{i,(j−1)2k−1+1}, _{i(j−1)2k−1+2}, ⋯ _{i,j2k−1}, with _{i,j2k−1+1}, _{i,j2k−1+2}, ⋯ _{i,(j+1)2k−1}. After O(log

An implementation on a small scale system with no interprocessor communication was realized since we were able to partition an image into up to a hundred areas and to apply the square block matching heuristic independently to each area with no relevant loss of compression effectiveness. In [_{i, j}^{h}^{1/2}), ^{k}^{h}^{2h} and the leaves are numbered from 1 to 2^{2h} from left to right. It follows that the tree has height 2

In parallel for each area, each leaf processor applies the sequential parsing algorithm so that in O(_{R}_{R}_{R}_{R}, w_{R}_{R}^{k}^{1/2} ^{2k} log _{2i−1,j} and _{2i,j} are siblings. Such processors merge _{2i−1,j} and _{2i,j} provided they are monochromatic and have the same color by broadcasting the information through their parent. It also follows from such procedure that the same can be done horizontally for _{i,2j−1} and _{i,2j} by broadcasting the information through the processors at level 2_{(i−1)2k−1+1,j}, _{(i−1)2k−1+2,j}, ⋯ _{i2k−1,j}, with _{1} ≤ _{2}, the processor storing area _{(i−1)2k−1+1,w1} will broadcast to the processors storing the areas _{i2k−1+1,j}, _{i2k−1+2,j}, ⋯ _{(i+1)2k−1,j} to merge with the above areas for _{1} ≤ _{2}, if they are monochromatic with the same color. The broadcasting will involve processors up to level 2_{i,(j−1)2k−1+1}, _{i,(j− 1)2k−1+2}, ⋯ _{i,j2k−1}, with _{1} ≤ _{2}, the processor storing area _{ℓ1,(j−1)2k−1+1} will broadcast to the processors storing the areas _{i,j2k−1+1}, _{i, j2k−1+2}, ⋯ _{i,(j+1)2k−1} to merge with the above areas for _{1} ≤ _{2}, if they are monochromatic with the same color. The broadcasting will involve processors up to level 2

In this paper, we presented a survey on parallel and distributed algorithms for Lempel–Ziv data compression. This field has developed in the last twenty years from a theoretical approach concerning parallel time complexity with no memory constraints to the practical goal of designing distributed algorithms with bounded memory and low communication cost. An extension to image compression has also been realized. As future work, we would like to implement distributed algorithms for Lempel–Ziv text compression and decompression on a parallel machine. As far as image compression is concerned, we wish to experiment with more processors on a graphical processing unit.

The semi-greedy factorization procedure.

The making of the surplus factors.

The making of a surplus factor.

Computing the largest rectangle match in (

The largest match in (0,0) and (6,6) is computed at step 5.

Storing the image into the leaves of the tree.

^{k}