The compression process, especially in the case of large databases, is characterized by a trade-off between time and space that the user must be aware of and on which they must make a decision. There are various compression algorithms that allow you to optimize the degree of compression, or the time needed for this process.
Individual compression algorithms often also allow a more fine-grained choice by specifying a compression level, where a low level represents the desire for a minimum required time, while a high level represents the need for maximum compressibility of the supplied document.
With the advent of cloud computing and the service-based and pay-per-use model widely used nowadays, one can easily imagine the need to be able to set a specific time budget for a certain activity such as data compression. As specific activities, one can imagine the creation of hourly or event-related dumps. A concrete case specific to cloud computing can be imagined by considering the FaaS (“Function as a Service”) model.
The cost of this model is directly proportional to the number of calls performed and, in particular, to the duration of each.
2.1. Past Related Work
Data compression is nowadays a very active field of research: every reduction in the size of digital data allows us to transmit them faster, and therefore has a significant economic impact.
In lossless compression algorithms, the decoded output of the compression system is identical, bit for bit, to the original data. In contrast, lossy compression algorithms produce an “acceptable” approximation (depending on the application) of the original input.
Textual data, including text and HTML pages, are typically not stored in compressed form since they need to be searchable. In contrast, raster data such as audio, images, and video are generally stored in compressed formats and are often created in compressed form by the devices that generate them. Lossy compression is used exclusively for raster data.
Zohar and Cassuto in [
1] studied for the first time the problem of optimizing the lossless compression of one-dimensional data when there is a time limit within which the compression process must be completed. They experimentally demonstrated that the optimization is possible.
Carpentieri in [
2] resumed the work of Zohar and Cassuto and extended it to the lossless compression of two-dimensional data (images).
In the paper of Liao, Moffat, Petri, and Wirth [
3], a comprehensive model for the total retention cost (TRC) of a data archiving system is established, integrating cloud computing provider charging rates to quantify costs across various compression strategies. This analysis serves as a foundation for developing innovative, cost-efficient alternatives that surpass the effectiveness of existing methods.
Wiseman and Schwan [
4] investigate the application of compression techniques to enhance middleware-based information exchange in interactive and collaborative distributed systems. In these environments, achieving high compression ratios must be balanced with compression speeds that align with sustainable network transfer rates. Their approach dynamically monitors network and processor resources, evaluates compression efficiency, and autonomously selects the most suitable compression methods to optimize performance.
In this paper, we study data compression with a time limit in both the one-dimensional and the two-dimensional case and in the case of both lossless and lossy compression.
Lossless compression algorithms are often based on the text substitution model introduced by Lempel, Ziv, and Storer in the 1970s. and later used for text and image compression (see for example [
5,
6,
7]), or using Huffman or arithmetic coding (see for example [
8,
9]).
The approach we have presented for the optimization of compression given a time limit is totally independent of the algorithms used in the experiments. For simplicity, we have chosen to use in the experiments some of the most popular compression tools. Obviously, this choice has no impact on the optimization process other than providing data different from others.
In this paper, regarding the lossless compression of one-dimensional data, we have used in our experiments
gzip (see [
10]),
xz (based on LZMA: deriving from the seminal work of Lempel and Ziv [
4,
5]), and
bzip2 (based on the Burrows Wheeler transform, see [
11]),
arithmetic coding (see [
9]).
Lossless and lossy image compression algorithms often use the “modeling + coding” approach in which a prediction of the current pixel is built consistently by encoder and decoder depending on a chosen context of already coded samples, and then a prediction error, i.e., the difference between the real value of the current pixel and the prediction made, is sent from the encoder to the decoder.
As for lossless image compression, we used
PNG (see [
12]),
TIFF (based on the LZW algoritym, see [
13]),
JPEG-LS (see [
14]),
JPEG200 (see [
15]),
BMP (see [
16]), and
FELICS (see [
17]) in the testing phase.
Saha in [
18] presents a review of lossy image compression algorithms. For lossy compression of images, in our tests we used
JPEG (see [
19]) and
WEBP (see [
20]).
When lossy image coding is used, it is important to balance the compression obtained and the quality of the decompressed image. Here, to evaluate the quality of the decompressed image we used the
SSIM metric (see [
21]).
2.2. A Data Compression Algorithm with a Time Limit
Generally speaking, the framework proposed in [
1] is a specialization of the more general approach with respect to the basic mechanism of compression algorithms related to the configuration of effort levels.
It is not necessary to specify the compression algorithms to be used, since the approach is agnostic with respect to them. The simplified idea is to use a set of configurations of a specific algorithm, each on a partition of the data to be compressed. The percentage of data to be compressed with each algorithm is chosen to maximize the use of the available time, provided as input budget, in order to maximize the degree of compressibility obtained.
In the case of single documents, this set of configurations can specify the use of a single algorithm or a pair of algorithms.
For multiple documents, instead, combinations of algorithms will be used where each will be executed on a specific document. Each element of these combinations can be a single algorithm or a pair of algorithms, but it can be experimentally shown that in all cases it is sufficient to use a pair of algorithms, at most only for one document, using a single algorithm for the remaining documents. Therefore, in this paper we are not considering the possibility of compressing a single document in parts, with more than one data compression algorithm, because doing so would only bring a low gain that does not justify the increase in complexity due to the decoding.
Assuming a function that compresses a given set of data, a time optimization activity would allow us to reduce the economic costs of our system without sacrificing more than necessary the reductions in data size.
When we normally apply data compression, all the data we want to compress are input into a single compression tool which will try to reduce the size of the input data (while keeping the same information content) in a certain time t, and the tool will return as its output the compressed data.
The focus is frequently placed on the algorithm’s compression efficiency, while the time required for compression is often overlooked, provided it remains within a reasonable limit.
This way of doing may not be convenient in situations where you want to specify a time t’ that could also be less than t, within which you want to complete the compression process: that is, when you are trying to optimize compression performance while respecting specified time limits.
Recall that a convex surface, or convex hull, of a set of points S is the intersection of all convex sets that contain S and that the lower polygon chain contains all points that minimize the second coordinate (y, which in our case will be the size of the compressed document) for each x (in our case, time) of the convex surface.
If we consider the optimization of the compression of a single document, with the notions of a convex surface and lower polygon chain it is possible to obtain the set of optimal mixes of algorithms for any time budget.
The basic idea is to obtain the lower polygon chain of the best algorithms for each time budget, representing them on a two-dimensional plane by choosing as coordinates the time required (x) and the resulting dimension from the execution of the algorithm (y).
By best algorithms, we mean the input algorithms sorted by the time required, filtered by taking only those that lead to an improvement in terms of compression compared to the previous algorithm.
This filtering activity removes two classes of algorithms.
The algorithms that would cause a worse resulting dimension for the same time budget compared to another mix are excluded. In case two algorithms require the same amount of time, the one with the smaller resulting dimension is preferred.
By building the convex surface of the remaining algorithms, we can obtain the lower polygon chain composed of the algorithms involved in each possible optimal mix. This step is necessary because there may still be algorithms that are better than the previous one but that involve a non-optimal mix.
By definition of a convex surface, there cannot be points below it, while points above the lower polygonal chain will not correspond to optimal algorithms due to a larger resulting dimension. For each possible time budget, we will then have two possible options. It will be possible to use a specific algorithm, or the document can be partitioned into two parts, where each will be compressed by one of the two members of the optimal algorithm pair for that time budget.
If we consider the optimization of the compression of multiple documents, an important feature of the resulting polygon chain is the slope of the segments connecting two algorithms. This slope captures the concept of benefit obtained by switching from one algorithm to another. A more significant slope will result in a greater benefit.
The optimization of the compression of multiple documents starts from the lower polygon chains built through the process seen previously for each document. The idea is to join these chains to obtain an overall chain representing the entire set of documents.
The resulting lower polygon chain will be made up of points representing combinations of algorithms to be used, one for each document involved in the process. Each point of this lower polygon chain will be chosen in order to maximize the benefit for that specific time budget. This maximization is obtained by changing, with respect to the previous combination, only one algorithm. Once the overall lower polygon chain is obtained, the mixing process will be similar to the one seen previously. The optimal algorithms will correspond to the extremes of the segment where the time budget falls.
Following the work of Zohar and Cassuto in [
1], let us suppose we find ourselves in a situation where we want to compress a single file
f or a large data set
D by using a compressor. In real life, we will have many compressors and setups to choose from, but here, for simplicity let us consider the situation in which we have two possible compressors available, or two possible configurations of a single compressor called, respectively,
setup1 and
setup2: it could be easy then to generalize the following discussion to multiple compressors and multiple setups.
Suppose setup1 takes less time to compress than setup2, but that setup2 compresses more than setup1. Now, suppose that the execution must finish within a certain time t’ (let us call this value time-budget).
We define t1 as the time taken by setup1, t2 the time taken by setup2.
t’ < t1: it is not possible to compress with either setup, since the time budget is less than the time taken by the fastest compressor (setup1).
t’ > t2: it is possible to compress with both setup1 and setup2. We choose to compress with setup2, since compression is more effective in terms of output size.
t1 < t’ < t2: it is not possible to compress with setup2 because the time budget is not sufficient. We therefore decide to use setup1.
If we find ourselves in situation 3, the system manages to compress f (or D) through setup1; however, the time value Δ(t) = t’ − t1 in which the system remains unused is not negligible since the chosen setup finished its execution before the set time t’. Compression optimization tries to reduce, if not to eliminate, the Δ(t) value, considering not one, but a mix of setups.
In our previous example, if we found ourselves in a situation in which the time-budget is t1 < t’ < t2, we could think of adopting a “mixing” strategy in which a part of the file f (or of the data set D) to be compressed goes as input to setup1, while the remaining part to setup2.
This strategy, compared to the classic application of a single compression tool, could lead to the use of the entire time budget initially chosen and to a reduction in the output size. The proposed algorithm searches for an optimal-mix, i.e., an optimal setup configuration (among many considered) that can be used to compress a file f, given a time budget. As we will see, the search is not a trivial process, as it adopts a technique that builds a function that will consider potential setups that could be part of the mixing, and subsequently, among all the candidates, two or more are chosen.
The algorithm that is described next allows you to obtain all the
useful-setups for a certain file
f to be compressed. Subsequently, it will be possible to obtain the optimal mix once the time budget
t’ has been set. The inputs of the algorithm are pairs (
bi,
ti), where each pair represents a setup:
ti represents the time taken by the setup to compress the file
f, while
bi is the resulting size, as in [
1].
The algorithm for finding the optimal mix consists of four main steps, listed below:
Determination of the pairs (bi, ti) for each setup; these pairs are obtained by running each compression tool individually on the file f or by simply estimating its performance.
Sorting the setups in an ascending manner by ti, with bi used as the second index in descending order.
Removal of the worst setups and construction of the convex hull of the remaining points:
- i.
Between two setups that take the same time to execute, the one that gives the largest output size is discarded.
- ii.
All setups that give an output size that is too large compared to others that take less time are discarded.
Acquisition of the optimal-mix given a time budget t’.
Phase 3 will build what in computational geometry is called the convex hull, that is, given a set of points, the determination of the smallest convex set that can contain them all. As proved in [
1], the setups located at the vertices of the lower part of the convex hull are the only
useful-setups.
Useful-mixes will always be two setups connected by an arc on the bottom edge of the polygon. Given m useful setups arranged in ascending order of running time and a time budget t′, the goal is to identify the optimal combination of setups, a and b, along with the fractions of files, ra and rb, that each setup will handle.
In the non-trivial case where
t1 <
t’ <
t2, the chosen combination consists of the two adjacent setups, a and b, such that
tb <
t’ <
ta. After finding the optimal mix, the percentage of data
D that must be compressed with the chosen compression tools is calculated as in [
1]. Assuming we have two setups
sa and
sb with times
ta and
tb, respectively, choosing
t’ as the time budget such that
tb <
t’ <
ta, the fraction of the file
f to compress with
sa will be:
The fraction of
D compressed by the
sb setup will be:
In a multi-document context, each document typically possesses its own unique convex hull. Consequently, optimizing compression requires addressing multiple convex hulls simultaneously. Since the effectiveness of a particular tool or configuration depends on the document’s specific information characteristics, the set of optimal configurations will vary across documents.
A critical part of the solution involves an algorithm that consolidates the individual convex hulls into a unified convex hull. This unified structure allows the system to determine the best configuration for a given computational time constraint with ease, as explained in [
1].
To illustrate how the merged sequence facilitates finding the optimal configuration for any compute-time limit, as outlined in [
1], the process begins by calculating the compute time required when the least resource-intensive configurations are applied across all documents. The algorithm then iterates through successive configuration vectors in the merged sequence, recalculating the compute time at each step. This progression continues until the system encounters the last configuration within the allowed compute-time limit. As the system approaches the budget, the time constraint will eventually lie between two adjacent configurations in the sequence.
At this point, much like the single-document scenario, the solution involves blending two adjacent configurations. Since only one document transitions between setups in adjacent configurations, the final allocation will result in most documents sticking to a single configuration, with at most one document employing a mix of two configurations across its instances.