Data Compression with a Time Limit

Carpentieri, Bruno

doi:10.3390/a18030135

Open AccessArticle

Data Compression with a Time Limit

by

Bruno Carpentieri

Dipartmento di Informatica, Università di Salerno, 84084 Fisciano, SA, Italy

Algorithms 2025, 18(3), 135; https://doi.org/10.3390/a18030135

Submission received: 21 January 2025 / Revised: 18 February 2025 / Accepted: 21 February 2025 / Published: 3 March 2025

(This article belongs to the Special Issue 2024 and 2025 Selected Papers from Algorithms Editorial Board Members)

Download

Browse Figures

Versions Notes

Abstract

In this paper, we explore a framework to identify an optimal choice of compression algorithms that enables the best allocation of computing resources in a large-scale data storage environment: our goal is to maximize the efficiency of data compression given a time limit that must be observed by the compression process. We tested this approach with lossless compression of one-dimensional data (text) and two-dimensional data (images) and the experimental results demonstrate its effectiveness. We also extended this technique to lossy compression and successfully applied it to the lossy compression of two-dimensional data.

Keywords:

data compression; digital transmission; convex hull

1. Introduction

Digital technologies have revolutionized our daily lives in a short period. Thanks to mobile devices, modern personal computers, and high-speed mobile networks, we are always connected and can share photos or videos with a click.

These important changes, along with the new applications emerging from recent innovations, have been readily accepted by end-users. However, these users remain largely unaware of the underlying mechanisms and functions: for example, how the process of playing a video or sending an image over the internet works.

Whether you surf the internet, listen to a piece of music, watch a video or scroll through images, few people know how fundamental the data compression process is for these activities.

Communication today is largely based on the exchange of digital information on data networks, such as the internet. In fact, one might say that these infrastructures are now fundamentally based on data compression, which enables easier and faster exchange of digital data and information. Moreover, compression is often combined with encryption to guarantee the privacy of communication.

Typically, the choice of compression algorithms to use and the technical details to rely on depend on the application that will use the digital file and/or the destination of the message sent.

The goal of data compression algorithms is to maximize the level of compression given the limitations of processing resources.

Deciding how to allocate processing resources for compression tasks is not a trivial problem. In this paper, we focus on a framework that begins with multimedia data compression and studies the optimal allocation of compression resources. Our goal is to maximize the efficiency of data compression given limited computational resources, particularly with respect to a time limit (see [1,2]).

The paper is structured as follows: Section 2 discusses past related work and the optimization algorithm. Section 3 presents our experimental results for the lossless compression of one-dimensional and two-dimensional data. Section 4 refines the technique used for lossless compression to perform efficient lossy compression of images and presents experimental results on lossy compression. Finally, Section 5 provides conclusions and outlines possible new research directions.

2. Data Compression with a Time Limit

The compression process, especially in the case of large databases, is characterized by a trade-off between time and space that the user must be aware of and on which they must make a decision. There are various compression algorithms that allow you to optimize the degree of compression, or the time needed for this process.

Individual compression algorithms often also allow a more fine-grained choice by specifying a compression level, where a low level represents the desire for a minimum required time, while a high level represents the need for maximum compressibility of the supplied document.

With the advent of cloud computing and the service-based and pay-per-use model widely used nowadays, one can easily imagine the need to be able to set a specific time budget for a certain activity such as data compression. As specific activities, one can imagine the creation of hourly or event-related dumps. A concrete case specific to cloud computing can be imagined by considering the FaaS (“Function as a Service”) model.

The cost of this model is directly proportional to the number of calls performed and, in particular, to the duration of each.

2.1. Past Related Work

Data compression is nowadays a very active field of research: every reduction in the size of digital data allows us to transmit them faster, and therefore has a significant economic impact.

In lossless compression algorithms, the decoded output of the compression system is identical, bit for bit, to the original data. In contrast, lossy compression algorithms produce an “acceptable” approximation (depending on the application) of the original input.

Textual data, including text and HTML pages, are typically not stored in compressed form since they need to be searchable. In contrast, raster data such as audio, images, and video are generally stored in compressed formats and are often created in compressed form by the devices that generate them. Lossy compression is used exclusively for raster data.

Zohar and Cassuto in [1] studied for the first time the problem of optimizing the lossless compression of one-dimensional data when there is a time limit within which the compression process must be completed. They experimentally demonstrated that the optimization is possible.

Carpentieri in [2] resumed the work of Zohar and Cassuto and extended it to the lossless compression of two-dimensional data (images).

In the paper of Liao, Moffat, Petri, and Wirth [3], a comprehensive model for the total retention cost (TRC) of a data archiving system is established, integrating cloud computing provider charging rates to quantify costs across various compression strategies. This analysis serves as a foundation for developing innovative, cost-efficient alternatives that surpass the effectiveness of existing methods.

Wiseman and Schwan [4] investigate the application of compression techniques to enhance middleware-based information exchange in interactive and collaborative distributed systems. In these environments, achieving high compression ratios must be balanced with compression speeds that align with sustainable network transfer rates. Their approach dynamically monitors network and processor resources, evaluates compression efficiency, and autonomously selects the most suitable compression methods to optimize performance.

In this paper, we study data compression with a time limit in both the one-dimensional and the two-dimensional case and in the case of both lossless and lossy compression.

Lossless compression algorithms are often based on the text substitution model introduced by Lempel, Ziv, and Storer in the 1970s. and later used for text and image compression (see for example [5,6,7]), or using Huffman or arithmetic coding (see for example [8,9]).

The approach we have presented for the optimization of compression given a time limit is totally independent of the algorithms used in the experiments. For simplicity, we have chosen to use in the experiments some of the most popular compression tools. Obviously, this choice has no impact on the optimization process other than providing data different from others.

In this paper, regarding the lossless compression of one-dimensional data, we have used in our experiments gzip (see [10]), xz (based on LZMA: deriving from the seminal work of Lempel and Ziv [4,5]), and bzip2 (based on the Burrows Wheeler transform, see [11]), arithmetic coding (see [9]).

Lossless and lossy image compression algorithms often use the “modeling + coding” approach in which a prediction of the current pixel is built consistently by encoder and decoder depending on a chosen context of already coded samples, and then a prediction error, i.e., the difference between the real value of the current pixel and the prediction made, is sent from the encoder to the decoder.

As for lossless image compression, we used PNG (see [12]), TIFF (based on the LZW algoritym, see [13]), JPEG-LS (see [14]), JPEG200 (see [15]), BMP (see [16]), and FELICS (see [17]) in the testing phase.

Saha in [18] presents a review of lossy image compression algorithms. For lossy compression of images, in our tests we used JPEG (see [19]) and WEBP (see [20]).

When lossy image coding is used, it is important to balance the compression obtained and the quality of the decompressed image. Here, to evaluate the quality of the decompressed image we used the SSIM metric (see [21]).

2.2. A Data Compression Algorithm with a Time Limit

Generally speaking, the framework proposed in [1] is a specialization of the more general approach with respect to the basic mechanism of compression algorithms related to the configuration of effort levels.

It is not necessary to specify the compression algorithms to be used, since the approach is agnostic with respect to them. The simplified idea is to use a set of configurations of a specific algorithm, each on a partition of the data to be compressed. The percentage of data to be compressed with each algorithm is chosen to maximize the use of the available time, provided as input budget, in order to maximize the degree of compressibility obtained.

In the case of single documents, this set of configurations can specify the use of a single algorithm or a pair of algorithms.

For multiple documents, instead, combinations of algorithms will be used where each will be executed on a specific document. Each element of these combinations can be a single algorithm or a pair of algorithms, but it can be experimentally shown that in all cases it is sufficient to use a pair of algorithms, at most only for one document, using a single algorithm for the remaining documents. Therefore, in this paper we are not considering the possibility of compressing a single document in parts, with more than one data compression algorithm, because doing so would only bring a low gain that does not justify the increase in complexity due to the decoding.

Assuming a function that compresses a given set of data, a time optimization activity would allow us to reduce the economic costs of our system without sacrificing more than necessary the reductions in data size.

When we normally apply data compression, all the data we want to compress are input into a single compression tool which will try to reduce the size of the input data (while keeping the same information content) in a certain time t, and the tool will return as its output the compressed data.

The focus is frequently placed on the algorithm’s compression efficiency, while the time required for compression is often overlooked, provided it remains within a reasonable limit.

This way of doing may not be convenient in situations where you want to specify a time t’ that could also be less than t, within which you want to complete the compression process: that is, when you are trying to optimize compression performance while respecting specified time limits.

Recall that a convex surface, or convex hull, of a set of points S is the intersection of all convex sets that contain S and that the lower polygon chain contains all points that minimize the second coordinate (y, which in our case will be the size of the compressed document) for each x (in our case, time) of the convex surface.

If we consider the optimization of the compression of a single document, with the notions of a convex surface and lower polygon chain it is possible to obtain the set of optimal mixes of algorithms for any time budget.

The basic idea is to obtain the lower polygon chain of the best algorithms for each time budget, representing them on a two-dimensional plane by choosing as coordinates the time required (x) and the resulting dimension from the execution of the algorithm (y).

By best algorithms, we mean the input algorithms sorted by the time required, filtered by taking only those that lead to an improvement in terms of compression compared to the previous algorithm.

This filtering activity removes two classes of algorithms.

The algorithms that would cause a worse resulting dimension for the same time budget compared to another mix are excluded. In case two algorithms require the same amount of time, the one with the smaller resulting dimension is preferred.

By building the convex surface of the remaining algorithms, we can obtain the lower polygon chain composed of the algorithms involved in each possible optimal mix. This step is necessary because there may still be algorithms that are better than the previous one but that involve a non-optimal mix.

By definition of a convex surface, there cannot be points below it, while points above the lower polygonal chain will not correspond to optimal algorithms due to a larger resulting dimension. For each possible time budget, we will then have two possible options. It will be possible to use a specific algorithm, or the document can be partitioned into two parts, where each will be compressed by one of the two members of the optimal algorithm pair for that time budget.

If we consider the optimization of the compression of multiple documents, an important feature of the resulting polygon chain is the slope of the segments connecting two algorithms. This slope captures the concept of benefit obtained by switching from one algorithm to another. A more significant slope will result in a greater benefit.

The optimization of the compression of multiple documents starts from the lower polygon chains built through the process seen previously for each document. The idea is to join these chains to obtain an overall chain representing the entire set of documents.

The resulting lower polygon chain will be made up of points representing combinations of algorithms to be used, one for each document involved in the process. Each point of this lower polygon chain will be chosen in order to maximize the benefit for that specific time budget. This maximization is obtained by changing, with respect to the previous combination, only one algorithm. Once the overall lower polygon chain is obtained, the mixing process will be similar to the one seen previously. The optimal algorithms will correspond to the extremes of the segment where the time budget falls.

Following the work of Zohar and Cassuto in [1], let us suppose we find ourselves in a situation where we want to compress a single file f or a large data set D by using a compressor. In real life, we will have many compressors and setups to choose from, but here, for simplicity let us consider the situation in which we have two possible compressors available, or two possible configurations of a single compressor called, respectively, setup1 and setup2: it could be easy then to generalize the following discussion to multiple compressors and multiple setups.

Suppose setup1 takes less time to compress than setup2, but that setup2 compresses more than setup1. Now, suppose that the execution must finish within a certain time t’ (let us call this value time-budget).

We define t₁ as the time taken by setup1, t₂ the time taken by setup2.

t’ < t₁: it is not possible to compress with either setup, since the time budget is less than the time taken by the fastest compressor (setup1).
t’ > t₂: it is possible to compress with both setup1 and setup2. We choose to compress with setup2, since compression is more effective in terms of output size.
t₁ < t’ < t₂: it is not possible to compress with setup2 because the time budget is not sufficient. We therefore decide to use setup1.

If we find ourselves in situation 3, the system manages to compress f (or D) through setup1; however, the time value Δ(t) = t’ − t₁ in which the system remains unused is not negligible since the chosen setup finished its execution before the set time t’. Compression optimization tries to reduce, if not to eliminate, the Δ(t) value, considering not one, but a mix of setups.

In our previous example, if we found ourselves in a situation in which the time-budget is t₁ < t’ < t₂, we could think of adopting a “mixing” strategy in which a part of the file f (or of the data set D) to be compressed goes as input to setup1, while the remaining part to setup2.

This strategy, compared to the classic application of a single compression tool, could lead to the use of the entire time budget initially chosen and to a reduction in the output size. The proposed algorithm searches for an optimal-mix, i.e., an optimal setup configuration (among many considered) that can be used to compress a file f, given a time budget. As we will see, the search is not a trivial process, as it adopts a technique that builds a function that will consider potential setups that could be part of the mixing, and subsequently, among all the candidates, two or more are chosen.

The algorithm that is described next allows you to obtain all the useful-setups for a certain file f to be compressed. Subsequently, it will be possible to obtain the optimal mix once the time budget t’ has been set. The inputs of the algorithm are pairs (b_i, t_i), where each pair represents a setup: t_i represents the time taken by the setup to compress the file f, while b_i is the resulting size, as in [1].

The algorithm for finding the optimal mix consists of four main steps, listed below:

Determination of the pairs (b_i, t_i) for each setup; these pairs are obtained by running each compression tool individually on the file f or by simply estimating its performance.
Sorting the setups in an ascending manner by t_i, with b_i used as the second index in descending order.
Removal of the worst setups and construction of the convex hull of the remaining points:
i.
Between two setups that take the same time to execute, the one that gives the largest output size is discarded.
ii.
All setups that give an output size that is too large compared to others that take less time are discarded.
Acquisition of the optimal-mix given a time budget t’.

Phase 3 will build what in computational geometry is called the convex hull, that is, given a set of points, the determination of the smallest convex set that can contain them all. As proved in [1], the setups located at the vertices of the lower part of the convex hull are the only useful-setups.

Useful-mixes will always be two setups connected by an arc on the bottom edge of the polygon. Given m useful setups arranged in ascending order of running time and a time budget t′, the goal is to identify the optimal combination of setups, a and b, along with the fractions of files, r_a and r_b, that each setup will handle.

In the non-trivial case where t₁ < t’ < t₂, the chosen combination consists of the two adjacent setups, a and b, such that t_b < t’ < t_a. After finding the optimal mix, the percentage of data D that must be compressed with the chosen compression tools is calculated as in [1]. Assuming we have two setups s_a and s_b with times t_a and t_b, respectively, choosing t’ as the time budget such that t_b < t’ < t_a, the fraction of the file f to compress with s_a will be:

r_a = (t’ − t_b)/(t_a − t_b)

The fraction of D compressed by the s_b setup will be:

r_b = 1 − r_a.

In a multi-document context, each document typically possesses its own unique convex hull. Consequently, optimizing compression requires addressing multiple convex hulls simultaneously. Since the effectiveness of a particular tool or configuration depends on the document’s specific information characteristics, the set of optimal configurations will vary across documents.

A critical part of the solution involves an algorithm that consolidates the individual convex hulls into a unified convex hull. This unified structure allows the system to determine the best configuration for a given computational time constraint with ease, as explained in [1].

To illustrate how the merged sequence facilitates finding the optimal configuration for any compute-time limit, as outlined in [1], the process begins by calculating the compute time required when the least resource-intensive configurations are applied across all documents. The algorithm then iterates through successive configuration vectors in the merged sequence, recalculating the compute time at each step. This progression continues until the system encounters the last configuration within the allowed compute-time limit. As the system approaches the budget, the time constraint will eventually lie between two adjacent configurations in the sequence.

At this point, much like the single-document scenario, the solution involves blending two adjacent configurations. Since only one document transitions between setups in adjacent configurations, the final allocation will result in most documents sticking to a single configuration, with at most one document employing a mix of two configurations across its instances.

3. Lossless Compression

We experimented this approach by using lossless compressors on one-dimensional data (text) and on two-dimensional data (images); the obtained results follow.

3.1. Lossless Compression of One-Dimensional Data with a Time Limit

We began testing on a set of 1000 html pages that have similar contents and that were downloaded through the web crawler Wibbi by Stanford University, obtaining a resulting document of size 34 Mb.

We tested the following compression tools: gzip, xz, bzip2, arithmetic coding all in the default configuration.

For this experiment, we used a Mac Book Air with 8 Gb of RAM and a 1.7 Ghz Intel Core i7 processor. These, in Table 1, are the results of running the compression tools on this test data set:

Gzip is the fastest of all the tools, probably because it is implemented in the java.util.zip library and this makes it optimized. Arithmetic Coding, although faster than all the other setups except Gzip, returns a significantly larger output size than the other compressors: about 100% more. The best setup in terms of output size is XZ, even though it takes on this platform the longest time: about 26 s. In this experiment, we decided to limit our optimal mix to only two algorithms and to compress each file with a single compressor. We tested two different time budgets: 22,000 milliseconds and 10,000 milliseconds.

For the time budget of 22,000 milliseconds, the optimal mix turns out to be composed of Bzip2 and XZ. A total of 632 html pages were compressed by using XZ and the remaining 368 by using Bzip2. The compressed file size for this test was 6,935,454 bytes long and the total time to compress was 22,566 milliseconds, i.e., slightly more that the assigned time budget: an error of about 0.0257%. This depends on the fact that we had chosen to compress each file with a single compressor. By removing this constraint, we can likely adhere precisely to the assigned time budget.

For the time budget of 10,000 milliseconds, the optimal mix turns out to be composed of Bzip2 and Gzip. A total of 557 html pages were compressed by using Gzip and the remaining 443 by using Bzip2. The compressed file size for this test was 7,401,784 bytes long and the total time to compress was 9703 milliseconds, i.e., slightly less that the assigned time budget.

3.2. Lossless Compression of Two-Dimensional Data with a Time Limit

We evaluated the performance of the algorithm by conducting experiments on different types of inputs. The following four cases were initially analyzed:

–: Compression of a set of animated images with similar scenes.
–: Compression of a set of animated images with different scenes.
–: Compression of a set of non-animated images with similar scenes.
–: Compression of a set of non-animated images with different scenes.

The time budgets used are different for each experiment.

The tests on the lossless compression of images with a time limit were conducted on a personal computer with the following specifications:

–: CPU: Intel Core i7-2630QM CPU 2.00 GHz;
–: RAM: 8.00 GB;
–: Operating system: Windows 10 Pro;
–: Mass storage: 128 GB Samsung Pro SSD.

The following Table 2 lists all the compression tools used, their configuration adopted, and the download link for the compression libraries (accessed last time on 19 November 2024):

The images were extracted from videos currently available at the following links (accessed last time on 19 November 2024):

–: https://www.youtube.com/watch?v=FmiRX9Oh_0g;
–: https://www.youtube.com/watch?v=5FNFBgubeqI;
–: https://www.youtube.com/watch?v=u5CZs4y53UE;
–: https://www.youtube.com/watch?v=CkEowmI7a3w.

Each video was downloaded by connecting to the portal http://it.savefrom.net/ and then specifying the link and format (MP4 360p).

To perform the subdivision into frames, we used a GOM Player tool (available at the link https://www.gomlab.com/). The input images are in uncompressed TIFF format.

3.2.1. Compression of a Set of Animated Images with Similar Scenes

The dataset D to be compressed consisted of 1000 images, for a total size of 659 MB (691,744,000 bytes). Single runs of the compression tools yielded the following pairs (b_i, t_i).

Table 3 shows us that the fastest compressor was BMP while the slowest was JPEG2000. From the point of view of the compression obtained, the best compressor was JPEG-LS, while the worst was BMP.

The useful-setup is therefore composed of the compressors BMP, FELICS, PNG, TIFF, and JPEG-LS. With an assigned time budget t₁ = 39,000 ms then the optimal mix is given by the mix of the BMP and FELICS setups, and it is closer to the BMP setup. All this is shown in Figure 1:

In this case, the optimum is perfectly intersected with the convex hull. A total of 794 images will be compressed using BMP, the other 206 with FELICS. We therefore compressed D as suggested by the algorithm: the compression time was 39,141 ms and the output size 563 MB (590,397,633 bytes).

For t₂ = 89,000 ms, instead, the optimal mix is given by the mix between the TIFF and PNG configurations, and it is very close to the TIFF configuration. Figure 2 shows it:

Here too, the optimum intersects the convex hull. The number of images to compress with TIFF was 897, while for PNG it was 103.

The compression time was 89,546 ms., while the output size was 163 MB (170,978,001 bytes).

3.2.2. Compression of a Set of Animated Images with Different Scenes

The dataset D to be compressed consisted again of 1000 images, for a total size of 659 MB (691,744,000 bytes). Single runs of the compression tools yielded the following pairs (b_i, t_i):

Table 4 shows us that the fastest compressor was BMP while the slowest was JPEG2000. From the point of view of the compression obtained, the best compressor was JPEG2000, while the worst was BMP.

The useful-setup is therefore composed of the compressors BMP, FELICS, JPEG-LS, and JPEG2000. Two time budgets were set to experiment the algorithm. With an assigned time budget, t₁ = 571,000 ms then the optimal mix is given by the mix of the JPEG-LS and JPEG2000 setups, and it is very close to the JPEG2000 setup. All this is shown in Figure 3:

Here, the optimal mix is at the intersection point.

A total of 16 images must be compressed using JPEG-LS, the other 984 with JPEG2000.

The compression time was 566,798 ms, while the output size was 167 MB (175,356,694 bytes).

In this case, the time budget is not used completely: the compression ends about 5 s earlier: this difference is to be considered acceptable considering that a very large time budget has been used, and that the compression is performed one file at a time.

For t₂ = 69,000 ms, instead, the optimal mix is given by the mix between the JPEG-LS and FELICS configurations, and it is in their center. Figure 4 shows it:

The results are consistent with all the previous tests and the optimal mix is at the intersection point.

The number of images we compress with JPEG-LS is 147, while for FELICS it is 853.

The compression time was 68,541 ms., while the output size was 232 MB (243,538,616 bytes).

3.2.3. Compression of a Set of Non-Animated Images with Similar Scenes

The data set D to be compressed was composed of 1000 images, for a total size of 659 MB (691,744,000 bytes).

Single runs of the compression tools yielded the following pairs (b_i, t_i).

Table 5 shows that the fastest tool was BMP, while the slowest was JPEG2000. The best tool for output size was JPEG2000, while the worst was BMP. The calculated useful-setup was made up of the tools BMP, FELICS, JPEG-LS, and JPEG2000.

With an assigned time budget, t₁ = 58,000 ms then the optimal mix is given by the mix of the BMP and FELICS setups, and it is found in the center between them.

All this is shown in Figure 5:

Here too, the optimum intersects the convex hull. The number of images to compress with BMP was 459, while for FELICS it was 541.

The compression time was 58,600 ms, while the output size was 471 MB (493,429,735 bytes).

If we increase the time budget to t₂ = 130,000 ms, the optimal mix is given by the mix between the JPEG-LS and FELICS configurations, and it is very close to the JPEG-LS configuration.

Figure 6 shows it:

A total of 959 images will be compressed using JPEG-LS, the other 41 with FELICS.

The compression time was 130,292 ms and the output size 245 MB (240,675,273 bytes). We had to use slightly more than the expected budget: about 292 ms more.

3.2.4. Compression of a Set of Non-Animated Images with Different Scenes

The dataset D to be compressed consisted again of 1000 images, for a total size of 306 MB (320,864,256 bytes). Single runs of the compression tools yielded the following pairs (b_i, t_i).

Table 6 shows us that the fastest compressor was BMP, while the slowest was JPEG2000.

From the point of view of the compression obtained, the best compressor was JPEG2000, while the worst was BMP. The useful-setup is therefore composed of the compressors BMP, FELICS, JPEG-LS, and JPEG2000.

Again, two time budgets were set to experiment the algorithm.

With an assigned time budget, t₁ = 61,000 ms then the optimal mix is given by the mix of the BMP and FELICS setups, and it is very close to the FELICS setup. All this is shown in Figure 7:

The optimum is on the convex hull. The number of images to compress with BMP was 224, while for FELICS it was 776. The compression time was 61,215 ms., while the output size was 366 MB (383,557,600 bytes).

If we increase the time budget to t₂ = 132,000 ms then the optimal mix is given by the mix between the JPEG-LS and FELICS configurations, and it is very close to the JPEG-LS configuration. Figure 8 shows it:

The optimum is on the convex hull. The number of images to compress with JPEG-LS was 997, while for JPEG2000 it was 3.

The compression time was 132,764 ms, while the output size was 216 MB (226,036,083 bytes).

4. Lossy Compression

As seen so far, the optimization algorithm allows us to obtain all the useful setups when we use lossless compressors. From these useful setups, it is then possible to obtain the so-called optimal mix once the time within which the compression must be completed has been set.

If instead of using lossless compressors we consider lossy compressors, we will have to consider a third parameter: quality. This is because lossy compression leads to a trade-off between loss of information and compression obtained and it is necessary to evaluate the result of the compression not only in terms of compression but also in terms of quality of the decompressed image. The amount of data loss is determined by the level of compression achievable, as explained by rate-distortion theory.

Lossy compression optimization becomes important, for example, in situations where compression must be performed every time data are to be transmitted because the quality is individually chosen by the remote user.

In general, for many lossy compression tools, such as JPEG and WebP, when we perform the compression, it is possible to set the quality, often in a range of 0–100, where low values mean lower quality (and greater compression). Therefore, the greater the reduction in the size of the digital image is, the greater the loss of information relating to the image will be (and, consequently, the quality will be reduced).

We have therefore chosen to add a third parameter to the optimization algorithm: the SSIM index.

The SSIM (Structural Similarity Index) is used to measure the similarity between two images. It is a perceptual metric that qualifies the degradation of image quality caused by processing, such as compression or loss of information in data transmission.

This metric evaluates the similarity between two images of the same acquisition: in our case, an original reference image and an image first compressed and then decompressed. The SSIM was designed to improve traditional evaluation methods such as PSNR and mean square error (MSE).

The idea was to ask the user to set a limit on the SSIM parameter at the beginning of the optimization. Subsequently, we generate the pairs (b_i, t_i), obtained by sequentially executing the compression algorithms on a data set D (in our case, the images) and then used as input for the useful setup search algorithm, but we select only those pairs whose compression returns us an SSIM value greater than or equal to of the SSIM limit value set.

In detail, the optimization algorithm for lossy image compression (which we called SSIM_Adaptive) receives as input the SSIM parameter that we obtain by compressing with the maximum settable quality. Starting from this SSIM, the algorithm at each iteration compresses with a different, lower, quality that is calculated to obtain an SSIM value that is close to the SSIM limit and inserts each setup it tries into an array.

At the end, we obtain q that indicates the minimum quality (0–100) to use during compression to obtain a SSIM that is closest and higher than the quality limit that was set.

It is important to clarify that the SSIM value calculated during the execution of the SSIM_Adaptive algorithm is a weighted average on the SSIM values of all the images given as input.

Therefore, in addition to the set-up with the q received as output, we also take all the set-ups that have a quality greater than q. By carrying out this pre-processing, we are sure that among the algorithms that will be chosen from the optimal mix only those with a quality that respects that entered by the user or even higher will appear.

In this way, we can compress D, respecting the time budget and using only the algorithms with a quality greater than or equal to the fixed one.

We experimentally tested our approach on images that were acquired as we did in Section 3.2. The images were extracted from videos that were currently available at the following links (accessed last time on 19 November 2024):

–: https://www.youtube.com/watch?v=FmiRX9Oh_0g;
–: https://www.youtube.com/watch?v=5FNFBgubeqI;
–: https://www.youtube.com/watch?v=u5CZs4y53UE;
–: https://www.youtube.com/watch?v=CkEowmI7a3w.

Each video was downloaded by connecting to the portal http://it.savefrom.net/ and then specifying the link and format (MP4 360p).

To perform the subdivision into frames, we used the GOM Player tool (available at the link https://www.gomlab.com/). The input images are in PPM format.

The tests on the lossy compression of images with a time limit were conducted on a personal computer with the following specifications:

–: CPU: 8 Core Intel Core i9 CPU 2.3 GHz;
–: RAM: 16.00 GB;
–: Operating system: Mac OS;

The following Table 7 lists all the compression tools used, the tool to compute the SSIM, and the download link for the compression libraries:

We tested this lossy approach by using various configurations of time limits and quality limit. Here, we report a few.

4.1. Test 1

The dataset D to be compressed consisted of 1000 images, for a total size of 138.2 MB (138,243,000 bytes). Single runs of the compression tools with the indicated quality setups yielded the following results shown in Table 8:

From Figure 9, it is possible to observe that the fastest tool was WEBP10 while the slowest was JPEG_NO_OPTIMIZE15.

The best tool for output size was WEBP11, while the worst was JPEG_NO_OPTIMIZE3.

Figure 10 shows the useful setups.

Figure 11 shows the optimal mix. If the time limit is 3530 ms and the SSIM quality limit is set to 0.933483, the optimal mix is given by the mix of the WEBP10 and WEBP11 setups.

It is possible to observe how the optimum is perfectly intersected with the convex hull.

The number of images to be compressed with WEBP10 is equal to 750 with quality 24, while for WEBP11 it is 250 with quality 23.

Compressing D as suggested by the algorithm resulted in a compression time of 3683 ms, while the output size in bytes is equal to 9,433,432 bytes with a quality indicated by an SSIM index of 0.931679.

The time budget was used entirely. The very small margin of error, about 153 ms, is certainly negligible and can depend on many factors strictly related to the system (CPU usage, RAM, file reading/writing, etc.).

4.2. Test 2

The dataset D to be compressed consisted of 1000 images, for a total size of 692.2 MB (691,215,000 bytes). Single runs of the compression tools with the indicated quality setups yielded the following results shown in Table 9:

From Figure 12 it is possible to observe that the fastest tool was JPG1 with a quality setup of 95, while the slowest was JPEG_NO_OPTIMIZE14 with a quality setup of 20. The best tool for output size was WEBP11 with a quality setup of 3, while the worst was JPEG_NO_OPTIMIZE3 with a quality setup of 95.

Figure 13 shows the useful setups.

Figure 14 shows the optimal mix.

In the case where the time limit is 3300 ms. and the SSIM quality limit is set to 0.874144, ms., the optimal mix is given by the mix of the WEBP9 and WEBP10 setups.

The number of images to compress with WEBP9 was found to be 832 with a quality of 25, while for WEBP11 it was 168 with a quality of 14.

Compressing D as suggested by the algorithm resulted in a compression time of 3768 ms.

The output size was 9,751,208 bytes with an SSIM quality index of 0.927159.

The time budget was practically used entirely.

Also, in this case there was a small margin of error of about 468 ms.

4.3. Test 3

The dataset D to be compressed consisted of 1000 images, for a total size of 692.2 MB (691,215,000 bytes). Single runs of the compression tools with the indicated quality setups yielded the following results shown in Table 10:

From Figure 15 it is possible to observe that the fastest tool was WEBP2 with a quality setup of 95, while the slowest was JPEG_NO_OPTIMIZE12 with a quality setup of 18.

The best tool for output size was WEBP8 with a quality setup of 1, while the worst was JPEG_NO_OPTIMIZE3 with a quality setup of 95.

Figure 16 shows the useful setups.

Figure 17 shows the optimal mix.

In the case where the time budget is t’ = 4800 and the SSIM quality limit is 0.873483 ms, the optimal mix is given by the mix of the WEBP2 and WEBP8 setups.

The number of images to be compressed with WEBP2 is equal to 613 with a quality of 95, while for WEBP8 the images are 387 with a quality of 1.

Compressing D as suggested by the algorithm resulted in a compression time of 4868 ms, while the output size was 16,623,908 bytes and an SSIM of 0.927159.

The time budget was used practically entirely.

In this case, there was a very small margin of error of about 68 ms.

4.4. Test 4

The dataset D to be compressed consisted of 1000 images, for a total size of 692.2 MB (691,215,000 bytes). Single runs of the compression tools with the indicated quality setups yielded the following results shown in Table 11:

In this test, assuming that the quality desired is lower or equal than SSIM 0.8955097, a single setup will be chosen to compress: the WEBP10 setup with a quality of 3.

This is because it is the best setup among all both in terms of compression and output size. The final output size will be 4,688,350 bytes, and it will take 3158 ms. Therefore, there will not be a mix of two setups that will be able to do better.

5. Conclusions

In this paper, we first explore a framework (closely aligned with the one proposed in [1]) for determining the optimal selection of compression algorithms in order to achieve the best allocation of computing resources in large-scale data storage environments. This approach is experimentally validated through the lossless compression of one-dimensional and two-dimensional data. We then extended this technique for lossy compression, and we successfully tested it on the lossy compression of bidimensional data.

The results of our experiments demonstrate that the proposed framework is highly efficient for compressing large quantities of images within a specified time budget. Across all input scenarios, it was observed that, compared to a traditional strategy, the optimized compression approach, that is based on identifying an optimal combination of algorithms, effectively eliminated idle periods in the system. By doing so, it maximized the use of the allocated time budget, leading to more efficient processing.

The algorithm performed reliably with all types of images (both animated and non-animated), regardless of whether the scenes were similar or varied. Additionally, the algorithm was evaluated in scenarios where the time budget was set very close to the runtime of one of the two optimal compression algorithms. Even in these cases, the algorithm performed correctly, distributing files among the compressors to ensure that the entire time budget was utilized effectively.

However, a key challenge remains in calculating the metrics of individual algorithms. If we assume that this activity can be performed once for a document and then the metrics can be reused for all future similar documents, then the proposed framework remains highly effective. On the other hand, if metrics have to be recalculated for each new document, it will be necessary to develop a method to estimate them as quickly as possible, which warrants further investigation.

Future research will focus on extending the algorithm’s experimentation to other types of digital data, such as medical or hyperspectral images (see [22,23]), or even to hybrid situations in which different types of digital data must be compressed together: sometimes using lossless compression, and at other times using lossy compression. The ability to handle a broad range of data types in an efficient and adaptive manner will be a valuable addition to this framework. In addition, as in [4], it could be interesting to study a mechanism that involves frequent sampling of the data stream to detect changes in its characteristics.

Funding

This work was partially supported by project SERICS (PE00000014) under the NRRP MUR program funded by the EU—NGEU.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

The author wishes to thank his students Giuseppe Cantisani, Vincenzo Ceci, Francesco Foglia, Alfonso Guarino, Christian Iodice, Pasquale Priscio, Andrea Sessa, and Antonio Sfera who developed, at different times, preliminary versions of the software that was used for the experimental tests in this paper.

Conflicts of Interest

The author declares no conflict of interest.

References

Zohar, E.; Cassuto, Y. Data Compression Cost Optimization. In Proceedings of the Data Compression Conference (DCC 2015), Snowbird, UT, USA, 7–9 April 2015; pp. 393–402. [Google Scholar] [CrossRef]
Carpentieri, B. Data Compression in Massive Data Storage Systems. In Proceedings of the International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA 2024), Victoria, Seychelles, 1–2 February 2024; pp. 343–348. [Google Scholar]
Liao, K.; Moffat, A.; Petri, M.; Wirth, A. A Cost Model for Long-Term Compressed Data Retention. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM ’17), Hannover, Germany, 10–14 March 2017; pp. 241–249. [Google Scholar] [CrossRef]
Wiseman, Y.; Schwan, K.; Widener, P. Efficient End to End Data Exchange Using Configurable Compression. In Proceedings of the 24th IEEE Conference on Distributed Computing Systems (ICDCS 2004), Tokyo, Japan, 24–26 March 2004; pp. 228–235. [Google Scholar]
Ziv, J.; Lempel, A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 1977, 23, 337–343. [Google Scholar] [CrossRef]
Ziv, J.; Lempel, A. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 1978, 24, 530–536. [Google Scholar] [CrossRef]
Rizzo, F.; Storer, J.A.; Carpentieri, B. LZ-based image compression. Inf. Sci. 2001, 135, 107–122. [Google Scholar] [CrossRef]
Moffat, A. Huffman encoding. ACM Comput. Surv. 2019, 52, 85. [Google Scholar]
Witten, I.H.; Neal, R.M.; Cleary, J.G. Arithmetic coding for data compression. Commun. ACM 1987, 30, 520–540. [Google Scholar] [CrossRef]
Deutsch, P. Rfc 1952: GZIP File Format Specification Version 4.3; RFC Editor: Marina del Rey, CA, USA, 1996. [Google Scholar] [CrossRef]
Burrows, M.; Wheeler, D.J. A Block–Sorting Lossless Data Compression Algorithm; Research Report; Digital Systems Research Center: North Syracuse, NY, USA, 1994. [Google Scholar]
Roelofs, G.; Koman, R. PNG: The Definitive Guide; O’Reilly Media: Sebastopol, CA, USA, 1999. [Google Scholar]
Welch, T.A. A Technique for High-Performance Data Compression. Computer 1984, 17, 8–19. [Google Scholar] [CrossRef]
Weinberger, M.J.; Seroussi, G.; Sapiro, G. The LOCO-I lossless image compression algorithm: Principles and standardization into JPEG-LS. IEEE Trans. Image Process. 2000, 9, 1309–1324. [Google Scholar] [CrossRef] [PubMed]
Skodras, A.; Christopoulos, C.; Ebrahimi, T. The JPEG 2000 still image compression standard. IEEE Signal Process. Mag. 2001, 18, 36–58. [Google Scholar] [CrossRef]
Sharma, K.; Gupta, K. Lossless data compression techniques and their performance. In Proceedings of the 2017 International Conference on Computing, Communication and Automation (ICCCA), Greater Noida, India, 5–6 May 2017; pp. 256–261. [Google Scholar] [CrossRef]
Howard, P.G.; Vitter, J.S. Fast and efficient lossless image compression. In Proceedings of the DCC ′93: Data Compression Conference, Snowbird, UT, USA, 30 March–1 April 1993; pp. 351–360. [Google Scholar] [CrossRef]
Saha, S. Image compression—From DCT to wavelets: A review. XRDS Crossroads ACM Mag. Stud. 2000, 6, 12–24. [Google Scholar] [CrossRef]
Wallace, G. The JPEG still picture compression standard. IEEE Trans. Consum. Electron. 1992, 38, 18–34. [Google Scholar] [CrossRef]
Ginesu, G.; Pintus, M.; Giusto, D.D. Objective assessment of the WebP image coding algorithm. Signal Process. Image Commun. 2012, 27, 867–874. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Pizzolante, R.; Carpentieri, B. Multiband and Lossless Compression of Hyperspectral Images. Algorithms 2016, 9, 16. [Google Scholar] [CrossRef]
Pizzolante, R.; Carpentieri, B. Lossless, low-complexity, compression of three-dimensional volumetric medical images via linear prediction. In Proceedings of the 18th International Conference on Digital Signal Processing (DSP 2013), Fira, Greece, 1–3 July 2013; pp. 1–6. [Google Scholar] [CrossRef]

Figure 1. Optimal mix, t = 39,000 ms.

Figure 2. Optimal mix, t = 89,000 ms.

Figure 3. Optimal mix, t = 571,000 ms.

Figure 4. Optimal mix, t = 69,000 ms.

Figure 5. Optimal mix, t = 58,000 ms.

Figure 6. Optimal mix, t = 130,000 ms.

Figure 7. Optimal mix, t = 61,000 ms.

Figure 8. Optimal mix, t = 132,000 ms.

Figure 9. Setups.

Figure 10. Useful setups.

Figure 11. Optimal mix.

Figure 12. Setups.

Figure 13. Useful setups.

Figure 14. Optimal mix.

Figure 15. Setups.

Figure 16. Useful setups.

Figure 17. Optimal mix.

Table 1. Compression of 1000 html pages.

Algorithm	Compressed Size (Bytes)	Time to Compress (milliseconds)
Gzip	7,700,660	1784
XZ	6,793,836	25,974
Bzip2	7,216,597	16,514
Arithmetic Coding	13,906,174	7928

Table 2. Compression tools.

Tool	Download	Configuration
PNG	https://www.idrsolutions.com/jdeli	Lossless
TIFF	https://www.idrsolutions.com/jdeli	Deflate
JPEG-LS	http://www.labs.hp.com/research/info_theory/loco/locodownold.html	Lossless 8 bits
JPEG2000	http://www.dclunie.com/jj2000/JPEG%202000%20implementation%20in%20Java.html	Lossless
BMP	https://github.com/jai-imageio/jai-imageio-core	Compression with uncompressed pixel map (BI RGB)
FELICS	Java implementation	Lossless 8 bits

Table 3. Animated images with similar scenes.

Tool	Size of Output in Bytes	Compression Time in ms.
PNG	183,362,330	70,806
TIFF	169,626,226	91,082
JPEG-LS	129,232,679	116,263
JPEG2000	145,893,942	564,370
BMP	691,254,004	37,975
FELICS	201,865,197	42,952

Table 4. Animated images with different scenes.

Tool	Size of Output in Bytes	Compression Time in ms.
PNG	278,923,712	83,538
TIFF	263,128,720	93,542
JPEG-LS	188,777,508	127,575
JPEG2000	174,971,373	578,176
BMP	691,254,300	36,992
FELICS	252,695,812	58,873

Table 5. Non-animated images with similar scenes.

Tool	Size of Output in Bytes	Compression Time in ms.
PNG	367,459,743	89,924
TIFF	356,243,190	89,626
JPEG-LS	237,640,546	132,333
JPEG2000	206,645,155	595,192
BMP	691,254,001	36,900
FELICS	316,963,000	75,920

Table 6. Non-Animated images with different scenes.

Tool	Size of Output in Bytes	Compression Time in ms.
PNG	326,568,532	97,109
TIFF	311,831,649	100,836
JPEG-LS	226,142,667	130,656
JPEG2000	182,808,175	569,792
BMP	691,254,002	36,708
FELICS	289,826,020	67,992

Table 7. Compression tools.

Tool	Download	Configuration
JPG	https://github.com/LuaDist/libjpeg	JPEG lossy optimized
WEBP	https://github.com/webmproject/libwebp	WebP lossy
JPEG NO OPTIMIZE	https://github.com/LuaDist/libjpeg	JPEG lossy not optimized
SSIM	https://imagemagick.org/index.php	SSIM

Table 8. Test 1.

Tool	Size of Output in Bytes	Time in ms.	Quality Parameter	SSIM
JPG1	60,837,185	3652	95	0.987475
WEBP2	43,938,844	4894	95	0.988174
JPEG-NO OPTIMIZE3	63,865,873	5745	95	0.987482
JPEG4	16,576,161	5636	40	0.942975
JPEG5	15,107,663	5632	34	0.940295
JPEG6	14,934,171	5754	33	0.939170
JPEG7	14,607,030	5693	32	0.937845
WEBP8	12,025,946	3791	40	0.949340
WEBP9	9,976,364	3560	27	0.937786
WEBP10	9,481,598	3524	24	0.934067
WEBP11	9,313,614	3548	23	0.932677
JPEG-NO OPTIMIZE12	17,732,502	5888	40	0.942941
JPEG-NO OPTIMIZE13	16,365,984	6046	34	0.940250
JPEG-NO OPTIMIZE14	16,206,014	6030	33	0.939123
JPEG-NO OPTIMIZE15	15,901,162	6172	32	0.937797

Table 9. Test 2.

Tool	Size of Output in Bytes	Time in ms.	Quality Parameter	SSIM
JPG1	66,567,945	3140	95	0.984874
WEBP2	45,131,526	4321	95	0.985433
JPEG-NO OPTIMIZE3	70,390,195	5305	95	0.984874
JPEG4	21,543,285	5680	48	0.944169
JPEG5	15,001,444	5613	25	0.907689
JPEG6	13,186,571	5683	20	0.889513
JPEG7	12,510,155	5667	18	0.882732
WEBP8	13,685,922	3482	48	0.95248
WEBP9	10,041,668	3283	25	0.93276
WEBP10	8,200,328	3384	14	0.91429
WEBP11	5,839,634	4032	3	0.874497
JPEG-NO OPTIMIZE12	22,619,623	5732	48	0.944170
JPEG-NO OPTIMIZE13	16,340,793	5938	25	0.907669
JPEG-NO OPTIMIZE14	14,648,964	6368	20	0.889482
JPEG-NO OPTIMIZE15	14,017,389	5984	18	0.882703

Table 10. Test 3.

Tool	Size of Output in Bytes	Time in ms.	Quality Parameter	SSIM
JPG1	43,432,750	5672	95	0.972077
WEBP2	25,647,408	4728	95	0.980147
JPEG-NO OPTIMIZE3	45,836,232	5911	95	0.972077
JPEG4	13,738,395	5795	48	0.912076
JPEG5	9,696,851	5493	25	0.8863093
JPEG6	8,613,749	5974	20	0.8786752
JPEG7	8,198,962	5741	18	0.8743648
WEBP8	3,603,432	4914	1	0.8735347
JPEG-NO OPTIMIZE9	14,969,285	6431	48	0.912082
JPEG-NO OPTIMIZE10	11,228,397	6629	25	0.886300
JPEG-NO OPTIMIZE11	10,266,529	6980	20	0.878630
JPEG-NO OPTIMIZE12	99,131,536	6999	18	0.874339

Table 11. Test 4.

Tool	Size of Output in Bytes	Time in ms.	Quality Parameter	SSIM
JPG1	55,165,877	5096	95	0.983708
WEBP2	34,071,556	4251	95	0.98584
JPEG-NO OPTIMIZE3	57,648,359	5201	95	0.983713
JPEG4	17,071,808	5658	51	0.9462306
JPEG5	12,384,938	5702	29	0.9226299
JPEG6	11,053,309	5536	24	0.9121278
JPEG7	10,516,324	5875	22	0.9027293
JPEG8	9,873,987	5781	20	0.90005430
WEBP9	4,908,124	3260	4	0.90047801
WEBP10	4,688,350	3158	3	0.8955097
JPEG-NO OPTIMIZE11	18,071,969	5913	51	0.946195
JPEG-NO OPTIMIZE12	13,702,370	5926	29	0.922539
JPEG-NO OPTIMIZE13	12,492,529	6028	24	0.912026
JPEG-NO OPTIMIZE14	12,018,149	6018	22	0.902617
JPEG-NO OPTIMIZE15	11,446,486	6135	20	0.899925

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Carpentieri, B. Data Compression with a Time Limit. Algorithms 2025, 18, 135. https://doi.org/10.3390/a18030135

AMA Style

Carpentieri B. Data Compression with a Time Limit. Algorithms. 2025; 18(3):135. https://doi.org/10.3390/a18030135

Chicago/Turabian Style

Carpentieri, Bruno. 2025. "Data Compression with a Time Limit" Algorithms 18, no. 3: 135. https://doi.org/10.3390/a18030135

APA Style

Carpentieri, B. (2025). Data Compression with a Time Limit. Algorithms, 18(3), 135. https://doi.org/10.3390/a18030135

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Data Compression with a Time Limit

Abstract

1. Introduction

2. Data Compression with a Time Limit

2.1. Past Related Work

2.2. A Data Compression Algorithm with a Time Limit

3. Lossless Compression

3.1. Lossless Compression of One-Dimensional Data with a Time Limit

3.2. Lossless Compression of Two-Dimensional Data with a Time Limit

3.2.1. Compression of a Set of Animated Images with Similar Scenes

3.2.2. Compression of a Set of Animated Images with Different Scenes

3.2.3. Compression of a Set of Non-Animated Images with Similar Scenes

3.2.4. Compression of a Set of Non-Animated Images with Different Scenes

4. Lossy Compression

4.1. Test 1

4.2. Test 2

4.3. Test 3

4.4. Test 4

5. Conclusions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI