Next Article in Journal
Breast Cancer Classification Using an Adapted Bump-Hunting Algorithm
Next Article in Special Issue
Control of High-Power Slip Ring Induction Generator Wind Turbines at Variable Wind Speeds in Optimal and Reliable Modes
Previous Article in Journal
Edge Detection Attention Module in Pure Vision Transformer for Low-Dose X-Ray Computed Tomography Image Denoising
Previous Article in Special Issue
Connected and Autonomous Vehicle Scheduling Problems: Some Models and Algorithms
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Data Compression with a Time Limit

by
Bruno Carpentieri
Dipartmento di Informatica, Università di Salerno, 84084 Fisciano, SA, Italy
Algorithms 2025, 18(3), 135; https://doi.org/10.3390/a18030135
Submission received: 21 January 2025 / Revised: 18 February 2025 / Accepted: 21 February 2025 / Published: 3 March 2025
(This article belongs to the Special Issue 2024 and 2025 Selected Papers from Algorithms Editorial Board Members)

Abstract

:
In this paper, we explore a framework to identify an optimal choice of compression algorithms that enables the best allocation of computing resources in a large-scale data storage environment: our goal is to maximize the efficiency of data compression given a time limit that must be observed by the compression process. We tested this approach with lossless compression of one-dimensional data (text) and two-dimensional data (images) and the experimental results demonstrate its effectiveness. We also extended this technique to lossy compression and successfully applied it to the lossy compression of two-dimensional data.

1. Introduction

Digital technologies have revolutionized our daily lives in a short period. Thanks to mobile devices, modern personal computers, and high-speed mobile networks, we are always connected and can share photos or videos with a click.
These important changes, along with the new applications emerging from recent innovations, have been readily accepted by end-users. However, these users remain largely unaware of the underlying mechanisms and functions: for example, how the process of playing a video or sending an image over the internet works.
Whether you surf the internet, listen to a piece of music, watch a video or scroll through images, few people know how fundamental the data compression process is for these activities.
Communication today is largely based on the exchange of digital information on data networks, such as the internet. In fact, one might say that these infrastructures are now fundamentally based on data compression, which enables easier and faster exchange of digital data and information. Moreover, compression is often combined with encryption to guarantee the privacy of communication.
Typically, the choice of compression algorithms to use and the technical details to rely on depend on the application that will use the digital file and/or the destination of the message sent.
The goal of data compression algorithms is to maximize the level of compression given the limitations of processing resources.
Deciding how to allocate processing resources for compression tasks is not a trivial problem. In this paper, we focus on a framework that begins with multimedia data compression and studies the optimal allocation of compression resources. Our goal is to maximize the efficiency of data compression given limited computational resources, particularly with respect to a time limit (see [1,2]).
The paper is structured as follows: Section 2 discusses past related work and the optimization algorithm. Section 3 presents our experimental results for the lossless compression of one-dimensional and two-dimensional data. Section 4 refines the technique used for lossless compression to perform efficient lossy compression of images and presents experimental results on lossy compression. Finally, Section 5 provides conclusions and outlines possible new research directions.

2. Data Compression with a Time Limit

The compression process, especially in the case of large databases, is characterized by a trade-off between time and space that the user must be aware of and on which they must make a decision. There are various compression algorithms that allow you to optimize the degree of compression, or the time needed for this process.
Individual compression algorithms often also allow a more fine-grained choice by specifying a compression level, where a low level represents the desire for a minimum required time, while a high level represents the need for maximum compressibility of the supplied document.
With the advent of cloud computing and the service-based and pay-per-use model widely used nowadays, one can easily imagine the need to be able to set a specific time budget for a certain activity such as data compression. As specific activities, one can imagine the creation of hourly or event-related dumps. A concrete case specific to cloud computing can be imagined by considering the FaaS (“Function as a Service”) model.
The cost of this model is directly proportional to the number of calls performed and, in particular, to the duration of each.

2.1. Past Related Work

Data compression is nowadays a very active field of research: every reduction in the size of digital data allows us to transmit them faster, and therefore has a significant economic impact.
In lossless compression algorithms, the decoded output of the compression system is identical, bit for bit, to the original data. In contrast, lossy compression algorithms produce an “acceptable” approximation (depending on the application) of the original input.
Textual data, including text and HTML pages, are typically not stored in compressed form since they need to be searchable. In contrast, raster data such as audio, images, and video are generally stored in compressed formats and are often created in compressed form by the devices that generate them. Lossy compression is used exclusively for raster data.
Zohar and Cassuto in [1] studied for the first time the problem of optimizing the lossless compression of one-dimensional data when there is a time limit within which the compression process must be completed. They experimentally demonstrated that the optimization is possible.
Carpentieri in [2] resumed the work of Zohar and Cassuto and extended it to the lossless compression of two-dimensional data (images).
In the paper of Liao, Moffat, Petri, and Wirth [3], a comprehensive model for the total retention cost (TRC) of a data archiving system is established, integrating cloud computing provider charging rates to quantify costs across various compression strategies. This analysis serves as a foundation for developing innovative, cost-efficient alternatives that surpass the effectiveness of existing methods.
Wiseman and Schwan [4] investigate the application of compression techniques to enhance middleware-based information exchange in interactive and collaborative distributed systems. In these environments, achieving high compression ratios must be balanced with compression speeds that align with sustainable network transfer rates. Their approach dynamically monitors network and processor resources, evaluates compression efficiency, and autonomously selects the most suitable compression methods to optimize performance.
In this paper, we study data compression with a time limit in both the one-dimensional and the two-dimensional case and in the case of both lossless and lossy compression.
Lossless compression algorithms are often based on the text substitution model introduced by Lempel, Ziv, and Storer in the 1970s. and later used for text and image compression (see for example [5,6,7]), or using Huffman or arithmetic coding (see for example [8,9]).
The approach we have presented for the optimization of compression given a time limit is totally independent of the algorithms used in the experiments. For simplicity, we have chosen to use in the experiments some of the most popular compression tools. Obviously, this choice has no impact on the optimization process other than providing data different from others.
In this paper, regarding the lossless compression of one-dimensional data, we have used in our experiments gzip (see [10]), xz (based on LZMA: deriving from the seminal work of Lempel and Ziv [4,5]), and bzip2 (based on the Burrows Wheeler transform, see [11]), arithmetic coding (see [9]).
Lossless and lossy image compression algorithms often use the “modeling + coding” approach in which a prediction of the current pixel is built consistently by encoder and decoder depending on a chosen context of already coded samples, and then a prediction error, i.e., the difference between the real value of the current pixel and the prediction made, is sent from the encoder to the decoder.
As for lossless image compression, we used PNG (see [12]), TIFF (based on the LZW algoritym, see [13]), JPEG-LS (see [14]), JPEG200 (see [15]), BMP (see [16]), and FELICS (see [17]) in the testing phase.
Saha in [18] presents a review of lossy image compression algorithms. For lossy compression of images, in our tests we used JPEG (see [19]) and WEBP (see [20]).
When lossy image coding is used, it is important to balance the compression obtained and the quality of the decompressed image. Here, to evaluate the quality of the decompressed image we used the SSIM metric (see [21]).

2.2. A Data Compression Algorithm with a Time Limit

Generally speaking, the framework proposed in [1] is a specialization of the more general approach with respect to the basic mechanism of compression algorithms related to the configuration of effort levels.
It is not necessary to specify the compression algorithms to be used, since the approach is agnostic with respect to them. The simplified idea is to use a set of configurations of a specific algorithm, each on a partition of the data to be compressed. The percentage of data to be compressed with each algorithm is chosen to maximize the use of the available time, provided as input budget, in order to maximize the degree of compressibility obtained.
In the case of single documents, this set of configurations can specify the use of a single algorithm or a pair of algorithms.
For multiple documents, instead, combinations of algorithms will be used where each will be executed on a specific document. Each element of these combinations can be a single algorithm or a pair of algorithms, but it can be experimentally shown that in all cases it is sufficient to use a pair of algorithms, at most only for one document, using a single algorithm for the remaining documents. Therefore, in this paper we are not considering the possibility of compressing a single document in parts, with more than one data compression algorithm, because doing so would only bring a low gain that does not justify the increase in complexity due to the decoding.
Assuming a function that compresses a given set of data, a time optimization activity would allow us to reduce the economic costs of our system without sacrificing more than necessary the reductions in data size.
When we normally apply data compression, all the data we want to compress are input into a single compression tool which will try to reduce the size of the input data (while keeping the same information content) in a certain time t, and the tool will return as its output the compressed data.
The focus is frequently placed on the algorithm’s compression efficiency, while the time required for compression is often overlooked, provided it remains within a reasonable limit.
This way of doing may not be convenient in situations where you want to specify a time t’ that could also be less than t, within which you want to complete the compression process: that is, when you are trying to optimize compression performance while respecting specified time limits.
Recall that a convex surface, or convex hull, of a set of points S is the intersection of all convex sets that contain S and that the lower polygon chain contains all points that minimize the second coordinate (y, which in our case will be the size of the compressed document) for each x (in our case, time) of the convex surface.
If we consider the optimization of the compression of a single document, with the notions of a convex surface and lower polygon chain it is possible to obtain the set of optimal mixes of algorithms for any time budget.
The basic idea is to obtain the lower polygon chain of the best algorithms for each time budget, representing them on a two-dimensional plane by choosing as coordinates the time required (x) and the resulting dimension from the execution of the algorithm (y).
By best algorithms, we mean the input algorithms sorted by the time required, filtered by taking only those that lead to an improvement in terms of compression compared to the previous algorithm.
This filtering activity removes two classes of algorithms.
The algorithms that would cause a worse resulting dimension for the same time budget compared to another mix are excluded. In case two algorithms require the same amount of time, the one with the smaller resulting dimension is preferred.
By building the convex surface of the remaining algorithms, we can obtain the lower polygon chain composed of the algorithms involved in each possible optimal mix. This step is necessary because there may still be algorithms that are better than the previous one but that involve a non-optimal mix.
By definition of a convex surface, there cannot be points below it, while points above the lower polygonal chain will not correspond to optimal algorithms due to a larger resulting dimension. For each possible time budget, we will then have two possible options. It will be possible to use a specific algorithm, or the document can be partitioned into two parts, where each will be compressed by one of the two members of the optimal algorithm pair for that time budget.
If we consider the optimization of the compression of multiple documents, an important feature of the resulting polygon chain is the slope of the segments connecting two algorithms. This slope captures the concept of benefit obtained by switching from one algorithm to another. A more significant slope will result in a greater benefit.
The optimization of the compression of multiple documents starts from the lower polygon chains built through the process seen previously for each document. The idea is to join these chains to obtain an overall chain representing the entire set of documents.
The resulting lower polygon chain will be made up of points representing combinations of algorithms to be used, one for each document involved in the process. Each point of this lower polygon chain will be chosen in order to maximize the benefit for that specific time budget. This maximization is obtained by changing, with respect to the previous combination, only one algorithm. Once the overall lower polygon chain is obtained, the mixing process will be similar to the one seen previously. The optimal algorithms will correspond to the extremes of the segment where the time budget falls.
Following the work of Zohar and Cassuto in [1], let us suppose we find ourselves in a situation where we want to compress a single file f or a large data set D by using a compressor. In real life, we will have many compressors and setups to choose from, but here, for simplicity let us consider the situation in which we have two possible compressors available, or two possible configurations of a single compressor called, respectively, setup1 and setup2: it could be easy then to generalize the following discussion to multiple compressors and multiple setups.
Suppose setup1 takes less time to compress than setup2, but that setup2 compresses more than setup1. Now, suppose that the execution must finish within a certain time t’ (let us call this value time-budget).
We define t1 as the time taken by setup1, t2 the time taken by setup2.
  • t’ < t1: it is not possible to compress with either setup, since the time budget is less than the time taken by the fastest compressor (setup1).
  • t’ > t2: it is possible to compress with both setup1 and setup2. We choose to compress with setup2, since compression is more effective in terms of output size.
  • t1 < t’ < t2: it is not possible to compress with setup2 because the time budget is not sufficient. We therefore decide to use setup1.
If we find ourselves in situation 3, the system manages to compress f (or D) through setup1; however, the time value Δ(t) = t’ − t1 in which the system remains unused is not negligible since the chosen setup finished its execution before the set time t’. Compression optimization tries to reduce, if not to eliminate, the Δ(t) value, considering not one, but a mix of setups.
In our previous example, if we found ourselves in a situation in which the time-budget is t1 < t’ < t2, we could think of adopting a “mixing” strategy in which a part of the file f (or of the data set D) to be compressed goes as input to setup1, while the remaining part to setup2.
This strategy, compared to the classic application of a single compression tool, could lead to the use of the entire time budget initially chosen and to a reduction in the output size. The proposed algorithm searches for an optimal-mix, i.e., an optimal setup configuration (among many considered) that can be used to compress a file f, given a time budget. As we will see, the search is not a trivial process, as it adopts a technique that builds a function that will consider potential setups that could be part of the mixing, and subsequently, among all the candidates, two or more are chosen.
The algorithm that is described next allows you to obtain all the useful-setups for a certain file f to be compressed. Subsequently, it will be possible to obtain the optimal mix once the time budget t’ has been set. The inputs of the algorithm are pairs (bi, ti), where each pair represents a setup: ti represents the time taken by the setup to compress the file f, while bi is the resulting size, as in [1].
The algorithm for finding the optimal mix consists of four main steps, listed below:
  • Determination of the pairs (bi, ti) for each setup; these pairs are obtained by running each compression tool individually on the file f or by simply estimating its performance.
  • Sorting the setups in an ascending manner by ti, with bi used as the second index in descending order.
  • Removal of the worst setups and construction of the convex hull of the remaining points:
    i.
    Between two setups that take the same time to execute, the one that gives the largest output size is discarded.
    ii.
    All setups that give an output size that is too large compared to others that take less time are discarded.
  • Acquisition of the optimal-mix given a time budget t’.
Phase 3 will build what in computational geometry is called the convex hull, that is, given a set of points, the determination of the smallest convex set that can contain them all. As proved in [1], the setups located at the vertices of the lower part of the convex hull are the only useful-setups.
Useful-mixes will always be two setups connected by an arc on the bottom edge of the polygon. Given m useful setups arranged in ascending order of running time and a time budget t′, the goal is to identify the optimal combination of setups, a and b, along with the fractions of files, ra and rb, that each setup will handle.
In the non-trivial case where t1 < t’ < t2, the chosen combination consists of the two adjacent setups, a and b, such that tb < t’ < ta. After finding the optimal mix, the percentage of data D that must be compressed with the chosen compression tools is calculated as in [1]. Assuming we have two setups sa and sb with times ta and tb, respectively, choosing t’ as the time budget such that tb < t’ < ta, the fraction of the file f to compress with sa will be:
ra = (t’ − tb)/(tatb)
The fraction of D compressed by the sb setup will be:
rb = 1 − ra.
In a multi-document context, each document typically possesses its own unique convex hull. Consequently, optimizing compression requires addressing multiple convex hulls simultaneously. Since the effectiveness of a particular tool or configuration depends on the document’s specific information characteristics, the set of optimal configurations will vary across documents.
A critical part of the solution involves an algorithm that consolidates the individual convex hulls into a unified convex hull. This unified structure allows the system to determine the best configuration for a given computational time constraint with ease, as explained in [1].
To illustrate how the merged sequence facilitates finding the optimal configuration for any compute-time limit, as outlined in [1], the process begins by calculating the compute time required when the least resource-intensive configurations are applied across all documents. The algorithm then iterates through successive configuration vectors in the merged sequence, recalculating the compute time at each step. This progression continues until the system encounters the last configuration within the allowed compute-time limit. As the system approaches the budget, the time constraint will eventually lie between two adjacent configurations in the sequence.
At this point, much like the single-document scenario, the solution involves blending two adjacent configurations. Since only one document transitions between setups in adjacent configurations, the final allocation will result in most documents sticking to a single configuration, with at most one document employing a mix of two configurations across its instances.

3. Lossless Compression

We experimented this approach by using lossless compressors on one-dimensional data (text) and on two-dimensional data (images); the obtained results follow.

3.1. Lossless Compression of One-Dimensional Data with a Time Limit

We began testing on a set of 1000 html pages that have similar contents and that were downloaded through the web crawler Wibbi by Stanford University, obtaining a resulting document of size 34 Mb.
We tested the following compression tools: gzip, xz, bzip2, arithmetic coding all in the default configuration.
For this experiment, we used a Mac Book Air with 8 Gb of RAM and a 1.7 Ghz Intel Core i7 processor. These, in Table 1, are the results of running the compression tools on this test data set:
Gzip is the fastest of all the tools, probably because it is implemented in the java.util.zip library and this makes it optimized. Arithmetic Coding, although faster than all the other setups except Gzip, returns a significantly larger output size than the other compressors: about 100% more. The best setup in terms of output size is XZ, even though it takes on this platform the longest time: about 26 s. In this experiment, we decided to limit our optimal mix to only two algorithms and to compress each file with a single compressor. We tested two different time budgets: 22,000 milliseconds and 10,000 milliseconds.
For the time budget of 22,000 milliseconds, the optimal mix turns out to be composed of Bzip2 and XZ. A total of 632 html pages were compressed by using XZ and the remaining 368 by using Bzip2. The compressed file size for this test was 6,935,454 bytes long and the total time to compress was 22,566 milliseconds, i.e., slightly more that the assigned time budget: an error of about 0.0257%. This depends on the fact that we had chosen to compress each file with a single compressor. By removing this constraint, we can likely adhere precisely to the assigned time budget.
For the time budget of 10,000 milliseconds, the optimal mix turns out to be composed of Bzip2 and Gzip. A total of 557 html pages were compressed by using Gzip and the remaining 443 by using Bzip2. The compressed file size for this test was 7,401,784 bytes long and the total time to compress was 9703 milliseconds, i.e., slightly less that the assigned time budget.

3.2. Lossless Compression of Two-Dimensional Data with a Time Limit

We evaluated the performance of the algorithm by conducting experiments on different types of inputs. The following four cases were initially analyzed:
Compression of a set of animated images with similar scenes.
Compression of a set of animated images with different scenes.
Compression of a set of non-animated images with similar scenes.
Compression of a set of non-animated images with different scenes.
The time budgets used are different for each experiment.
The tests on the lossless compression of images with a time limit were conducted on a personal computer with the following specifications:
CPU: Intel Core i7-2630QM CPU 2.00 GHz;
RAM: 8.00 GB;
Operating system: Windows 10 Pro;
Mass storage: 128 GB Samsung Pro SSD.
The following Table 2 lists all the compression tools used, their configuration adopted, and the download link for the compression libraries (accessed last time on 19 November 2024):
The images were extracted from videos currently available at the following links (accessed last time on 19 November 2024):
Each video was downloaded by connecting to the portal http://it.savefrom.net/ and then specifying the link and format (MP4 360p).
To perform the subdivision into frames, we used a GOM Player tool (available at the link https://www.gomlab.com/). The input images are in uncompressed TIFF format.

3.2.1. Compression of a Set of Animated Images with Similar Scenes

The dataset D to be compressed consisted of 1000 images, for a total size of 659 MB (691,744,000 bytes). Single runs of the compression tools yielded the following pairs (bi, ti).
Table 3 shows us that the fastest compressor was BMP while the slowest was JPEG2000. From the point of view of the compression obtained, the best compressor was JPEG-LS, while the worst was BMP.
The useful-setup is therefore composed of the compressors BMP, FELICS, PNG, TIFF, and JPEG-LS. With an assigned time budget t1 = 39,000 ms then the optimal mix is given by the mix of the BMP and FELICS setups, and it is closer to the BMP setup. All this is shown in Figure 1:
In this case, the optimum is perfectly intersected with the convex hull. A total of 794 images will be compressed using BMP, the other 206 with FELICS. We therefore compressed D as suggested by the algorithm: the compression time was 39,141 ms and the output size 563 MB (590,397,633 bytes).
For t2 = 89,000 ms, instead, the optimal mix is given by the mix between the TIFF and PNG configurations, and it is very close to the TIFF configuration. Figure 2 shows it:
Here too, the optimum intersects the convex hull. The number of images to compress with TIFF was 897, while for PNG it was 103.
The compression time was 89,546 ms., while the output size was 163 MB (170,978,001 bytes).

3.2.2. Compression of a Set of Animated Images with Different Scenes

The dataset D to be compressed consisted again of 1000 images, for a total size of 659 MB (691,744,000 bytes). Single runs of the compression tools yielded the following pairs (bi, ti):
Table 4 shows us that the fastest compressor was BMP while the slowest was JPEG2000. From the point of view of the compression obtained, the best compressor was JPEG2000, while the worst was BMP.
The useful-setup is therefore composed of the compressors BMP, FELICS, JPEG-LS, and JPEG2000. Two time budgets were set to experiment the algorithm. With an assigned time budget, t1 = 571,000 ms then the optimal mix is given by the mix of the JPEG-LS and JPEG2000 setups, and it is very close to the JPEG2000 setup. All this is shown in Figure 3:
Here, the optimal mix is at the intersection point.
A total of 16 images must be compressed using JPEG-LS, the other 984 with JPEG2000.
The compression time was 566,798 ms, while the output size was 167 MB (175,356,694 bytes).
In this case, the time budget is not used completely: the compression ends about 5 s earlier: this difference is to be considered acceptable considering that a very large time budget has been used, and that the compression is performed one file at a time.
For t2 = 69,000 ms, instead, the optimal mix is given by the mix between the JPEG-LS and FELICS configurations, and it is in their center. Figure 4 shows it:
The results are consistent with all the previous tests and the optimal mix is at the intersection point.
The number of images we compress with JPEG-LS is 147, while for FELICS it is 853.
The compression time was 68,541 ms., while the output size was 232 MB (243,538,616 bytes).

3.2.3. Compression of a Set of Non-Animated Images with Similar Scenes

The data set D to be compressed was composed of 1000 images, for a total size of 659 MB (691,744,000 bytes).
Single runs of the compression tools yielded the following pairs (bi, ti).
Table 5 shows that the fastest tool was BMP, while the slowest was JPEG2000. The best tool for output size was JPEG2000, while the worst was BMP. The calculated useful-setup was made up of the tools BMP, FELICS, JPEG-LS, and JPEG2000.
With an assigned time budget, t1 = 58,000 ms then the optimal mix is given by the mix of the BMP and FELICS setups, and it is found in the center between them.
All this is shown in Figure 5:
Here too, the optimum intersects the convex hull. The number of images to compress with BMP was 459, while for FELICS it was 541.
The compression time was 58,600 ms, while the output size was 471 MB (493,429,735 bytes).
If we increase the time budget to t2 = 130,000 ms, the optimal mix is given by the mix between the JPEG-LS and FELICS configurations, and it is very close to the JPEG-LS configuration.
Figure 6 shows it:
A total of 959 images will be compressed using JPEG-LS, the other 41 with FELICS.
The compression time was 130,292 ms and the output size 245 MB (240,675,273 bytes). We had to use slightly more than the expected budget: about 292 ms more.

3.2.4. Compression of a Set of Non-Animated Images with Different Scenes

The dataset D to be compressed consisted again of 1000 images, for a total size of 306 MB (320,864,256 bytes). Single runs of the compression tools yielded the following pairs (bi, ti).
Table 6 shows us that the fastest compressor was BMP, while the slowest was JPEG2000.
From the point of view of the compression obtained, the best compressor was JPEG2000, while the worst was BMP. The useful-setup is therefore composed of the compressors BMP, FELICS, JPEG-LS, and JPEG2000.
Again, two time budgets were set to experiment the algorithm.
With an assigned time budget, t1 = 61,000 ms then the optimal mix is given by the mix of the BMP and FELICS setups, and it is very close to the FELICS setup. All this is shown in Figure 7:
The optimum is on the convex hull. The number of images to compress with BMP was 224, while for FELICS it was 776. The compression time was 61,215 ms., while the output size was 366 MB (383,557,600 bytes).
If we increase the time budget to t2 = 132,000 ms then the optimal mix is given by the mix between the JPEG-LS and FELICS configurations, and it is very close to the JPEG-LS configuration. Figure 8 shows it:
The optimum is on the convex hull. The number of images to compress with JPEG-LS was 997, while for JPEG2000 it was 3.
The compression time was 132,764 ms, while the output size was 216 MB (226,036,083 bytes).

4. Lossy Compression

As seen so far, the optimization algorithm allows us to obtain all the useful setups when we use lossless compressors. From these useful setups, it is then possible to obtain the so-called optimal mix once the time within which the compression must be completed has been set.
If instead of using lossless compressors we consider lossy compressors, we will have to consider a third parameter: quality. This is because lossy compression leads to a trade-off between loss of information and compression obtained and it is necessary to evaluate the result of the compression not only in terms of compression but also in terms of quality of the decompressed image. The amount of data loss is determined by the level of compression achievable, as explained by rate-distortion theory.
Lossy compression optimization becomes important, for example, in situations where compression must be performed every time data are to be transmitted because the quality is individually chosen by the remote user.
In general, for many lossy compression tools, such as JPEG and WebP, when we perform the compression, it is possible to set the quality, often in a range of 0–100, where low values mean lower quality (and greater compression). Therefore, the greater the reduction in the size of the digital image is, the greater the loss of information relating to the image will be (and, consequently, the quality will be reduced).
We have therefore chosen to add a third parameter to the optimization algorithm: the SSIM index.
The SSIM (Structural Similarity Index) is used to measure the similarity between two images. It is a perceptual metric that qualifies the degradation of image quality caused by processing, such as compression or loss of information in data transmission.
This metric evaluates the similarity between two images of the same acquisition: in our case, an original reference image and an image first compressed and then decompressed. The SSIM was designed to improve traditional evaluation methods such as PSNR and mean square error (MSE).
The idea was to ask the user to set a limit on the SSIM parameter at the beginning of the optimization. Subsequently, we generate the pairs (bi, ti), obtained by sequentially executing the compression algorithms on a data set D (in our case, the images) and then used as input for the useful setup search algorithm, but we select only those pairs whose compression returns us an SSIM value greater than or equal to of the SSIM limit value set.
In detail, the optimization algorithm for lossy image compression (which we called SSIM_Adaptive) receives as input the SSIM parameter that we obtain by compressing with the maximum settable quality. Starting from this SSIM, the algorithm at each iteration compresses with a different, lower, quality that is calculated to obtain an SSIM value that is close to the SSIM limit and inserts each setup it tries into an array.
At the end, we obtain q that indicates the minimum quality (0–100) to use during compression to obtain a SSIM that is closest and higher than the quality limit that was set.
It is important to clarify that the SSIM value calculated during the execution of the SSIM_Adaptive algorithm is a weighted average on the SSIM values of all the images given as input.
Therefore, in addition to the set-up with the q received as output, we also take all the set-ups that have a quality greater than q. By carrying out this pre-processing, we are sure that among the algorithms that will be chosen from the optimal mix only those with a quality that respects that entered by the user or even higher will appear.
In this way, we can compress D, respecting the time budget and using only the algorithms with a quality greater than or equal to the fixed one.
We experimentally tested our approach on images that were acquired as we did in Section 3.2. The images were extracted from videos that were currently available at the following links (accessed last time on 19 November 2024):
Each video was downloaded by connecting to the portal http://it.savefrom.net/ and then specifying the link and format (MP4 360p).
To perform the subdivision into frames, we used the GOM Player tool (available at the link https://www.gomlab.com/). The input images are in PPM format.
The tests on the lossy compression of images with a time limit were conducted on a personal computer with the following specifications:
CPU: 8 Core Intel Core i9 CPU 2.3 GHz;
RAM: 16.00 GB;
Operating system: Mac OS;
The following Table 7 lists all the compression tools used, the tool to compute the SSIM, and the download link for the compression libraries:
We tested this lossy approach by using various configurations of time limits and quality limit. Here, we report a few.

4.1. Test 1

The dataset D to be compressed consisted of 1000 images, for a total size of 138.2 MB (138,243,000 bytes). Single runs of the compression tools with the indicated quality setups yielded the following results shown in Table 8:
From Figure 9, it is possible to observe that the fastest tool was WEBP10 while the slowest was JPEG_NO_OPTIMIZE15.
The best tool for output size was WEBP11, while the worst was JPEG_NO_OPTIMIZE3.
Figure 10 shows the useful setups.
Figure 11 shows the optimal mix. If the time limit is 3530 ms and the SSIM quality limit is set to 0.933483, the optimal mix is given by the mix of the WEBP10 and WEBP11 setups.
It is possible to observe how the optimum is perfectly intersected with the convex hull.
The number of images to be compressed with WEBP10 is equal to 750 with quality 24, while for WEBP11 it is 250 with quality 23.
Compressing D as suggested by the algorithm resulted in a compression time of 3683 ms, while the output size in bytes is equal to 9,433,432 bytes with a quality indicated by an SSIM index of 0.931679.
The time budget was used entirely. The very small margin of error, about 153 ms, is certainly negligible and can depend on many factors strictly related to the system (CPU usage, RAM, file reading/writing, etc.).

4.2. Test 2

The dataset D to be compressed consisted of 1000 images, for a total size of 692.2 MB (691,215,000 bytes). Single runs of the compression tools with the indicated quality setups yielded the following results shown in Table 9:
From Figure 12 it is possible to observe that the fastest tool was JPG1 with a quality setup of 95, while the slowest was JPEG_NO_OPTIMIZE14 with a quality setup of 20. The best tool for output size was WEBP11 with a quality setup of 3, while the worst was JPEG_NO_OPTIMIZE3 with a quality setup of 95.
Figure 13 shows the useful setups.
Figure 14 shows the optimal mix.
In the case where the time limit is 3300 ms. and the SSIM quality limit is set to 0.874144, ms., the optimal mix is given by the mix of the WEBP9 and WEBP10 setups.
The number of images to compress with WEBP9 was found to be 832 with a quality of 25, while for WEBP11 it was 168 with a quality of 14.
Compressing D as suggested by the algorithm resulted in a compression time of 3768 ms.
The output size was 9,751,208 bytes with an SSIM quality index of 0.927159.
The time budget was practically used entirely.
Also, in this case there was a small margin of error of about 468 ms.

4.3. Test 3

The dataset D to be compressed consisted of 1000 images, for a total size of 692.2 MB (691,215,000 bytes). Single runs of the compression tools with the indicated quality setups yielded the following results shown in Table 10:
From Figure 15 it is possible to observe that the fastest tool was WEBP2 with a quality setup of 95, while the slowest was JPEG_NO_OPTIMIZE12 with a quality setup of 18.
The best tool for output size was WEBP8 with a quality setup of 1, while the worst was JPEG_NO_OPTIMIZE3 with a quality setup of 95.
Figure 16 shows the useful setups.
Figure 17 shows the optimal mix.
In the case where the time budget is t’ = 4800 and the SSIM quality limit is 0.873483 ms, the optimal mix is given by the mix of the WEBP2 and WEBP8 setups.
The number of images to be compressed with WEBP2 is equal to 613 with a quality of 95, while for WEBP8 the images are 387 with a quality of 1.
Compressing D as suggested by the algorithm resulted in a compression time of 4868 ms, while the output size was 16,623,908 bytes and an SSIM of 0.927159.
The time budget was used practically entirely.
In this case, there was a very small margin of error of about 68 ms.

4.4. Test 4

The dataset D to be compressed consisted of 1000 images, for a total size of 692.2 MB (691,215,000 bytes). Single runs of the compression tools with the indicated quality setups yielded the following results shown in Table 11:
In this test, assuming that the quality desired is lower or equal than SSIM 0.8955097, a single setup will be chosen to compress: the WEBP10 setup with a quality of 3.
This is because it is the best setup among all both in terms of compression and output size. The final output size will be 4,688,350 bytes, and it will take 3158 ms. Therefore, there will not be a mix of two setups that will be able to do better.

5. Conclusions

In this paper, we first explore a framework (closely aligned with the one proposed in [1]) for determining the optimal selection of compression algorithms in order to achieve the best allocation of computing resources in large-scale data storage environments. This approach is experimentally validated through the lossless compression of one-dimensional and two-dimensional data. We then extended this technique for lossy compression, and we successfully tested it on the lossy compression of bidimensional data.
The results of our experiments demonstrate that the proposed framework is highly efficient for compressing large quantities of images within a specified time budget. Across all input scenarios, it was observed that, compared to a traditional strategy, the optimized compression approach, that is based on identifying an optimal combination of algorithms, effectively eliminated idle periods in the system. By doing so, it maximized the use of the allocated time budget, leading to more efficient processing.
The algorithm performed reliably with all types of images (both animated and non-animated), regardless of whether the scenes were similar or varied. Additionally, the algorithm was evaluated in scenarios where the time budget was set very close to the runtime of one of the two optimal compression algorithms. Even in these cases, the algorithm performed correctly, distributing files among the compressors to ensure that the entire time budget was utilized effectively.
However, a key challenge remains in calculating the metrics of individual algorithms. If we assume that this activity can be performed once for a document and then the metrics can be reused for all future similar documents, then the proposed framework remains highly effective. On the other hand, if metrics have to be recalculated for each new document, it will be necessary to develop a method to estimate them as quickly as possible, which warrants further investigation.
Future research will focus on extending the algorithm’s experimentation to other types of digital data, such as medical or hyperspectral images (see [22,23]), or even to hybrid situations in which different types of digital data must be compressed together: sometimes using lossless compression, and at other times using lossy compression. The ability to handle a broad range of data types in an efficient and adaptive manner will be a valuable addition to this framework. In addition, as in [4], it could be interesting to study a mechanism that involves frequent sampling of the data stream to detect changes in its characteristics.

Funding

This work was partially supported by project SERICS (PE00000014) under the NRRP MUR program funded by the EU—NGEU.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

The author wishes to thank his students Giuseppe Cantisani, Vincenzo Ceci, Francesco Foglia, Alfonso Guarino, Christian Iodice, Pasquale Priscio, Andrea Sessa, and Antonio Sfera who developed, at different times, preliminary versions of the software that was used for the experimental tests in this paper.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Zohar, E.; Cassuto, Y. Data Compression Cost Optimization. In Proceedings of the Data Compression Conference (DCC 2015), Snowbird, UT, USA, 7–9 April 2015; pp. 393–402. [Google Scholar] [CrossRef]
  2. Carpentieri, B. Data Compression in Massive Data Storage Systems. In Proceedings of the International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA 2024), Victoria, Seychelles, 1–2 February 2024; pp. 343–348. [Google Scholar]
  3. Liao, K.; Moffat, A.; Petri, M.; Wirth, A. A Cost Model for Long-Term Compressed Data Retention. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM ’17), Hannover, Germany, 10–14 March 2017; pp. 241–249. [Google Scholar] [CrossRef]
  4. Wiseman, Y.; Schwan, K.; Widener, P. Efficient End to End Data Exchange Using Configurable Compression. In Proceedings of the 24th IEEE Conference on Distributed Computing Systems (ICDCS 2004), Tokyo, Japan, 24–26 March 2004; pp. 228–235. [Google Scholar]
  5. Ziv, J.; Lempel, A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 1977, 23, 337–343. [Google Scholar] [CrossRef]
  6. Ziv, J.; Lempel, A. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 1978, 24, 530–536. [Google Scholar] [CrossRef]
  7. Rizzo, F.; Storer, J.A.; Carpentieri, B. LZ-based image compression. Inf. Sci. 2001, 135, 107–122. [Google Scholar] [CrossRef]
  8. Moffat, A. Huffman encoding. ACM Comput. Surv. 2019, 52, 85. [Google Scholar]
  9. Witten, I.H.; Neal, R.M.; Cleary, J.G. Arithmetic coding for data compression. Commun. ACM 1987, 30, 520–540. [Google Scholar] [CrossRef]
  10. Deutsch, P. Rfc 1952: GZIP File Format Specification Version 4.3; RFC Editor: Marina del Rey, CA, USA, 1996. [Google Scholar] [CrossRef]
  11. Burrows, M.; Wheeler, D.J. A Block–Sorting Lossless Data Compression Algorithm; Research Report; Digital Systems Research Center: North Syracuse, NY, USA, 1994. [Google Scholar]
  12. Roelofs, G.; Koman, R. PNG: The Definitive Guide; O’Reilly Media: Sebastopol, CA, USA, 1999. [Google Scholar]
  13. Welch, T.A. A Technique for High-Performance Data Compression. Computer 1984, 17, 8–19. [Google Scholar] [CrossRef]
  14. Weinberger, M.J.; Seroussi, G.; Sapiro, G. The LOCO-I lossless image compression algorithm: Principles and standardization into JPEG-LS. IEEE Trans. Image Process. 2000, 9, 1309–1324. [Google Scholar] [CrossRef] [PubMed]
  15. Skodras, A.; Christopoulos, C.; Ebrahimi, T. The JPEG 2000 still image compression standard. IEEE Signal Process. Mag. 2001, 18, 36–58. [Google Scholar] [CrossRef]
  16. Sharma, K.; Gupta, K. Lossless data compression techniques and their performance. In Proceedings of the 2017 International Conference on Computing, Communication and Automation (ICCCA), Greater Noida, India, 5–6 May 2017; pp. 256–261. [Google Scholar] [CrossRef]
  17. Howard, P.G.; Vitter, J.S. Fast and efficient lossless image compression. In Proceedings of the DCC ′93: Data Compression Conference, Snowbird, UT, USA, 30 March–1 April 1993; pp. 351–360. [Google Scholar] [CrossRef]
  18. Saha, S. Image compression—From DCT to wavelets: A review. XRDS Crossroads ACM Mag. Stud. 2000, 6, 12–24. [Google Scholar] [CrossRef]
  19. Wallace, G. The JPEG still picture compression standard. IEEE Trans. Consum. Electron. 1992, 38, 18–34. [Google Scholar] [CrossRef]
  20. Ginesu, G.; Pintus, M.; Giusto, D.D. Objective assessment of the WebP image coding algorithm. Signal Process. Image Commun. 2012, 27, 867–874. [Google Scholar] [CrossRef]
  21. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
  22. Pizzolante, R.; Carpentieri, B. Multiband and Lossless Compression of Hyperspectral Images. Algorithms 2016, 9, 16. [Google Scholar] [CrossRef]
  23. Pizzolante, R.; Carpentieri, B. Lossless, low-complexity, compression of three-dimensional volumetric medical images via linear prediction. In Proceedings of the 18th International Conference on Digital Signal Processing (DSP 2013), Fira, Greece, 1–3 July 2013; pp. 1–6. [Google Scholar] [CrossRef]
Figure 1. Optimal mix, t = 39,000 ms.
Figure 1. Optimal mix, t = 39,000 ms.
Algorithms 18 00135 g001
Figure 2. Optimal mix, t = 89,000 ms.
Figure 2. Optimal mix, t = 89,000 ms.
Algorithms 18 00135 g002
Figure 3. Optimal mix, t = 571,000 ms.
Figure 3. Optimal mix, t = 571,000 ms.
Algorithms 18 00135 g003
Figure 4. Optimal mix, t = 69,000 ms.
Figure 4. Optimal mix, t = 69,000 ms.
Algorithms 18 00135 g004
Figure 5. Optimal mix, t = 58,000 ms.
Figure 5. Optimal mix, t = 58,000 ms.
Algorithms 18 00135 g005
Figure 6. Optimal mix, t = 130,000 ms.
Figure 6. Optimal mix, t = 130,000 ms.
Algorithms 18 00135 g006
Figure 7. Optimal mix, t = 61,000 ms.
Figure 7. Optimal mix, t = 61,000 ms.
Algorithms 18 00135 g007
Figure 8. Optimal mix, t = 132,000 ms.
Figure 8. Optimal mix, t = 132,000 ms.
Algorithms 18 00135 g008
Figure 9. Setups.
Figure 9. Setups.
Algorithms 18 00135 g009
Figure 10. Useful setups.
Figure 10. Useful setups.
Algorithms 18 00135 g010
Figure 11. Optimal mix.
Figure 11. Optimal mix.
Algorithms 18 00135 g011
Figure 12. Setups.
Figure 12. Setups.
Algorithms 18 00135 g012
Figure 13. Useful setups.
Figure 13. Useful setups.
Algorithms 18 00135 g013
Figure 14. Optimal mix.
Figure 14. Optimal mix.
Algorithms 18 00135 g014
Figure 15. Setups.
Figure 15. Setups.
Algorithms 18 00135 g015
Figure 16. Useful setups.
Figure 16. Useful setups.
Algorithms 18 00135 g016
Figure 17. Optimal mix.
Figure 17. Optimal mix.
Algorithms 18 00135 g017
Table 1. Compression of 1000 html pages.
Table 1. Compression of 1000 html pages.
AlgorithmCompressed Size (Bytes)Time to Compress (milliseconds)
Gzip7,700,6601784
XZ6,793,83625,974
Bzip27,216,59716,514
Arithmetic Coding13,906,1747928
Table 2. Compression tools.
Table 2. Compression tools.
ToolDownloadConfiguration
PNGhttps://www.idrsolutions.com/jdeliLossless
TIFFhttps://www.idrsolutions.com/jdeliDeflate
JPEG-LShttp://www.labs.hp.com/research/info_theory/loco/locodownold.htmlLossless 8 bits
JPEG2000http://www.dclunie.com/jj2000/JPEG%202000%20implementation%20in%20Java.htmlLossless
BMPhttps://github.com/jai-imageio/jai-imageio-coreCompression with uncompressed pixel map (BI RGB)
FELICSJava implementationLossless 8 bits
Table 3. Animated images with similar scenes.
Table 3. Animated images with similar scenes.
ToolSize of Output in BytesCompression Time in ms.
PNG183,362,33070,806
TIFF169,626,22691,082
JPEG-LS129,232,679116,263
JPEG2000145,893,942564,370
BMP691,254,00437,975
FELICS201,865,19742,952
Table 4. Animated images with different scenes.
Table 4. Animated images with different scenes.
ToolSize of Output in BytesCompression Time in ms.
PNG278,923,71283,538
TIFF263,128,72093,542
JPEG-LS188,777,508127,575
JPEG2000174,971,373578,176
BMP691,254,30036,992
FELICS252,695,81258,873
Table 5. Non-animated images with similar scenes.
Table 5. Non-animated images with similar scenes.
ToolSize of Output in BytesCompression Time in ms.
PNG367,459,74389,924
TIFF356,243,19089,626
JPEG-LS237,640,546132,333
JPEG2000206,645,155595,192
BMP691,254,00136,900
FELICS316,963,00075,920
Table 6. Non-Animated images with different scenes.
Table 6. Non-Animated images with different scenes.
ToolSize of Output in BytesCompression Time in ms.
PNG326,568,53297,109
TIFF311,831,649100,836
JPEG-LS226,142,667130,656
JPEG2000182,808,175569,792
BMP691,254,00236,708
FELICS289,826,02067,992
Table 7. Compression tools.
Table 7. Compression tools.
ToolDownloadConfiguration
JPGhttps://github.com/LuaDist/libjpegJPEG lossy optimized
WEBPhttps://github.com/webmproject/libwebpWebP lossy
JPEG NO OPTIMIZEhttps://github.com/LuaDist/libjpegJPEG lossy not optimized
SSIMhttps://imagemagick.org/index.phpSSIM
Table 8. Test 1.
Table 8. Test 1.
ToolSize of Output in BytesTime in ms.Quality
Parameter
SSIM
JPG160,837,1853652950.987475
WEBP243,938,8444894950.988174
JPEG-NO OPTIMIZE363,865,8735745950.987482
JPEG416,576,1615636400.942975
JPEG515,107,6635632340.940295
JPEG614,934,1715754330.939170
JPEG714,607,0305693320.937845
WEBP812,025,9463791400.949340
WEBP99,976,3643560270.937786
WEBP109,481,5983524240.934067
WEBP119,313,6143548230.932677
JPEG-NO OPTIMIZE1217,732,5025888400.942941
JPEG-NO OPTIMIZE1316,365,9846046340.940250
JPEG-NO OPTIMIZE1416,206,0146030330.939123
JPEG-NO OPTIMIZE1515,901,1626172320.937797
Table 9. Test 2.
Table 9. Test 2.
ToolSize of Output in BytesTime in ms.Quality
Parameter
SSIM
JPG166,567,9453140950.984874
WEBP245,131,5264321950.985433
JPEG-NO OPTIMIZE370,390,1955305950.984874
JPEG421,543,2855680480.944169
JPEG515,001,4445613250.907689
JPEG613,186,5715683200.889513
JPEG712,510,1555667180.882732
WEBP813,685,9223482480.95248
WEBP910,041,6683283250.93276
WEBP108,200,3283384140.91429
WEBP115,839,634403230.874497
JPEG-NO OPTIMIZE1222,619,6235732480.944170
JPEG-NO OPTIMIZE1316,340,7935938250.907669
JPEG-NO OPTIMIZE1414,648,9646368200.889482
JPEG-NO OPTIMIZE1514,017,3895984180.882703
Table 10. Test 3.
Table 10. Test 3.
ToolSize of Output in BytesTime in ms.Quality
Parameter
SSIM
JPG143,432,7505672950.972077
WEBP225,647,4084728950.980147
JPEG-NO OPTIMIZE345,836,2325911950.972077
JPEG413,738,3955795480.912076
JPEG59,696,8515493250.8863093
JPEG68,613,7495974200.8786752
JPEG78,198,9625741180.8743648
WEBP83,603,432491410.8735347
JPEG-NO OPTIMIZE914,969,2856431480.912082
JPEG-NO OPTIMIZE1011,228,3976629250.886300
JPEG-NO OPTIMIZE1110,266,5296980200.878630
JPEG-NO OPTIMIZE1299,131,5366999180.874339
Table 11. Test 4.
Table 11. Test 4.
ToolSize of Output in BytesTime in ms.Quality
Parameter
SSIM
JPG155,165,8775096950.983708
WEBP234,071,5564251950.98584
JPEG-NO OPTIMIZE357,648,3595201950.983713
JPEG417,071,8085658510.9462306
JPEG512,384,9385702290.9226299
JPEG611,053,3095536240.9121278
JPEG710,516,3245875220.9027293
JPEG89,873,9875781200.90005430
WEBP94,908,124326040.90047801
WEBP104,688,350315830.8955097
JPEG-NO OPTIMIZE1118,071,9695913510.946195
JPEG-NO OPTIMIZE1213,702,3705926290.922539
JPEG-NO OPTIMIZE1312,492,5296028240.912026
JPEG-NO OPTIMIZE1412,018,1496018220.902617
JPEG-NO OPTIMIZE1511,446,4866135200.899925
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Carpentieri, B. Data Compression with a Time Limit. Algorithms 2025, 18, 135. https://doi.org/10.3390/a18030135

AMA Style

Carpentieri B. Data Compression with a Time Limit. Algorithms. 2025; 18(3):135. https://doi.org/10.3390/a18030135

Chicago/Turabian Style

Carpentieri, Bruno. 2025. "Data Compression with a Time Limit" Algorithms 18, no. 3: 135. https://doi.org/10.3390/a18030135

APA Style

Carpentieri, B. (2025). Data Compression with a Time Limit. Algorithms, 18(3), 135. https://doi.org/10.3390/a18030135

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop