First Gradually, Then Suddenly: Understanding the Impact of Image Compression on Object Detection Using Deep Learning

Video surveillance systems process high volumes of image data. To enable long-term retention of recorded images and because of the data transfer limitations in geographically distributed systems, lossy compression is commonly applied to images prior to processing, but this causes a deterioration in image quality due to the removal of potentially important image details. In this paper, we investigate the impact of image compression on the performance of object detection methods based on convolutional neural networks. We focus on Joint Photographic Expert Group (JPEG) compression and thoroughly analyze a range of the performance metrics. Our experimental study, performed over a widely used object detection benchmark, assessed the robustness of nine popular object-detection deep models against varying compression characteristics. We show that our methodology can allow practitioners to establish an acceptable compression level for specific use cases; hence, it can play a key role in applications that process and store very large image data.


Introduction
Real-life applications of object detection, such as intelligent video surveillance systems, faced a number of practical challenges. One critical challenge is how to handle a large volume of image data efficiently. Since city surveillance systems are spread geographically and include hundreds (or even thousands) of cameras, processing captured data requires compression, usually in a lossy manner. Such lossy compression ultimately degrades the quality of images because it discards a portion of the contained information. Therefore, adjusting the compression settings leads to a trade-off between image quality and storage/transfer requirements and constraints.
Object detection is a prominent task in computer vision. It has received much research attention because of its cornerstone role in many practical applications ranging from personal photography to security and surveillance. Additionally, 2D-object detection can play a key role in many other areas, such as 3D sensing (e.g., when combined with LIDAR data [1]), or autonomous driving [2], in which detecting tiny objects suchas traffic signs in real time is crucial [3]). Although there are numerous classical machine learning approaches for object detection, deep-learning object detectors are advantageous compared to other kinds of algorithms for the following reasons: • They offer a high object-detection performance that outperforms classical approaches [4]; • They can be trained to detect new classes of objects without programming new algorithms or feature extractors in a human-dependent manual way [5]; and • Hardware acceleration to address their substantial computational needs is readily available, thus allowing end-to-end training of large models.
Although there have been attempts to verify the impact of lossy image and video compression on the performance of deep convolutional architectures applied in various computer vision tasks (e.g., human pose estimation, semantic segmentation, object detection, action recognition, or monocular depth estimation [6,7]), the quality of images after lossy compression was considered primarily with human perception in mind [8][9][10]. We follow the former research pathway, and our main objective is to understand how image compression affects the performance of deep-learning models for object detection.

Related Work
Most research into object detection from image data does not concurrently take into account image quality and lossy compression, or their impact on object detection performance. However, there some works on deep convolutional neural networks (CNNs) performance under different conditions, sucha as quality degradation resulting from input data compression. (Note that compressing CNNs and elaborating resource-frugal deep models is another interesting research area, for which we are aiming to obtain compact models that occupy less memory and infer faster, ideally without degrading the abilities of the algorithm [11,12]).
Dodge and Karam [13] investigated the influence of image quality on the performance of image classification, which is similar to object detection but without the localization requirement. One of the methods of quality degradation was lossy image compression, using JPEG and JPEG2000. The study was based on the ImageNet 2012 1000-class dataset, specifically on 10,000 images drawn from the ILSVRC 2012 validation set [14]. Four deep architectures, two variants of AlexNet alongside the larger VGG-16 network and GoogLeNet [15], were exploited in the experimental study that covered five types of image distortions: (i) additive Gaussian noise, (ii) blur via convolution using a Gaussian kernel, (iii) contrast reduction via blending with an uniform gray image with a varying blending factor, (iv) JPEG compression at different quality levels (reflected by the Q parameter), and (v) JPEG2000 compression with a different target peak signal to noise ratio (PSNR). The authors measured Top-5 classification accuracy (the percentage of classifications where the correct class was among the five most confident predictions), as well as the strict (Top-1) accuracy. The experiments indicated a significant influence of blurring, together with a medium influence of noise and a high robustness against the contrast degradation. Both image compression methods were found to have an impact on the classification performance. For all considered models, the accuracy did not decrease significantly for quality levels from 20 to 100.
In [16], the authors described methods for generating additive, seemingly random, noise with low amplitude, causing models to classify modified images wrongly. Although such alterations are perceivable by humans, they do not affect object recognition capabilities. This showed that the image quality can be regarded differently for human perception and for (at least some) deep CNN models. This observation was explored further in [13] for the image classification task. In the work reported here, we tackle the problem of understanding the influence of lossy compression on deep-learning object detection, which remains an open question in the literature.

Contribution
We investigate the influence of the JPEG compression on the performance of CNNs for object detection. This compression method is widely used in various real-life applications, such as digital photography or document archiving. Moreover, other lossy compression algorithms, including video encoders, are based on the same principles [17]. The insights learned from our study can be generalized over many other compression techniques as well. Our contribution centers around the following points: • We devise a fully reproducible computational study to thoroughly assess the influence of varying compression levels on the performance of a representative collection of models for object detection (including one-stage and two-stage detection pipelines), using a well-known validation dataset [18].
• We analyze the changes in performance using both an aggregated performance score, as well as separate metrics of recall and precision to allows us to observe differences in behaviors between models in detail. • We show how different architectures behave with respect to the confidence threshold, and we present the examples of high and low sensitivity to such threshold, which needs to be taken into account while balancing precision and recall. • We examine the influence of object size (quantified as the object's area in pixels) on the robustness of object detection.
Our findings have a practical application in systems using deep CNNs for object detection, which uses lossy compression for image transfer or storage. The insights concerning the nature of the trade-off between the compression level and detection performance can lead to better decisions about the underlying compression parameters and to a reduction in data storage requirements, while maintaining an acceptable performance of the deep learning-powered object detectors.

Paper Structure
The remainder of this paper is organized as follows. Section 2 describes the materials and methods used in our study. Here, we discuss the JPEG compression algorithm, together with the deep-learning-powered object detection and the metrics that are commonly used to quantify the performance of such techniques. We also elaborate on the benchmark dataset and the models exploited in our study. Section 3 presents and discusses the experimental results, and Section 4 concludes the paper.

JPEG Image Compression
A high-level JPEG image data compression flowchart is rendered in Figure 1. We discuss its pivotal steps in more detail in the following subsections. 8 Figure 1. The JPEG compression pipeline. The lossy step is indicated as a rounded block.

Block Transform
The lossy compression used in the JPEG standard is based on the discrete cosine transform (DCT) [19]. The forward DCT (FDCT), also known as DCT-II, processes an 8 × 8 block of samples ( f (x, y) ∈ [0, 255]) producing an 8 × 8 block of DCT coefficients F(u, v) using the following formula [20]: where C(u) and C(v) are the normalization constants defined as follows: This transform can be reversed using the associated inverse DCT (IDCT), also known as DCT-III, which is defined as follows:

Quantization
The output of DCT is quantized-each DCT coefficient in the 8 × 8 block is divided by its corresponding quantization table (QT) entry Q(u, v) ∈ [1,255] rounded to the nearest integer. This operation is only approximately reversed during the decompression process: before the inverse transform, the coefficients are multiplied by their QT entries. The larger the divisor in the QT becomes, the lower the number of the discrete quantized coefficient values that can be generated is. Finally, the DCT coefficients that are smaller than a half of the QT entry become zero.

Setting the JPEG Quality
The Q parameter, which is used to control image quality in JPEG compression, is an integer in the range [1,100], with the smaller values indicating lower quality and smaller output size, and higher values corresponding to better quality at the cost of size (hence, "weaker" compression)-see Figure 2. Interestingly, the value of 100 does not correspond to the lossless compression, but to the configuration that introduces the smallest possible information loss, which is achieved when there are only ones in the QT. For Q = 1, all divisors in the QTs are equal to 255. The QTs used in this research are available at https://github.com/tgandor/urban_oculus/tree/master/jpeg/quantization (accessed on 19 December 2021).
The JPEG data compression ratio may vary since it depends on image content, e.g., a single-level 8 × 8 block needs to store one DCT coefficient, but high-frequency areas may have 64 non-zero coefficients even after quantization. Table 1 shows the compression ratio statistics for a set of the Q values computed on the COCO val2017 set. Image quality can be defined in multiple ways, one of which uses full-reference image quality metrics (FIQMs). They can be applied if the reference image is available, so the similarity can be expressed numerically. Such metrics include the mean squared error (MSE) and the root mean square error (RMSE), which depends on the dynamic range of the image. Still, they are useful in various scenarios, e.g., they can be exploited as loss functions while training image autoencoders. The peak noise-to-signal ratio (PSNR) is measured in decibels and is computed as It is worth mentioning that PSNR is used in the JPEG2000 algorithm to control the compression quality, but it suffers from several limitations that were pointed out by Wang and Bovik [22]. A popular FIQM which overcomes these shortcomings is the structural similarity index metric (SSIM) [21]. In contrast to MSE, which is computed pixel-wise, SSIM uses a block-wise computation, which is averaged for the entire image. For each block, the metric compares the average pixel value, the standard deviation, and also the co-variance of two blocks, and normalizes the result to [0.0, 1.0]. Table 2 gathers the SSIM statistics obtained for the selected Q values over the COCO val2017 set.

Deep Learning in Object Detection
Deep learning has been blooming in the field of object detection [23], and a plethora of techniques benefiting from automated representation learning have been proposed for this task so far [24]. The following subsections discuss such approaches in more detail.

The Object Detection Pipeline
A high-level flowchart of the object detection pipeline that exploits deep learning is rendered in Figure 3. Such deep learning-powered models include

The Backbone for Feature Extraction
The object detection pipeline commonly starts with feature extraction, which may be followed by feature selection [25]. A flowchart of the backbone that extracts features, together with an optional feature enhancement pathway, is shown in Figure 4. The enhancement is achieved by building a feature pyramid network (FPN) [26], which is a fusion of high-and low-level features [27]. Before the features are actually extracted, the input images are pre-processed, a step that often includes their resizing so they can be fed into the input layer, and standardization. The backbones are usually taken directly from a well-established deep image classifier [28,29]. The most popular backbones encompass the ResNet [30], ResNeXt [31], DarkNet [24], MobileNet [32], and EfficientNet [33] architectures.

Single-Stage Detectors
Single-stage detector architectures (also referred to as the single-shot [24,34] and dense detectors [35]) perform prediction directly on the output features of the backbone network. Single-shot detectors process the input image (i.e., extract the features) only once [24], which was not the case in the earlier two-stage detectors. Additionally, every point of the final feature map (or feature maps for feature pyramids) can potentially detect a specified number of objects. Finally, localization and classification tasks can be handled by two separate sub-networks, as in RetinaNet [35], or a single fully convolutional network, as presented in [36].

Two-Stage Detectors
The two-stage detectors (also referred to as the sparse detectors) are built with two functional blocks (as shown in Figure 5): • The region of interest (ROI) proposal mechanism, which generates the locations (boxes) in the image (where an any-class object can be found); when implemented as a neural network, it is known as the region proposal network (RPN). • The ROI heads, which evaluate the proposals, producing detections.
Deep Features RPN Proposals ROI head Detections Figure 5. A two-stage detector head, also referred to as the sparse detectors.
In the two-stage detection approach, the detection occurs only in a limited number of regions, which were produced by the RPN, and not across the entire image. Therefore, the most important quality metric related to the RPN is its recall.
A ROI head performs the second stage of the sparse-object detection. It takes a proposal from RPN together with the deep features from the backbone. The features relevant for a given region are processed through an operation called the ROI pooling and are fed to the networks that localize and classify the objects.

Performance Metrics for Object Detection
This section describes the performance metrics that apply to the task of object detection. Here, we show which count as true positive (TP) and false positive (FP) objects, how the results are aggregated over a benchmark dataset, and which parameters (thresholds) can be specified for the metrics.

Assessing a Single Detection
A single detection returned by the model needs to be categorized as a TP or FP. To determine this, the intersection over union (IoU) is commonly used. This value is the result of dividing the areas of the ground-truth and predicted box intersection (or zero if the boxes are disjoint) by the area of their union.
The threshold value for IoU, denoted as T IoU , is a parameter of the object detector evaluation: a detection is treated as a TP, if there exists a ground truth (GT) box for the same class with an IoU ≥ T IoU ; otherwise, it is treated as a FP. The choice of the T IoU value strongly influences the quantitative results. Thus, only the values obtained with the same T IoU should ever be compared. Commonly, the metrics found in the literature specify the threshold used-too low can lead to an over-optimistic evaluation and an incorrect assignment to the GT boxes, and too large a threshold may cause many correct detections to be rejected, especially if the GT boxes are not accurate. The lowest widely used T IoU is 0.5, followed by 0.75 when the localization accuracy requirements are strict. Each detection has a confidence p-we apply the confidence threshold T c and process the detections with p ≥ T c .

Detecting the Unlabeled Objects: Crowds
The object detection datasets may include some annotations designated as "crowd". These regions include many objects of the same class without individual object annotations. We record the number of such detections, and refer to them as the "extra" detections (EX).

Precision, Recall and the F1-Score
Once the number of TP and FP detections was determined for the specific values of T IoU and T c , we calculated the precision (positive predictive value, PPV), which is the ratio of TP and the total number of detections: To calculate the recall metric (also called sensitivity, true positive rate, TPR), we additionally exploited the number of objects in GT that were not detected such an a false negative (FN). This metric became The F1-score aggregates TPR and PPV into a single value in the range [0, 1], by using the harmonic mean: The T c parameter can be used to tune the above performance metrics-increasing the threshold potentially increases precision at the cost of recall, and lowering it has the opposite effect.

The Precision-Recall Curve and Average Precision
All detections for all images in a benchmark dataset are first sorted by their confidence in descending order. When considering the top k elements, the TP k and FP k values can be used to compute the running precision PPV k = TP k /(TP k + FP k ), and recall TPR k = TP k /GT, where GT is the number of ground-truth objects in the dataset. TPR k is a non-decreasing series, but PPV k is not monotonic. To convert PPV k into a non-increasing curve, we used PPV k = max i≥k PPV i . To efficiently compute AP, we sampled the precision by recall ( Figure 6D), and therefore we obtained The AveP in this approximation becomes the arithmetic mean: Finally, the set of the recall samples R is evenly spaced from 0 to 1, usually by 0.1 (as proposed in [37]) or by 0.01 (as exploited in [18]). The code for evaluating the AP metric is available at https://github.com/cocodataset/cocoapi/ (accessed on 19 December 2021).

The Performance Metrics Selected for This Study
After each step in the image degradation, the objects are detected using each investigated model, and the performance metrics are computed. The parameters were set as follows: T c = 0.5 (the confidence p cutoff), and T IoU = 0.5 (for the metrics using a single IoU threshold except mAP .75 , for which T IoU = 0.75), and T IoU ∈ [0.5, 0.55, . . . , 0.9, 0.95] for AP. The following object detection performance metrics were evaluated for all experiments: AP: the overall AP metric, • AP s , AP m , AP l : AP separately for small (below 32 2 = 1024 pixels of area), medium (between 1024 and 96 2 = 9216 pixels) and large (above 9216 pixels) objects, • mAP .5 , mAP .75 : the mean average precision for two different T IoU values.

Qualitative Assessment of the Detection Performance
The 5000 images included in the val2017 dataset, multiplied by 100 quality settings and 9 models gives 4.5M of possible images with detections, which is infeasible to analyze manually. However, we analyzed a subset of the detections qualitatively, and proposed names for the unwanted behavior of the detectors. The errors encompass • omission of an object (FN), the most common error, • wrong classifications (a bounding box around an object with the wrong category returned, sometimes alongside a correct detection of that very object), • mistaken objects (detecting real objects with a correct bounding box, but of a category not present in the GT), • detections of unrelated objects at random places in the image ("halucinating"). • loss of bounding box accuracy, • selecting only part of an object or having multiple selections of the object (loss of continuity), • one box covering multiple objects (cluster) or parts of different objects from the same category (chimera).

Reproducibility Information
All Python code and Jupyter notebooks associated with the study were published at https://github.com/tgandor/urban_oculus (accessed on 19 December 2021). The input data is available for download at https://cocodataset.org/ (accessed on 19 December 2021), and the raw output data (detections in the JSON format) was deposited in a public data repository at https://doi.org/10.7910/DVN/UPIKSF (accessed on 19 December 2021).

Benchmark Dataset
There is a plethora of datasets for object detection [23], such as PASCAL VOC, Im-ageNet, Open Images and MS COCO. We exploited the validation subset of the COCO Detection Challenge 2017, which is called val2017 for short. It consists of 5000 images of objects in natural environments. These are known as non-iconic images [18], in contrast to iconic images which are typically used for image classification. There were 80 object categories and 36,781 annotated objects in total. The number of objects in each category was uneven: the top 3 of them were people (11,004), cars (1932), and chairs (1791), while the least represented two classes contained only 11 and 9 object instances. The original JPEG quality of the images had the following distribution: Q = 96:3540, Q = 90:1414, Q = 80:46. Finally, there were 134 grayscale images.

The Investigated Deep Models
For object detection, we used nine pre-trained deep models taken from the De-tectron2 [38] Model ZOO available at https://github.com/facebookresearch/detectron2 /blob/master/MODEL_ZOO.md (accessed on 19 December 2021). The models were given the following identifiers: R101, R101_C4, R101_DC5, R101_FPN, R50, R50_C4, R50_DC5, R50_FPN, X101 (Table 3). This choice of models wais comprehensive, and covered both one-stage (RetinaNet) and two-stage (Faster R-CNN) detectors, as well as different variants of Faster R-CNN (with and without the feature pyramid). As the backbones, we used two ResNet depths (50 and 101 layers0, and there was one backbone using ResNeXt-101 (X101). The non-FPN Faster R-CNNs had two kinds of backbones: the first (C4) used a standard ResNet, and the other (DC5) exploited dilated convolutions (DC). Finally, both RetinaNets included a FPN. All the models were trained on the train2017 dataset [18], which was the training subset of the COCO Detection Challenge 2017. It encompassed the same 80 categories as val2017, but there were many more images (117,266) and object annotations (849,949). The stochastic gradient descent optimizer with a 0.9 momentum value, 270,000 iterations, and 16 images per batch (36 epochs in total) was used to train the deep models.

Image Degradation
This step performs the process of compressing all images from the benchmark set to a certain quality setting Q. For this task, the mogrify program from the ImageMagick suite (available at https://imagemagick.org/; accessed on 19 December 2021) was used throughout the computational experiments. The following command was run for the degradation: Since this operation is deterministic, there was no need to publish the degraded images for reproducibility. The set of the Q parameter values was all integers from the range 1, 2, . . . , 100. Importantly, no spatial transformations were applied to the input images; hence, the object locations and classes remained unchanged throughout the experiment.

The Baseline Results
The inference was first executed for all the models on unchanged images (the baseline), with T c = 0.5. These baseline results are presented in Table 4. The best results are in bold, and the second-best are underlined; AP l , AP m , AP s -the AP metric for objects classified as large, medium and small; TPR, PPV-recall and precision at T IoU = 0.5; TP, FP-true positives and false positives at T IoU = 0.5; EX-"extra" detection of objects in the crowd regions.

The AP and Related Metrics
For the metrics based on the AveP, AP, mAP, and per-size AP values, X101 was the dominant model except for AP l , where the simple R101_C4 model achieved 1.5 percentage points more. However, the second places behind X101 in AP and mAP were specific to the metric: This meant that the benefits of FPN were manifesting themselves with a higher precision of the bounding boxes, which improved mAP for higher T IoU , and thus also AP. The RetinaNets fell behind the two-stage models in these metrics because of low TPR values (around 50%) compared to the 63-68% range of Faster R-CNNs.

Counting the Objects (TP, FP, EX), Recall and Precision
In the baseline results from Table 4 for the detected object counts, the TPR and PPV of the detection were more nuanced then the AP-related ranking. The largest numbers of objects were found by the classic Faster R-CNNs (about 24.5k or 67-68% TPR). This was closely followed by X101 with 66%, and two other FPN-based models achieving 64-65%. The ranking was concluded by the RetinaNets, which achieved only 50-52% for detecting 18-18.5k objects. Interestingly, the precision ranking was exactly reversed. The RetinaNets returne only about 4.3k FP, which was less than 20% of their total detections (81% PPV). This precision was more than 10% higher than that of the next group (FPN-based models), which had precision values in the range of 67-70%. The non-FPN two-stage models were another 10% below that, with a PPV ranging from 57 to 59%. The EX metric, which counted the additional objects in the "crowd" regions, was similar to the TPR, but there were greater differences between the RetinaNets (about 800 detections), FPN models (2.5-2.8k detections) and non-FPN models 4.5-4.7k detections). Surprisingly, the ResNet-50 Faster R-CNNs produced even more EX detections than those based on ResNet-101.

Discussing the Impact of T c
For practical applications such as in video surveillance, when detected objects cause a resource-consuming intervention, an appropriate T c value needs to be determined in advance. When the benefit of GT annotations is not present, the risk of missing objects needs to be balanced against the cost of human attention dedicated to reviewing FP, by means of setting a right T c value. The T c = 0.5 is a good simulation of such a situation, because it expresses the greater prior probability of a detection being correct than false. Having collected all detections with the confidence p ≥ 0.05, we examined the baseline models' behavior in a wide range of the T c values. Figure 7 shows the precision, recall and F1-score of each model in our study as a function of T c . The T c value with the best F1-score is not necessarily the best threshold for any given detection task, but it informs us about the trade-off between the TPR and PPV. When the shape of F1 as a function of T c is steep, with a small region of values close to maximum, it means that the corresponding model is highly sensitive to the choice of threshold. A flatter shape, with a plateau in the neighborhood of the maximum, indicates that the choice of T c may be more arbitrary, and favoring either the TPR or PPV does not disproportionately affect the other metric. Looking at Figure 7, we can confirm that T c = 0.5 was an acceptable choice for all the models in this study.

Detection Results on Degraded Images
Examples of highly degraded images together with their detections, and-for TP-the GT boxes, are presented in Figures 8-10. For every GT object, it is possible to indicate the minimal compression quality at which it was detected by a selected (or by every) model. Surprisingly, there are objects that were detected by all investigated deep-learning detectors even at Q = 1. Conversely, there were cases where the detector made systematic errors (wrong classification or hallucinating the object), up to a certain quality, above which we could observe the correct behavior. Two examples of images demonstrating the "minimal Q" are gathered in Figure 11.

Performance Metrics as a Function of Q
Considering the performance metrics as the functions of Q allowed us to analyze the rate of change by taking a discrete derivative and plot the metrics against the compression quality. In the following subsections, we discuss this in more detail.

The Precision, Recall and F1 Metrics
The metrics dependent on T c are presented in Figure 12. Here, we can appreciate the near constant value of precision across the range of Q. As a consequence of that, the general decline in performance, in this case measured with the F1 metric, was due to the worsening of recall. The shape of the curves for these metrics depended on the model family. The non-FPN Faster R-CNNs had TPR and PPV values closest to one another, with the TPR starting out higher than the PPV and becoming equal to it near Q = 20. The FPN Faster R-CNN models had higher precision, and recall starts near the precision value, that declined slowly until the turning point. Finally, the RetinaNets had lower TPR values that declined at a comparatively high rate, but they had the highest precision, which was consequently maintained down to low quality values. The three metrics related to the area under the precision-recall curve behaved as shown in Figure 13. These metrics are averaged, and therefore the curves look smooth.
The general shape of AP was the same for all models, manifesting the first gradually, then suddenly (a famous E. Hemingway's quote about the process of bankruptcy) shape. The similarity was visible not only between models, but also between the curves themselves within each detector.

The AP Behavior for Different Sizes of Objects
The AP metric computed for small medium and large objects is shown in Figure 14.
There was a high similarity across all models, but in contrast to the AP at different T IoU , the AP at different sizes had noticeable differences in the curve shape. Specifically, the large objects were robustly detected down to low compression quality, and the shape of the middle-sized objects' AP curve was similar but with a lower value. This can be related to the influence of the bounding box precision: big objects had high IoU with detections misaligned by a few pixels, and the GT annotations were also not perfect. In the case of small objects, the difficulty of detecting them was apparent in the plots. Not only do they start with AP s approximately half of the AP m of the medium objects, but they declined at a greater rate. This was likely an effect of the lossy compression, which suppresses high frequency signal in the image, which is highly important for analyzing the fine detail of small objects. The AP s curve was visibly noisier than the other curves: the quantization that produces a consistent average effect of the size reduction and quality degradation manifested more randomness in the influence on the small regions, which spanned only few 8 × 8 compressed blocks.

Analyzing the Derivative of AP with Respect to Q
To confirm the linear behavior of the performance degradation, we analyzed the derivative of the AP with respect to Q. The derivative produced a noisy curve, so we applied smoothing using a running average of five adjacent values-the derivatives are shown in Figure 15. There was a plateau from Q = 100 to Q = 40 ("first gradually"), and then the degradation accelerated ("and then suddenly"). Using the derivative allowed us to pinpoint the Q value where the decline in detection performance started.  Table 5 shows the same set of metrics as the baseline Table 4, but calculated for the dataset degraded by the compression with Q = 25. This value is already below the "turning point", but the images with this quality are usually good enough for processing by humans, despite visible artifacts. The results showed the precision staying close to the baseline value, and the reduced number of TPs. The number of extra objects was also reduced, as the rate of detecting these objects was comparable to the TPR. Here, we observed that the X101 model was still the best in the AP, mAP .75 and AP s metrics (related to the precise localization, especially of smaller objects), but the absolute value of these metrics was low, and the difference relative to non-FPN Faster R-CNNs was smaller. This suggested that the benefits of using the feature pyramid were adversely affected by using too aggressive a compression. The best results are in bold, whereas the second-best are underlined; AP l , AP m , AP s -the AP metric for objects classified as large, medium and small; TPR, PPV-recall and precision at T IoU = 0.5; TP, FP-true positives and false positives at T IoU = 0.5; EX-"extra" detection of objects in the crowd regions.

Conclusions and Future Work
In this paper, we reported our study on the effect of JPEG compression on the performance of deep-learning object detectors based on CNNs. We exploited the COCO val2017 benchmark dataset and collected a wide range of performance metrics for different levels of compression controlled by the parameter Q ranging from 1 (strongest compression, lowest quality) to 100 (weakest compression, nearly lossless). We also established a baseline of metric values for the original dataset. The baseline results were used to characterize the models under testing, including their sensitivity to the confidence threshold value and their trade-off between precision and recall. We performed the qualitative assessment of the detection behavior and introduced a taxonomy of wrong detections: incorrect bounding boxes; wrongly classified, mistaken and hallucinated objects; clusters; and chimera detections.
The experiments showed that the one-stage detectors had a narrower range of admissible thresholds than the two-stage detectors, which were influenced by the threshold but offered a more beneficial trade-off for thresholds further from 0.5. For the metrics obtained over the degraded dataset, we treated them as functions of the parameter Q. For precision and recall at T c = 0.5, we observed radically different behavior where the precision remained constant regardless of compression quality, as well as a decline in recall that had knee-shaped characteristics and a rapid decrease below Q = 30. The results were consistent for a wide range of T c values with possible shifts in the precision and recall curves, but with the same general shape. We studied the AP metric with its specific cases, the more specialized mAP metrics, and the AP of the large, medium and small objects. The AP, mAP .5 and mAP .75 curves were similar in all models (with possible differences in scaling, but not general shape), and the mAP .75 curve was a good approximation of the more computationally extensive AP. The per-size AP metrics were consistent across the investigated models, with the AP s values having a steeper decline. Therefore, the small objects were more affected by compression. We speculated that this was related to high-frequency information and fine detail, which are not preserved by lossy compression. Finally, we verified the first derivative of the AP curves to find that they are linearly decreased from Q = 96 to Q = 40 (the AP value was approximately constant in the 100-96 range of Q values, but these compression levels were impractical because of an increase in data size). This defined the range of practical compression levels, and the exact Q value for a specific use depended on the recall value that needed to be achieved and the nature of the objects. From a practical perspective, the experimental results helped us draw several important conclusions. Performance decreases with stronger compression following a knee-shaped curve. This curve, as a function of the Q parameter of the JPEG compression, is continuous, so it can be sampled sparsely to save processing time. We observed that some techniques such as FPN lose their benefits over simpler approaches below a certain compression quality. To summarize, the JPEG compression is generally friendly to the deep-learning-powered object detectors, but unlike previous findings about the influence on image classification there was a measurable influence throughout the whole range of the quality settings. This effect ccame from reduced recall while the precision value remained unchanged.
Our study provides a framework for evaluating the influence of image compression on the performance of object detection methods, which can be applied to asses emerging methods for this task. It also opens the door to further research, which encompasses three main directions: broadening the scope of this research, finding the ways to mitigate the effects of image compression on the deep-learning-powered object detectors, and improving compression methods to make them more "friendly" to object detection. The following bullet points summarize a set of potential approaches toward broadening the analysis of the effects of image compression: • Inclusion of more deep-learning-powered object detection models; • Expansion of the set of detection performance metrics (e.g., LRP [41], PDQ [42]); • Incorporation of image quality metrics (based on feature similarity, such as FSIM [43], or salience-aware artifact detection [44], among many others); • Consideration of related computer vision tasks such as instance segmentation; • Investigation of other compression algorithms, both transformative (e.g., JPEG2000 [45]) and generative/predictive (e.g., WebP [46], PDE-based methods [47]).
Additionally, the methods that could help overcome compression influence encompass (i) pre-detection quality improvements similar to the super-resolution reconstruction [48,49], (ii) inclusion of compression-degraded images in the training dataset--models could be trained separately over the data with and without quality degradation, or a single model could be built based on the dataset of original and degraded images with the latter treated as the augmented samples, and (iii) building dedicated models for specific ranges of compression quality to be used as ensemble or dedicated models for small and non-small objects, since our results showed that large and medium objects were similar and more robust to the compression effects. These issues constitute our current research efforts that should ultimately lead to more robust deep-learning object detectors ready to be deployed in the wild.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.