Masked Feature Compression for Object Detection

: Deploying high-accuracy detection models on lightweight edge devices (e.g., drones) is challenging due to hardware constraints. To achieve satisfactory detection results, a common solution is to compress and transmit the images to a cloud server where powerful models can be used. However, the image compression process for transmission may lead to a reduction in detection accuracy. In this paper, we propose a feature compression method tailored for object detection tasks, and it can be easily integrated with existing learned image compression models. In the method, the encoding process consists of two steps. Firstly, we use a feature extractor to obtain the low-level feature, and then use a mask generator to obtain an object mask to select regions containing objects. Secondly, we use a neural network encoder to compress the masked feature. As for decoding, a neural network decoder is used to restore the compressed representation into the feature that can be directly inputted into the object detection model. The experimental results demonstrate that our method surpasses existing compression techniques. Specifically, when compared to one of the leading methods—TCM2023—our approach achieves a 25.3% reduction in compressed file size and a 6.9% increase in mAP0.5.


Introduction
Since Shannon proposed information entropy in 1948 [1], image compression has been a popular research field.Before the advent of computer vision, all images were created for human perception.Consequently, the goal of image compression was to make the compressed image visually identical to the original.In recent years, with the continuous development of object detection techniques, images have become able to be "seen" by machines.Therefore, when the final recipient of images is an object detection model, the goal of compression shifts to making the detection results of the compressed representation as close as possible to those on the original image.Traditional compression methods do not take the requirements of detection tasks into account; all parts of the image are uniformly compressed according to a preset compression rate.As a result, some information needed for detection is discarded when the compression rate is high, leading to a drastic decrease in accuracy.Thus, we need a compression method capable of preserving information crucial for detection tasks.
Handcrafted image codecs [2][3][4][5] reduce image size by transforming the image into the frequency domain through mathematical methods [6,7] and then remove some highfrequency signals that are insensitive to human eyes.Learning-based image compression methods [8][9][10], on the other hand, use neural networks to perform nonlinear transform of the image.Such methods learn to preserve key information by training under mean squared error (MSE) loss between the original and reconstructed image.Some of them [11][12][13][14][15][16][17] are even better than advanced hand-crafted methods (like the still image coding of VVC [18]) in terms of PSNR (peak signal-to-noise ratio) and MS-SSIM [19].Both kinds of compression methods fundamentally involve discarding information that is less perceptible to human eyes, reducing the size of the image while preserving visual quality.
Currently, in many edge-cloud collaborative scenarios, it is challenging to deploy complex image processing models due to the limited computational capabilities of edge devices.One way to solve this problem is to directly compress the models [20], while another is to compress the images captured by edge devices and transmit them to the powerful cloud server for processing.This paper primarily discusses the latter approach, and such a scenario has led to more research on ensuring the performance of downstream task models on compressed data.One kind of method [21][22][23] is to adjust the backend model to work with compressed input, e.g., Chan et al. [22] improved the performance of downstream tasks on compressed data by conducting transfer learning on backend DNNs with compressed data acquired from vehicle sensors.Another type of method involves making adjustments at the input side to minimize the impact of compression on downstream models.For example, for video compression, Huang et al. [24] proposed a Learned Semantic Representation (LSR) method to extract semantic information between temporally adjacent frames, which can be used for signal reconstruction observable by humans and visual analysis understandable by machines.In this paper, we present a feature compression approach designed for integration with current neural network-based image compression frameworks.This method allows for smooth incorporation with these frameworks to enhance their efficiency in object detection tasks.Furthermore, our method considers the computational power imbalance between the edge and the cloud.During edge-side encoding, a lightweight and high-recall module is used to identify potential target areas in the extracted features.The features are then encoded and transmitted to the cloud, where they are decoded, and a powerful detection model is employed for the final detection.This ensures that compression is performed effectively at the edge while high-accuracy detection is carried out in the cloud, addressing both the need for high performance in object detection and the constraints of edge computing.
The idea behind our method is that for image compression aimed at detection, we can remove information that is less sensitive to the detection model.Each spatial position in the image impacts the detection results differently; regions with objects are more influential than others.Therefore, a key issue of compression for detection is how to retain the parts containing the objects effectively.At the edge device, a lightweight model can be used to determine the regions where objects are located.Based on this idea, we design a masked feature compression method.Firstly, a feature extractor processes the image to extract the low-level features, followed by a mask generator to create an object mask for choosing target regions.Additionally, we enhance the mask by adding information from the objects' vicinity through a "neighborhood convolution" process.By conducting complexity analysis and exploring feasibility, we find that compressing the features obtained from the feature extractor directly is more efficient than compressing the input image.The encoder then compresses the masked features into latent representations.At the decoding stage, a decoder will reconstruct the features.Experimental results show that under identical compression rates, our method outperforms other compression methods in detection accuracy.At the same time, compared with some up-to-date neural network image compression methods, the proposed feature compression method has a faster encoding and decoding speed.The key contributions of this paper can be summarized as follows:

•
We explore the feasibility of applying generated masks to low-level features and reduce the model's time complexity by directly compressing the features.The model's encoding speed surpasses that of current DNN image compression models.

•
We design a lightweight mask generator that can generate an object mask in one forward pass, and perform compression on the masked feature to save bits while ensuring the accuracy of backend object detection tasks.

•
The proposed framework can easily be integrated with existing neural network compression frameworks, enhancing the compression performance for object detection tasks.
The structure of the remainder of this paper is as follows.Section 2 reviews related work.Section 3 explores the feasibility of mask feature compression.Section 4 provides a detailed description of the proposed method.Section 5 presents and discusses the experimental results.Finally, Section 6 concludes the paper.

Variational Image Compression
Since Ballé proposed the first variational image compression model [25], such kinds of methods have become the mainstream of learning-based compression.The key to these approaches is entropy modeling and nonlinear transformation by neural networks.The basic structure of those methods is shown in Figure 1, where x is the original image; y = g e (x) is the latent representation after encoding; z = h e (y) is the hyper latent information [26] used as side information to estimate the entropy of y more accurately, and it will also be transmitted to the decoder side.The entropy model will output the likelihood of each element y i in y.With the likelihoods, the arithmetic coders [27] can complete the entropy coding of y without any information loss.At last, the image decoder reconstructs the image as x = g d ( ŷ).The training loss of such models consists of two components: distortion loss and rate loss.The distortion loss controls the quality of the compressed image: where d(•) is the distortion metric (MSE or MS-SSIM).The rate loss controls the file size of the compressed image: where p y (y i ) and p z (z i ) are the likelihoods of y i and z i , and H, W are the height and width of the original image, respectively.The rate loss is the estimation of the bits per pixel (BPP)for the compressed image.By using λ to control the rate-distortion trade-off, the final loss is as follows:

Region Proposal Methods
As mentioned in Section 1, to achieve compression for detection, we need to preliminarily identify potential target-containing areas on the encoding side.Two-stage object detection methods [28][29][30] involve a process of generating a set of candidate target regions.This process generates tens of thousands of candidate boxes and reduces the number of overlapping boxes through nonmaximum suppression.In the context of lightweight encoding, we prefer to generate candidate regions in one step.Since specific detection is not required on the encoding side, there is no need to consider the overlapping parts between objects.In fact, it is only necessary to generate a one-dimensional 0-1 mask, using 1 to indicate regions where objects may exist and 0 for the background, without the need for the additional differentiation of object types.
Image segmentation models [31][32][33] can generate semantic maps of images, dividing background and object areas.However, the full image segmentation algorithm is computationally intensive, and the encoding side's limited computational power restricts our ability to use complex models.On the encoding side, we do not require highly accurate masks but rather a rough identification of object areas, and the model's recall rate is deemed more important than its precision to keep as many targets as possible.Additionally, we aim for the mask generator to share a feature extraction process with the backend detection model to conserve computational resources.Therefore, we design a lightweight mask generator to facilitate rapid mask generation on the encoding side.

One-Stage Object Detection Models
One-stage object detection models can perform both detection and classification in one forward inference.The YOLO series [34][35][36][37] is probably the most famous among them.YOLO consists of three main components: backbone, neck, and head.The backbone is responsible for extracting features from the input image; the neck is used to further process the features extracted by the backbone and merge information from different layers; and the head performs object detection and classification, predicting the final results.
YOLOv5 [37] has been widely used in recent years.It achieves different model sizes (from light to heavy, named YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x) by adjusting the number of convolutional layers and channels.As shown in Table 1, there is a large gap in mAP between models of different sizes.Some edge devices are unable to deploy large-scale models due to the limited computing capabilities.In cases where precision is required, a common solution is to transmit images to more powerful devices and then use large models for detection.
Table 1.The mAP, parameters, and FLOPs of different sizes of YOLOv5 models provided by the official.Note: mAP is tested on the COCO dataset [38], and the IoU threshold is 0.5.

Feasibility Analysis of Masked Feature Compression
To utilize object-related information during the encoding process, we need to perform a preliminary detection of the input image to identify potential target areas before implementing specific compression.After that, two different pipelines can be adopted to integrate the acquired information with encoding.The first pipeline uses the original image as the input for the encoder, and the second one involves directly inputting the obtained features into the encoder for compression.The structures of the two pipelines are illustrated in Figure 2. By the feature extraction process, the resolution of the input image is reduced.For the convolution operation, the resolution significantly affects its speed.The computational complexity of the convolution is shown in Equation ( 4): where S is the output feature map's side length, K is the kernel size, C in is the input channel number, and C out is the output channel number.It is straightforward that using the image as input for the encoder does not take advantage of the resolution reduction achieved by the feature extraction process.Thus, directly compressing the extracted features is a more efficient method for encoding.Through the detection part shown in Figure 2, potential target areas are identified in advance.Subsequently, a 0-1 mask, where 1 corresponds to the targets and 0 corresponds to the background, can be used to filter the content to be compressed.The prerequisite for applying the mask to the extracted features is that the features and the image have a consistent spatial structure.Previous studies [39,40] have shown that low-level features are capable of preserving the spatial information of the original image.In this paper, we train the complete YOLOv5l [37] structure as the powerful detection model (Section 5.2), and use its first four layers as the feature extractor.We visualize the features extracted by the feature extractor.Some results are shown in Figure 3.It can be observed that the features retain the spatial information of the original image, meaning the relative position of the objects in both the features and the original image is consistent.This allows the mask generated for the original image to be directly applied to the features.

Proposed Method
Figure 4 shows the workflow of the proposed method.The encoding process is performed on the edge device.First, extract the feature from the image, and then the mask generator will output an object mask based on the feature.The neighborhood convolution adds the parts near the objects to the mask.The encoder then uses the mask to remove the background and compress the feature to the latent representation.As for the decoding process on the high-performance cloud server, a decoder recovers the latent representation into the feature.Finally, a powerful detector outputs the detection results from the feature.

The Feature Extractor, Mask Generator, and Powerful Detector
The feature extractor can be regarded as a module shared by the mask generator and the powerful detector.Figure 5 integrates it into the mask generator.After masking, extracted features will be compressed and transmitted.During the feature extraction stage, the input image undergoes three times reduction in both width and height; correspondingly, the channel dimension increases three times.Eventually, when the input size is (3, H, W), the final extracted feature size is (256, H 8 , W 8 ).
Figure 5.The structure of the mask generator.The C3 block used here is based on the structure in CSPNet [41], ×n means the repeat number of it."Up" represents the nearest-neighbor interpolation with a scaling factor of 2. In the mask, "1" is represented by white, while "0" is represented by black.
To generate a masked feature, an intuitive approach is to use a lightweight detection network (e.g., YOLOv5n) to perform pre-detection on the extracted feature and keep all values inside the detected objects while discarding others.However, a major problem with this method is that it needs to iterate through each bounding box to obtain coordinates, which significantly slows down the encoding speed.To address this issue, we introduce the mask generator model, which directly outputs a mask as shown in Figure 5.The model has a similar structure to detection models, and the difference lies in the fact that the detection model's neck outputs three different size features.Each feature is intended for detecting objects of large, medium, and small sizes, respectively.In contrast, the mask generator only needs to output an object mask that has the same size as the extracted feature from the feature extractor.
The labels for training the mask generator are obtained from the detection labels.Since the low-level feature's receptive field is small, the extracted feature still retains a spatial structure similar to the original image and the object's position in the original image aligns with its position in the feature.Assuming the height and width of the feature to be compressed are h and w, the mask label size should be (h, w) too.As Figure 5 shows, we utilize the bounding box coordinates from the detection label, and set the mask values inside objects to 1, and the rest to 0. The training loss is the MSE loss between the mask label and the generated mask.Details are as follows: where h and w are the height and width of the mask, β is the weight of the object region loss, m is the mask label, m is the generated mask, obj means the object, and back means the background.The powerful detector is employed in the decoding stage to perform the final detection.Since the decoding stage is deployed on high-computing cloud servers, high-accuracy large models can be used.In this paper, we adopt YOLOv5l as the powerful detector.

Neighborhood Convolution
The mask generator can identify object positions, yet object detection models require surrounding pixel information for more accurate predictions.Hence, selecting a suitable neighborhood range around objects is vital.Too large a range increases file size, whereas too small a range compromises accuracy.
We can calculate the distance of each pixel from the target and set a threshold to define the neighborhood.However, iterating over each pixel drastically reduces the coding speed.Given that current machine learning frameworks [42,43] have optimized convolution operations for efficiency, we propose the neighborhood convolution to expedite the neighborhood determination process.Details are illustrated in Figure 6.First, we define a convolutional kernel with all values set to 1.Then, we use this kernel to perform convolution (to keep the output's size unchanged, paddings should be Kernel Size−1 2 and stride should be 1) on the generated mask.Values in the output represent the count of pixels belonging to objects within the kernel.Finally, we set all nonzero values in the output to 1 to obtain a new mask.By adjusting the size of the convolutional kernel, we can control the size of the areas to be retained in the final mask, thereby fine-tuning the file size of the compressed representation.Prior to feature compression, it is crucial to understand how the mask affects detection outcomes.To this end, we combine neighborhood convolution with a series of exploratory experiments on the VisDrone dataset, following these specific steps: 1.
Use the pretrained mask generator (training details in Section 5) to obtain a 0-1 mask, where the object areas are marked as 1 and the background areas as 0.Then, resize the mask through nearest neighbor interpolation to match the size of the features.

2.
Use neighborhood convolutions of different sizes to expand the object regions in the mask, and then multiply the mask with the features to obtain masked features.

3.
Detect the masked features by YOLOv5l, which has been trained on the VisDrone dataset.Results are shown in Table 2. Table 2 indicates that when the kernel size of the neighborhood convolution is 11, the difference of mAP0.5 between the masked and unmasked situations is only 0.6%.Therefore, during the feature compression process, we can choose to compress and transmit the features after masking, instead of transmitting the full features.Correspondingly, during decoding, we only need to restore the masked features instead of the complete features.

Feature Compression Model
The feature compression model is used to achieve the compression and decompression of the masked feature.It consists of three parts: encoder, decoder and entropy model.Its goal is to make the decompressed feature consistent with the input.Current learning-based image compression networks [11,13,14,26] have good compression performance on images.Since both the feature compression model and the image compression model essentially aim to restore the input, we modify the mean-scale hyperprior compression model [14] to make it suitable for feature compression.Here, we do not use the context model, as it would slow down the decoding speed.The masking operation has already reduced the amount of information in the feature, so the compression performance is satisfying even without the context model.Note that feature compression differs from image compression only in terms of input and output dimensions.Therefore, other learning-based image compression models can also be adapted for feature compression with a few modifications.In the ablation study in Section 5.4, we experiment with other compression models and demonstrate that the proposed framework can easily integrate with current compression networks, enhancing their compression performance.
The differences between our feature compression model and original image compression models are shown in Table 3 ).Therefore, the feature encoder needs to reduce the number of channels.We set the feature encoder's intermediate channel number to 224 and the output channels to 192.The reduction in width and height only occurs in the first convolution.The feature decoder implements the inverse process of the encoder.Since both image compression and feature compression have the same latent representation size, there is no need to make any changes to the entropy model.

Computational Complexity Analysis
Based on Equation ( 4), we can estimate the ratio of time complexities between the image encoder and the feature encoder in Table 3.Assuming the size of the image is (3,640,640), for the image encoder, we estimate the complexity: The feature size after extraction is (256, 80, 80).So, the complexity of the feature encoder is: Note that these two values are used only for estimating the ratio of time complexities between the feature encoder and the image encoder, and do not represent the actual amount of computation.We conclude that the complexity of encoding images is nearly twice that of encoding features.From this, it is evident that compressing features directly can enhance the speed of the compression model.In Section 5.3, we compare the encoding speeds of different compression methods.The experimental results demonstrate that our feature compression method achieves a faster coding speed when compared to image compression models with competitive compression performance.Although the total number of feature elements increases from an input image size of 3 × H × W to 256 × H 8 × W 8 , after processing by the feature encoder, the dimension is reduced to (192, H 16 , W 16 ), aligning with the situation when the input is an image.

Dataset
VisDrone: The VisDrone dataset [45] is a large-scale benchmark obtained from dronemounted cameras.The dataset has ten distinct categories and contains a total of 343,205 labels.Drones, as a typical edge device, have limited computational capabilities due to payload requirements.Moreover, the images obtained through aerial photography mainly contain small objects.Accurately identifying these objects using low-accuracy lightweight models is challenging.Therefore, the proposed method is well suited for the usage scenarios of drones.We use the complete VisDrone training and validation datasets, with the training set containing 6471 images and the validation set containing 548 images.
COCO: COCO is a large-scale object detection, segmentation, and captioning dataset.The training set comprises 118,000 images, while the validation set contains 5000 images.
It is widely utilized to evaluate object detection performance, and we use it to assess the performance of the proposed compression method on general object detection tasks.Previous work [12,13] has shown that training a convolution-based compression model does not need a large dataset.Therefore, when training on COCO, we randomly select 6500 images from it.Of them, 6000 are used as the training set and the rest are used as the validation set.

Training Settings
First, we train YOLOv5l [37] as the powerful detection model on the above two datasets.Following the official training step, Adam optimizer [46] is adopted.It can adaptively adjust the learning rate, enhancing the training efficiency and performance of the model.Additionally, it can converge quickly.The learning rate, batch size, image size, and epochs are 0.001, 8, 640, and 300 respectively.Then, we fix the weights of the feature extractor in Figure 5, and train the mask generator with the same hyperparameters above.
During the training phase of the compression model, both the mask generator and powerful detectors' weights are frozen.Training loss is the rate-distortion loss.We set the kernel size of the neighborhood convolution to 11 for VisDrone and 21 for COCO; the λ in the rate-distortion loss is chosen from {50, 100, 200, 300}.Choosing Adam as the optimizer, we set the initial learning rate to 1 × 10 −4 , batch size to 8, and train each model under specific λ for 300 epochs.The forward propagation process of the training is shown in Figure 7.The feature after masking is transferred to the compression model, and we want the output to be the same as the input.The powerful detector used in the training of the compression model acts as an evaluator.It generates detection results based on the feature reconstructed by the decoder.The higher the mAP in the detection results, the better the compression model is.

Experimental Results
The example of masks generated by the mask generator is shown in Figure 8.We can see that the generated mask demonstrates a close resemblance to the mask labels.Although the generated mask is different from the mask label in some places, the neighborhood convolution can enlarge the object regions so that some missing object pixels will also be included in the final mask after neighborhood convolution.
Figure 9 presents the rate-accuracy curves of different compression methods on two datasets.The detection model used here is the pre-trained powerful detector.The learningbased compression models to be compared include Ballé2018 [26], Minnen2018-MeanScale [14], Cheng2020 [11], ELIC2022 [13], STF2022 [15], TCM2023 [16], and Lightweight LIC2024 [17].We also plot the rate-accuracy curves of the handcrafted compression methods-BPG [4] and WebP [5].The mAP on the original images is used as the baseline.It can be seen that our method outperforms other methods across a wide range of BPP, and it performs particularly well at high compression ratios.At low compression levels (high BPP), most of the information from the original image is retained.However, when the compression level becomes higher (low BPP), the mAPs of the others sharply drop, except for ours.This is because these methods ignore which information is helpful for object detection.When increasing the compression rate, they tend to uniformly discard information from the entire image.As shown in the workflow in Figure 4, by pre-identifying the potential target regions before encoding, we can preserve the target-related information with higher quality while omitting a large amount of unimportant background information.Therefore, at the same BPP, our method better retains the necessary object information for detection, resulting in a significant improvement in detection accuracy compared to methods that uniformly compress the entire image.Table 4 presents the detection metrics of various compression models at comparable BPP values.Our method demonstrates the highest detection accuracy with the lowest BPP, providing clear evidence of the performance advantages of the proposed algorithm.[16] 0.274 0.602 0.420 STF2022 [15] 0.254 0.602 0.421 COCO Cheng2020 [11] 0.22 0.597 0.408 Minnen2018 [14] 0.231 0.591 0.413 BPG [4] 0.306 0.580 0.398 WebP [5] 0.342 0.570 0.395 Figure 10 visualizes the images (or features) and detection results obtained by different compression models at similar compression rates.It can be seen that traditional image compression methods apply uniform compression across all parts of the image at high compression rates, leading to the blurring of small objects and making them difficult to detect.As shown in the last column of the first row, in areas sparse with objects, our prior identification of target regions allows for the erasure of nontarget-related areas in the upper part of the patch, reducing the image's information content.This enables the higher-quality preservation of object areas to ensure accurate final detection results.For areas dense with objects as shown in the last column of the third row, the patch is completely preserved to retain all targets.[11,13,15].The first column displays results obtained by applying the detector to the original image (baseline).The first and third rows display the input to the detection model under different compression methods, which, for conventional compression methods, would be the decompressed image.Since our approach involves compressing features, we showcase the visualization of the channel with the most information.The second and fourth rows present the results of detection.Our method outputs detection results directly from the features, and for visualization use, we draw the bounding boxes on the corresponding original image.
To evaluate the operational efficiency of our compression method, we perform tests measuring encoding and decoding speeds on an Nvidia GeForce GTX 1080Ti.Considering that traditional handcrafted image compression is performed on CPUs, while neural network-based models are computed on GPUs, it is challenging to compare them under the same computational capacity.This paper focuses on improving the performance of neural network compression models for object detection tasks, so we only compare the compression speeds of neural network-based models here.As presented in Table 5, our method stands out, requiring only 17.5 ms to encode an image, surpassing the performance of most competing methods.Since the compression is applied to the features rather than the original high-resolution images, the encoding and decoding speeds have been greatly improved.Additionally, thanks to the increased compression rate achieved by the mask scheme, we have eliminated the time-consuming context model, significantly reducing the computational load of the model.Note that while our encoding duration is 5 ms longer than that of Ballé2018 [26], this marginal increase is deemed acceptable in light of our method's superior compression performance.[14] 12.9 6.6 6.9 79.9 Minnen2018 [14] 20.9 >10 3  12.0 168.9 Ballé18 [26] 12.5 6.3 4.9 77.1

Ablation Study
The Masked Feature Compression Method: To evaluate the enhancements brought by our approach, we conduct a comparison between the rate-accuracy (R-A) curve of our method and that of the baseline model (Minnen2018-MeanScale) as depicted in Figure 11.Observations reveal that our method significantly improves the mAP at equivalent BPP levels compared to the baseline.Furthermore, an analysis of the slope variations in both curves shows that our method's R-A curve demonstrates a more consistent performance across different compression rates.This consistency and improvement in mAP underscore the capability of our proposed mask feature compression technique to preserve critical detection information, even when subjected to high levels of compression.The Mask Generator: To verify the efficacy of our mask generator, we perform comparative experiments utilizing YOLOv5n for initial detection during the encoding phase, creating masks from the detected bounding boxes.The procedure for this mask generation is outlined in Algorithm 1, with bbxes denoting the bounding boxes, and (x 1 , y 1 ) and (x 2 , y 2 ) specifying the coordinates of an individual bounding box.As illustrated in Figure 12, our findings demonstrate that employing the mask generator as opposed to relying solely on pre-detection with a detection model does not compromise the final mAP.This outcome suggests that the mask generator is capable of efficiently identifying the bounding box regions corresponding to object locations through a single forward pass.Moreover, by omitting the iteration over each bounding box, the encoding process is substantially expedited.Evidence of this enhancement is presented in Table 6, where the total encoding duration is reduced from 38.3 ms per image to 17.5 ms per image.Neighborhood Convolution: The size of the kernel determines how much the neighborhood convolution expands the neighborhood regions of objects, and Figure 13 compares the effect of neighborhood convolution on the R-A curve when using different kernel sizes.At first, as the convolution kernel size becomes bigger, the R-A curve shows an upward shift, i.e., the compression model becomes better.However, with a kernel size of 13, the performance of the compression model is inferior compared to that achieved with a kernel size of 11.This is because when the neighborhood convolution kernel is too large, there is a lot of information that is not relevant to the object detection task that is also retained, which increases the size of the compressed representation but has little effect on the detection accuracy.Thus, we can conclude that when the kernel size is 11, the information needed for detection is better preserved.In addition, when not using neighborhood convolution as shown in Table 2, even without compression, the mAP is only 34.6%.The above results demonstrate that neighborhood convolution indeed contributes to the improvement in detection performance.

Different Image Compression Base Model:
The feature compression model we propose is based on image compression models.Essentially, both feature compression and image compression aim to restore the input at the output end.The difference lies in the dimensional changes during the compression.
The base model we used previously is Minnen2018-MeanScale [14].Its structure is relatively simple and offers fast coding speed.However, other deep learning compression models can also use the proposed framework to enhance performance for object detection.We integrate the masked feature compression scheme into the existing image compression models.In Figure 14, we demonstrate the rate-accuracy curves when using masked feature compression with different base models [11,14,26].The training settings are the same as in Section 5. We compare these curves with those obtained when compressing directly using the base model.We can see that the compression performance for detection significantly improved for all three models after applying the masked feature compression scheme.This demonstrates the universality of our proposed method for DNN image compression models.Meanwhile, the performance of the feature compression model is positively correlated with the performance of its base model.

Conclusions and Future Work
Differing from the traditional image compression methodology that discards highfrequency signals insensitive to human eyes, we propose a masked feature compression method.This method utilizes a mask generator to generate an object mask and remove most background information that is irrelevant to detection tasks during the compression process.Compared to existing methods, our approach achieves superior detection performance across various compression levels.Experimental results indicate it has an outstanding real-time performance and low computational demands.Through an ablation study, we demonstrate that the proposed mask generator significantly speeds up the encoding progress, and the neighborhood convolution markedly improves the compression performance.Based on the above analysis, we can ascertain that our method holds significant potential for application in cloud-edge collaborative scenarios.For lightweight edge devices such as drones and smart cameras, the mask feature compression scheme can be utilized to obtain compressed representations for transmission.Once these compressed representations are transmitted to the cloud (the decoding end) and decoded, they can be input into the detection model for automatic detection.Compared to traditional compression methods, our method enhances the accuracy of the detection model on the compressed representations.
The RA curves on the VisDrone and COCO datasets show that the masked feature compression method proposed in this paper has a more pronounced advantage in the VisDrone dataset.This is because the VisDrone dataset consists of small objects, characterized by target sparsity.Therefore, the mask method can filter out a significant amount of background information.In contrast, the COCO dataset contains many large objects that occupy most of the image area, leaving limited background regions that can be discarded.As a result, the compression gain brought by the mask is limited.In recent years, masked image modeling has been widely applied in unsupervised learning.The underlying idea is to mask certain regions of an image and reconstruct the entire image based on the visible regions.Considering that not all parts of large objects contribute significantly to the detection task (e.g., when the detection target is a human, the importance of facial features is far greater than the color patterns on clothing), we can mask the regions that do not contain key features and reconstruct these masked regions at the decoding end based on the visible parts.Through this method, we aim to make the mask compression scheme applicable to images containing a high proportion of large objects.

Figure 1 .
Figure 1.Variational image compression with a hyperprior.Image encoder, image decoder, hyperprior encoder, and hyperprior decoder are denoted by g e , g d , h e , and h d , respectively.All encoders and decoders are neural networks.AE means arithmetic encoder and AD means arithmetic decoder.

Figure 2 .
Figure 2. Two kinds of compression pipelines.The feature extractor block is typically a neural network backbone used to extract feature information from the input image.The image encoder and image decoder are neural networks to perform image encoding and decoding, respectively.Similarly, the feature encoder and feature decoder are neural network implementations for feature encoding and decoding.The perform detection block can be implemented using common detection models and is intended to preliminarily determine the approximate regions where the objects are located.

Figure 3 .
Figure 3.The visualization of extracted features.

Figure 6 .
Figure 6.The neighborhood convolution operation, where the yellow region means objects and the orange region is the final neighborhood."*" is convolution, and the kernel size here is 3.

Figure 7 .
Figure 7.The forward propagation process of training.

Figure 8 .
Figure 8. Example of generated masks by the mask generator.

Figure 10 .
Figure 10.Example patches comparing the detection outcomes obtained from different methods[11,13,15].The first column displays results obtained by applying the detector to the original image (baseline).The first and third rows display the input to the detection model under different compression methods, which, for conventional compression methods, would be the decompressed image.Since our approach involves compressing features, we showcase the visualization of the channel with the most information.The second and fourth rows present the results of detection.Our method outputs detection results directly from the features, and for visualization use, we draw the bounding boxes on the corresponding original image.

Figure 11 .
Figure11.The rate-accuracy curves before and after using our method on VisDrone[14].

Algorithm 1 Figure 12 .
Figure 12.Rate-accuracy curves before and after using the mask generator on VisDrone.

Figure 13 .
Figure 13.Neighborhood convolution with different kernel sizes on VisDrone.

Table 2 .
Detection results under different masking ranges.The values below mAP are the range of IoU thresholds.
. The image encoder takes the input image of size (3, H, W) and outputs latent representations of size (192, H 16 , W 16 ).It transforms the image format into a more compact latent representation.The feature encoder should also output latent representations of size (192, H 16 , W 16 ) to achieve the same effect.After feature extraction, the feature size becomes (256, H 8 , W 8

Table 3 .
Differences between the image compression model and feature compression model.

Table 4 .
Comparison of detection performance across different compression methods at similar compression rates.A smaller BPP value indicates a higher compression rate, while higher values of mAP0.5 and mAP0.5:0.95indicate better detection performance.The best metrics in the table are highlighted in bold.

Table 5 .
Running speeds comparison of different learning-based compression methods.The time for entropy coding and file I/O is not included in the calculation, and the input size is 3 × 640 × 640.

Table 6 .
Comparison of encoding speed on one image before and after using the mask generator.