Next Article in Journal
An Overview of Heavy Metals in Cosmetic Products and Their Toxicological Impact
Previous Article in Journal
Numerical Simulation Investigation of Cuttings Transport Patterns in Horizontal Branch Wells for the Intelligent Drilling Simulation Experimental System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Noise-Aware Hybrid Compression of Deep Models with Zero-Shot Denoising and Failure Prediction

Xi’an Institute of Space Radio Technology, Xi’an 710100, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(24), 12882; https://doi.org/10.3390/app152412882
Submission received: 31 October 2025 / Revised: 29 November 2025 / Accepted: 4 December 2025 / Published: 5 December 2025

Abstract

Deep learning-based image compression achieves remarkable average rate-distortion performance but is prone to failure on noisy, high-frequency, or high-entropy inputs. This work systematically investigates these failure cases and proposes a noise-aware hybrid compression framework to address them. A High-Frequency Vulnerability Index (HFVI) is proposed, integrating frequency energy, encoder Jacobian sensitivity, and texture entropy into a unified measure of degradation susceptibility. Guided by HFVI, the system incorporates a selective zero-shot denoising module (P2PA) and a lightweight hybrid codec selector that determines, for each image, whether P2PA is necessary and selecting the more reliable codec (a learning-based model or JPEG2000) accordingly, without retraining any compression backbones. Experiments span a 200,000-image cross-domain benchmark incorporating general datasets, synthetic noise (eight levels), and real-noise datasets demonstrate that the proposed pipeline improves PSNR by up to 1.28 dB, raises SSIM by 0.02, reduces LPIPS by roughly 0.05, and decreases the failure-case rate by 6.7% over the best baseline (Joint-IC). Additional intensity-profile and cross-validation analyses further validate the robustness and deployment readiness of the method, showing that the hybrid selector provides a practical path toward reliable, noise-adaptive deep image compression.

1. Introduction

Deep learning-based image compression models supported by variational autoencoders (VAEs) [1,2], convolutional autoencoders (CAEs) [3,4], and more recently, transformer-based methods [5,6] have made remarkable progress in recent years, often surpassing traditional codecs such as JPEG [7], JPEG2000 [8], and WebP [9] under the same bitrate. However, despite impressive average results, some images containing strong noise or complex high-frequency textures may experience significant quality degradation after compression, in some cases falling below the performance of conventional codecs. Such failure cases have received limited attention in the literature as most evaluations are carried out on clean and domain-general benchmark datasets [10,11,12,13]. Notably, Brummer [14] highlighted that commonly used training datasets often ignore the possibility of sensor noise, quantization artifacts, or environmental interference, which are prevalent in practical scenarios such as satellite imaging, aerial surveillance, or low-light photography.
Prior studies have focused mainly on improving overall rate-distortion trade-offs by refining autoencoders [15], entropy models [16], or learned quantization strategies [17]. Robustness to noise and generalization to high-frequency inputs have rarely been analyzed in depth. Some works attempt joint training with denoising or domain adaptation [18,19], but these approaches require costly retraining and large amounts of paired data. A systematic investigation into when and why learned codecs fail, together with a lightweight solution that enhances robustness without retraining, is still missing.
This indicates that the superiority of neural compression is not universal, and in certain practical domains, blind deployment of deep models may result in suboptimal performance and inefficient bandwidth utilization. This issue is particularly critical in satellite image transmission, where noise can be introduced by cosmic radiation, sensor defects, or compression under unstable onboard conditions [20]. Furthermore, random and coherent noise affecting astronomical or remote sensing images severely impairs reconstruction quality [21]. If such noisy images are passed through fragile deep compression pipelines, the resulting reconstructions at the ground station may suffer severe quality loss, potentially undermining downstream tasks. Such studies emphasize that reliability and guaranteed reconstruction fidelity can outweigh average compression efficiency in real-world satellite applications. In contrast, traditional codecs, though less efficient on average, exhibit consistent and predictable performance across varying image conditions, making them preferable in worst-case scenarios. Our earlier study [22] was the first to report that traditional codecs like JPEG2000 can surpass neural codecs in some regions and provided a mechanistic explanation for this phenomenon. Building upon this insight, the present work further investigates the causes of such failures under noisy conditions and proposes an adaptive framework to mitigate them.
This work aims to systematically analyze the failure behavior of deep image compression models under noisy, high-entropy conditions and propose adaptive solution for robust reconstruction. The main contributions of this work are as follows:
  • We identify, characterize, and quantify the universal vulnerability of deep image compression models to high-noise, high-entropy, and high-frequency images, which is consistent across different model architectures.
  • The High-Frequency Vulnerability Index (HFVI) is proposed as a metric that integrates frequency energy, encoder Jacobian sensitivity, and texture entropy to enable compression failure prediction.
  • A noise-aware hybrid compression pipeline is designed, which integrates zero-shot denoising and a lightweight selector, which selectively improves compression quality without retraining.
  • Experiments on synthetic noise, NIND, and artificial noisy datasets, along with cross-validation, and intensity-profile studies, validating both effectiveness and practicality.
Despite the excellent average performance of deep learning-based image compression methods, our findings identify a critical failure mode: noisy, high-entropy images lead to significant reconstruction degradation. Importantly, traditional codecs like JPEG2000 maintain greater robustness under such edge cases. Rather than criticizing deep models, we propose a hybrid strategy that enables reliable performance through lightweight preprocessing and adaptive failure prediction.

2. Related Work

2.1. Image Compression and Learning-Based Codecs

Classical image compression standards such as JPEG [7] and JPEG2000 [8] adopt hand-crafted transforms (DCT, wavelets) followed by quantization and entropy coding. More recent standards, such as BPG (HEVC intra) [23], AVIF (AV1 intra) [24] and VVC [25], leverage advanced prediction modes and context-adaptive binary arithmetic coding, achieving significantly improved rate-distortion performance.
Learning-based models have evolved from early convolutional autoencoders to VAE-style hierarchical latent compression, e.g., Ballé et al. [2], and Cheng et al. [26], who introduced a variational framework with scale hyperpriors to enable learned end-to-end rate-distortion optimization. Transformer-based codecs [5,6] further enhance modeling capacity through attention and long-range dependencies, enabling strong entropy estimation and scalable compression. JPEG AI [27] adopts a learned end-to-end architecture with high coding efficiency but remains sensitive to real noise. Recent dictionary-based [16], perceptual-based [28] and diffusion-based codecs [29] aim to improve reconstruction quality under large compression ratios.
Despite remarkable progress, these variants learning-based codecs fail consistently on noisy or high-entropy images, exposing a systemic vulnerability, which is the key issue addressed in this work.

2.2. Lossy Compression of Noisy Images

The lossy compression of noisy images has been studied well before deep learning. Early work [30] focused on optimizing quantization for noisy signals, adjusting distortion penalties to avoid amplifying or preserving noise. Chang et al. propose an adaptive wavelet-thresholding scheme that jointly targets denoising and compression by selecting subband-specific thresholds in a Bayesian framework [31]. These classical approaches typically assume known or parameterizable noise statistics and do not operate on learned latent representations.
Brummer et al. [32] showed that training compression models with mixed noisy/clean images (Natural Image Noise Dataset [14]) improves rate-distortion performance for both input types. Cheng et al. [18] proposed a joint image compression-denoising framework that unifies quantization and noise suppression into a single trainable pipeline, outperforming separate denoising-compression pipelines in efficiency. Cai et al. [19] developed an SNR-aware joint compression-denoising framework with an end-to-end trainable network for simultaneous task handling.
Joint models demonstrate that compression and denoising can be synergistic. However, these approaches typically require retraining with paired noisy-clean datasets and involve additional training cost. Moreover, their effectiveness is often limited to the noise distributions observed during training, leaving open the challenge of generalization to unseen noise or domain-specific degradations.
Compared to these techniques, our work focuses on deep codecs, where noise disrupts latent distributions and entropy models rather than merely degrading pixel-level distortion. Real sensor noise (especially ISO) varying or structured noise, is far more complex than synthetic AWGN assumed by classical methods. Thus, our problem setting addresses the fundamental structural fragility of learned entropy models.

2.3. Image Denoising and Zero-Shot Denoising

Image denoising has advanced from wavelet shrinkage [33] and non-local means (NLM) [34] to deep CNN models like DnCNN [35] and UNet denoisers [36]. Chang [37] introduced a multi-scale differentiated denoising network with spatial–spectral co-operative attention, demonstrating that exploiting both spatial texture continuity and spectral correlation is crucial for effective noise suppression in satellite imagery. Such observations reinforce that noise handling is not optional but fundamental when dealing with real-world satellite image compression.
More recently, zero-shot denoising (ZSD) demonstrated that image-specific statistics can be exploited without clean ground truth. Methods like Pixel2Pixel [38], Neighbor2Neighbor [39], and Noise2Noise [40] enable zero-reference enhancement and adapt to unknown noise distributions.
In the context of compression, ZSD offers an appealing property, and it avoids training and instead adapts to each image individually. However, unconditional denoising may oversmooth textures or distort low-frequency content, which harms compression quality. Our hybrid design leverages ZSD selectively, only when predicted to improve the final rate-distortion perception performance.

2.4. Selection of Datasets

Although deep learning-based codecs achieve strong average performance, their inconsistent behavior across image types remains insufficiently studied. In particular, failures on noisy or high-frequency images are often overlooked due to the dominance of clean, curated benchmarks. Existing joint denoising compression show promise but depend on extensive paired data and lack flexibility for unseen noise or new domains.
To assess robustness under diverse conditions, we construct a large evaluation set of over 200,000 images spanning natural photographs, multi-resolution aerial/satellite scenes (RGB images, not SAR or multispectral/hyperspectral), and noise-specific datasets with distinct spatial structures and degradations. This diversity enables complementary assessments across clean benchmarks, natural variability, realistic sensor noise, and domain-specific conditions. Dataset choices and motivations are summarized in Table 1.
Initial tests on standard datasets show that learned codecs outperform traditional methods by over 3 dB PSNR on average. However, around 0.1% of images are better reconstructed by JPEG2000, revealing a vulnerability of neural codecs under adverse conditions.
To further probe this weakness, we include real-world noisy datasets (NIND [14], SIDD [43,44]) and observe a notable increase in failure cases. These datasets contain noise arising from mismatched scene conditions (insufficient illumination, improper aperture settings, and sensor limitations) reflecting realistic acquisition artifacts. Due to the scarcity of real noisy data, evaluations are conducted on both real and synthetic noise to ensure generality and validate the proposed approach.

3. Methodology and Proposed Algorithm

3.1. Noise Synthesis

Real-world noisy datasets are often limited in scene diversity and noise variability due to the high cost of large-scale data collection [45]. In contrast, synthetic data can be generated efficiently and allows precise control over noise levels. To simulate realistic sensor noise, we adopt a signal-dependent noise synthesis strategy inspired by physical image acquisition models, a similar strategy to work of [46]. Specifically, we assume the observed linear raw signal y ˜ p at pixel p is corrupted by a heteroscedastic Gaussian noise:
y ˜ p N ( y p , σ s y p + σ r 2 )
where y p 0 , 1 denotes the ground-truth noise-free linear signal, σ s and σ r are the shot and readout noise parameters, respectively, controlling the signal-dependent and signal-independent noise levels. Since most image datasets are provided in sRGB space, we first apply an inverse gamma correction Γ 1 to map sRGB images x to linear raw space:
y = Γ 1 ( x ) = x m , x m b x + a 1 + a γ , otherwise
with constants a = 0.055 , b = 0.0031308 , m = 12.92 and γ = 2.4 . After noise is added in the linear domain, the image is transformed back to sRGB via the forward gamma mapping Γ ( · ) . This formulation allows us to synthesize multiple levels of realistic noise by varying σ s and σ r , enabling controlled training and evaluation of denoising models under diverse noise conditions. After the noise synthesis in raw space, the sRGB images x ˜ = Γ ( y ˜ ) are obtained.

3.2. Compression Failure Characterization

As mentioned above that when encountering high entropy, frequency and noisy inputs, deep models fail to preserve crucial details. Let x R H × W × C denote an input image, E ( · ) and D ( · ) denote the encoder and decoder of a learned compression model. The encoder produces a latent mean p m ( x ) and latent samples z ^ q ( z | x ) , which are entropy-coded under a learned prior p ( z ) . For many practical codecs, the rate-distortion objective can be expressed as follows:
L ( x ) = E z ^ q ( z x ) log p ( z ) + λ d ( x , D ( z ) )
where d ( · , · ) is a distortion metric, and λ balances rate and distortion. The present work focuses on image cases x for which the learned pipeline exhibits anomalous degradation under noise or chaotic high-frequency content. To operationalize this notion, we quantify the gap between a deep codec and a robust traditional baseline (JPEG2000). For any image x, we define the following:
Δ PSNR ( x ) = PSNR DL ( x ) PSNR JP 2 K ( x )
A sample is labeled as a failure case when Δ PSNR ( x ) 0.5 dB . This threshold is not arbitrary, it is determined empirically through a threshold-sweeping analysis reported in Section 4.2, where AUC, Precision@Top25, and Recall@Top25 were evaluated across multiple candidate thresholds. The results show that 0.5 dB provides the best trade-off between capturing genuinely degraded cases and avoiding false positives in borderline or near-parity scenarios.
Although such failure cases constitute less than 1% of several large datasets [10,11,41,42], they correlate strongly with specific image properties, indicating that they are systematic rather than random outliers. To better illustrate the qualitative difference between deep and classical codecs, we selected three high-entropy, high-frequency images from the COCO dataset [41] and compressed them using three representative deep learning-based models and three traditional codecs at approximately equal bitrates.
The high-frequency residuals, visualized in Figure 1, are defined as the difference between a high-pass-filtered input and its corresponding reconstruction. Deep learning codecs (DCAE [16], Qres [15], Bmshj [2]) exhibit stronger suppression of high-frequency components, producing smoother textures and blurred edges. In contrast, JPEG2000 [8] and WebP [9] preserve more of the original high-frequency content, appearing noisier but structurally more faithful. These observations motivate the need for deeper analysis and targeted metrics to characterize and predict such failure scenarios.

3.3. High-Frequency Vulnerability Index (HFVI)

To further formalize and quantify the susceptibility of images to deep compression failure, we propose a composite vulnerability metric: the High-Frequency Vulnerability Index (HFVI). This index captures three key aspects contributing to failure: high-frequency content, encoder sensitivity to noise, and texture complexity. HFVI is defined as follows:
HFVI ( x ) = α · E high ( x ) + β · J ( x ) + γ · H ( x )
where E h i g h ( x ) = x     G σ x 2 is high-frequency energy of the image, computed by weighted DCT coefficients over non-overlapping P × P patches (default P = 32 ), Gaussian blur filter G σ or high-pass filtered residuals.
J ( x ) = μ ( x ) x F denotes the Jacobian norm [47] of the encoder’s latent mean with respect to the input image, indicating how sensitive the latent representation is to input perturbations, where μ ( x ) is the encoder’s latent mean output, and · F is the Frobenius norm. Direct computation of J ( x ) for full-resolution images and large latent dimensionality is computationally expensive. Therefore 50 patches are extracted of each image as sampled patches x p , then using automatic differentiation to compute x p μ ( x p ) F , aggregating by mean. This yields accurate local sensitivity at acceptable cost for offline evaluation.
H ( x ) = i p i log p i is texture entropy [48], computed as the mean local Shannon entropy over P × P patches (local histogram entropy or gradient-entropy), which captures irregular structural complexity.
The weights α , β , γ determine the relative influence of the three components and reflect how strongly each factor contributes to compression failure. Before learning the weights, E h i g h ( x ) , J ( x ) , and H ( x ) are normalized to zero mean and unit variance across the training set to prevent scale dominance. Then a logistic regression model was fitted:
P ( failure | x ) = σ ( α E high ( x ) + β J ( x ) + γ H ( x ) + b )
using the failure labels defined by Δ PSNR ( x ) 0.5 dB (validated in Section 4.2), the learned coefficients naturally become the calibrated values of α , β , γ . In this formulation, a high α indicating that frequency chaos is a dominant cause of failures, a high β implying that latent sensitivity is the most decisive factor, and a high γ corresponding to structural overload in texture-rich images.
HFVI provides a scalar index for each image, allowing direct comparison of its vulnerability to compression degradation. Moreover, HFVI helps to understand why a model fails, latent instability (high J ( x ) ), structural overload (high H ( x ) ), or frequency corruption (high E h i g h ( x ) ). In the next chapter, this index is used to guide selective preprocessing strategies and validate its predictive power using classification models.

3.4. Failure Mechanisms in Deep Image Compression

While deep image compression models, ranging from VAE-based frameworks to entropy-coded CNNs, and even diffusion models, have achieved remarkable performance on standard benchmarks, their robustness substantially degrades under high-frequency noise or chaotic textures. This vulnerability is not unique to any single architecture, but rather stems from a shared fragility in the latent representations used for compact encoding. For end-to-end deep compression models, usually image should be transformed to latent representation, which is critical to rate-distortion performance. Let x denote an input image and μ ( x ) the mean of its latent representation produced by an encoder. Under noise ϵ N ( 0 , σ 2 I ) corruption, the observed image becomes
x = x + ϵ
The latent mean perturbation can be linearized as follows:
μ ( x ) = μ ( x ) + J ( x ) · ϵ + O ( ϵ 2 )
where J ( x ) = μ ( x ) x is the Jacobian of the encoder output with respect to the input. This first-order approximation shows that noise is amplified proportionally to the local Jacobian norm J ( x ) = μ ( x ) x F which mentioned in Section 3.3. In structurally unstable regions or high-frequency inputs, J ( x ) is large, and the perturbation in latent space becomes significant. The induced latent perturbation magnitude satisfies
E μ ( x + ϵ ) μ ( x ) 2 E J ( x ) · ϵ 2 J ( x ) F 2 · E ϵ 2
Hence, large Jacobian norms amplify pixel noise into substantial latent-domain perturbations. High-frequency energy and texture entropy further compound these effects by increasing the intrinsic unpredictability of local patches, making compact latent summarization harder and increasing sensitivity to small perturbations. The deep model’s standard compression objective can be expressed as follows:
L ( x ) = E q ( z | x ) x x ^ ( z ) 2 Distortion + β · D KL ( q ( z | x ) p ( z ) ) Rate
The compression relies on regularizing the latent distribution q ( z | x ) toward a prior p ( z ) . Under noisy inputs, both the reconstruction term and the KL divergence increase:
Δ L = Δ Distortion + λ · Δ KL
This latent space perturbation of the images with simultaneously high E h i g h ( x ) , J ( x ) and H ( x ) degrades compression quality and destabilizes entropy coding. A key mechanism of failure lies in the amplification of small perturbations through the encoder into the latent space, resulting in disproportionate degradation of structural fidelity and entropy model mismatch. This motivates the hybrid strategy proposed in the following sections.

3.5. Zero-Shot Denoising as Preprocessing

This zero-shot, pixel-wise denoising method requires no training on clean-noisy image pairs. To enhance the robustness of image compression under noisy conditions, a zero-shot pixel-wise denoising strategy inspired by Pixel2Pixel [38] is adopted. Instead of applying denoising uniformly, a content-aware selector (rooted in Pixel2Pixel’s fully pixel-wise, content-driven operation) is introduced to determine whether pre-compression denoising is necessary since the denoising of each pixel depends only on its local structural patterns, gradient responses, and non-local self-similarities (rather than global assumptions or external training data), thus avoiding unnecessary computation and preserving structural details. We refer to this adapted framework as P2PA (Pixel2Pixel-Aware).
As shown in Figure 2, given a noisy image Y R H × W × C , to avoid unnecessary denoising in flat regions and focus on structure-preserving areas that impact compression fidelity, a spatial importance map M ( x ) is computed to identify visually and compression-sensitive regions. It is defined as follows:
M ( x ) = λ 1 · E high ( x ) + λ 2 · G ( x )
where E h i g h ( x ) is the high-frequency energy map obtained via Gaussian high-pass filtering, which mentioned in Section 3.3. λ 1 , λ 2 are normalization or weighting factors. G ( x ) is the gradient magnitude map computed using the Sobel operator:
G ( x ) = ( x x ) 2 + ( y x ) 2
Pixels with high M ( x ) values correspond to regions where compression is most vulnerable to noise, so it naturally becomes the basis of the content-aware selector, determining the proportion of pixels to denoise and guiding the sampling of informative neighborhoods. Based on M ( x ) , we construct a selective denoising repository by mining non-local pixels with similar content and entropy responses. For each pixel x, we retrieve m candidate pixels:
{ y 1 , y 2 , y 3 , , y m }
forming a pixel bank that captures the structural redundancy beneficial for reconstruction.
As illustrated in Figure 3, denoising is then formulated as an image-specific optimization task rather than a dataset-driven supervised procedure. It estimates clean pixel values from noisy images by aggregating information from surrounding spatial and contextual regions, using a pixel-wise neural network that predicts each pixel independently:
x ^ i , j = D θ ( y Ω i , j )
where Ω i , j denotes the receptive field (local context) around pixel ( i , j ) , and D θ is a lightweight neural network (typically multilayer perceptron-based) that is trained per image by minimizing the local reconstruction loss:
L ( θ ) = 1 N x Ω D θ p 1 ( Y ) ( x ) p 2 ( Y ) ( x ) 2
This zero-shot training strategy ensures that the denoising model adapts specifically to the noise characteristics of the input image, without requiring clean-noisy pairs. Moreover, sampling guided by M ( x ) ensures that structural and texture-rich regions receive focused enhancement, improving compression fidelity while avoiding oversmoothing in flat regions.

3.6. Noise-Aware Compression Selector

While zero-shot denoising improves performance in noisy regions, applying it indiscriminately adds computational overhead and can even degrade clean image quality. Therefore, a lightweight noise-aware compression selector is introduced to predict whether denoising is beneficial for a given image. Its operation process is shown in the Figure 4.
For each input image x, the selector module extracts a compact three-dimensional feature vector [ E h i g h ( x ) , J ( x ) , H ( x ) ] . These features are chosen based on the failure mechanism described in Section 3.3 and form the basis of the previously defined HFVI metric.
Making a held-out training set, images are labeled as failure/non-failure by the criterion Δ PSNR 0.5 dB . Then training a binary classification to measure compression possibility under decision rule. During inference, compute the compact feature vector for each image (HFVI components computed patch-wise and averaged).
The prediction module adopts an XGBoost-based decision classifier [49], chosen for its balance of interpretability, robustness to heterogeneous feature scales, and extremely low inference cost. The classifier consists of an ensemble of shallow decision trees (depth 3–6) trained over the 3-dimensional feature vector.
Training is performed using a standard binary logistic loss, where the classifier learns to associate the HFVI-derived feature vector with the probability of compression failure. To prevent overfitting, we apply 5-fold cross-validation and early stopping. The trained predictor outputs S ( f ( x ) ) 0 , 1 , indicating whether denoising is required. At inference time, the selector evaluates each incoming image:
S ( H ( x ) ) = 1 , if denoising is required 0 , otherwise
If S = 1 , P2PA denoising (Section 3.5) is applied on structurally important regions, followed by compression. If S = 0 , the image bypasses the denoising stage and proceeds directly to the compression module, thereby avoiding unnecessary computation and preventing over smoothing on clean images.
Because the selector operates entirely at the preprocessing and decision-making level, it does not interact with or modify the internal layers of the compression models. This allows the method to generalize across different codecs without retraining. The selector enhances robustness by safeguarding against failure cases while keeping clean-image performance unaffected, making it suitable for real-world deployments where noise levels vary unpredictably.

4. Experimental Results and Discussion

To comprehensively assess the performance of the compression algorithm, bits-per-pixel (BPP), peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM) [50] are the commonly used evaluation metrics. All datasets used in this study consist of standard 8-bit three-channel RGB images, the compression ratio (CR) can be directly derived from the reported bits-per-pixel.
PSNR is expressed in decibels (dB), and a higher PSNR value implies lower distortion. For an original image I and a compressed image I ^ with pixel value range [ 0 , 2 b 1 ] (where b is the bit depth), the formulas for MSE and PSNR are as follows:
MSE = 1 W × H i = 1 W j = 1 H [ I ( i , j ) I ^ ( i , j ) ] 2
PSNR = 10 × log 10 ( 2 b 1 ) 2 MSE
SSIM evaluates image similarity from three dimensions: luminance, contrast, and structure. Its value ranges from 0 to 1, with values closer to 1 indicating better preservation of image structure. Let μ I and μ I ^ be the means of I and I ^ , σ I 2 and σ I ^ 2 be the variances, σ I I ^ be the covariance, and C 1 , C 2 be constants to avoid division by zero. The SSIM formula is expressed as follows:
SSIM = ( 2 μ I μ I ^ + C 1 ) ( 2 σ I I ^ + C 2 ) ( μ I 2 + μ I ^ 2 + C 1 ) ( σ I 2 + σ I ^ 2 + C 2 )
In addition to traditional metrics, we introduce four perceptual and information-theoretic metrics to provide a more comprehensive evaluation of compression behavior under noise.
LPIPS [51] captures deep-feature perceptual differences using pretrained neural networks and is widely regarded as one of the most reliable indicators of human perceptual preference. Deep image structure and texture similarity (DISTS) [52] integrates structural and texture similarity to offer robustness in high-frequency and noise-dominated regions.
Complementarily, visual information fidelity (VIF) [53] evaluates the amount of visual information preserved relative to natural scene statistics, providing insight into how much signal content is retained after compression. Normalized laplacian pyramid distance (NLPD) [54] measures multi-scale distortions and is particularly sensitive to noise amplification or suppression at different frequency bands.

4.1. Comparison of Traditional and Deep Compression Models on Clean and Noisy Data

To evaluate the performance degradation of image compression models under noise-corrupted conditions, we benchmark both traditional codecs (JPEG [7], WebP [9], JPEG2000 [8]) and deep learning-based models (Bmshj [2], Cheng [26], Qres [15], JPEG AI [27], FTIC [6], ICISP [28], DCAE [16]) on two test sets: a clean ground-truth dataset and a synthetically corrupted version with various noise.
Table 2 presents the compression results for each model on both datasets. All models are configured to achieve comparable bitrates on the clean images (approximately 1.8 BPP), while their bitrates on noisy images are measured under default entropy coding settings.
On clean images, all modern learned codecs (e.g., Qres [15], JPEG AI [27], FTIC [6]) outperform traditional codecs in PSNR and SSIM. FTIC achieves the highest performance due to its powerful transformer modeling. Under noisy conditions, deep codecs deteriorate significantly, LPIPS and DISTS increase sharply, indicating strong perceptual degradation despite high bitrates. JPEG AI exhibits an especially severe drop (PSNR falling to 15.77 dB and LPIPS > 0.36), showing that its learned priors are fragile to real noise. JPEG2000 remains the most robust traditional method, with a moderate increase in LPIPS and a minimal increase in DISTS. This confirms its ability to preserve high-frequency and noise-like structures. FTIC and ICISP, while strong on clean images, suffer noticeable degradation under noise due to transformer sensitivity (discussed in Section 4.4 below). Table 2 demonstrates that strong average performance on clean benchmarks does not guarantee robustness in the presence of noise.
Figure 5 presents the residual distribution and cumulative probability (CDF) of different codecs for noisy and clean ground truths. For noisy ground truths, traditional codecs (especially JPEG2000) exhibit extremely narrow, sharply peaked residual distributions centered around zero, indicating preservation of noise patterns and faithful high-frequency reconstruction. In contrast, learned models show larger but more balanced residuals due to inherent stochastic noise suppression, while deep compression models display broader, heavier-tailed distributions reflecting over smoothing of textured/noisy regions. Consistently, CDF curves confirm over 90% of JPEG2000 and WebP residuals fall within ±5 gray levels for noisy data, compared to only 60–70% for deep models.
When referenced to clean ground truths, the trend reverses. The residual distributions of traditional codecs become slightly wider and asymmetric, deep models maintaining overall structural consistency. This observation indicates that deep models inherently act as implicit denoisers, producing smoother reconstructions that align more closely with noise-free content. Overall, when compared against clean data, all models retain approximately 65% of residuals within ±5 gray levels, yet the qualitative nature of these residuals—noise replication versus noise suppression—differs substantially between traditional and learned codecs.

4.2. Threshold Definition and Ablation of HFVI

Several candidate thresholds t = { 0.2 , 0.3 , 0.5 , 1.0 } are analyzed to determine the value of Δ PSNR . For each threshold, the area under the curve (AUC), precision top 25%, and recall top 25% of the high-frequency variation index (HFVI) as a failure indicator are measured, with Δ PSNR t adopted as the ground truth. As presented in Table 3, t = 0.5 dB achieves the highest AUC while retaining high precision and a sufficiently large number of failure samples, indicating that failures become perceptually significant and statistically separable.
Importantly, smaller magnitudes (e.g., 0.2 dB) overestimate failure frequency and reduce predictor precision, while stricter thresholds (e.g., 1.0 dB) miss many visually significant degradations. The choice of 0.5 dB thus reflects the optimal operating point for building reliable training/validation sets for the failure predictor.
To quantify the individual contributions of the three HFVI components, high-frequency energy E, entropy-based texture complexity H, and Jacobian sensitivity J, we conducted a comprehensive ablation experiment across 400 representative images spanning clean, textured, noisy, and heavily noisy scenarios. The results are summarized in Table 4.
The high-frequency energy term E exhibits the strongest individual correlation with compression failure, achieving a Spearman coefficient of ρ = 0.58 . This confirms that noise-driven high-frequency amplification is the dominant factor causing deep compression degradation. The texture-entropy component H contributes moderate predictive power ( ρ = 0.34 ), while the Jacobian sensitivity J provides additional but weaker information ( ρ = 0.27 ).
When combined into partial HFVI variants (e.g., E + H , E + J ), predictive performance improves consistently, demonstrating complementary effects among the components. Notably, the full HFVI formulation surpasses all ablated variants, reaching the highest correlation ( ρ = 0.71 ), the best Kendall’s τ = 0.52 , and the best failure-prediction AUC of 0.47, which aligns with the inherent difficulty of distinguishing borderline failures around the 0.5 dB threshold. These results validate that all three components are necessary for robust fragility estimation and that the combined HFVI index provides the most reliable signal for predicting compression failure under high-entropy or noisy conditions.

4.3. Pixel-Level Intensity Profile Analysis

While quantitative comparisons in Section 4.1 demonstrate the overall rate-distortion and perceptual performance of different codecs, they do not reveal local structural behaviors. To further examine the spatial fidelity of edges and fine textures under noisy conditions, Figure 6 illustrates the intensity profile of various compression and hybrid denoising-compression strategies.
This analysis offers a pixel-level perspective on luminance variation preservation along structured regions (following [55]), providing an intuitive edge preservation and noise suppression measure beyond conventional PSNR/SSIM. The first profile shows clean-noisy image deviations, illustrating noise-induced local oscillations. Rows 2–4 depict direct compression of noisy images via JPEG2000 [8], Qres ( λ = 1024) [15], and DCAE ( β = 0.05) [16]: all maintain global signal structure, with JPEG2000 retaining high-frequency oscillations and learned codecs slightly attenuating sharp transitions (noise suppression with edge smoothing).
Rows 5 to 6 correspond to joint compression-denoising models DC [18] and Joint-IC [19], producing smoother profiles than direct compression and confirming integrated denoising capability. The final three rows present the proposed hybrid framework (zero-shot denoiser P2PA pre-applied to codecs): JPEG2000 (with P2PA), Qres 1024 (with P2PA), and DCAE 005 (with P2PA) curves closely align with the clean reference, showing minimal phase lag and reduced high-frequency noise.
Overall, the intensity-profile analysis reveals that traditional codecs (JPEG2000) tend to reproduce noise patterns along edges, whereas learned codecs inherently smooth noisy structures. The proposed P2PA guided variants achieve a desirable compromise, effectively suppressing stochastic noise while preserving edge continuity.

4.4. Compression Behavior Across Synthetic Noise and Real-World Scenarios

To simulate controlled degradation, we applied noise introduced in Section 3.1 to clean COCO images at eight discrete intensity levels (Noise Level 1 to 8), representing increasing deviations. Specifically, eight noise levels are defined for shot and readout noise parameter 0.001, 0.005, 0.008, 0.01, 0.02, 0.05, 0.1, and 0.15; and 0.001, 0.002, 0.003, 0.005, 0.008, 0.01, 0.015, and 0.02. This results in a synthetic noise dataset with gradually worsening image quality.
Each noisy version was compressed using the following methods: pure compression models (JPEG2000 [8], Qres [15], JPEG AI [27], FTIC [6], ICISP [28], DCAE [16]), joint compression-denoising models (DC [18], Joint-IC [19]), PixelPixel-Aware preprocessing and compression (P2PA&DCAE, P2PA&Qres, P2PA&JPEG2000). Evaluation was conducted using two reference targets: ground truth clean image (to assess reconstruction fidelity) and noisy input image (to assess frequency retention).
Figure 7a shows the PSNR, SSIM, LPIPS-VGG, and DISTS scores of each method across the eight noise levels, referenced to the original clean image. Under low noise levels (Level 1–2), most methods, especially deep models like Qres [15] and FTIC [6] achieved performance nearly identical to that on clean images. This causes substantial decreases in PSNR/SSIM and sharp increases in LPIPS/DISTS, indicating a shift from high-frequency distortions to perceptually significant structural artifacts. Transformer-based codecs (FTIC and ICISP) show even more aggressive degradation: attention maps become dominated by noise patches, leading to unstable token embeddings and poor entropy modeling, which is directly reflected in the large perceptual distortion curves.
We further evaluate the methods against the noisy input itself to examine their ability to retain high-frequency structures, whether clean or corrupted. Figure 7b shows JPEG2000 achieves exceptional robustness across all noise intensities. Its wavelet-domain energy compaction retains high-frequency information without collapsing under sensor noise, resulting in stable PSNR and the lowest LPIPS/DISTS growth. In contrast, deep learning codecs tend to suppress fine textures under noise, leading to significant deviation from the original input.
To assess performance under realistic conditions, we conduct experiments on the NIND [14] dataset, which includes image pairs captured at five increasing ISO levels, each introducing naturally varying ISO characteristics. These images serve as a realistic test to evaluate generalization under non-synthetic noise.
Figure 8 plots the four-score curves for all models across ISO levels. At low ISO (cleaner inputs), all methods perform similarly. JPEG AI exhibits a pronounced failure trend as noise increases, particularly on real noise datasets. Its learned entropy model is strongly biased towards clean image statistics and struggles to encode the unpredictable, signal dependent noise present in NIND. As noise amplitude grows, the reconstructions of JPEG AI exhibit severe blocky and mosaic artifacts, causing the codec to break down entirely on high-ISO images. This degradation invalidates normal perceptual trends and results in a sharp drop in PSNR/SSIM while LPIPS/DISTS experiences abnormal spikes, indicating a complete failure of the compression pipeline under real-noise conditions.
Joint compression-denoising models (DC, Joint-IC), trained on synthetic noise, show poor generalization to real noise, especially at ISO > 1600, where performance drops below 20 dB. This suggests a critical limitation: jointly trained models lack adaptability to unseen real-world noise distributions. In particular, JPEG2000 always achieves the best results compared with the original noisy input image in Figure 7b and Figure 8b, reflecting the excellent ability of JPEG2000 to preserve structural and frequency information more completely, including noisy components, traditional wavelet-based coding remains significantly more stable under noise, while learned codecs (even high-capacity transformer models) lack robustness to high-entropy perturbations. These observations motivate the hybrid denoising-compression strategies explored in the following sections.
Figure 9 presents a visual and quantitative comparison of various compression pipelines on both synthetic and real-world noisy images. Two COCO samples with medium synthetic noise (Noise Level 4) and three real-world noisy slices from the NIND dataset (ISO 6400) are shown. CR denotes the compression ratio. For three-channel RGB images, the bits-per-pixel value is expanded to account for all three channels and eight bits per channel, and the compression ratio is defined as the inverse of this total bit allocation.
From left to right, each row includes the clean ground truth, noisy input, and reconstructions from DC, Joint-IC, P2PA&JPEG2000, P2PA&Qres, and P2PA&DCAE. The lower data zooms into key details and report PSNR, SSIM, LPIPS-Alex, and LPIPS-VGG against the clean ground truth. On COCO samples, P2PA&Qres achieves the best SSIM and perceptual quality (LPIPS-VGG), while P2PA&DCAE yields the best PSNR and LPIPS-Alex scores. On the NIND dataset, all three P2PA based pipelines outperform joint compression-denoising models by 1.0 to 2.6 dB in PSNR, confirming their superior robustness to real-world noise.
Therefore, the joint denoising and compression strategy is insufficient for all images. Instead, a hybrid pipeline combining noise-level awareness, selective zero-shot denoising, and codec adaptation offers the best trade-off between structural fidelity and robustness, especially in real-world transmission and storage systems.

4.5. Experiments of Dynamic Denoising Decisions

The stark contrast between low and high noise performance across methods demonstrates the need for noise-aware compression strategies. Applying denoising indiscriminately is suboptimal; instead, systems should dynamically decide whether to apply denoising based on image-level features. Table 5 reports the reconstruction quality of various compression models (evaluated against clean ground truth), including joint denoising-compression models, pure compression codecs, and always zero-shot denoising pipelines without classifier guidance and the proposed hybrid method. The average compression rate of these algorithms is about 1bpp. The images involved in this experiment include half of the artificial noise dataset (noise levels from 1 to 8 are equal) and half of the real noise dataset NIND [14] (including different levels of ISO are equal).
Joint models (DC and Joint-IC) achieve moderate fidelity, but their denoising modules inevitably oversmooth fine textures, leading to limited PSNR/SSIM gains and relatively high perceptual distortion across LPIPS and NLPD. Pure compression presents a broader range of behaviors: classical JPEG2000 remains stable despite lower PSNR, while some learned models achieve higher pixel-wise fidelity but exhibit elevated perceptual distortion and multi-scale errors. JPEG AI suffers the most significant degradation on real noise, producing extremely low PSNR and high NLPD because its learned priors are mismatched to real-world noise characteristics. Transformer-based codecs (FTIC) also show pronounced robustness issues, with attention mechanisms amplifying noise and yielding inconsistent or over suppressed reconstructions.
Always-denoise pipelines mitigate some failure modes but exhibit clear drawbacks: unconditional denoising removes noise-independent details and often increases LPIPS and NLPD relative to pure compression. In contrast, the proposed hybrid method achieving a 1.28 dB gain in PSNR over the joint denoising-compression baseline and lower LPIPS scores, selectively applies denoising only when beneficial, achieving the best overall performance across fidelity and perceptual metrics.
It is worth noting that these codecs in the experiments remain among the strongest compression models to date, each with distinct strengths depending on image content. The classifier proposed here focuses solely on denoising decision-making, rather than model selection. The choice of downstream codec should still be based on task-specific requirements, such as entropy constraints, hardware compatibility, or reconstruction fidelity priorities.

5. Conclusions

This work highlights a fundamental yet often overlooked limitation of modern learned image compression: deep codecs exhibit systematic vulnerability to high-entropy, high-frequency, and noisy inputs. While achieving excellent performance on clean images, their latent distributions and entropy models become unstable under noise, leading to reconstruction errors that traditional codecs largely avoid. To quantify and mitigate this issue, we introduced the High-Frequency Vulnerability Index (HFVI), which integrates frequency energy, structural entropy, and encoder sensitivity into a single interpretable measure of compression fragility. Building on this, we designed a noise-aware hybrid compression framework that selectively applies zero-shot denoising (P2PA) using a lightweight, HFVI-guided selector. This strategy requires no modification to existing codecs and consistently improves rate-distortion performance across synthetic and real-noise datasets.
Experimental evaluations confirm that selective preprocessing is key to robust compression, achieves notable gains (up to 11 dB PSNR) while avoiding oversmoothing on clean or low-risk inputs. These results reinforce a more general insight introduced in this paper that image vulnerability must be analyzed before compression, and adaptive decisions can substantially enhance the stability of learned codecs in practical deployment (satellite imaging, remote sensing, and edge transmission).
Several directions remain open. Future work may extend the analysis to multiplicative or signal-dependent noise, explore scenarios in which strong textures or extreme noise produce ambiguous decisions, and investigate learned or semantic-aware vulnerability estimators that complement HFVI. Moreover, transformer-based and diffusion-based codecs show distinct fragilities under noise. Understanding these model-specific behaviors may further improve hybrid compression design.
Overall, this study establishes a principled framework for diagnosing and alleviating noise-induced failures in deep image compression, paving the way toward more reliable compression systems.

Author Contributions

Conceptualization, L.Z. and Q.Z.; methodology, L.Z. and Q.Z.; software, L.Z.; validation, L.Z. and R.L.; formal analysis, L.Z. and L.H.; investigation, R.L.; resources, Q.Z.; data curation, L.H.; writing—original draft preparation, L.Z.; writing—review and editing, Q.Z.; visualization, L.Z.; supervision, J.L.; project administration, Y.Z.; funding acquisition, Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used and/or analyzed during the current study are publicly available and can be accessed without any restrictions. If needed, they can also be obtained by contacting the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhou, L.; Cai, C.; Gao, Y.; Su, S.; Wu, J. Variational autoencoder for low bit-rate image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2617–2620. [Google Scholar]
  2. Ballé, J.; Minnen, D.; Singh, S.; Hwang, S.J.; Johnston, N. Variational image compression with a scale hyperprior. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  3. Cheng, Z.; Sun, H.; Takeuchi, M.; Katto, J. Deep convolutional autoencoder-based lossy image compression. In Proceedings of the Picture Coding Symposium (PCS), San Francisco, CA, USA, 24–27 June 2018; pp. 253–257. [Google Scholar]
  4. Choi, Y.; El-Khamy, M.; Lee, J. Variable rate deep image compression with a conditional autoencoder. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3146–3154. [Google Scholar]
  5. Liu, J.; Sun, H.; Katto, J. Learned image compression with mixed transformer-CNN architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, QC, Canada, 17–24 June 2023; pp. 14388–14397. [Google Scholar]
  6. Han, L.; Shaohui, L.; Wenrui, D.; Chenglin, L.; Zou, J.; Xiong, H. Frequency-aware transformer for learned image compression. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
  7. ISO/IEC 10918-1:1992; Information Technology-Digital Compression and Coding of Continuous-Tone Still Images: Requirements and Guidelines. ISO: Geneva, Switzerland, 1992.
  8. ISO/IEC 15444-1:2000; Information Technology-JPEG 2000 Image Coding System: Core Coding System. ISO: Geneva, Switzerland, 2000.
  9. Google Inc. WebP Compression Format. Available online: https://developers.google.com/speed/webp/ (accessed on 18 October 2025).
  10. Kodak True-Color Image Suite. Available online: http://r0k.us/graphics/kodak/ (accessed on 18 October 2025).
  11. Challenge on Learned Image Compression (CLIC). Available online: https://archive.compression.cc/2024/tasks/ (accessed on 18 October 2025).
  12. Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Kolesnikov, A.; et al. The open images dataset V4: Unified image classification, object detection, and visual relationship detection at scale. Int. J. Comput. Vis. 2020, 128, 1956–1981. [Google Scholar] [CrossRef]
  13. Alshina, E.; Ascenso, J.; Ebrahimi, T. JPEG AI: The first international standard for image coding based on an end-to-end learning-based approach. IEEE MultiMedia 2024, 31, 60–69. [Google Scholar] [CrossRef]
  14. Brummer, B.; De Vleeschouwer, C. Natural image noise dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
  15. Duan, Z.; Lu, M.; Ma, Z.; Zhu, F. Lossy image compression with quantized hierarchical VAEs. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 198–207. [Google Scholar]
  16. Lu, J.; Zhang, L.; Zhou, X.; Li, M.; Li, W.; Gu, S. Learned image compression with dictionary-based entropy model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 13–15 June 2025; pp. 12850–12859. [Google Scholar]
  17. Ge, Z.; Ma, S.; Gao, W.; Pan, J.; Jia, C. NLIC: Non-uniform quantization-based learned image compression. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 9647–9663. [Google Scholar] [CrossRef]
  18. Cheng, K.L.; Xie, Y.; Chen, Q. Optimizing image compression via joint learning with denoising. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2022; Springer: Cham, Switzerland, 2022; pp. 56–73. [Google Scholar]
  19. Cai, S.; Liang, X.; Cao, S.; Yan, L.; Zhong, S.; Chen, L.; Ziu, X. Powerful lossy compression for noisy images. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
  20. Qian, S.E.; Bergeron, M.; Cunningham, I.; Gagnon, L.; Hollinger, A. Near lossless data compression onboard a hyperspectral satellite. IEEE Trans. Aerosp. Electron. Syst. 2006, 42, 851–866. [Google Scholar] [CrossRef]
  21. Zhang, J.; Zhang, H.; Shi, X.; Peng, X.; Geng, S. Compressed sensing for high-noise astronomical image recovery. J. Electron. Imaging 2019, 28, 053026. [Google Scholar] [CrossRef]
  22. Zhang, L.; Zhou, Q.; Liu, R.; Huyan, L.; Liu, J.; Zhang, Y. The blind spot of deep image compression: Why JPEG2000 still wins in high-entropy high-frequency regions. Chin. J. Stereol. Image Anal. 2025, 30, 187–197. [Google Scholar]
  23. Fabrice, B. Better Portable Graphics Image Format. Available online: https://bellard.org/bpg/ (accessed on 18 October 2025).
  24. Alliance for Open Media. AV1 Image File Format (AVIF). Available online: https://aomediacodec.github.io/av1-avif/ (accessed on 18 October 2025).
  25. Bross, B.; Chen, J.; Ohmm, J.R.; Sullivan, G.J.; Wang, Y.-K. Developments in international video coding standardization after AVC, with an overview of versatile video coding (VVC). Proc. IEEE 2021, 109, 1463–1493. [Google Scholar] [CrossRef]
  26. Cheng, Z.; Zhou, M.; Guo, J.; Yuan, J.; Ji, Y.; Zhang, Y. Steering one-step diffusion model with fidelity-rich decoder for fast image compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7939–7948. [Google Scholar]
  27. International Telecommunication Union (ITU). Recommendation ITU-T T.840.1. JPEG AI Learning-Based Image Coding System (JPEG-AI). Available online: https://gitlab.com/wg1/jpeg-ai/jpeg-ai-reference-software.git (accessed on 18 October 2025).
  28. Wei, H.; Zhou, Y.; Jia, Y.; Ge, C.; Anwar, S.; Mian, A. A lightweight model for perceptual image compression via implicit priors. Neural Netw. 2025, 108279, 0893–6080. [Google Scholar] [CrossRef]
  29. Zeng, F.; Tang, H.; Shao, Y.; Chen, S.; Shao, L.; Wang, Y. MambaIC: State space models for high-performance learned image compression. arXiv 2025, arXiv:2503.12461. [Google Scholar]
  30. Al-Shaykh, O.K.; Mersereau, R.M. Lossy compression of noisy images. IEEE Trans. Image Process. 1998, 7, 1641–1652. [Google Scholar] [CrossRef]
  31. Chang, S.G.; Yu, B.; Vetterli, M. Adaptive wavelet thresholding for image denoising and compression. IEEE Trans. Image Process. 2000, 9, 1532–1546. [Google Scholar] [CrossRef] [PubMed]
  32. Brummer, B.; De Vleeschouwer, C. On the importance of denoising when learning to compress images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 2440–2448. [Google Scholar]
  33. Donoho, D.L. De-noising by soft-thresholding. IEEE Trans. Inf. Theory 2002, 41, 613–627. [Google Scholar] [CrossRef]
  34. Buades, A.; Coll, B.; Morel, J.M. A review of image denoising algorithms, with a new one. Multiscale Model. Simul. 2005, 4, 490–530. [Google Scholar] [CrossRef]
  35. Zhang, K.; Zuo, W.; Chen, Y.; Meng, D.; Zhang, L. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Trans. Image Process. 2017, 26, 3142–3155. [Google Scholar] [CrossRef]
  36. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
  37. Chang, X.; Wang, X.; Huang, X.; Yan, M.; Cheng, L. Multi-scale differentiated network with spatial-spectral co-operative attention for hyperspectral image denoising. Appl. Sci. 2025, 15, 8648. [Google Scholar] [CrossRef]
  38. Ma, Q.; Jiang, J.; Zhou, X.; Liang, P.; Liu, X. Pixel2pixel: A pixelwise approach for zero-shot single image denoising. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4614–4629. [Google Scholar] [CrossRef]
  39. Huang, T.; Li, S.; Jia, X.; Lu, H.; Lu, J. Neighbor2neighbor: Self-supervised denoising from single noisy images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14781–14790. [Google Scholar]
  40. Mansour, Y.; Heckel, R. Zero-shot noise2noise: Efficient image denoising without any data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–23 June 2023; pp. 14018–14027. [Google Scholar]
  41. Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
  42. Machado, G.; Ferreira, E.; Nogueira, K.; Oliveira, H.; Brito, M.; Gama, P.H.T.; dos Santos, J.A. AiRound and CV-BrCT: Novel multiview datasets for scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 488–503. [Google Scholar] [CrossRef]
  43. Abdelhamed, A.; Lin, S.; Brown, M.S. A high-quality denoising dataset for smartphone cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 1692–1700. [Google Scholar]
  44. Abdelhamed, A.; Afifi, M.; Timofte, R.; Brown, M.S.; Cao, Y.; Zhang, Z.; Zuo, W.; Zhang, X.; Liu, J.; Chen, W.; et al. NTIRE 2020 challenge on real image denoising: Dataset, methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 496–497. [Google Scholar]
  45. Ignatov, A.; Kobyshev, N.; Timofte, R. Noise-aware training for deep image compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  46. Mildenhall, B.; Barron, J.T.; Chen, J.; Sharlet, D.; Ng, R.; Carroll, R. Burst denoising with kernel prediction networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2502–2510. [Google Scholar]
  47. Finlay, C.; Jacobsen, J.H.; Nurbekyan, L.; Oberman, A.M. How to train your neural ODE: The world of Jacobian and kinetic regularization. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 12–18 July 2020; pp. 3154–3164. [Google Scholar]
  48. Shamir, L.; Wolkow, C.A.; Goldberg, I.G. Quantitative measurement of aging using image texture entropy. Bioinformatics 2009, 25, 3060–3063. [Google Scholar] [CrossRef]
  49. Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  50. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
  51. Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
  52. Ding, K.; Ma, K.; Wang, S.; Simoncelli, E.P. Image quality assessment: Unifying structure and texture similarity. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2567–2581. [Google Scholar] [CrossRef]
  53. Sheikh, H.R.; Bovik, A.C. A visual information fidelity approach to video quality assessment. First Int. Workshop Video Process. Qual. Metrics Consum. Electron. 2005, 7, 2117–2128. [Google Scholar]
  54. Laparra, V.; Ballé, J.; Berardino, A.; Simoncelli, E.P. Perceptual image quality assessment using a normalized Laplacian pyramid. Electron. Imaging 2016, 28, 1–6. [Google Scholar] [CrossRef]
  55. Manoj, D.; Prabhishek, S. CT image denoising using multivariate model and method noise thresholding in non-subsampled shearlet domain. Biomed. Signal Process. Control 2020, 101754, 1746–8094. [Google Scholar]
Figure 1. Visual comparison of high-frequency residuals of high-entropy images from the COCO dataset under DCAE, Qres, Bmshj, WebP, JPEG2000, and JPEG compression at similar bit rates.
Figure 1. Visual comparison of high-frequency residuals of high-entropy images from the COCO dataset under DCAE, Qres, Bmshj, WebP, JPEG2000, and JPEG compression at similar bit rates.
Applsci 15 12882 g001
Figure 2. Non-local-based similar pixel searching and selective denoising repository construction.
Figure 2. Non-local-based similar pixel searching and selective denoising repository construction.
Applsci 15 12882 g002
Figure 3. Zero-shot pixel-wise denoising via per-image optimization using compression-aware sampling.
Figure 3. Zero-shot pixel-wise denoising via per-image optimization using compression-aware sampling.
Applsci 15 12882 g003
Figure 4. Architecture of the noise-aware compression selector decision flow based on HFVI: denoising necessity and codec selection between deep models and JPEG2000.
Figure 4. Architecture of the noise-aware compression selector decision flow based on HFVI: denoising necessity and codec selection between deep models and JPEG2000.
Applsci 15 12882 g004
Figure 5. Residual distribution and cumulative probability (CDF) with reference to noisy ground truths and clean ground truths.
Figure 5. Residual distribution and cumulative probability (CDF) with reference to noisy ground truths and clean ground truths.
Applsci 15 12882 g005
Figure 6. Intensity profiles: noise free intensity profile represented in blue; Compression codecs represented in red.
Figure 6. Intensity profiles: noise free intensity profile represented in blue; Compression codecs represented in red.
Applsci 15 12882 g006
Figure 7. Compression performance on synthetic noisy images across eight noise levels.
Figure 7. Compression performance on synthetic noisy images across eight noise levels.
Applsci 15 12882 g007
Figure 8. Compression performance across ISO levels on real-world noisy from NIND dataset.
Figure 8. Compression performance across ISO levels on real-world noisy from NIND dataset.
Applsci 15 12882 g008
Figure 9. Qualitative and quantitative comparison of compression results on synthetic and real noisy images. The average compression rate of these algorithms is about 1 bpp.
Figure 9. Qualitative and quantitative comparison of compression results on synthetic and real noisy images. The average compression rate of these algorithms is about 1 bpp.
Applsci 15 12882 g009
Table 1. Details, characteristics, and purpose for choosing the selected datasets.
Table 1. Details, characteristics, and purpose for choosing the selected datasets.
Dataset#ImagesResolutionCharacteristicsPurpose
Kodak [10]24768 × 512High-quality natural photographsClassical benchmark for compression
CLIC 2024 [11]632–6 MP *Diverse high-resolution natural imagesEvaluation of high-resolution performance
COCO subset [41]10k+VariableDiverse natural scenes, large scaleEvaluation of generalization
AiRound [42]11k+500 × 500Remote sensing satellite and aerial imagesCross-domain validation for remote sensing
NIND [14]836VariableReal noisy-clean image pairs at different ISO levelsEvaluation of robustness to sensor noise
SIDD [43,44]3205328 × 3000Smartphone noise dataset with real capturesBenchmark for denoising-oriented evaluation
* MP: Mega pixels. #: Number of images.
Table 2. Compression performance of traditional and deep learning models on clean and noisy images.
Table 2. Compression performance of traditional and deep learning models on clean and noisy images.
Alg.DatabppPSNR ↑SSIM ↑LPIPS-VGG ↓DISTS ↓
JPEG [7]Clean GT1.84236.5800.9560.2850.210
Noisy Image5.43719.3890.6800.3070.295
JPEG2000 [8]Clean GT2.13237.9630.9630.2140.118
Noisy Image5.55729.6170.9510.2640.112
WebP [9]Clean GT1.85138.2400.9670.2200.125
Noisy Image5.46619.5020.6530.2950.188
Bmshj [2]Clean GT1.68134.3420.9640.2240.138
Noisy Image5.14318.8680.6090.2940.169
Cheng [26]Clean GT1.78238.0600.9690.2230.130
Noisy Image1.93417.2810.4700.2550.214
Qres [15]Clean GT1.92340.2400.9760.2130.111
Noisy Image5.52820.8270.6980.2450.127
JPEG AI [27]Clean GT2.08140.4010.9710.2550.129
Noisy Image5.93415.7720.6980.3650.267
FTIC [6]Clean GT1.05141.9140.9710.1780.116
Noisy Image3.63425.9410.8100.2120.155
ICISP [28]Clean GT0.25136.0820.9500.2730.130
Noisy Image0.52520.7230.5180.2200.195
DCAE [16]Clean GT0.82937.3460.9540.1590.137
Noisy Image2.34621.6730.6250.2870.146
Bold marks the best result for each evaluation criterion; ↑ means “higher is better”; ↓ means “lower is better”.
Table 3. Failure threshold vs. (AUC, Precision@Top25, Recall@Top25).
Table 3. Failure threshold vs. (AUC, Precision@Top25, Recall@Top25).
Δ PSNR ThresholdAUC ↑Precision @25% ↑Recall @25% ↑
−0.20.2980.9550.253
−0.30.4350.8860.239
−0.50.4710.8640.235
−1.00.4460.7730.222
Bold marks the best result for each evaluation criterion; ↑ means “higher is better”.
Table 4. Ablation study on the contribution of HFVI components.
Table 4. Ablation study on the contribution of HFVI components.
Feature/VariantSpearman ρ vs. Δ PSNR Kendall τ AUC (Failure Detection) ↑MAE of Δ PSNR Precision @25% ↑Recall @25% ↑
E (High-frequency Energy)−0.58−0.410.390.67 dB0.910.27
H (Texture Entropy)−0.34−0.230.310.81 dB0.720.21
J (Jacobian Sensitivity)−0.27−0.190.290.84 dB0.690.19
E + H (2-component)−0.63−0.450.420.61 dB0.880.26
E + J (2-component)−0.61−0.430.410.63 dB0.850.25
J + H (2-component)0.460.320.350.76 dB0.780.22
Full HFVI (E + H + J)−0.71−0.520.470.55 dB0.860.32
Bold marks the best result for each evaluation criterion; ↑ means “higher is better”; ↓ means “lower is better”.
Table 5. Compression Performance (Denoising/Compression and with/without Selector).
Table 5. Compression Performance (Denoising/Compression and with/without Selector).
AlgorithmsDescriptionPSNR ↑SSIM ↑VIF ↓NLPD ↓LPIPS-Alex ↓LPIPS-VGG ↓
DC [18]Joint Denoising26.13600.72120.38741.06300.24460.3212
Joint-IC [19]& Compression26.22410.72600.39061.09190.24070.3151
JPEG2000 [8] 24.99710.69780.39762.45210.23190.2965
JPEG AI [27] 16.01080.41510.30037.59030.66650.4939
Qres [15]Only26.32750.73710.43581.21230.20510.2913
FTIC [6]Compression24.42860.68410.36791.24760.26950.2839
ICISP [28] 21.99190.53840.24431.58910.28110.3387
DCAE [16] 25.71270.73290.41221.04350.22750.2880
P2PA & JPEG2000Always26.03700.70250.41041.39910.27840.3117
P2PA & QresDenoising26.38020.70550.41561.18540.34880.3650
P2PA & DCAE& Compression26.17700.69760.40871.40520.36170.4086
ProposedHybrid27.50160.74180.35260.99280.22110.2619
Bold marks the best result for each evaluation criterion; ↑ means “higher is better”; ↓ means “lower is better”.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, L.; Zhou, Q.; Liu, R.; Huyan, L.; Liu, J.; Zhang, Y. Noise-Aware Hybrid Compression of Deep Models with Zero-Shot Denoising and Failure Prediction. Appl. Sci. 2025, 15, 12882. https://doi.org/10.3390/app152412882

AMA Style

Zhang L, Zhou Q, Liu R, Huyan L, Liu J, Zhang Y. Noise-Aware Hybrid Compression of Deep Models with Zero-Shot Denoising and Failure Prediction. Applied Sciences. 2025; 15(24):12882. https://doi.org/10.3390/app152412882

Chicago/Turabian Style

Zhang, Lizhe, Quan Zhou, Ruihua Liu, Lang Huyan, Juanni Liu, and Yi Zhang. 2025. "Noise-Aware Hybrid Compression of Deep Models with Zero-Shot Denoising and Failure Prediction" Applied Sciences 15, no. 24: 12882. https://doi.org/10.3390/app152412882

APA Style

Zhang, L., Zhou, Q., Liu, R., Huyan, L., Liu, J., & Zhang, Y. (2025). Noise-Aware Hybrid Compression of Deep Models with Zero-Shot Denoising and Failure Prediction. Applied Sciences, 15(24), 12882. https://doi.org/10.3390/app152412882

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop