Spectral Robustness Mixer: Cross-Scale Neck for Robust No-Reference Image Quality Assessment
Abstract
1. Introduction
- Key hypothesis: cross-scale fusion can amplify instability.
- Conditioning of fusion operators: attention- or kernel-based mixing may become ill-conditioned, such that small changes in token statistics lead to disproportionately large changes in fused representations.
- Variance drift across stages: shallow and deep features often have different activation scales; naive fusion can induce scale dominance or collapse, making the regressor hypersensitive to perturbations.
- Unconstrained dynamic gating: adaptive gates can become high-gain functions of unstable features, further magnifying input gradients and creating failure modes consistent with gradient masking pitfalls unless evaluated carefully [7].
- These considerations motivate an operator-centric design goal: explicitly stabilize the conditioning, variance, and gain of cross-scale mixing.
- Our approach: Spectral Robustness Mixer (SRM).
- Contributions.
- Plug-and-play robustness neck for NR-IQA: we introduce SRM, a lightweight fusion module that can be inserted between an NR-IQA backbone and regressor to improve measured adversarial robustness with modest overhead (in the tested settings).
- Stability- and gain-controlled fusion design: SRM combines Nyström low-rank mixing, ridge-conditioned landmark kernels, variance-aware fusion, and a bounded gain cap, targeting conditioning, variance drift, and unstable gating as robustness bottlenecks.
- Correlation-aware robustness evaluation: we define and evaluate correlation-aware adversarial objectives aligned with rank correlation (SROCC) using a correlation-aware pairwise inversion objective (ranking-inspired surrogate), and we apply expectation-over-transformation (EOT) and standard anti-gradient masking checks when stochasticity is present [7].
- Paper organization.
2. Related Work
2.1. Backbones for No-Reference IQA
2.2. Cross-Scale Fusion and Neck Modules
2.3. Adversarial Robustness of IQA Models
2.4. Perceptual Quality Assessment Beyond Still Images
2.5. Robustness-by-Design and Stable Feature Mixing
3. Methodology
3.1. Problem Setting and Notation
- Per-image regression objectives (image-wise). Given MOS , an attacker may maximize a regression loss such as (MSE) or maximize a score-drift loss
- Correlation-aware objectives (batch-level via pairwise inversions). To directly degrade ranking consistency (and thus SROCC), we use a lightweight pairwise inversion objective that increases the number of mis-ordered MOS pairs within each mini-batch, consistent with correlation-focused NR-IQA attack analyses [3]. We instantiate this objective as the smooth inversion loss in Equation (32) (Section 3.8.4), using comparable pairs with (MOS normalized to ) and temperature .
| Notation. M evaluation set size; B batch size; spatial resolution; N flattened token count; d head dimension; r Nyström landmark count; condition number proxy of the unwhitened landmark kernel; post-attention temperature/rescaler; standard deviation of the post-attention output; gain cap used in fusion gating; temperature/sharpness parameter in the pairwise inversion loss (32). |
3.2. Backbone Integration and Cross-Stage Attention
| Algorithm 1: SRM—Spectral Robustness Mixer (single forward pass, overview) |
![]() |
| Algorithm 2: NystromCrossAttn (interface) |
|
Input: , , rank r Output: // Landmarks. Select r key/value landmarks (deterministic by default; see Section 3.4). 1 Form by selecting indices of size r from // Nyström approximation. Compute low-rank cross-attention output using landmark kernels (Section 3.3). 2 Compute // Post-scaling. Apply post-attention stabilization (temperature/rescaling) when enabled (Section 3.5). 3 Optionally rescale by the learned post-attention temperature 4 return |
3.3. Nyström Cross-Attention with Ridge-Leverage Landmarks
- Cross-attention setup (rectangular case).
- Nyström approximation (cross-attention).
- Landmark selection via ridge leverage (deterministic by default).
- k-means++ refinement (ablation-only).
- Stable computation of (mixed precision).
- Optional stochastic key/value pooling (training-time only).
3.4. Spectral Conditioning
- What we measure: conditioning and solve residuals.
- Regularized conditioning (singular-value ratio):This controls sensitivity of linear solves to perturbations in W and to finite-precision rounding.
- Solve residual (numerical accuracy): for each solve we report the relative residualwith for numerical safety.
- Diagnostics are computed in fp32, even when the forward pass uses mixed precision.
- Why “row-QR makes ” is not valid.
- Stabilization and solve pipeline (implemented).
- Optional fast inverse (Neumann truncation; ablation-only).
- Scale calibration (no “variance unity” claim).
- Cost.
- Key takeaways.
- SRM treats landmark stability as a numerical linear algebra problem: ridge regularize, optionally equilibrate, then solve in fp32.
- We replace informal “” statements with measurable diagnostics: and .
- Neumann truncation is an ablation-only acceleration, enabled only when a convergence proxy and residual thresholds are satisfied.
3.5. Variance Stabilization with a Learnable
- Post-attention scale statistic (vector RMS).
- Reference scale and EMA tracking.
- Learnable rescaler and auxiliary loss.
- Safety rails (reproducible).
- DyT pre-normalization (bounded activations).
- Relation to normalization layers.
3.6. Entropy-Regularized Fusion Gates
- Inputs to the gate.
- Gate logits.
- Fusion weights.
- Entropy/balance regularization (discourages collapse).
- Gain-capped fusion (activation stability, not certification).
- Implementation (forward pass).
- Gate diagnostics (reported).
| Algorithm 3: GateFuse: entropy-regularized, gain-capped fusion |
![]() |
3.7. Complexity and Backbone Compatibility
3.8. Implementation Details
3.8.1. Backbones Under Study
3.8.2. Datasets, Splits, and Metrics
- TID2013: content-separated split by reference-image ID (no leakage of content). We use 15 reference images for training, 5 for validation, and 5 for testing (3000 images distributed accordingly across distortions). We repeat with three seeds (0/1/2) by re-sampling the reference-ID partition and report mean ± std.
- KonIQ-10k: we use the widely adopted train/val/test assignment distributed with common KonIQ loaders [40].
- Train: resize shorter side to 512, random crop , random horizontal flip (p = 0.5).
- Eval: resize shorter side to 512, center crop (deterministic).
3.8.3. Training Protocol
3.8.4. Adversarial Evaluation
- FGSM-:.
- PGD-: steps , step size , restarts .
- PGD-: steps , step size , restarts .
3.8.5. Hardware Footprint and Profiling
4. Results
4.1. Setup and Reporting Conventions
4.2. Clean-Set Performance
4.3. Robustness Under White-Box Attacks
4.4. Attack Transfer (Black-Box Sanity Check)
4.5. Diagnostics (Stability and Mechanism Evidence)
- Conditioning and solve accuracy: Table 1 reports quantiles of and solve residuals (Section 3.4).
- Scale stabilization: Table 2 reports normalized RMS magnitudes before/after ; Figure 3 visualizes training trajectories (Section 3.5).
- Gate behavior: Figure 4 reports entropy, routing mass, and gain-cap distributions (Section 3.6).
4.6. Ablation Study
- Landmark quality and conditioning.
- Scale stabilization.
- Gating and gain capping.
- Approximation budget and inverse routine.
- I1.
- Landmark selection and conditioning matter for robustness: uniform landmarks lead to the largest robust drop.
- I2.
- Scale stabilization and gating are complementary: affects both clean and robust performance, while gate regularization and gain capping primarily affect robustness.
- I3.
- Approximation budget matters: reducing Nyström rank harms robustness more than changing the inverse routine.
5. Discussion, Limitations, and Future Work
5.1. Mechanism Evidence: What We Measure (Not What We “Guarantee”)
- Conditioning/solve stability (Nyström landmarks). We report quantiles of the regularized condition number and solve residuals for the landmark system (Table 1; Section 3.4). These diagnostics make the “stable landmark solve” claim falsifiable and help interpret robustness differences between landmark/conditioning ablations (Table 11).
- Post-attention scale stabilization. We report normalized post-attention RMS magnitudes before/after (Table 2) and training-time trajectories (Figure 3; Section 3.5). We treat (and DyT) as numerical stabilizers that reduce scale drift; robustness is established empirically under the attack suite.
- Gate behavior and activation spikes. We report entropy and routing-mass diagnostics for fusion weights, along with gain-cap statistics (Figure 4; Section 3.6). These measurements support that the fusion does not collapse to a single stage and that fused activations are bounded in magnitude. We do not interpret gain capping as a certified Lipschitz guarantee for the full network because gate weights are input-dependent (Equation (29)).
5.2. Limitations
- L1.
- Dataset coverage is still limited. We evaluate on two datasets (TID2013 and KonIQ-10k), which cover synthetic distortions and in-the-wild conditions, respectively. However, broader generality (e.g., LIVE/CSIQ/CLIVE/KoNViD for video) is not established in this work.
- L2.
- Attack coverage remains primarily first-order. Our evaluation focuses on strong first-order, multi-step attacks with restarts, sweeps, transfer checks, and EOT for stochasticity (Section 3.8.4; Figure 5) [7,25]. We do not claim robustness against all possible adaptive strategies (e.g., query-based black-box attacks at large budgets, or attacks optimizing non-differentiable dataset-level correlation directly).
- L3.
- Correlation-aware objectives are proxies. The pairwise inversion loss (Equation (32)) is a lightweight differentiable surrogate that induces ranking inversions within a batch and empirically reduces SROCC. It is not identical to directly optimizing Spearman correlation over the full dataset, and different surrogates could lead to different worst-case outcomes.
- L4.
- No certified robustness claims. Although SRM includes explicit magnitude caps and numerical stabilizers, we do not provide certified robustness bounds for the full end-to-end mapping from pixels to quality score. All robustness improvements are empirical under the specified threat model and attack ensemble.
5.3. Future Work
- Regression-/ranking-specific “AutoAttack-style” suites. AutoAttack is a strong parameter-free baseline for classification evaluation [9]. A natural extension for NR-IQA is a standardized ensemble that mixes regression and ranking-aligned objectives (including pairwise and listwise surrogates), with fixed hyperparameters, monotonicity checks, transfer tests, and EOT requirements when stochasticity is present.
- Stronger black-box and physically plausible perturbations. Beyond transfer-based checks, future work should evaluate SRM under query-based black-box attacks (score- and decision-based) and under perceptual/physical constraints (e.g., content-preserving, camera/codec distortions) to complement the norm-bounded threat model studied here.
- Certified deviation bounds for scalar regression. A concrete direction is to combine SRM’s bounded-gain components with certification techniques for regression, e.g., Interval Bound Propagation (IBP) [47] or randomized smoothing adapted to scalar outputs, to bound output deviation for IQA regressors under well-defined norms.
- Extensions to video and 3D quality pipelines. The cross-scale stabilization principles in SRM are potentially applicable to temporal and 3D settings where fusion and stability are also central: (i) video quality assessment (temporal fusion + long-range dependencies), and (ii) point-cloud quality assessment (PCQA) pipelines that must remain stable under acquisition noise and adversarial perturbations [26,27]. Recent benchmarks for video quality understanding in LMMs further motivate robust evaluation beyond still images [28].
- Alternative backbone families and operators. Future work could attach SRM-style fusion to non-attention sequence backbones such as selective state-space models (e.g., Mamba) [48] and diffusion-based perceptual decoders [49] and study whether the same conditioning/scale/gating diagnostics predict robustness improvements.
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021. [Google Scholar] [CrossRef]
- Yang, C.; Liu, Y.; Li, D.; Zhong, Y.; Jiang, T. Beyond Score Changes: Adversarial Attack on No-Reference Image Quality Assessment from Two Perspectives. arXiv 2024, arXiv:2404.13277. [Google Scholar] [CrossRef]
- Ran, Y.; Zhang, A.X.; Li, M.; Tang, W.; Wang, Y.G. Black-box Adversarial Attacks Against Image Quality Assessment Models. Expert Syst. Appl. 2024, 260, 125415. [Google Scholar] [CrossRef]
- Gushchin, A.; Abud, K.; Bychkov, G.; Shumitskaya, E.; Chistyakova, A.; Lavrushkin, S.; Rasheed, B.; Malyshev, K.; Vatolin, D.; Antsiferova, A. Guardians of image quality: Benchmarking defenses against adversarial attacks on image quality metrics. arXiv 2024, arXiv:2408.01541. [Google Scholar] [CrossRef]
- Rasheed, B.; Abdelhamid, M.; Khan, A.; Menezes, I.; Khatak, A.M. Exploring the impact of conceptual bottlenecks on adversarial robustness of deep neural networks. IEEE Access 2024, 12, 131323–131335. [Google Scholar] [CrossRef]
- Athalye, A.; Carlini, N.; Wagner, D. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm Sweden, 10–15 July 2018. [Google Scholar]
- Xiong, Y.; Zeng, Z.; Chakraborty, R.; Tan, M.; Fung, G.; Li, Y.; Singh, V. Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention. arXiv 2021, arXiv:2102.03902. [Google Scholar] [CrossRef]
- Croce, F.; Hein, M. Reliable Evaluation of Adversarial Robustness with an Ensemble of Diverse Parameter-free Attacks. In Proceedings of the 37th International Conference on Machine Learning (ICML). PMLR, Vienna, Austria, 13–18 July 2020; Volume 119, pp. 2206–2216. [Google Scholar]
- Su, S.; Yan, Q.; Zhu, Y.; Zhang, C.; Ge, X.; Sun, J.; Zhang, Y. Blindly Assess Image Quality in the Wild Guided by a Self-Adaptive Hyper Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3664–3673. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 3–7 May 2021. [Google Scholar]
- Golestaneh, M.; Dadsetan, S.; Kitani, K.M. No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022. [Google Scholar]
- Ke, J.; Wang, Q.; Wang, Y.; Milanfar, P.; Yang, F. MUSIQ: Multi-scale Image Quality Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
- Yang, S.; Wu, T.; Shi, S.; Lao, S.; Gong, Y.; Cao, M.; Wang, J.; Yang, Y. MANIQA: Multi-dimension Attention Network for No-Reference Image Quality Assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022. [Google Scholar] [CrossRef]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. arXiv 2019, arXiv:1908.07919. [Google Scholar] [CrossRef]
- Wang, W.; Yao, L.; Chen, L.; Lin, B.; Cai, D.; He, X.; Liu, W. CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention. arXiv 2021, arXiv:2108.00154. [Google Scholar] [CrossRef]
- Korhonen, J.; You, J. Adversarial Attacks Against Blind Image Quality Assessment Models. In Proceedings of the 2nd Workshop on Quality of Experience in Visual Multimedia Applications (QoEVMA ’22), New York, NY, USA, 14 October 2022; pp. 3–11. [Google Scholar] [CrossRef]
- Antsiferova, A.; Abud, K.; Gushchin, A.; Shumitskaya, E.; Lavrushkin, S.; Vatolin, D.S. Comparing the Robustness of Modern No-Reference Image- and Video-Quality Metrics to Adversarial Attacks. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vancouver, BC, Canada, 26–27 February 2024; pp. 700–708. [Google Scholar]
- Zhu, H.; Zhao, Z.; Wang, S.; Li, X. SwinIQA: Learned Swin Distance for Compressed Image Quality Assessment. arXiv 2022, arXiv:2205.04264. [Google Scholar] [CrossRef]
- Meftah, M.; Fezza, S.A.; Hamidouche, W.; Déforges, O. Evaluating the Vulnerability of Deep Learning-based Image Quality Assessment Methods to Adversarial Attacks. In Proceedings of the 2023 11th European Workshop on Visual Information Processing (EUVIP), Gjovik, Norway, 11–14 September 2023. [Google Scholar] [CrossRef]
- Liu, A.; Zhu, L.; Xu, M.; Li, Q.; Zhang, Y. Robust No-reference Image Quality Assessment: A Comprehensive Benchmark and Insights. arXiv 2024, arXiv:2404.13277. [Google Scholar]
- Meleshin, I.; Chistyakova, A.; Antsiferova, A.; Vatolin, D. Robustness as Architecture: Designing IQA Models to Withstand Adversarial Perturbations. In Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, 27–31 October 2025. [Google Scholar]
- Shumitskaya, E.; Antsiferova, A.; Vatolin, D. Towards Adversarial Robustness Verification of No-Reference Image and Video-Quality Metrics. Comput. Vis. Image Underst. 2024, 240, 103913. [Google Scholar] [CrossRef]
- Athalye, A.; Engstrom, L.; Ilyas, A.; Kwok, K. Synthesizing Robust Adversarial Examples. arXiv 2017, arXiv:1707.07397. [Google Scholar]
- Wu, X.; He, Z.; Luo, T.; Jiang, G.; Zhou, W.; Zhu, L.; Lin, W. DA-Net: A Double Alignment Multimodal Learning Network for Point Cloud Quality Assessment. IEEE Trans. Image Process. 2025, 34, 8185–8200. [Google Scholar] [CrossRef] [PubMed]
- Li, L.; Zhang, X. A robust assessment method of point cloud quality for enhancing 3D robotic scanning. Robot.-Comput.-Integr. Manuf. 2025, 92, 102863. [Google Scholar] [CrossRef]
- Zhang, Z.; Jia, Z.; Wu, H.; Li, C.; Chen, Z.; Zhou, Y.; Sun, W.; Liu, X.; Min, X.; Lin, W.; et al. Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
- Brock, A.; De, S.; Smith, S.L.; Simonyan, K. High-Performance Large-Scale Image Recognition Without Normalization. arXiv 2021, arXiv:2102.06171. [Google Scholar]
- Tolstikhin, I.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. MLP-Mixer: An all-MLP Architecture for Vision. arXiv 2021, arXiv:2105.01601. [Google Scholar]
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Musco, C.; Musco, C. Recursive Sampling for the Nyström Method. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Arthur, D.; Vassilvitskii, S. k-means++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), New Orleans, LA, USA, 7–9 January 2007. [Google Scholar]
- LeCun, Y.; Zhu, Y.; Tang, Y.; Sun, C.; Liu, Z. Transformers without Normalization. arXiv 2025, arXiv:2503.10622. [Google Scholar] [CrossRef]
- Zhang, B.; Sennrich, R. Root Mean Square Layer Normalization. arXiv 2019, arXiv:1910.07467. [Google Scholar] [CrossRef]
- Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In Proceedings of the International Conference on Learning Representations (ICLR) Workshop, Toulon, France, 24–26 April 2017. [Google Scholar]
- Fedus, W.; Zoph, B.; Shazeer, N. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv 2021, arXiv:2101.03961. [Google Scholar]
- Ponomarenko, N.; Jin, L.; Ieremeiev, O.; Lukin, V.; Egiazarian, K.; Astola, J.; Vozel, B.; Chehdi, K.; Carli, M.; Battisti, F.; et al. Image Database TID2013: Peculiarities, Results and Perspectives. Signal Process. Image Commun. 2015, 30, 57–77. [Google Scholar] [CrossRef]
- Hosu, V.; Lin, H.; Szir’anyi, T.; Saupe, D. KonIQ-10k: An Ecologically Valid Database for Deep Learning of Blind Image Quality Assessment. IEEE Trans. Image Process. 2020, 29, 4041–4056. [Google Scholar] [CrossRef]
- Video Quality Experts Group (VQEG). Final Report from the Video Quality Experts Group on the Validation of Objective Models of Video Quality Assessment, Phase I (FR-TV); Technical report; Video Quality Experts Group (VQEG): Boulder, CO, USA, 2000; Available online: https://www.vqeg.org/media/8212/frtv_phase1_final_report.doc (accessed on 5 January 2026).
- Blondel, M.; Teboul, O.; Berthet, Q.; Djolonga, J. Fast Differentiable Sorting and Ranking. In Proceedings of the 37th International Conference on Machine Learning (ICML), Online, 13–18 July 2020. [Google Scholar]
- Grover, A.; Wang, E.; Zweig, A.; Ermon, S. Stochastic Optimization of Sorting Networks via Continuous Relaxations. arXiv 2019, arXiv:1903.08850. [Google Scholar] [CrossRef]
- Prillo, S.; Eisenschlos, J. SoftSort: A Continuous Relaxation for the argsort Operator. In Proceedings of the 37th International Conference on Machine Learning (ICML), Online, 13–18 July 2020. [Google Scholar]
- Cuturi, M.; Teboul, O.; Vert, J.P. Differentiable Ranking and Sorting using Optimal Transport. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Carlini, N.; Athalye, A.; Papernot, N.; Brendel, W.; Rauber, J.; Tsipras, D.; Goodfellow, I.J.; Madry, A.; Kurakin, A. On Evaluating Adversarial Robustness. arXiv 2019, arXiv:1902.06705. [Google Scholar] [PubMed]
- Gowal, S.; Dvijotham, K.; Stanforth, R.; Bunel, R.; Qin, C.; Uesato, J.; Mann, T.; Kohli, P. On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models. arXiv 2018, arXiv:1810.12715. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
- Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020. [Google Scholar]






| Setting | Median | p90 | Median | p90 | Solve Share (% Time) |
|---|---|---|---|---|---|
| No equilibration + ridge (default) | |||||
| Equilibration + ridge | |||||
| Neumann-m = 3 (ablation; gated) |
| Model | Median (Pre) | p90 Pre | Median (Post) | p90 Post | ||
|---|---|---|---|---|---|---|
| SRM (DyT + ) |
| Neck | Params (M) | MACs (GMAC) | Peak Alloc (GB) |
|---|---|---|---|
| None (backbone→head) | 0.00 | 0.00 | 0.00 |
| Simple concat+MLP fuse | 0.41 | 0.09 | 0.06 |
| SRM (Nyström + + GateFuse) | 0.78 | 0.82 | 0.24 |
| SRM w/o conditioning (no fp32 solve) | 0.78 | 0.81 | 0.20 |
| SRM w/o gain cap () | 0.78 | 0.82 | 0.24 |
| Backbone | Strides | Pretrain | Notes | |
|---|---|---|---|---|
| PVT-Small [1] | (64, 128, 320, 512) | (4, 8, 16, 32) | IN-1K | hierarchical tokens |
| Swin-T [2] | (96, 192, 384, 768) | (4, 8, 16, 32) | IN-1K | shifted windows |
| ConvNeXt-T [31] | (96, 192, 384, 768) | (4, 8, 16, 32) | IN-1K | ConvNet stages |
| MobileNetV3-L [32] | (24, 40, 112, 160) | (4, 8, 16, 32) | IN-1K | low-resource |
| Component | Setting |
|---|---|
| SRM links | schedule (one-way deep→shallow) |
| Nyström | rank , heads , head dim , aligned width |
| Conditioning | ridge (Equation (16)), solve in fp32; LU fast-path if |
| stabilization | EMA (Equation (19)), (Equation (21)), clamp |
| DyT | enabled by default (Equation (23)) |
| GateFuse | temperature , , , gain cap |
| Model | TID2013 (SROCC/PLCC) | KonIQ-10k (SROCC/PLCC) |
|---|---|---|
| PVT-S baseline | / | / |
| PVT-S + SRM | / | / |
| Swin-T baseline | / | / |
| Swin-T + SRM | / | / |
| ConvNeXt-T baseline | / | / |
| ConvNeXt-T + SRM | / | / |
| MobileNetV3-L baseline | / | / |
| MobileNetV3-L + SRM | / | / |
| Model | PGD MSE | PGD Drift | PGD pairInv | Transfer PGD |
|---|---|---|---|---|
| PVT-S baseline | ||||
| PVT-S + SRM |
| Model | TID (SROCC/PLCC) | TID (SROCC/PLCC) | KonIQ (SROCC/PLCC) | KonIQ (SROCC/PLCC) |
|---|---|---|---|---|
| PVT-S baseline | ||||
| PVT-S + SRM | ||||
| Swin-T baseline | ||||
| Swin-T + SRM | ||||
| ConvNeXt-T baseline | ||||
| ConvNeXt-T + SRM | ||||
| MobileNetV3-L baseline | ||||
| MobileNetV3-L + SRM |
| Attack | Objective | Norm | Steps | Step Size | Restarts / EOT K | |
|---|---|---|---|---|---|---|
| FGSM | MSE | 1 | – | |||
| PGD | MSE | 40 | ||||
| PGD | drift | 40 | ||||
| PGD | pairInv (Equation (32)) | 40 | ||||
| PGD | MSE | 40 | ||||
| PGD | drift | 40 | ||||
| PGD | pairInv (Equation (32)) | 40 | ||||
| EOT-PGD | any above | same | 40 | same |
| Transfer Setting | Target Robust SROCC/PLCC | Interpretation |
|---|---|---|
| Swin-T baseline → PVT-S baseline (craft on surrogate, evaluate on target) | cross-backbone transfer is non-trivial | |
| PVT-S baseline → Swin-T baseline | consistent with a standard transfer gap | |
| PVT-S baseline → PVT-S SRM | SRM remains attackable under transfer (anti-masking) | |
| PVT-S SRM → PVT-S baseline | baseline remains vulnerable |
| Variant | Clean | Robust | ||
|---|---|---|---|---|
| No neck (PVT-S baseline) | ||||
| Full SRM | — | — | ||
| SRM without spectral landmarks (uniform) | ||||
| SRM without regularized solve ( = 0) | ||||
| SRM without stabilization () | ||||
| SRM without entropy gating (uniform fusion) | ||||
| SRM with fixed Nyström rank (r = 32) | ||||
| SRM without stage gating (no cross-scale routing) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Rasheed, B.; Antsiferova, A.; Vatolin, D. Spectral Robustness Mixer: Cross-Scale Neck for Robust No-Reference Image Quality Assessment. Technologies 2026, 14, 145. https://doi.org/10.3390/technologies14030145
Rasheed B, Antsiferova A, Vatolin D. Spectral Robustness Mixer: Cross-Scale Neck for Robust No-Reference Image Quality Assessment. Technologies. 2026; 14(3):145. https://doi.org/10.3390/technologies14030145
Chicago/Turabian StyleRasheed, Bader, Anastasia Antsiferova, and Dmitriy Vatolin. 2026. "Spectral Robustness Mixer: Cross-Scale Neck for Robust No-Reference Image Quality Assessment" Technologies 14, no. 3: 145. https://doi.org/10.3390/technologies14030145
APA StyleRasheed, B., Antsiferova, A., & Vatolin, D. (2026). Spectral Robustness Mixer: Cross-Scale Neck for Robust No-Reference Image Quality Assessment. Technologies, 14(3), 145. https://doi.org/10.3390/technologies14030145



