1. Introduction
Blind Image Quality Assessment (BIQA) plays a critical role in a wide range of applications, including device benchmarking and image-driven pattern analysis. Its influence spans multiple domains from data compression, signal transmission, media quality assurance, image enhancement, to content understanding [
1,
2,
3,
4]. Unlike full- or reduced-reference methods [
5,
6,
7,
8], BIQA aims to estimate perceptual image quality solely based on the distorted image itself, without access to any corresponding reference image or related metadata. This no-reference characteristic makes BIQA especially well-suited for real-world deployments in industrial and consumer-level scenarios, where reference images are typically unavailable. BIQA is not only foundational for improving automated visual systems but also essential for ensuring quality consistency in ubiquitous imaging.
With the rapid advancement of imaging hardware and the widespread deployment of high-performance chips, UHD images have become increasingly accessible, not only through professional imaging equipment but also via consumer-grade mobile devices. This surge in image resolution presents significant challenges for UHD-BIQA. The resolution of realistic images in public datasets has increased from ≈500 × 500 pixels in early databases, such as LIVE Challenge (CLIVE) [
9], to 3840 × 2160 pixels in recent benchmarks [
10]. Despite this progress in data availability, effectively handling UHD images remains an open problem due to high computational costs, increased visual complexity, and the scale sensitivity of existing models.
To address these challenges, existing approaches offer valuable insights into modeling perceptual quality under High-Resolution (HR) imaging conditions. These methods can be categorized based on their methodological foundations, architectural designs, and feature extraction strategies. (a) Machine learning-based methods quantify an input image using a compact feature vector, which serves as the input to a regression model for score prediction. Chen et al. compute maximum gradient variations and local entropy across multiple image channels, which are combined to assess image sharpness [
11]. Yu et al. treat the outputs of BIQA indicators as mid-level features, and various regression models are investigated for scoring perceptual quality [
12]. Wu et al. collect multi-stage semantic features from a pre-trained deep neural network (DNN) and refine them using a multi-level channel attention module to improve prediction accuracy [
13]. Chen and Yu utilize a pre-trained DNN as a fixed feature extractor and evaluate a range of regression models [
14]. Although these feature-based approaches [
11,
12,
13,
14,
15] are computationally efficient, their reliance on handcrafted features often limits their representational capacity, making it challenging to capture the rich visual structures and complex distortions of UHD images. (b) Patch-based deep learning methods sample numerous sub-regions from an input image and feed these patches into a neural network. The quality of the image is predicted through jointly optimizing the feature representations and network parameters. Yu et al. develop a shallow network, and random patches from each image are used to train the network by minimizing the prediction error [
16,
17]. Bianco et al. explore the features from pre-trained networks and fine-tuning strategies, and the final image quality score is obtained by average-pooling the predicted scores across multiple patches [
18]. Ma et al. propose a multi-task learning network that simultaneously predicts distortion types and image quality by using two sub-networks [
19]. Su et al. introduce a hyper-network architecture that divides the BIQA process into content understanding, perception rule learning, and self-adaptive score regression [
20]. While these deep learning methods [
16,
17,
18,
19,
20] are effective in a data-efficient manner, they typically assign the same global quality score to all patches regardless of local variation. This simplification overlooks the spatial heterogeneity and region-specific distortions pronounced in UHD images.
To the best of our knowledge, only a few studies have attempted to address UHD-BIQA challenges. These efforts have explored various strategies for the high computational and perceptual complexity of UHD content. Huang et al. integrate ResNet [
21], Vision Transformer (ViT) [
22], and recurrent neural network (RNN) [
23] to benchmark their curated HR image database [
24]. This hybrid architecture is designed to combine spatial feature extraction, global context modeling, and sequential processing. Sun et al. develop a multi-branch framework that leverages to extract features corresponding to global aesthetic attributes, local technical distortions, and salient content. These diverse feature representations of perceptual quality are fused and regressed into final quality scores [
25]. Tan et al. explore Swin Transformer [
26] to process full-resolution images. Their approach mimics human visual perception by assigning adaptive weights to different sub-regions and incorporates fine-grained frequency-domain information to enhance prediction accuracy [
27]. However, these approaches [
24,
25,
27] rely on complex multi-branch architectures and Transformer-based pyramid perception mechanisms, which demand substantial computational resources and processing times. These limitations hinder their scalability and practicality in real-world UHD-BIQA applications.
This study presents a novel framework, SUper-resolved Pseudo References In Dual-branch Embedding (SURPRIDE), to address UHD-BIQA challenges. The core idea is to leverage SR as a lightweight and deterministic transformation to generate pseudo-reference from the distorted input. Although SR is not intended to restore ground-truth quality, it introduces structured distortions that provide informative visual contrasts. Further, by pairing each distorted image with its corresponding SR version, we construct a dual-branch architecture that simultaneously learns intrinsic quality representations from the original input and comparative quality cues from the pseudo-reference. This design enables the model to better capture fine-grained differences that are especially critical in UHD images. The main contributions of this work are summarized as follows:
We propose the SURPRIDE framework, which leverages SR reconstruction as a self-supervised transformation to generate external quality representations.
We implement a dual-branch network that jointly models intrinsic quality from distorted images and comparative cues from generated pseudo-references.
We design a hybrid loss function incorporating cosine similarity between dual-branch output features, enabling the model to learn from the relational differences between input patch pairs.
We conduct extensive experiments on UHD, high-definition (HD), and standard-definition (SD) image datasets, demonstrating SURPRIDE’s competitive performance and strong generalization ability in BIQA tasks.
The rest of this paper is organized as follows.
Section 2 details the proposed framework, including the motivation behind the dual-branch architecture and descriptions of its core components: patch preparation, the dual-branch network, and the hybrid loss function.
Section 3 outlines the experimental setup, including datasets, baseline BIQA models for comparison, implementation details, and evaluation protocol.
Section 4 presents a comprehensive analysis of the experimental results across UHD, HD, and SD image databases, along with supporting ablation studies on the UHD-IQA database [
10] and the intuitive explanation of the proposed hybrid loss function from the perspective of feature similarity change.
Section 5 discusses key findings, practical implications, and current limitations of the proposed framework. Finally,
Section 6 summarizes the work and highlights potential directions for future UHD-BIQA research.
5. Discussion
The UHD-BIQA task remains challenging. Most existing approaches rely heavily on handcrafted features [
11,
12,
13,
14,
15] and patch-based deep learning [
16,
17,
18,
19,
20,
43]. A limited number of models have been specifically tailored for UHD images [
24,
25,
27], and these typically involve complex architectures and computationally intensive modules such as Transformers [
22,
26,
45,
59]. While effective, such designs are often impractical for real-world deployment due to their high resource requirements and limited scalability. To address these limitations, we propose a lightweight dual-branch framework, SURPRIDE, to represent image quality by learning from both the original input and a super-resolved pseudo-reference. Specifically, ConvNeXt [
47] is employed as the VFM for efficient feature extraction, and SwinFIR [
49] is used for fast SR reconstruction. Inspired by the visual differences observed between high- and low-quality images after down-sampling and SR reconstruction, a hybrid loss function is introduced that balances prediction accuracy with feature similarity. This design enables the network to effectively learn both absolute and comparative quality cues.
The SURPRIDE framework demonstrates top-tier BIQA performance on the UHD-IQA (
Table 2,
Figure 4) and HR-IQA (
Table 12) databases, both of which contain authentic UHD images. The effectiveness of the proposed method can be attributed to several key factors. First, the use of the pre-trained ConvNeXt [
47] as the VFM backbone in both branches allows for the extraction of intrinsic characteristics from the original input and comparative embeddings from the SR-reconstructed patch (
Table 5), while the primary branch plays the dominant role in BIQA performance, the SR branch provides valuable complementary information, leading to further improvements (
Table 10). Motivated by the observed phenomenon that higher-quality images often exhibit more noticeable degradation after SR reconstruction (
Figure 1), the features from both branches are weighted and concatenated to form a more robust representation for quality prediction. Second, critical parameters, including input image size, patch size, SR method, and scaling factor, are systematically optimized through extensive ablation studies (
Table 6,
Table 7,
Table 8,
Table 9 and
Table 10). These settings enable the framework to achieve optimal performance, with additional applications across HD and SD databases. Third, the proposed hybrid loss function encourages the learning of comparative embedding by leveraging differences introduced through SR reconstruction. This loss formulation enhances the network’s ability to model perceptual quality more accurately (
Table 8 and
Table 11). In summary, SURPRIDE combines a deep learning architecture with strategically tuned components and loss design to deliver superior or competitive results on the UHD image databases, while remaining efficient and scalable for real-world deployment.
The proposed SURPRIDE framework demonstrates strong performance not only on the UHD image databases (
Table 2 and
Table 12), but also on the HD (
Table 13 and
Table 14) and SD (
Table 15 and
Table 16) image databases. Falling under the category of patch-based deep learning methods, SURPRIDE randomly samples patches from the original images and reconstructs corresponding SR patches [
49]. These pairs of patches are used to extract deep representative features, which are weighted and concatenated for robust quality representation and score prediction. Unlike earlier approaches that use small patches of size 16 × 16 or 32 × 32 [
16,
17,
58], SURPRIDE adopts a much larger patch size of 384 × 384 (
Table 6), which better captures contextual and structural information. Prior studies suggest that larger patch sizes contribute positively to performance in BIQA tasks [
18,
25,
37,
41,
43]. Notably, SURPRIDE avoids reliance on high-computation modules or complex attention mechanisms. Its use of ConvNeXt [
47] for feature extraction proves effective across a wide range of image resolutions. However, the proposed SURPRIDE also shows slightly inferior performance compared with certain specialized BIQA algorithms, including HR-BIQA [
24] on the HRIQ database (
Table 12), ATTIQA [
42] on the KonIQ-10k database (
Table 14), and Prompt-IQA [
44] on the CLIVE database (
Table 15). This inferiority may be attributed to the use of highly customized architectural designs in these models. For example, HR-BIQA [
24] incorporates semantic content embeddings, ATTIQA [
42] leverages vision–language-based quality-relevant feature extraction, and Prompt-IQA [
44] benefits from image–score pairing in combination with extensive data augmentation tailored to the target databases. In addition, the diverse and dataset-specific characteristics [
14] pose inherent challenges in BIQA, making it difficult for a single algorithm to achieve consistent superiority across all benchmarks. Therefore, further investigation is needed to understand the underlying causes of SURPRIDE’s performance limitations on these particular databases and to explore potential strategies for improving its adaptability.
The success of SURPRIDE across databases with varying resolutions (
Table 4) can be attributed to several key factors. (a) HR inputs enable fine-grained distortion learning. Large patch sizes (384 × 384 or 224 × 224) retain sufficient local information while also capturing broader spatial context. These patches represent meaningful regions of the image, enabling the model to learn fine-grained distortion patterns that are often consistent within HR content [
41,
55]. (b) Training with multiple patches per image increases data diversity and supports better generalization [
39,
43,
55,
58]. By sampling image patches either randomly or strategically, the model is exposed to a diverse range of distortions and scene content, helping to compensate for the unavailability of full-resolution images during training. (c) Patch-level features tend to be scale-invariant, allowing models trained on HR patches to generalize well across different image sizes [
39,
43,
55]. Local distortions in UHD images often exhibit characteristics similar to those in HD or SD images. (d) Moreover, processing entire UHD images directly is computationally prohibitive in terms of memory and training time. Patch-based learning provides a practical alternative, enabling the reuse of deep networks [
21,
22,
26,
37,
42,
43,
47,
59] without compromising batch size or training stability. (e) The proposed SURPRIDE learns image quality representation without large-scale external datasets. Many SOTA works require external data, HR inputs, or large computational budgets to perform well on certain datasets [
34]. However, such resources are not always accessible in real-world applications. By generating pseudo-reference images directly from a single image, we provide a self-sufficient mechanism that preserves performance without relying on external ground truths. That ensures distributional consistency between the input image and the generated supervision and mitigates the domain gap typically introduced by training on external datasets with potentially mismatched distributions. In conclusion, patch-based learning strategies remain highly effective for UHD-BIQA, offering a favorable balance between computational efficiency and model accuracy.
Unsurprisingly, dual- and multi-branch network architectures have gained increasing prominence in advancing BIQA tasks [
24,
25,
27,
37,
38]. This trend reflects the growing demand for richer and more nuanced modeling of perceptual image quality that single-branch models often struggle to achieve effectively. First, these architectures enable the separation and specialization of complementary information [
37,
39,
55,
58]. Dual-branch networks typically extract different feature types, such as global semantics (or quality-aware encoding) in one branch and local distortions (or content-aware encoding) in the other. Multi-branch networks may explicitly model multiple perceptual dimensions, including aesthetic quality, data fidelity, object saliency, and content structure. This separation facilitates better feature disentanglement, allowing the network to handle the complex, multidimensional characteristics of human visual perception effectively [
37,
39]. Second, robustness and adaptability across distortion types are enhanced. Real-world images often exhibit mixed or unknown distortion types, and no single representation is sufficient to capture all relevant quality variations [
43,
51]. Branching allows each sub-network to specialize in detecting specific distortions or perceptual cues, leading to improved generalization and performance across diverse scenarios. Third, dual- and multi-branch architectures provide flexibility for adaptive fusion [
37,
38,
39,
55,
58]. By incorporating attention mechanisms, gating functions, or learned weighting schemes, these models can dynamically integrate information from multiple branches. This enables the network to emphasize the most relevant features at inference time, which is particularly important for UHD content or images with complex structures, where different regions may contribute unequally to perceived quality. Ultimately, the popularity of dual- and multi-branch networks is driven by their superior ability to align with human perceptual processes. Their modular and interpretable design supports the modeling of multi-scale, multi-type distortions in a way that more closely reflects how humans assess image quality. As a result, such architectures have consistently demonstrated strong performance on challenging, distortion-rich datasets in both synthetic and authentic environments.
To improve understanding of the proposed hybrid loss function, we employed both linear CKA (
Figure 6) and
t-SNE (
Figure 7) to analyze the feature representations learned by the network. The hybrid loss was designed to balance prediction accuracy (via quality regression loss,
) and feature similarity (via perceptual alignment loss,
), thereby encouraging the model to capture not only absolute quality cues but also relative perceptual differences. This combination helps stabilize training and guides the network toward more discriminative and perceptually aligned representations. Visualization techniques offer an intuitive perspective on the effectiveness of the hybrid loss from representational consistency [
60] and representation learning [
61,
62] by examining the behavior of dual-branch architectures and their learned feature spaces. Linear CKA provides a quantitative measure of similarity between feature spaces under different training objectives, revealing how the hybrid loss promotes stronger alignment between features from the original and reconstructed images [
52]. Meanwhile,
t-SNE maps high-dimensional features into a low-dimensional space, allowing us to visually observe cluster separation and structural consistency among images of different quality levels [
53]. Together, these analyses confirm that the hybrid loss promotes functional disentanglement across branches and enhances robustness and generalization in perceptual quality prediction without incurring redundant architectural complexity, ultimately contributing to improved BIQA performance.
Several limitations remain in the current study. First, the design and exploration of loss functions are not comprehensive. Alternative loss functions, such as marginal cosine similarity loss [
63], may offer additional benefits for enhancing feature similarity learning and improving quality prediction accuracy. Second, the feature fusion strategy employed is relatively simplistic. More advanced fusion techniques, including feature distribution matching [
64] and cross-attention-based fusion [
65], could be explored to enrich the representational power of the dual-branch framework. Third, the choices of backbone and SR architectures are not exhaustively evaluated. Integrating other promising modules, such as MobileMamba [
66], arbitrary-scale SR [
67], uncertainty-aware models [
68], or sequential feature processing and score prediction [
69], may further boost performance. Fourth, the proposed framework lacks fine-grained optimization of hyper-parameters and architectural configurations. A more systematic exploration of design choices could lead to additional performance gains for UHD images. Fifth, the time cost of the SR branch remains a major bottleneck of the real-time UHD-BIQA task (
Table 4). Promising solutions can be drawn from lightweight SR networks, which leverage techniques such as knowledge distillation, algorithmic and procedural optimization, problem re-parameterization, dedicated DNN architecture design, and model quantization [
70]. These strategies not only reduce computational burden but also preserve quality-related feature embedding, thereby making real-time deployment more feasible. Finally, Transformer-based and other advanced large multi-modal models [
22,
26,
59] that exploit image content and additional text cues could be investigated in the UHD-BIQA field.