1. Introduction
Visual quality assessment is critically important in modern multimedia and communication systems, driven by the increasing demand for high-quality visual content and enhanced quality of experience (QoE) [
1]. Objective quality evaluation plays a central role in applications such as compression, restoration, transmission, content generation, and 3D reconstruction, where perceptual fidelity serves as a fundamental benchmark for algorithm design and system optimization.
In many practical scenarios, visual data originate from sensing devices such as mobile cameras, surveillance cameras, autonomous driving sensors, and RGB-D imaging systems. Imperfections in these sensing pipelines—including sensor noise, exposure instability, motion blur, and optical distortions—can significantly affect the perceptual quality of captured visual signals [
2,
3].
Traditionally, visual quality assessment has been divided into subjective and objective paradigms [
4,
5]. Subjective assessment collects human perceptual opinions under standardized viewing conditions and provides reliable ground-truth labels. However, its high cost and limited scalability restrict practical deployment. To enable automated and efficient evaluation, objective models have been developed to approximate perceptual judgments computationally. Early approaches relied on handcrafted perceptual priors, natural scene statistics (NSS), and statistical modeling [
6,
7,
8,
9,
10,
11,
12], while subsequent deep learning methods shifted toward data-driven representation learning, distribution learning, and supervised regression [
13,
14,
15,
16,
17,
18]. These conventional approaches have substantially improved prediction accuracy, yet they remain largely distortion-centric and scalar-regression-oriented.
Over time, the scope of visual quality assessment has expanded from images to videos and 3D content [
19,
20,
21,
22,
23,
24]. Image quality assessment (IQA) operates under fixed spatial observation, focusing on structural fidelity and perceptual naturalness within a single frame. Video quality assessment (VQA) introduces temporal integration, where motion consistency, temporal masking, and perceptual memory jointly shape quality judgments [
19,
20,
21,
25,
26]. 3D quality assessment (3DQA) further complicates the problem through viewpoint-conditioned and rendering-mediated perception: visual quality is not observed from a single static view, but depends on viewpoint selection, visibility, shading, and, in dynamic scenarios, temporal coherence [
22,
23,
24,
27]. As visual content becomes increasingly diverse and high-dimensional, distortion-centric modeling alone becomes insufficient for capturing complex semantic and contextual factors, particularly in AI-generated content (AIGC) [
28,
29,
30,
31,
32,
33,
34,
35].
The rise of large multimodal models (LMMs) has opened a new frontier for visual quality assessment [
36,
37,
38,
39,
40,
41]. In this paper, we use LMMs as an umbrella term encompassing both vision–language models (VLMs) and multimodal large language models (MLLMs).
VLMs, typically represented by models such as CLIP [
36] and Flamingo [
37], aim to bridge visual content and textual descriptions through cross-modal modeling. In quality assessment, they enable prompt-based evaluation by mapping images or videos to quality-related semantic concepts [
42,
43,
44,
45,
46,
47,
48]. However, such approaches largely remain within a similarity-based or regression-oriented paradigm. In contrast, MLLMs extend this capability by incorporating large language models (LLMs) as reasoning backbones, enabling instruction-following, contextual understanding, and multi-step inference. This allows quality assessment to move beyond static score prediction toward prompt-conditioned evaluation, comparative reasoning, explanation generation, and quality-aware critique [
49,
50,
51,
52,
53,
54,
55,
56,
57,
58,
59,
60].
This evolution signals a broader transformation in visual evaluation—from distortion-sensitive regression toward semantically grounded and reasoning-capable assessment. Rather than relying solely on predefined distortion measures, modern approaches increasingly integrate perceptual cues with high-level semantic interpretation and language-conditioned criteria. As illustrated in
Figure 1, visual quality assessment evolves along both modal complexity (from images to videos and 3D content) and methodological progression (from distortion-centric modeling to deep learning and further to LMM-driven reasoning). In particular, the paradigm shift is not only reflected in improved predictive performance, but also in a fundamental transition from scalar regression toward semantically grounded and instruction-driven evaluation [
49,
54,
57,
60].
In this survey, we provide a structured and comprehensive review of LMM-driven visual quality assessment across three major modalities: image, video, and 3D content. For each domain, we first introduce the task characteristics and dataset evolution, then summarize conventional modeling approaches, and finally examine recent advances enabled by LMMs. By organizing the literature along both modality and methodological progression, we aim to clarify the shared structural trends underlying visual evaluation and highlight emerging challenges in calibration, robustness, and scalability. To provide a more complete evaluation perspective, we further summarize commonly used evaluation metrics and benchmarking protocols in
Section 4.6.
3. Video Quality Assessment: From Deep Learning to LMMs
3.1. VQA Task Description
VQA, while conceptually rooted in IQA, is not merely image quality prediction applied frame by frame. Its defining challenge is that perceived quality is formed over time: motion, temporal persistence, and evolving distortions jointly shape the final judgment.
Given a video sequence
, captured by devices such as mobile cameras, surveillance systems, and vehicle-mounted sensors, VQA aims to learn a mapping:
where
q approximates subjective quality labels obtained from laboratory studies or crowdsourcing protocols. A straightforward extension of IQA applies frame-level predictors to sampled frames and averages the outputs. However, such frame-wise strategies neglect inter-frame dependencies and cannot faithfully represent how human observers accumulate quality impressions over time. Effective VQA must therefore model not only spatial fidelity within individual frames but also the temporal organization of perceptual evidence across frames.
3.1.1. Perceptual Characteristics of Video Quality
Compared with static IQA, VQA is distinguished by a temporally formed perceptual target rather than a temporally extended input alone.
First, perceptual quality is accumulated over time rather than determined instantaneously. Observers do not judge a video as an unordered collection of frames; instead, local impressions are integrated into a global quality judgment through temporal pooling and perceptual memory. As a result, brief but salient degradations may disproportionately influence the final opinion, and quality prediction cannot be reduced to simple frame averaging [
20,
21].
Second, artifact visibility is conditioned by motion. Motion may partially mask some spatial impairments, yet it can also amplify distortions that are weak in still images but highly noticeable in dynamic viewing, such as flicker, judder, ghosting, and temporal aliasing. Consequently, VQA must account for motion-conditioned sensitivity and temporal coherence, rather than treating distortion magnitude as a purely spatial quantity [
19,
25].
Third, authentic videos often exhibit mixed and non-stationary degradations introduced by capture, processing, compression, and transmission. Sensor noise, motion blur, shakiness, focus instability, exposure fluctuation, and bitrate variation may co-occur and evolve over time, interacting with scene dynamics in a content-dependent manner. This makes the quality signal temporally heterogeneous and often more ambiguous than the relatively stationary settings considered in conventional IQA [
20,
21,
26].
3.1.2. Structural Scope and Modeling Implications
The historical development of VQA reflects a shift from static fidelity estimation to temporally organized perceptual inference. Early VQA methods had already shown that motion-aware analysis is essential: quality must be evaluated not only in space but also along motion trajectories and through spatio-temporal statistics [
19,
25]. With the rise of user-generated content (UGC), the field further moved toward models that address authentic capture distortions, long-range temporal dependencies, and efficient temporal sampling, as represented by methods such as TLVQM, VSFA, and FAST-VQA [
20,
21,
26].
From the perspective of this survey, the core significance of VQA is therefore not simply that videos are longer than images, but that their perceptual quality is formed through temporal interaction, persistence, and variation. This temporal formation mechanism fundamentally distinguishes VQA from IQA and explains why subsequent advances in datasets, architectures, and evaluation protocols increasingly emphasize temporal reasoning rather than static distortion estimation alone.
At the same time, most conventional VQA methods still ultimately compress the viewing process into a single scalar score. Such a formulation is effective for many benchmark settings, yet it becomes increasingly strained when perceptual quality is entangled with semantic continuity, generated motion plausibility, or richer explanation demands. This limitation motivates the LMM-based VQA paradigms discussed in
Section 3.4.
3.2. Evolution of VQA Datasets
The evolution of VQA datasets reflects a clear expansion from controlled codec distortions to authentic temporal experience and, more recently, semantically complex generated videos (
Table 4).
Early benchmarks such as LIVE-VQA [
95] and CSIQ-VQA [
112] mainly focus on compression and packet-loss impairments under relatively controlled conditions. These datasets encouraged distortion-aware modeling with relatively simple temporal pooling strategies, where video quality was largely treated as a temporally extended version of image fidelity assessment.
The introduction of UGC-oriented datasets marked a major shift. KoNViD-1k [
98], LIVE-VQC [
99], and YouTube-UGC [
100] expose mixed authentic degradations caused by diverse capture devices, motion patterns, scene contents, and editing pipelines. These datasets demonstrate that real-world video quality is shaped by temporally varying and content-dependent distortions rather than by isolated codec artifacts alone. Large-scale datasets such as LSVQ [
101] further support data-driven temporal representation learning and more reliable cross-content evaluation. Large-scale and in-the-wild studies also continue to broaden the diversity of video sources and temporal distortions encountered in practice.
A related branch of dataset development emphasizes streaming QoE. Datasets such as LIVE-NFLX [
102] incorporate bitrate adaptation, delivery dynamics, and playback continuity, showing that perceptual quality depends not only on spatial fidelity within frames but also on temporal stability, interruption patterns, and viewing continuity across time. This branch broadens VQA from artifact recognition toward user-centered modeling of temporal viewing experience.
More recently, AI-generated benchmarks have pushed VQA toward semantically richer and temporally more challenging scenarios. Broad evaluation suites such as FETV [
32], VBench [
33], EvalCrafter [
34], TC-Bench [
109], DEVIL [
110] increasingly emphasize temporal compositionality, motion plausibility, semantic consistency, and holistic generation quality. HVEval [
35] further specializes this trend in human-centric videos, where local facial realism, identity consistency, and articulated motion strongly influence perception. In these benchmarks, video quality is increasingly entangled with semantic continuity, motion plausibility, and generation faithfulness.
Overall, the evolution of VQA datasets shows that VQA is not merely an image-based problem with more frames. Instead, benchmark design has progressively expanded from controlled distortion evaluation to authentic temporal perception, streaming experience, and AI-generated video realism. This broader dataset landscape explains why effective VQA models must increasingly address temporal organization, motion-conditioned perception, and semantic video consistency rather than relying on frame-level distortion analysis alone.
3.3. Conventional Modeling Approaches for VQA
Conventional modeling VQA approaches can be broadly categorized into classical handcrafted methods and deep learning-based representation learning frameworks.
3.3.1. Handcrafted VQA Methods
Early VQA methods explicitly modeled spatio-temporal statistics and motion-aware perceptual cues. MT-STQA [
25] is a FR-VQA metric that models human-visual sensitivity to motion by jointly analyzing spatial distortion and temporal motion information. Reduced-reference (RR) designs such as STRRED [
113] transmitted compact statistical features and computed entropic differences to estimate quality degradation. For NR settings, Video BLIINDS [
19] leveraged spatio-temporal NSS and motion coherence features to balance efficiency and prediction accuracy. VIIDEO [
114] models intrinsic statistical regularities in natural videos without using opinion labels, while STEM [
115] combines a temporal straightness-based perceptual measure with blind spatial quality cues to improve generalization on authentic UGC videos.
These classical approaches explicitly encode perceptual priors but rely on predefined distortion statistics and limited adaptability to complex in-the-wild degradations.
3.3.2. Conventional Deep Learning-Based VQA Methods
Deep learning shifted VQA toward representation-driven modeling of temporal dynamics. VSFA [
21] employed CNN feature extraction followed by GRU-based temporal modeling to capture long-term dependencies and perceptual memory effects, establishing a recurrent-learning paradigm for UGC videos. Subsequent works explored stronger spatio-temporal architectures. FAST-VQA [
26] introduced fragment-based sampling and attention-based aggregation to reduce computational cost while preserving temporal cues. Complementary studies such as RAPIQUE [
116] further emphasized efficient content-aware feature reuse and quality-aware pretraining for in-the-wild videos. DisCoVQA [
117] modeled distortion–content interaction through Transformer-based temporal coupling. STI-VQA [
118] further emphasized spatial-motion interaction by tokenizing distortion statistics and motion features within a Transformer framework. SAMA [
119] further proposed a scaling-and-masking sampling strategy for image and video quality assessment, aiming to preserve both local details and global semantics within a regular input size.
Table 5 compares representative VQA approaches across classical, deep, and LMM-based paradigms.
To address perceptual complexity beyond a single MOS score, DOVER [
120] disentangles technical distortion perception from aesthetic preference by modeling quality from both technical and aesthetic perspectives. Building on the need for interpretability, FineVQ [
104] introduces fine-grained, attribute-level supervision, incorporating multi-dimensional quality ratings (e.g., color, noise, blur, temporal) and degradation-type annotations (e.g., over-exposure, blur, jitter, frame drop, stall) to enable quality attribution and more explainable assessment. To address practical variability in video formats, ModularBVQA [
121] incorporates resolution-aware spatial rectifiers and frame-rate-aware temporal rectifiers to explicitly compensate for differences in spatial and temporal sampling conditions. Controlled analyses of spatio-temporal modeling and in-the-wild video quality datasets further investigated dataset bias and capacity limits in deep VQA architectures [
21,
125].
Overall, deep learning substantially improves spatio-temporal representation capacity compared with handcrafted approaches. However, these methods remain largely regression-centric and supervision-dependent, motivating the transition toward LMM-based reasoning and instruction-driven evaluation discussed in
Section 3.4.
3.4. Large Multimodal Model Approaches for VQA
Instead of treating VQA as a fixed scalar-regression task, LMM-based approaches allow quality criteria to be specified in natural language and, in some cases, generate structured explanations alongside scores. These paradigms are particularly valuable for AI-generated and human-centric videos, where degradations may be semantic (implausible content) or temporal (identity drift, motion discontinuity), and cannot be exhaustively enumerated as distortion labels.
3.4.1. VLM-Based VQA Methods
VLM-based VQA methods adapt pretrained vision–language encoders to inject semantic awareness into quality prediction without heavy MOS supervision.
BUONAVISTA [
46] introduced an opinion-unaware VQA framework that measures semantic affinity between video embeddings and quality-related textual prompts. By combining prompt-derived semantic scores with conventional technical indices, BUONAVISTA demonstrated that CLIP-like encoders encode perceptual plausibility cues useful for quality estimation. BVQI [
47] further refined text-prompted semantic criteria by designing more robust prompt formulations and introducing localized semantic affinity measures. These mechanisms improve sensitivity to high-level perceptual failures beyond low-level distortion statistics, particularly in AI-generated or semantically complex videos. More recently, CAMP-VQA [
126] extended this line of research from prompt-based semantic matching to caption-guided multimodal regression. It uses quality-aware prompts to guide a pretrained BLIP-2 model to generate fine-grained quality captions, and then fuses semantic, temporal, and spatial features through dedicated branches before regressing final MOS scores. Compared with earlier CLIP-affinity methods, CAMP-VQA preserves the regression-style output paradigm while offering stronger artifact-semantic awareness and better modeling of compressed UGC videos.
Such VLM-based approaches act as a transitional bridge between conventional deep VQA and fully instruction-driven MLLM evaluators. They enhance semantic sensitivity while largely preserving regression-style score outputs.
3.4.2. MLLM-Based VQA Methods
More recent approaches leverage MLLMs to reformulate VQA as an instruction-following and reasoning task.
LMM-VQA [
57] casts MOS prediction into a VQA-style question-answering problem, aligning spatio-temporal visual tokens with a MLLM to output quality levels conditioned on textual prompts. This design allows flexible quality specification and moves beyond fixed regression heads. CP-LLM [
58] introduced dual vision encoders (context-level and pixel-level) and multi-task objectives spanning scoring, description generation, and pairwise comparison. By integrating contextual and fine-grained cues, CP-LLM improves sensitivity to subtle distortions while maintaining semantic reasoning capability. VQAThinker [
122] introduces reinforcement learning with quality-aware reward designs to jointly optimize interpretable distortion analysis and robust score prediction. Pretraining strategies further enhance quality priors. Such pretraining aims to endow models with stable perceptual priors before task-specific adaptation.
MLLM-based VQA also expands evaluation targets beyond scalar scores.
[
123] constructed dedicated instruction datasets for video quality perception, training models that interleave visual and motion tokens to support both scalar scoring and quality-related question answering. This work highlights the role of curated instruction data in strengthening controllable quality understanding. VQ-Insight [
127] further explores reasoning-oriented quality understanding for AI-generated videos through progressive visual reinforcement learning, emphasizing multi-dimensional scoring and temporal modeling in AIGC-specific settings.
In human-centric and face-focused scenarios, FVQ [
124] focuses on face-centric quality modeling, emphasizing identity preservation, facial detail fidelity, and temporal coherence across frames, thereby improving sensitivity to subtle distortions that affect facial authenticity and recognition consistency.
Table 6 summarizes the performance of representative conventional and LMM-based NR-VQA methods on five common UGC benchmarks. Compared with the corresponding IQA results, a different trend can be observed here: LMM-based methods already show clearer competitiveness, and in several cases, stronger overall performance, on common in-domain VQA benchmarks. In particular, native VQA-oriented MLLM methods such as LMM-VQA [
57],
[
123], and VQAThinker [
122] achieve highly competitive, and in some cases, leading results across multiple datasets.
This difference suggests that the role of LMMs may be more substantial in VQA than in IQA. Unlike static image quality prediction, VQA requires the joint modeling of spatial distortion, temporal coherence, motion continuity, and semantic plausibility over time, making the task inherently more aligned with temporally contextualized and reasoning-aware architectures. At the same time, the table also indicates that these gains are not automatic: strong conventional methods such as FAST-VQA [
26], DOVER [
120], and SAMA [
119] remain highly competitive, while transferred reasoning-oriented IQA models such as Q-Insight [
93] and VisualQuality-R1 [
94] fall clearly behind dedicated VQA designs. Therefore, the current evidence does not point to a universal replacement of conventional VQA pipelines, but it does suggest that LMM-based approaches have already become a particularly promising direction for VQA.
3.5. Summary of VQA Progression
In summary, the development of VQA reflects a transition from handcrafted spatio-temporal descriptors and shallow perceptual priors to increasingly expressive deep architectures and, more recently, LMMs. Early NR methods mainly relied on natural video statistics, motion coherence, and hand-crafted temporal features, while subsequent deep models substantially improved performance by learning richer spatial–temporal representations from large-scale UGC datasets.
As shown by the benchmark comparison above, recent LMM-based methods already exhibit a clearer advantage trend in VQA than what is currently observed in IQA. In particular, native VQA-oriented MLLM frameworks achieve highly competitive, and in some cases, leading performance across multiple common UGC benchmarks, suggesting that VQA may benefit more directly from architectures with stronger temporal context modeling and higher-level reasoning capacity. At the same time, strong conventional methods remain highly competitive, and the table also shows that not every reasoning-based quality model transfers equally well to VQA.
Therefore, the progression reviewed in this section is best understood as an evolution toward richer temporal, semantic, and reasoning-aware quality modeling. Rather than indicating a simple replacement of conventional VQA methods, current evidence suggests that LMMs have become a particularly promising direction for NR-VQA, especially when quality judgment depends on the joint understanding of spatial distortion, motion dynamics, and temporal consistency.
4. 3D Quality Assessment: From Deep Learning to LMMs
4.1. Task Description of 3DQA
3DQA concerns the perceptual evaluation of three-dimensional visual content, including point clouds, textured meshes, and volumetric reconstructions acquired by sensing systems such as LiDAR, RGB-D cameras, structured-light devices, and multi-view capture pipelines. Given a 3D asset
X, 3DQA aims to learn a mapping:
where
approximates subjective quality scores obtained under human viewing protocols.
Unlike IQA and VQA, the perceptual target in 3D is defined over native geometric and attribute representations whose visual consequences are not trivially readable from Euclidean grids or temporal frames. The central challenge of 3DQA is therefore to assess how distortions in a 3D representation translate into perceptual quality degradation under human observation. Such perceptual evidence may be exposed through rendered views, viewpoint changes, and visibility conditions, but it may also be approximated directly from intrinsic geometric and attribute structures. In this sense, projection is an important perceptual interface rather than the sole defining basis of 3DQA.
4.1.1. Perceptual Characteristics of 3D Quality
Compared with IQA and VQA, the distinctive challenge of 3DQA lies in the gap between native 3D representations and their perceptual manifestation. A 3D object is not observed as a single fixed signal; instead, its perceived quality depends on how structural and appearance distortions become visible, salient, and interpretable under incomplete observation.
First, 3D representations exhibit structural irregularity. Point clouds are unordered and often non-uniformly sampled, while meshes and reconstructed surfaces may contain missing regions, topological defects, or unstable local structures. The perceptual importance of these errors is highly uneven: distortions near contours, thin components, articulated parts, or high-curvature regions are often much more noticeable than those in visually redundant areas. Therefore, 3DQA must account not only for distortion magnitude but also for how structural errors are distributed over perceptually sensitive regions.
Second, geometry and appearance are tightly coupled in perception. Geometric perturbations can alter normals, silhouettes, shading, and occlusion relationships, thereby changing perceived appearance even when texture values remain unchanged. Conversely, color and texture degradations may amplify the visual salience of structural defects. Perceived quality is thus determined by the joint effect of geometric fidelity and attribute consistency rather than by either factor alone.
Third, 3D quality perception is observation-dependent. Because only part of a 3D object is visible under any given observation condition, factors such as viewpoint, visibility, occlusion, and rendering strategy affect how perceptual evidence is exposed. However, these factors should be understood as mechanisms for revealing quality-relevant evidence, rather than as the only valid representation space for assessment. Accordingly, practical 3DQA methods may reason about them explicitly through rendered views, or implicitly through native 3D representations that encode perceptually relevant structural cues.
4.1.2. Structural Scope and Dynamic Extension
From this perspective, the canonical 3DQA problem is best understood as a perceptual assessment of native 3D distortions under observation-dependent evidence. Its central question is not simply how much a 3D representation deviates from a reference, nor solely how it appears under a particular rendering pipeline, but how geometric and attribute errors become perceptually meaningful to human observers. This broader formulation naturally supports multiple methodological routes, including intrinsic 3D modeling, projection-based perceptual modeling, and hybrid strategies that combine both.
Most existing 3DQA settings focus on static content, where the main challenge lies in reasoning over structural fidelity, geometry–appearance coupling, and incomplete perceptual exposure. Recent applications, however, increasingly involve dynamic 3D media such as dynamic point clouds, digital humans, and volumetric video. In these scenarios, temporal instability, flicker, topology inconsistency, and motion-dependent structural artifacts introduce an additional coherence dimension. Nevertheless, such temporal effects should be viewed as an extension of 3DQA to dynamic immersive content rather than the defining property of 3D perception itself.
Overall, the core significance of 3DQA is that perceptual quality must be inferred across a representation-to-perception gap: the object is stored in native 3D form, but its quality is judged through partially exposed perceptual consequences. This formulation explains why subsequent developments in datasets and models evolve along multiple directions, including rendered-view assessment, native 3D representation learning, and multimodal reasoning over geometry, appearance, and semantics. It also motivates the LMM-based 3DQA paradigms discussed in
Section 4.4, where rendered observations, intrinsic 3D structure, and language-guided interpretation begin to be integrated within a unified framework.
4.2. Evolution of 3DQA Datasets
The evolution of 3DQA datasets (
Table 7) reflects a gradual broadening of the task from controlled geometric distortion analysis to perceptual evaluation over richer 3D representations, observation conditions, and application scenarios.
Early point-cloud benchmarks such as G-PCD [
130,
131] and RG-PCD [
132] mainly focus on geometric distortions under relatively controlled settings, enabling systematic analysis of spatial fidelity but offering limited coverage of attribute-related perception. Subsequent datasets incorporate color information, distortion diversity, and more realistic processing pipelines, including IRPC [
134], SJTU-PCQA [
23], WPC [
135], and BASICS [
137]. Large-scale NR settings such as LS-PCQA [
3] further shift the emphasis toward robustness under realistic and mixed degradations.
Beyond static point clouds, the dataset landscape has expanded toward broader 3D content types and more diverse perceptual factors. Temporal 3DQA emerges with compressed or dynamic sequences, such as WPC2.0 [
136] and DPCD [
141], while volumetric video quality is represented by VsenseVVDB [
133]. On the mesh side, CMDM [
138] and textured mesh datasets such as TMQA [
139] explicitly highlight geometry–texture interactions. These developments indicate that 3DQA is no longer confined to geometry-only fidelity estimation but increasingly concerns the joint perceptual effect of structure, attributes, and content type.
More recent datasets further extend 3DQA toward semantically sensitive and human-centric scenarios. Human-centric 3DQA becomes prominent in head and full-body digital human settings, such as SJTU-H3D [
48], as well as dynamic and streaming scenarios, including DDHQA [
140] and DHQA-4D [
142]. In such settings, perceptual sensitivity is highly non-uniform, with semantically critical regions (e.g., face and articulated body parts) often dominating subjective judgments. Subjective evaluation may be conducted through rendered views or sequences, but the underlying assessment target remains the quality of native 3D content and its perceptual consequences.
Overall, the evolution of 3DQA datasets suggests that the field is moving from geometry-centric distortion evaluation toward representation-diverse, perception-oriented, and human-centric assessment. This broader dataset landscape naturally supports multiple methodological routes, including intrinsic 3D modeling, projection-based perceptual modeling, and hybrid approaches, and thereby motivates the progression from conventional deep models to LMM-driven 3DQA paradigms.
4.3. Conventional Modeling Approaches for 3DQA
Conventional 3DQA methods can be broadly grouped into two paradigms. Intrinsic 3D modeling operates directly on native geometry and attributes, aiming to characterize distortions in the original 3D representation. Projection-based perceptual modeling, by contrast, evaluates rendered observations that more closely mimic human viewing. Recent methods increasingly bridge the two by combining projection cues with auxiliary geometric representations.
4.3.1. Intrinsic 3D Modeling
Intrinsic approaches assess quality directly in geometry–attribute space, without explicitly rendering the content into 2D observations. Early FR metrics mainly focus on robust geometric comparison under irregular sampling and non-uniform density. PCQM [
22] combines point-to-surface distance estimation with quadric surface fitting and fuses geometry- and color-related components, thereby improving robustness to density variations while jointly measuring geometric and attribute degradation. MS-GraphSIM [
143] further models local neighborhoods as multi-scale graphs and measures structural and color-gradient consistency across scales. Likewise, Point2Dist [
144] compares local geometry and color neighborhoods through statistical point-to-distribution discrepancies, offering a more distribution-aware alternative to pointwise correspondence.
In the NR setting, intrinsic modeling has also been explored through learned quality representations directly extracted from distorted point clouds. LS-PCQA [
3] not only introduced a large-scale point-cloud quality database but also proposed the ResSCNN backbone for learning quality-aware representations in native 3D space. Related work also investigated task-driven supervision, where VisionTasks-PCQA [
145] uses the performance degradation of auxiliary vision tasks as a proxy signal for perceptual quality, thereby injecting machine-perception sensitivity into intrinsic quality modeling.
Overall, intrinsic methods are naturally suited to characterizing geometric perturbations, irregular sampling, and joint geometry–attribute distortions in the native 3D domain. However, because they do not explicitly model how the distorted content is actually observed after rendering, their connection to perceptual visibility is often indirect.
4.3.2. Projection-Based and Hybrid Perceptual Modeling
Because human observers ultimately perceive 3D content through rendered appearances, projection-based modeling has become a dominant paradigm in both FR and NR 3DQA. SJTU-PCQA [
23] established an influential 3D-to-2D projection framework by rendering point clouds into multiple views and aggregating projection-level quality evidence, demonstrating that rendered observations provide an effective perceptual interface for objective quality prediction. Deep NR models such as PQA-Net [
24] continue this direction by extracting view-wise features from multi-view projections and learning a regression model for perceptual quality prediction.
Subsequent work improved this projection-based pipeline from different angles. IT-PCQA [
146] transfers 2D IQA priors to point-cloud quality assessment through unsupervised domain adaptation on multi-perspective rendered views, revealing a practical bridge between image quality knowledge and 3D perceptual assessment. GMS-3DQA [
147] addresses the efficiency bottleneck of multi-view modeling by introducing grid mini-patch sampling over multiple projections and aggregating the sampled content into a compact quality map for Transformer-based feature extraction. MovingCam-PCQA [
148] further extends static multi-view observation into dynamic perceptual simulation by capturing point clouds as moving-camera videos and jointly modeling frame-level spatial evidence and clip-level temporal variation.
A more recent trend is to move from purely projection-based assessment toward hybrid perceptual modeling. MM-PCQA [
27] explicitly combines point-cloud sub-model features with projected image features and fuses them through cross-modal attention, showing that native 3D structure and rendered appearance provide complementary quality evidence. CoPA [
149] further improves NR point-cloud quality prediction through contrastive pretraining and semantic-guided multi-view fusion, highlighting the continued importance of data-efficient representation learning for 3D perceptual assessment. MPV-PCQA [
150] extends this idea by jointly modeling point clouds and captured dynamic video, integrating intrinsic geometric information with dynamic rendered observations in a unified NR framework.
Table 8 summarizes representative 3DQA approaches and their primary input modalities.
Beyond point clouds, related studies on other 3D content types also reinforce the importance of perceptual rendering while exposing content-specific factors. For textured meshes,
Textured Mesh Quality Assessment [
139] introduced a large-scale subjective dataset together with a deep learning-based metric, highlighting the interaction between geometry simplification, texture degradation, and semantic content in perceived quality. For dynamic digital humans, DDH-QA [
140] established a dedicated quality assessment database covering both model-based and motion-based distortions, emphasizing that motion naturalness and animation artifacts are central to quality perception in human-centric 3D content.
In summary, projection-based methods better align objective modeling with the rendered perceptual interface actually seen by observers, but they are inherently shaped by viewpoint sampling and rendering design. Hybrid methods partially alleviate this limitation by reintroducing intrinsic geometric cues and therefore form an important bridge between pure rendering-based assessment and fully native 3D quality modeling.
4.4. Large Multimodal Model Approaches for 3DQA
LMM-based 3DQA explores how VLMs and MLLMs can introduce semantic grounding, language-conditioned reasoning, and explanation capability into 3DQA. Rather than forming two strictly separated categories, existing methods are better viewed as lying on a continuum: early approaches remain projection-primary and use rendered views as the main perceptual interface, while more recent models increasingly elevate native 3D representations to first-class inputs in multimodal reasoning.
At the projection-primary end, SJTU-H3D [
48] introduced a subjective quality assessment benchmark for textured mesh digital humans together with a zero-shot NR evaluator. Its metric combines CLIP-based semantic affinity, low-level distortion cues from rendered projections, and lightweight mesh-geometry descriptors, showing that pretrained vision–language priors can improve semantic plausibility assessment beyond handcrafted distortion statistics. This line of work established an important precursor for language-aware 3D quality analysis, although the core perceptual evidence still comes from 2D renderings. LMM-PCQA [
59] represents an early attempt to more explicitly integrate MLLMs into point-cloud quality assessment. It reformulates NR PCQA through text supervision by converting quality labels into natural-language descriptions, allowing the model to derive quality-related logits from rendered 2D projections. To alleviate the geometry loss introduced by rendering, it further incorporates multi-scale structural features extracted from the point cloud, yielding a hybrid projection-dominant yet structure-aware framework. BMPCQA [
151] can be regarded as an LMM-based multimodal PCQA framework rather than a purely conventional deep model. It integrates rendering-projection, normal-image, and point-cloud patch features, and feeds them into a LMM for joint feature fusion and quality prediction. Compared with earlier projection-based quality regressors, this line of work highlights how multimodal language supervision can turn 3D quality evaluation into a richer perceptual reasoning problem with multi-task outputs.
At the native 3D-inclusive end, PIT-QMM [
60] adopts an end-to-end point–image–text multimodal architecture for NR point-cloud quality assessment. By jointly consuming textual prompts, rendered image projections, and point-cloud inputs, it treats intrinsic 3D structure as a first-class modality rather than a lightweight auxiliary cue. This design enables the model to combine semantic context, projection-level appearance evidence, and native geometric structure within a unified multimodal pipeline, and also supports distortion localization and identification in addition to score prediction. More generally, large point-cloud language models such as PointLLM [
152] suggest a broader foundation for future 3DQA. PointLLM [
152] aligns point-cloud representations with LLM token spaces through large-scale pretraining and instruction tuning on point–text pairs. Although it was not originally designed for perceptual quality prediction, it provides a reusable 3D-language backbone that can potentially be adapted to quality labels, preference learning, or explanation-oriented supervision. From this perspective, PointLLM [
152] is not a direct 3DQA solution, but it offers an important architectural prior for future explanation-capable quality assessment systems.
Overall, current LMM-based 3DQA methods evolve from projection-primary perceptual scoring toward more unified multimodal reasoning over rendered appearance, intrinsic geometry, and language. This trend suggests that future 3DQA systems may move beyond score regression alone and increasingly support explanation, distortion diagnosis, and interactive quality analysis.
To facilitate a concise comparison between representative conventional and LMM-based 3DQA methods,
Table 9 summarizes their performance on three common NR point-cloud benchmarks under aligned within-dataset evaluation. Compared with the benchmark evidence in IQA and VQA, the current empirical basis in 3DQA is noticeably narrower, as directly comparable results are still concentrated on static point-cloud NR-PCQA rather than evenly covering meshes, digital humans, or dynamic 3D content. Within this more constrained setting, however, recent LMM-based methods already show clear competitiveness. In particular, PIT-QMM [
60] achieves the best performance on all three benchmarks, while LMM-PCQA [
59] also demonstrates strong results on LS-PCQA and WPC.
At the same time, the table suggests that the benefit of LMMs in 3DQA is not merely inherited from 2D quality scoring. Strong conventional baselines such as MM-PCQA [
27] and CoPA+FT [
149] remain highly competitive, indicating that point-cloud-specific representation design, multi-view rendering, and geometric sensitivity are still central to 3D quality prediction. Therefore, the current evidence supports a cautious but meaningful conclusion: LMM-based approaches have become a promising direction for NR point-cloud quality assessment, yet it remains premature to generalize this advantage to the broader 3DQA landscape without more comparable results on textured meshes, human-centric assets, and dynamic 4D content.
4.5. Summary of 3DQA Progression
Overall, the development of 3DQA reflects a transition from geometry-oriented comparison and projection-based quality estimation to increasingly hybrid and multimodal perceptual modeling. Early research mainly focused on FR geometric or projection-based metrics, while later NR methods introduced sparse 3D representations, multi-view rendering, and cross-modal fusion to better capture perceptual quality in point clouds and related 3D content.
As shown by the benchmark comparison above, recent LMM-based methods already demonstrate clear competitiveness on common NR point-cloud quality assessment benchmarks. In particular, MLLM-based approaches begin to outperform strong conventional baselines on LS-PCQA, SJTU-PCQA, and WPC, suggesting that multimodal reasoning and richer cross-view integration are becoming increasingly relevant to 3D perceptual assessment. At the same time, current evidence remains concentrated on static point-cloud settings, and comparable benchmark results for textured meshes, digital humans, and dynamic 3D content are still limited.
Therefore, the progression reviewed in this section is best understood as an expansion from geometry- and rendering-centric regression toward multimodal and potentially explanation-capable 3D quality modeling. The strongest empirical support at present lies in NR point-cloud quality assessment, while the broader applicability of LMMs across the full 3DQA spectrum remains an open and important direction for future study.
Before discussing cross-modal trends, we briefly summarize commonly used evaluation metrics and benchmarking protocols shared across these quality assessment tasks.
4.6. Evaluation Metrics and Benchmarking Protocols
In addition to benchmark datasets, evaluation metrics and protocols play a critical role in assessing the performance of visual quality assessment methods across image, video, and 3D modalities. Most existing approaches are evaluated according to their consistency with human subjective judgments, which are typically represented by MOS or differential MOS (DMOS). Commonly used metrics quantify different aspects of prediction performance, including rank consistency, linear correlation, and absolute error.
The Spearman rank-order correlation coefficient (SRCC) measures the monotonic relationship between predicted scores and subjective ratings:
where
denotes the rank difference between the predicted score and the corresponding ground-truth score for the
ith sample, and
N is the total number of samples.
The Pearson linear correlation coefficient (PLCC) measures the linear correlation between predicted and subjective scores:
where
and
denote the predicted and ground-truth scores, respectively, and
and
are their sample means.
The Kendall rank correlation coefficient (KRCC) evaluates ranking consistency based on concordant and discordant pairs:
where
and
are the numbers of concordant and discordant pairs, respectively.
The root mean square error (RMSE) reflects the average magnitude of prediction error:
Among these metrics, the SRCC and KRCC mainly reflect ranking consistency, while the PLCC and RMSE emphasize score fidelity after regression or score alignment. In practice, the SRCC and PLCC are the most commonly reported criteria in IQA, VQA, and 3DQA benchmarks, whereas the KRCC and RMSE are often provided as complementary indicators.
Beyond metric selection, benchmarking protocols also substantially affect performance comparison. Common evaluation settings include random train–test splits within a dataset, cross-dataset testing, and leave-one-dataset-out validation for assessing generalization ability. For VQA, the temporal sampling strategy, clip duration, and sequence-level pooling can influence the final results. For 3DQA, viewpoint sampling, rendering settings, and multi-view aggregation protocols may also lead to different performance estimates.
The emergence of LMM-based methods further complicates evaluation. Recent approaches may generate textual explanations, pairwise preferences, or multi-dimensional attribute predictions instead of directly outputting scalar scores. As a result, future benchmarking should go beyond conventional correlation-based metrics and also consider explanation faithfulness, prediction consistency, robustness across prompts or datasets, and generalization across modalities.
5. Cross-Modal Trends and Unified Perspectives
Across image, video, and 3D modalities, visual quality assessment is undergoing a structural transformation that extends beyond increasing data dimensionality. More importantly, recent studies suggest that this transformation is not merely architectural but also conceptual: quality assessment is progressively shifting from distortion-centric regression toward semantically grounded, explainable, and context-conditioned evaluation.
Importantly, the motivation for adapting LMMs should not be reduced to using larger models for marginal benchmark gains alone. As shown by
Table 3,
Table 6, and
Table 9, conventional and deep quality predictors remain highly competitive, especially on mature in-domain benchmarks. Nevertheless, many emerging scenarios—including AI-generated content, instruction-conditioned evaluation, comparative judgment, and explainable assessment—require capabilities that are difficult to express within a single scalar-regression formulation. In this sense, VLMs and MLLMs are better interpreted as extending the scope of quality assessment rather than uniformly replacing established predictors.
This broader shift is also reflected in the recent publication landscape. As shown in
Figure 2, the share of large-/foundation-model-based methods rises steadily across all three domains but at clearly different rates. VQA exhibits the fastest growth, increasing from 0% in 2021–2022 to 6.1% in 2023, 16.0% in 2024, and 24.5% in 2025. IQA follows a more moderate trajectory, reaching 2.4%, 13.9%, and 16.4% over 2023–2025, while 3DQA starts later and remains smaller in scale, rising from 0% in 2021–2023 to 2.9% in 2024 and 11.6% in 2025. This asymmetry is broadly consistent with the benchmark evidence in this survey: in IQA, LMM-based methods mainly broaden the methodological space without yet showing uniformly dominant in-domain gains; in VQA, native LMM-based methods already exhibit clearer competitiveness on common UGC benchmarks; and in 3DQA, recent results are promising but still concentrated on a narrower NR point-cloud setting. The publication trend therefore suggests that the adoption of LMMs is task-driven rather than uniform, and appears strongest where long-context modeling, semantic reasoning, or multimodal fusion are most central to perceptual judgment.
First, the three modalities differ in how perceptual evidence is observed and integrated. IQA is based on fixed spatial observation, where judgments are formed from a single image. VQA introduces temporal accumulation, requiring models to account for motion consistency, temporal masking, and perceptual memory. 3DQA further depends on viewpoint-conditioned and rendering-mediated observation, where visibility, occlusion, shading, and projection strategy all affect perceived quality. These differences indicate that perceptual evidence becomes progressively more structured and condition-dependent across modalities.
Second, despite these modality-specific differences, a shared methodological progression can be observed. Conventional approaches in IQA, VQA, and 3DQA mostly formulate quality prediction as supervised regression toward MOS/DMOS, emphasizing distortion characterization and statistical correlation with human ratings. Recent LMM-based methods, however, increasingly move beyond direct score prediction. In IQA, methods such as DepictQA [
52], DeQA-Score [
91], and Grounding-IQA [
54] demonstrate that quality evaluation can be coupled with descriptive reasoning, score distribution modeling, and region-aware grounding. In the video domain, CP-LLM [
58],
[
123], and VQAThinker [
122] further extend quality assessment toward description generation, question answering, and reasoning-aided interpretation. For 3D content, LMM-PCQA [
59] and PIT-QMM [
60] show that projection-based evidence and intrinsic 3D structure can be jointly incorporated into multimodal quality reasoning. These studies collectively suggest that quality assessment is evolving from black-box regression into a more explainable perceptual inference process.
Third, perceptual quality itself is increasingly defined by semantic plausibility and structural coherence rather than only low-level artifact visibility. This trend is especially evident in AI-generated and enhancement-oriented scenarios. Datasets such as AIGIQA-20K [
29], PKU-AIGIQA-4K [
74], and HVEval [
35] reveal that human judgments often depend on realism, alignment, and consistency beyond conventional distortions. In human-centered settings, AGHI-QA [
76] and FVQ [
124] further highlight identity preservation, motion plausibility, and structural consistency as essential perceptual factors. These developments indicate that future quality assessment systems must integrate distortion sensitivity with semantic understanding and high-level structural reasoning.
Fourth, recent work also points to a gradual transition from modality-specific predictors toward unified multimodal quality assessment models. For instance, Q-Align [
129] demonstrates that a single LMM can be trained to handle IQA, image aesthetic assessment (IAA), and VQA within a unified framework by leveraging text-defined rating levels. Q-Align also shows that jointly learning across IQA and VQA datasets can improve both accuracy and generalization, especially under mixed-data scenarios. Together with the publication-trend analysis in
Figure 2, these findings suggest that quality assessment is increasingly moving toward a unified multimodal reasoning paradigm, although the pace of this transition still differs across image, video, and 3D domains.
6. Conclusions and Future Outlook
Visual quality assessment is undergoing a fundamental transformation driven by both the evolution of visual sensing modalities and the emergence of LMMs. From spatial observation in images to temporal integration in videos and viewpoint-conditioned perception in 3D content, each modality introduces new challenges in modeling perceptual evidence. Meanwhile, the field is shifting from distortion-centric regression toward semantically grounded and reasoning-capable evaluation paradigms.
Taken together, the benchmark comparisons and publication-trend analysis in this survey suggest that the rationale for adapting LMMs is not uniform across modalities. In IQA, where standard in-domain benchmarks are already highly mature, the present value of LMM-based methods lies more in promptable, explainable, and semantically grounded assessment than in uniformly superior correlation scores. In VQA, the joint modeling of spatial distortion, motion dynamics, and temporal coherence makes LMM-based approaches more directly competitive, which is also reflected in their recent faster growth. In 3DQA, early evidence on the NR point-cloud quality assessment is promising, but the empirical basis remains narrower than in image and video settings. Therefore, the adoption of LMMs should be understood as a task-dependent methodological transition rather than a universal replacement of conventional quality predictors.
This survey presents a unified perspective on quality assessment across image, video, and 3D modalities. By reviewing datasets, conventional approaches, and recent LMM-based methods, we show that quality assessment is evolving from a static signal-to-score mapping to a multimodal perceptual reasoning process, where predictions can be conditioned on textual criteria and supported by explanatory evidence.
Building upon these observations, we outline several key research directions:
- (1)
Unified cross-modal perceptual representation. Learning representations that generalize across image, video, and 3D modalities remains a fundamental challenge. Future work should jointly model spatial structure, temporal dynamics, and viewpoint-dependent perception within a unified framework.
- (2)
Reliable and interpretable score calibration and benchmarking. LMM-based methods often produce textual, comparative, or multi-attribute outputs instead of scalar scores. Establishing consistent and interpretable mappings to quantitative quality measures, while also designing fair benchmarking protocols across prompts, datasets, and modalities, is essential for rigorous evaluation and practical deployment.
- (3)
Long-context modeling for VQA. Video quality depends on long temporal dependencies involving motion consistency and temporal evolution. Although recent methods improve contextual reasoning, efficient and scalable long-context modeling remains largely underexplored.
- (4)
Explainable quality assessment. LMM-based approaches enable textual explanations alongside quality predictions. However, ensuring the faithfulness and reliability of such explanations remains challenging, especially in aligning them with perceptual evidence.
- (5)
Grounded quality assessment with spatial and structural alignment. Grounding quality judgments in spatial regions, temporal segments, or geometric structures is still underexplored. Improving localization accuracy and cross-modal consistency is essential for trustworthy assessment.
Overall, visual quality assessment is converging toward a unified multimodal paradigm that integrates distortion sensitivity, semantic understanding, temporal reasoning, and viewpoint-aware perception. Advancing this paradigm requires not only larger models but also principled evaluation frameworks and deeper integration of perceptual theory.
In this sense, future visual quality assessment systems may evolve from isolated evaluators into general multimodal perceptual intelligence systems.