Point Cloud Quality Assessment via Complexity-Driven Patch Sampling and Attention-Enhanced Swin-Transformer

Shen, Xilei; Li, Qiqi; Tu, Renwei; Bai, Yongqiang; Ge, Di; Zhu, Zhongjie

doi:10.3390/info17010093

Open AccessArticle

Point Cloud Quality Assessment via Complexity-Driven Patch Sampling and Attention-Enhanced Swin-Transformer

by

Xilei Shen

,

Qiqi Li

,

Renwei Tu

^*

,

Yongqiang Bai

,

Di Ge

and

Zhongjie Zhu

Key Laboratory of Industrial Vision and Industrial Intelligence, Zhejiang Wanli University, Ningbo 315100, China

^*

Author to whom correspondence should be addressed.

Information 2026, 17(1), 93; https://doi.org/10.3390/info17010093

Submission received: 12 December 2025 / Revised: 6 January 2026 / Accepted: 13 January 2026 / Published: 15 January 2026

Download

Browse Figures

Versions Notes

Abstract

As an emerging immersive media format, point clouds (PC) inevitably suffer from distortions such as compression and noise, where even local degradations may severely impair perceived visual quality and user experience. It is therefore essential to accurately evaluate the perceived quality of PC. In this paper, a no-reference point cloud quality assessment (PCQA) method that uses complexity-driven patch sampling and an attention-enhanced Swin-Transformer is proposed to accurately assess the perceived quality of PC. Given that projected PC maps effectively capture distortions and that the quality-related information density varies significantly across local patches, a complexity-driven patch sampling strategy is proposed. By quantifying patch complexity, regions with higher information density are preferentially sampled to enhance subsequent quality-sensitive feature representation. Given that the indistinguishable response strengths between key and redundant channels during feature extraction may dilute effective features, an Attention-Enhanced Swin-Transformer is proposed to adaptively reweight critical channels, thereby improving feature extraction performance. Given that traditional regression heads typically use a single-layer linear mapping, which overlooks the heterogeneous importance of information across channels, a gated regression head is designed to enable adaptive fusion of global and statistical features via a statistics-guided gating mechanism. Experiments on the SJTU-PCQA dataset demonstrate that the proposed method consistently outperforms representative PCQA methods.

Keywords:

PC; quality assessment; complexity-driven sampling; attention-enhanced Swin-Transformer

1. Introduction

As an unstructured data representation composed of discrete 3D coordinate points, 3D point clouds(PC) can accurately characterize the geometric structure, depth relationships, and surface attributes of a scene. It has therefore become a fundamental medium for constructing digital representations of real-world environments [1]. Compared with 2D images, PCs provide more complete shape and spatial topology information in 3D space, and have been widely adopted in high-demand applications such as autonomous driving, robotic navigation, virtual/augmented reality, and intelligent manufacturing [2]. With the continuous improvement in sensing resolution and the increasing scale of 3D perception applications, the number of points in high-resolution PC often reaches millions or even hundreds of millions. This imposes significant pressure on storage, bandwidth, and real-time computation. In recent years, to alleviate resource consumption in large-scale PC applications, both academia and industry have made substantial progress in PC compression, coding, and efficient processing, making them key components in the 3D data processing pipeline [3]. By removing redundancy and optimizing coding, PC compression can significantly reduce data volume and meet the requirements of resource-constrained scenarios such as mobile V/AR devices and cooperative vehicle–infrastructure communication. However, compression inevitably introduces distortions, manifested as geometric shifts, loss of fine details, or attribute deviations, which directly degrade subsequent PC applications.

Existing no-reference point clouds quality assessment (PCQA) methods can be broadly categorized into point-based and projection-based methods. Point-based methods operate directly in the 3D domain to model local structure, topological variations, and statistical properties, thereby estimating the degree of geometric distortion. For example, 3D-NSS [4] exploits natural scene statistics of PC to model distortion-induced shifts under different degradation types. GraphSIM [5] builds a structure-aware graph similarity metric to characterize local topological degradation. SGR-PCQA [6], also belonging to the point-based category, learns to regress quality from raw PC features using deep neural networks. Although these methods exhibit advantages in physical consistency and geometric modeling, their robustness and fine-grained modeling capacity are still limited in sparse regions, under unstable neighborhood definitions, and in the presence of varying point densities.

Projection-based PCQA methods project 3D PC into 2D images from multiple viewpoints, leveraging mature image quality assessment techniques while preserving geometric and appearance cues. PQA-Net [7] and IT-PCQA [8] follow this paradigm by integrating multi-view consistency or image-transform-enhanced representations to improve quality modeling. PAME [9] employs a self-supervised masked autoencoder to learn cross-view representations from multi-view projections. LP-PCQM [10] adopts a layered projection strategy to fuse geometry and color information. Simple Baselines [11] provide a unified projection-and-regression framework for both full-reference (FR) and no-reference (NR) PCQA. Plain-PCQA [12] jointly models geometric and visual planes, whereas PQSM [13] utilizes 3D saliency to detect structurally degraded regions. These projection-based approaches advance PCQA performance through cross-modal fusion, cross-view consistency modeling and fine-detail preservation.

Despite these advances, existing projection-based methods still face several limitations. Most adopt fixed or uniform sampling for viewpoints and patches, without accounting for regional complexity or information density, limiting the extraction of quality-sensitive features in high-complexity areas. Feature extraction networks often lack fine-grained channel-wise selection, making it challenging to emphasize channels associated with geometric or textural degradation. These issues affect both FR and NR PCQA, though they are especially critical in NR scenarios where reference information is unavailable.

To address the above problems, a no-reference PCQA method via complexity-driven patch sampling and attention-enhanced Swin-Transformer is proposed to accurately measure the perceived quality of PC. Experiments on the SJTU-PCQA datasets validate the effectiveness of the proposed method. The main contributions can be summarized as follows:

Considering that the quality-related information density varies significantly across local patches in PC projection maps and that high-complexity patches contain richer distortion-sensitive cues, a complexity-driven patch sampling strategy is designed to preferentially select high-information-density patches and thereby enhance subsequent distortion-sensitive feature representation.
Considering that the indistinct response strengths among critical and redundant channels may weaken the representation capability during feature extraction, an Attention-Enhanced Swin-Transformer is proposed to adaptively highlight informative channels and thereby enhance feature extraction performance.
To address the inherent limitations of traditional linear regression heads in handling diverse distortion patterns, a gated regression head is constructed, which integrates global semantic features with channel statistical descriptors. Through a statistics-driven gating mechanism using channel-wise mean and standard deviation, the proposed model can adaptively balance the contributions of the two feature types, thereby improving prediction robustness and generalization.

2. Motivation

Existing projection-based PCQA methods typically construct a Quality Mapping Module (QMM) by randomly selecting grid patches from multiple projection views. However, the quality information density varies significantly across patches. High-complexity patches that cover object boundaries, texture details, or visibly distorted regions tend to contain richer quality-sensitive information, exhibiting higher gradient strength and entropy. In contrast, low-complexity patches dominated by background or smooth surfaces are highly redundant and contribute little to quality assessment. Random patch selection cannot distinguish between these cases, which may result in discarding high-complexity patches while low-complexity patches occupy the limited QMM space, thereby interfering with downstream feature extractors’ ability to capture critical distortion cues and limiting further improvements in assessment accuracy. Here, distortion cues refer to a set of feature indicators that reflect the distortion state of PC data, including geometric, textural, and topological anomalies, which serve as the basis for identifying the type and degree of PC distortion. Therefore, by analyzing patch complexity, local patches containing rich quality-sensitive information can be preferentially selected from PC projection maps to enhance subsequent feature representations.

In the SJTU-PCQA dataset, PC distortions mainly include geometric distortions, texture distortions, and topological distortions, as shown in Figure 1. Each PC sample is labeled with a perceptual quality score (MOS) and categorized into multiple distortion levels, ranging from mild to severe degradation, providing reliable reference standards for model training and evaluation.

During feature extraction, existing methods typically treat all channels equally. However, the physical meaning and information value of different channels vary significantly: some channels are sensitive to geometric distortions, others emphasize texture distortions, and a few may primarily encode projection noise. Without differentiating between key and redundant channels, effective features are diluted, and quality-critical distortion cues cannot be sufficiently highlighted, ultimately reducing the accuracy of subsequent quality regression. Therefore, an Attention-Enhanced Swin-Transformer is adopted, which assigns adaptive weights to channels during distortion feature extraction, strengthening informative features while suppressing noisy channels.

In the quality regression stage, conventional regression heads typically use a single linear mapping to project high-dimensional features onto a scalar quality score, ignoring the varying importance of different channels and failing to emphasize features critical to perceptual quality. The lack of nonlinear modeling and adaptive feature fusion limits the model’s ability to respond to complex distortions, potentially weakening structural and texture-related salient features. Single-layer mappings also fail to fully exploit channel statistics, restricting the expressive power of high-dimensional features. To address these issues, a gated regression head is designed, in which a statistics-driven gating mechanism adaptively fuses global semantic and channel-wise statistical features, enhancing prediction robustness and generalization.

3. Proposed Method

In this section, a no-reference PCQA method via complexity-driven patch sampling and attention-enhanced Swin-Transformer is proposed and elaborately described. The proposed method, as illustrated in Figure 2, first performs multi-view projection on the distorted PC, followed by complexity-driven patch sampling to select 49 [14] representative patches. These patches are then fed into the proposed Attention-Enhanced Swin-Transformer to extract discriminative features. Finally, a statistics-guided gated regression head is employed to generate the final quality score, enabling accurate PCQA.

3.1. Complexity-Driven Patch Sampling

To extract quality-sensitive local patches from PC projection maps while reducing the influence of highly redundant regions on quality assessment, a complexity-driven patch sampling strategy is designed. First, let the distorted PC be projected from multiple viewpoints to generate corresponding projection maps. Each projection map is divided into non-overlapping patches using a

32 \times 32

grid, and each patch is converted into its grayscale representation.

Let

g (i, j)

denote the grayscale value of the pixel at location

(i, j)

in a patch,

b_{t}

denote the background threshold, and

D

denote the effective pixel density of the patch, then

D

is defined as:

D = \frac{\sum I (g (i, j) > b_{t})}{32 \times 32}

(1)

where

b_{t} = 0.05

and

I

is the indicator function; when

D < 0.1

, the patch complexity is set to zero.

Let H and W denote the height and width of the patch in pixels, respectively. Let

G_{x} (i, j)

and

G_{y} (i, j)

represent the horizontal and vertical gradient responses at pixel

(i, j)

, computed using Sobel operators. Then, the average gradient magnitude of the patch,

G

, is defined as follows:

G = \frac{1}{H W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} \sqrt{G_{x} {(i, j)}^{2} + G_{y} {(i, j)}^{2}}

(2)

where H and W are the height and width of the patch.

Let

p_{k}

denote the normalized histogram value of the k-th grayscale level in the patch. The grayscale entropy of the patch,

E

, is then expressed as follows:

E = - \sum_{1}^{K} p_{k} log (p_{k} + 10^{- 8})

(3)

where K is the number of discrete grayscale levels.

Let

C

denote the overall complexity score of a patch. It is computed via weighted fusion of the previously defined metrics

G

,

E

, and

D

as follows:

C = 0.4 G + 0.4 E + 0.2 D

(4)

The patches from each view are ranked in descending order of

C

, and the 49 most complex patches are selected and fed into the subsequent network for distortion-aware feature extraction.

3.2. Attention-Enhanced Swin-Transformer

To better extract features, particularly to enhance the model’s ability to discriminate the importance of channel information, an Attention-Enhanced Swin-Transformer is proposed, as shown in Figure 3. By quantifying the information value of each channel, the module strengthens key channels while suppressing redundant ones. The Attention-Enhanced Swin-Transformer is specifically designed to adapt to the uneven information density and structural characteristics of PC projection maps, assigning differentiated weights to channels to better capture distortion-sensitive features.

Specifically, the feature enhancement module based on channel attention takes the final feature sequence output by the backbone network as input and performs global average pooling on the input. Let

X_{a v g}

denote the pooled features,

X_{i}

represent the feature of each patch, and N be the number of patches. The global average pooling is defined as follows:

X_{a v g} = \frac{1}{N} \sum_{i = 1}^{N} X_{i}

(5)

The proposed module adopts a lightweight single-branch structure. The pooled descriptor

X_{a v g}

is processed by an attention-weight generation module composed of two fully connected (FC) layers. Let

W_{0}

and

W_{1}

denote the weight matrices of the first and second FC layers, respectively, and let

R (\cdot)

denote the ReLU activation function. Let

Z_{1}^{a v g}

and

Z_{2}^{a v g}

represent the intermediate and final vectors of the attention branch.

The corresponding computations are as follows:

Z_{1}^{a v g} = R (W_{0} X_{a v g}), Z_{2}^{a v g} = W_{1} Z_{1}^{a v g}

(6)

Let

σ (\cdot)

denote the Sigmoid activation function, which maps the values to the range

(0, 1)

. The channel attention weight vector

W_{a t t}

is therefore obtained as follows:

W_{a t t} = σ (Z_{2}^{a v g})

(7)

Finally, the original feature

X_{i}

is weighted by the attention vector

W_{a t t}

via channel-wise multiplication:

X_{a t t, i} = X_{i} ⊙ W_{a t t}

(8)

where ⊙ denotes element-wise multiplication. Through this adaptive weighting mechanism, the Attention-Enhanced Swin-Transformer highlights channel responses that are strongly associated with PC structural quality, while suppressing redundant or noisy information, thereby improving the model’s sensitivity to compression distortion and structural degradation.

3.3. Gated Regression Head

To overcome the inherent limitations of traditional linear regression heads in handling diverse distortion patterns, a gated regression head is constructed, where global semantic features are integrated with channel statistical descriptors. Specifically, when the input contains the spatial dimension M, global average pooling is applied to obtain an overall feature descriptor. Let

f_{global}

denote the pooled global feature, then

f_{global}

is expressed as follows:

f_{global} = \frac{1}{M} \sum_{i = 1}^{M} F_{i}^{'}

(9)

where

F_{i}^{'}

is the feature vector of the i-th input sample.

Let

μ

denote the channel-wise mean of

f_{global}

, and let

σ

denote its channel-wise standard deviation. These are defined as follows:

μ = \frac{1}{L} \sum_{L = 1}^{L} f_{global} (L)

(10)

σ = \sqrt{\frac{1}{L} \sum_{L = 1}^{L} {(f_{global} (L) - μ)}^{2}}

(11)

Let the statistical feature vector be

t = {[μ, σ]}^{⊤}

, which reflects both the central tendency and the variation amplitude of the channel-wise feature distribution. The channel-wise mean and standard deviation are adopted to summarize the overall distribution tendency of the global semantic features, providing a compact and interpretable statistical descriptor. This choice not only captures the central tendency and variability of feature responses but also avoids the potential instability and additional computational complexity introduced by higher-order statistics, thereby achieving a favorable balance between robustness and efficiency.

Let

W_{t 1}

and

W_{t 2}

denote the weight matrices of the two-layer statistical branch, and let

y_{t}

denote the statistical-branch prediction. Then

y_{t}

is expressed as follows:

y_{t} = W_{t 2} ϕ (W_{t 1} t),

(12)

where

ϕ (\cdot)

denotes the ReLU activation function.

Let

W_{n 1}

and

W_{n 2}

denote the weight matrices of the two-layer gating network, which generates a channel-wise fusion weight

α_{gate}

. The gating operation is defined as follows:

α_{gate} = σ (W_{n 2} ϕ (W_{n 1} t)),

(13)

where

ϕ (\cdot)

denotes the ReLU activation function,

σ (\cdot)

is the sigmoid function applied to produce a normalized scalar weight, and

t

represents the aggregated channel descriptor.

Finally, let Q denote the final predicted quality score, which is obtained via a statistics-guided gated fusion mechanism:

Q = (1 - α_{gate}) \cdot y_{t} .

(14)

where

y_{t}

denotes the output of the statistical branch, and

α_{gate}

is the gating weight generated from the statistical descriptor

t = {[μ, σ]}^{⊤}

. In this process, the global semantic feature extracted by the backbone is first transformed into a compact statistical descriptor through global average pooling and channel-wise statistical operations. Rather than being directly incorporated as an explicit regression term, the global semantic information influences the final quality prediction by guiding the gating mechanism. The adaptive gating weight

α_{gate}

dynamically modulates the contribution of the statistical branch output

y_{t}

, enabling the model to adjust its response according to the overall semantic distribution and distortion characteristics of the input PC. This design allows global semantic cues to participate in the quality prediction process in an implicit yet effective manner, enhancing robustness while avoiding redundant modeling and preserving a concise and interpretable regression formulation.

4. Experiments

This section introduces the PC dataset used in the experiments, as well as the implementation details and training settings of the proposed method. Then, experimental results are analyzed in detail and compared with mainstream PCQA methods. Finally, ablation studies and complexity analyses are conducted.

4.1. PC Dataset

SJTU-PCQA dataset [15]: As shown in Figure 4 and Table 1, this dataset contains 9 reference PCs. Each reference PC is subjected to 7 types of distortion (compression, color noise, geometric noise, downsampling, and three combined distortions), and each distortion type has 6 distortion levels, resulting in a total of 378 distorted PCs.

WPC dataset [16]: This dataset contains 20 reference PCs. For each reference PC, 37 distorted versions are generated by simulating five different types of distortions, resulting in a total of 740 distorted PCs.

4.2. Experimental Settings

The experimental environment includes an Intel Xeon Gold 5218R CPU (2.10 GHz), 128 GB RAM, and an NVIDIA GeForce RTX 3090 GPU. The implementation is based on Python 3.10.12, PyTorch 1.1.0, and PyCharm 2024.3.

Six-direction projection maps of PC are generated using the Open3D library (version 0.19.0), where the selection of the six viewing angles is fixed (i.e., along the positive/negative X, Y, and Z axes of the 3D Cartesian coordinate system). The backbone adopts the Swin-Transformer initialized with ImageNet-22K pre-trained weights. The Adam optimizer is used with an initial learning rate of

1 \times 10^{- 4}

, decayed exponentially with a rate of 0.9 per epoch. The batch size is 32, and the total number of training epochs is 50.

To ensure reliability and reproducibility, a 9-fold cross-validation strategy is adopted following the protocol of the SJTU-PCQA database (9 groups of samples). In each fold, 8 groups are used for training and the remaining 1 for testing. This process is repeated 9 times, and the averaged results across all folds are reported. All training and test sets are strictly disjoint to ensure fairness in evaluation.

4.3. Results and Analysis

To evaluate the effectiveness of the proposed method, experiments are conducted on the SJTU-PCQA dataset and the WPC dataset. We compare the proposed model with several representative PCQA methods, including PQA-Net [7], IT-PCQA [8], VQA-PC [17], GMS-3DQA [14], MM-PCQA [18], 3D-NSS [4], VPI-PCQA [19], and GC-PCQA [20]. The objective is to evaluate the prediction accuracy, robustness, and stability of the model under various distortion conditions.

As shown in Table 2 and Table 3, the proposed method achieves clear advantages over multiple mainstream PCQA models on SJTU-PCQA. Specifically, it obtains SROCC

= 0.9203

, PLCC

= 0.9370

, KROCC

= 0.7753

, and RMSE

= 0.8065

, reaching the best or near-best performance among the comparison methods in both correlation and error metrics.On the WPC dataset, the proposed method also shows clear advantages. Compared with PQA-NET (SROCC = 0.7000) and IT-PCQA (SROCC = 0.5500), our method achieves the best results across all metrics, with SROCC = 0.7689, PLCC = 0.7647, KROCC = 0.5817, and RMSE = 14.7151, demonstrating strong cross-dataset generalization capability. This demonstrates the effectiveness of the proposed complexity-driven QMM construction and gated regression head in handling high-complexity local regions and enhancing quality-sensitive feature representations.

The complexity-driven patch sampling strategy focuses on patches exhibiting structural changes and sensitivity to distortion, thereby improving the correlation between predicted scores and ground-truth subjective ratings. Meanwhile, the statistic-guided gated regression structure further strengthens the fusion of multi-scale quality information, yielding smoother and more stable predictions.

Overall, the experimental results indicate that the proposed method achieves notable advantages on the standard benchmark SJTU-PCQA and WPC, further confirming the effectiveness of complexity-driven sampling and gated statistical fusion for no-reference PC quality assessment.

4.4. Ablation Study

To further validate the effectiveness of each proposed component, ablation studies are performed on SJTU-PCQA, analyzing the contributions of the complexity-driven patch sampling strategy (CDPS), the Attention-Enhanced Swin-Transformer (AEST), and the gated regression head (GRH). The results are summarized in Table 4. Where the abbreviations in the variant column correspond to the full names of the proposed modules: w/o CDPS denotes without the Complexity-driven patch sampling, w/o AEST denotes without the Attention-Enhanced Swin-Transformer, and w/o GRH denotes without the Gated Regression Head. The experimental results in Table 4 demonstrate the performance gains contributed by each individual module. Moreover, they show that the proposed method achieves the highest overall performance when all components operate jointly. The complete model (Proposed) attains the best results on SJTU-PCQA, with SROCC = 0.9203, PLCC = 0.9370, and RMSE = 0.8065.

4.5. Complexity Analysis

In addition, Table 5 presents the complexity metrics of the proposed model: the model file size is 317.0 MB, GPU memory usage during training and testing is 6.0 GB, the training time on the entire dataset is 7613.9 s (approximately 2.1 h), and the average testing time per single PC is 8.5 s. In comparison, IT-PCQA requires 18,021.2 s, and PQA-Net requires 24,729.6 s for training on the same dataset using the same hardware. These results indicate that the proposed model maintains a performance advantage while significantly reducing training and inference costs, achieving a good balance between performance and efficiency.

5. Conclusions

In this paper, a new no-reference point clouds quality assessment (PCQA) method via complexity-driven patch sampling and attention-enhanced is proposed. The proposed complexity-driven patch sampling strategy effectively emphasizes high-information-density regions, thereby enhancing the representativeness of quality-sensitive local features. The Attention-Enhanced Swin-Transformer further strengthens feature extraction by adaptively reweighting critical channels and suppressing redundant ones. In addition, the gated regression head provides a more reliable quality prediction by enabling the adaptive fusion of global semantic descriptors and channel-level statistical cues. Together, these components collectively contribute to the superior performance and robustness observed across diverse distortion conditions. Experimental results on the SJTU-PCQA and WPC datasets demonstrate that the proposed method achieves competitive or superior performance compared with existing mainstream methods.

For future work, multi-scale statistical modeling, self-supervised representation learning, and lightweight network architectures will be further explored to improve the efficiency of quality perception and to enable dynamic assessment in real-time and large-scale PC scenarios.

Author Contributions

Conceptualization, X.S. and R.T.; methodology, X.S.; software, X.S.; validation, X.S., Q.L. and R.T.; formal analysis, X.S.; investigation, X.S.; resources, R.T.; data curation, X.S.; writing—original draft preparation, X.S.; writing—review and editing, R.T.; visualization, X.S.; supervision, R.T.; project administration, R.T.; funding acquisition, Y.B., D.G. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in SJTU-PCQA at https://smt.sjtu.edu.cn/database/point-cloud-subjective-assessment-database/ (accessed on 9 July 2025).

Acknowledgments

The authors would like to thank the members of the laboratory for their helpful discussions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

PCQA	PC Quality Assessment
NR-PCQA	No-Reference PC Quality Assessment
QMM	Quality Mapping Module
SJTU-PCQA	Shanghai Jiao Tong University PC Quality Assessment Database
WPC	Waterloo Point Cloud

References

Zhang, Y.; Yang, Q.; Zhou, Y.; Xu, X.; Yang, L.; Xu, Y. TCDM: Transformational Complexity Based Distortion Metric for Perceptual Point Cloud Quality Assessment. IEEE Trans. Vis. Comput. Graph. 2024, 30, 6707–6724. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Wu, H.; Zhou, Y.; Li, C.; Sun, W.; Chen, C.; Min, X.; Liu, X.; Lin, W.; Zhai, G. LMM-PCQA: Assisting Point Cloud Quality Assessment with LMM. arXiv 2024, arXiv:2404.18203v2. [Google Scholar] [CrossRef]
Duan, H.; Fu, K.; Wu, S.; Li, Y.; Zhang, Z.; Hu, Q.; Min, X.; Zhai, G. BMPCQA: Bioinspired Metaverse Point Cloud Quality Assessment Based on Large Multimodal Models. Adv. Intell. Syst. 2025, 2500504. [Google Scholar] [CrossRef]
Zhang, Z.; Sun, W.; Min, X.; Wang, T.; Lu, W.; Zhai, G. No-Reference Quality Assessment for 3D Colored Point Cloud and Mesh Models. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7618–7631. [Google Scholar] [CrossRef]
Yang, Q.; Ma, Z.; Xu, Y.; Li, Z.; Sun, J. Inferring Point Cloud Quality via Graph Similarity. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 3015–3029. [Google Scholar] [CrossRef]
Zhou, W.; Yang, Q.; Jiang, Q.; Zhai, G.; Lin, W. Blind Quality Assessment of 3D Dense Point Clouds with Structure Guided Resampling (SGR-PCQA). arXiv 2022, arXiv:2208.14603. [Google Scholar]
Liu, Q.; Su, H.; Duanmu, Z.; Liu, W.; Wang, Z. PQA-Net: Deep No Reference PC Quality Assessment via Multi-View Projection. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 4645–4660. [Google Scholar] [CrossRef]
Yang, Q.; Liu, Y.; Chen, S.; Xu, Y.; Sun, J. No-Reference Point Cloud Quality Assessment via Domain Adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 21147–21156. [Google Scholar] [CrossRef]
Shan, Z.; Zhang, Y.; Yang, Q.; Yang, H.; Xu, Y.; Liu, S. PAME: Self-Supervised Masked Autoencoder for No-Reference Point Cloud Quality Assessment. arXiv 2024, arXiv:2403.10061. [Google Scholar]
Chen, T.; Liu, Q.; Zhang, Y.; Xu, Y.; Tang, R.; Sun, J. Layered Projection-Based Quality Assessment of 3D Point Clouds. IEEE Access 2021, 9, 88108–88120. [Google Scholar] [CrossRef]
Zhang, Z.; Zhou, Y.; Sun, W.; Min, X.; Zhai, G. Simple Baselines for Projection-based Full-reference and No-reference Point Cloud Quality Assessment. arXiv 2023, arXiv:2310.17147. [Google Scholar] [CrossRef]
Chai, X.; Shao, F.; Mu, B.; Chen, H.; Jiang, Q.; Ho, Y.-S. Plain-PCQA: No-Reference Point Cloud Quality Assessment by Analysis of Plain Visual and Geometrical Components. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 6207–6223. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, Y.; Yang, Q.; Xu, Y.; Sun, J.; Liu, S. Point Cloud Quality Assessment using 3D Saliency Maps. In Proceedings of the 2023 IEEE International Conference on Visual Communications and Image Processing (VCIP), Jeju, Republic of Korea, 4–7 December 2023; pp. 1–5. [Google Scholar] [CrossRef]
Zhang, L.; Zhu, L.; Ma, K. GMS-3DQA: Geometry–Texture Joint Modeling for No-Reference Point Clouds Quality Assessment. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 3775–3788. [Google Scholar]
Yang, Q.; Chen, H.; Ma, Z.; Xu, Y.; Tang, R.; Sun, J. Predicting the Perceptual Quality of Point Cloud: A 3D-to-2D Projection-Based Exploration. IEEE Trans. Multimed. 2021, 23, 3877–3891. [Google Scholar] [CrossRef]
Liu, Q.; Su, H.; Duanmu, Z.; Liu, W.; Wang, Z. Perceptual Quality Assessment of Colored 3D Point Clouds. IEEE Trans. Vis. Comput. Graph. 2023, 29, 3642–3655. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Sun, W.; Zhu, Y.; Min, X.; Wu, W.; Chen, Y.; Zhai, G. Evaluating Point Cloud from Moving Camera Videos: A No-Reference Metric. arXiv 2023, arXiv:2208.14085v3. [Google Scholar] [CrossRef]
Zhang, Z.; Sun, W.; Min, X.; Wang, Q.; He, J.; Zhou, Q.; Zhai, G. MM-PCQA: Multi-Modal Learning for No-Reference Point Clouds Quality Assessment. In Proceedings of the IJCAI, Macao, China, 19–25 August 2023; pp. 1759–1767. [Google Scholar]
Zhang, Z.; Wu, W.; Min, X.; Zhu, Y.; Zhai, G.; Sun, W. Optimizing Projection-Based Point Cloud Quality Assessment with Human Preferred Viewpoints Selection. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
Chen, W.; Wu, Q.; Zhou, W.; Shao, F.; Zhai, G.; Lin, W. No-Reference Point Cloud Quality Assessment via Graph Convolutional Network. IEEE Trans. Multimed. 2025, 27, 2489–2502. [Google Scholar] [CrossRef]

Figure 1. Example of a distorted PC from the SJTU-PCQA dataset.

Figure 2. PCQA framework.

Figure 3. Attention-Enhanced Swin-Transformer.

Figure 4. PC examples in the SJTU-PCQA dataset.

Table 1. Distortion types in SJTU-PCQA.

Type	Description/Method	Effect
OT	Octree-based compression via MPEG PCC.	Geometry
CN	Gaussian noise on RGB channels.	Color
DS	Uniform down-sampling of points.	Geometry
DS + CN	DS followed by RGB Gaussian noise.	Geometry + Color
DS + CGN	DS followed by Gaussian geometry noise.	Hybrid Geometry
CGN	Gaussian noise on point coordinates.	Geometry
BN	Random brightness changes on RGB.	Color
CN + BN	Gaussian color + brightness noise.	Color Compound
OT + DS	DS after octree compression.	Geometry Compound

Table 2. Performance comparison on SJTU-PCQA.

Model	SROCC	PLCC	KROCC	RMSE
PQA-Net [7]	0.8500	0.8200	–	–
IT-PCQA [8]	0.5800	0.6300	–	–
VQA-PC [17]	0.8509	0.8635	0.6585	1.1334
GMS-3DQA [14]	0.9108	0.9177	0.7735	0.7872
3D-NSS [4]	0.7382	0.7144	0.5174	1.7686
MM-PCQA [18]	0.8998	0.9202	0.7677	0.8801
VPI-PCQA [19]	0.9041	0.9155	–	0.9263
GC-PCQA [20]	0.9108	0.9301	0.7546	0.8691
Proposed	0.9203	0.9370	0.7753	0.8065

Bold values denote the best performance.

Table 3. Performance comparison on WPC.

Model	SROCC	PLCC	KROCC	RMSE
PQA-NET [7]	0.7000	0.6900	0.5100	15.1800
IT-PCQA [8]	0.5500	0.5400	-	-
3D-NSS [4]	0.6514	0.6479	0.4417	16.5745
Proposed	0.7689	0.7647	0.5817	14.7151

Table 4. Ablation results on SJTU-PCQA.

Variant	SROCC	PLCC	KROCC	RMSE
w/o CDPS	0.9188	0.9302	0.7702	0.8129
w/o AEST	0.9132	0.9257	0.7684	0.8352
w/o GRH	0.9160	0.9146	0.7704	0.8286
Proposed	0.9203	0.9370	0.7753	0.8065

Table 5. Complexity analysis.

Metric	Value
Model File Size	317.0 MB
GPU Memory Usage	6.0 GB
Training Time	7613.9 s
Testing Time	8.5 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shen, X.; Li, Q.; Tu, R.; Bai, Y.; Ge, D.; Zhu, Z. Point Cloud Quality Assessment via Complexity-Driven Patch Sampling and Attention-Enhanced Swin-Transformer. Information 2026, 17, 93. https://doi.org/10.3390/info17010093

AMA Style

Shen X, Li Q, Tu R, Bai Y, Ge D, Zhu Z. Point Cloud Quality Assessment via Complexity-Driven Patch Sampling and Attention-Enhanced Swin-Transformer. Information. 2026; 17(1):93. https://doi.org/10.3390/info17010093

Chicago/Turabian Style

Shen, Xilei, Qiqi Li, Renwei Tu, Yongqiang Bai, Di Ge, and Zhongjie Zhu. 2026. "Point Cloud Quality Assessment via Complexity-Driven Patch Sampling and Attention-Enhanced Swin-Transformer" Information 17, no. 1: 93. https://doi.org/10.3390/info17010093

APA Style

Shen, X., Li, Q., Tu, R., Bai, Y., Ge, D., & Zhu, Z. (2026). Point Cloud Quality Assessment via Complexity-Driven Patch Sampling and Attention-Enhanced Swin-Transformer. Information, 17(1), 93. https://doi.org/10.3390/info17010093

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Point Cloud Quality Assessment via Complexity-Driven Patch Sampling and Attention-Enhanced Swin-Transformer

Abstract

1. Introduction

2. Motivation

3. Proposed Method

3.1. Complexity-Driven Patch Sampling

3.2. Attention-Enhanced Swin-Transformer

3.3. Gated Regression Head

4. Experiments

4.1. PC Dataset

4.2. Experimental Settings

4.3. Results and Analysis

4.4. Ablation Study

4.5. Complexity Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI