Perceptual Image Quality Prediction: Are Contrastive Language–Image Pretraining (CLIP) Visual Features Effective?

: In recent studies, the Contrastive Language–Image Pretraining (CLIP) model has showcased remarkable versatility in downstream tasks, ranging from image captioning and question-answering reasoning to image–text similarity rating, etc. In this paper, we investigate the effectiveness of CLIP visual features in predicting perceptual image quality. CLIP is also compared with competitive large multimodal models (LMMs) for this task. In contrast to previous studies, the results show that CLIP and other LMMs do not always provide the best performance. Interestingly, our evaluation experiment reveals that combining visual features from CLIP or other LMMs with some simple distortion features can significantly enhance their performance. In some cases, the improvements are even more than 10%, while the prediction accuracy surpasses 90%.


Introduction
In the dynamic realm of visual content, various types of distortions are inherent in its life cycle, such as generation, processing, and delivery [1][2][3].This has given rise to an increasing need for evaluating image quality without any reference points.In the domain of image quality assessment, particularly in the blind image quality assessment task, the emphasis has historically been on leveraging distortion features exclusively [1,2,4,5].Recent developments have witnessed significant progress in the alignment of images with accompanying text, such as Contrastive Language-Image Pretraining (CLIP) [6].In addition to the multimodal uses of CLIP [7][8][9][10][11], the visual features provided by CLIP have showcased remarkable versatility in diverse applications, such as captioning [12][13][14][15], object detection [16], semantic image segmentation [17], cross-modal retrieval tasks [18][19][20], etc.This wide-ranging utilization underscores the broad applicability and robust performance of CLIP and its derivatives across a spectrum of interdisciplinary challenges.Fueling this progress is the availability of massive training data used to train the image and text encoder blocks.
In this paper, our goal is to investigate how effective CLIP visual features work in image quality assessment.This is an interesting issue because, although CLIP was trained with a huge number of images and associated texts, there could be few hints for CLIP to learn which images have distortions or artifacts.So, it is not clear whether CLIP's visual features can represent the perceptual distortions present in images.
To this end, an evaluation of CLIP's visual features for the image quality assessment (IQA) task is carried out.In particular, CLIP's features are compared with those of related large multimodal models (LMMs), such as HPS [21], ALTCLIP [22], and ALIGN [23], and conventional features of image quality models.Through extensive experiments, it is shown that CLIP does not always provide the best performance, as seen in Figure 1.Here, the performance metrics are the Pearson Linear Correlation Coefficient (PLCC) and the Spearman Ranking Correlation Coefficient (SROCC) [24].Remarkably, it has been observed that augmenting LMMs with simple distortion features yields a substantial enhancement in performance.This implies that service providers can efficiently extract feature vectors using various APIs and then incorporate these vectors for the purpose of quality assessment.The synergistic combination of large pretrained models with more straightforward distortion features showcases a promising avenue for assessing image qualities.To the best of our knowledge, this is the first work that comprehensively evaluates the image features of CLIP and related models for image quality assessment.
The remainder of this paper is organized as follows.Section 2 reviews and analyzes related research works, while Section 3 describes our evaluation architecture, the datasets used, and our implementation details.In Section 4, the evaluation results and the experimental findings are presented.Finally, concluding remarks are given in Section 5.

Related Work
This section highlights the relevant related literature in image quality assessment.Additionally, LMMs and recent research that explores using their features are presented.

Perceptual Quality Assessment
Perceptual image quality evaluation takes into account human perception factors to assess how well an image reproduces the content, colors, textures, and overall visual experience.It can be divided into three broad categories: full-reference (FR), reducedreference (RR), and no-reference (NR) or blind image quality assessment (BIQA) [25,26].BIQA refers to the process of evaluating the quality of a distorted image without using a reference image for comparison.It is a challenging task, as it requires the model to analyze the image content and identify potential artifacts or distortions that may affect its quality.There are various approaches to IQA, ranging from hand-crafted features [27][28][29][30] to deep learning-based methods [24,[31][32][33][34][35][36].
Deep learning-based approaches have shown remarkable success in accurately evaluating image quality by leveraging the power of deep neural networks.However, a common challenge faced by deep learning methods is the availability of large-scale training datasets.In some cases, the limited availability of annotated image samples can affect the performance of IQA models [2,33].To address this, transfer learning has been proposed as an effective solution [37,38].Transfer learning involves pretraining a neural network on a large-scale image dataset.Subsequently, the pretrained model is fine-tuned on the IQA task using the available annotated IQA data.This approach was successfully applied in [33][34][35], where the models were pretrained on ImageNet [39] and then adapted to predict image quality scores.Additionally, the meta-learning approach in [2] can be applied to enhance the ability of IQA models to learn other distortion types prevalent in images.By employing meta-learning, the model can acquire knowledge about a wide range of distortions, improving their generalization capabilities and adaptability to diverse image datasets [2].

Large Multimodal Models
In recent times, multimodal or cross-modal learning has emerged as a thriving research area, particularly for downstream tasks like image-text similarity evaluation and image retrieval tasks.Researchers have made significant contributions by proposing robust learning models such as CLIP [6], ALIGN [23], HPS [21], ALTCLIP [22], etc., all of which aim to uncover and understand the intricate relationship between vision and language.These models often leverage advanced architectures, such as vision transformers [40], CNNs, or pretrained ResNet models [41], to effectively capture and analyze complex relationships within and between different modalities.The development of large multimodal models has been facilitated by advancements in deep learning, computational resources, and the availability of vast and diverse datasets.
The visual features of LMMs trained by text supervision can be applied for various downstream tasks.Notably, recent research conducted in [12] underscores the "unreasonable effectiveness" of the CLIP features for image captioning.They discovered that CLIP is currently one of the best visual feature extractors.Also, the research in [42] showed that CLIP can understand image aesthetics better.In their experiment, the authors trained a linear regression model using the CLIP visual features extracted from AVA datasets.Their finding shows that CLIP visual features outperform features from an ImageNet classification model.Similar to [12], the frozen CLIP features were employed in [43] for the video recognition task.It was also found that CLIP features are very effective in representing spatial features at the frame level.In the context of embodied AI tasks (e.g., robotics) [44], the use of CLIP features has been shown to be very simple but more effective than the features of ImageNet-pretrained models.
In this paper, we explore another important question: are CLIP visual features effective for perceptual image quality assessment?For this purpose, CLIP is compared with related large multimodal models, and its features are compared with those of conventional quality models.

Evaluation Methodology
In this section, we present a description of our evaluation architecture, LMM features, and distortion features.Also, an overview of the datasets and implementations used in our evaluation is provided.

Evaluation Architecture
Our purpose is to evaluate features from LMMs such as CLIP and related models.So, the selection and comparison of features are important for a meaningful evaluation.For this, the features were selected from various models with different degrees of complexity, as presented in Sections 3.2 and 3.3.Also, similar to [12], as a large number of models and their feature combinations were considered in our study, we adopted a simple architecture, as depicted in Figure 2. The flow of feature processing and quality prediction is as follows.
First, we extract and compare visual features from each LMM.Distorted images are passed through the image encoder block, and the visual features are extracted and input into the fully connected (FC) layer block, which serves as a regressor.Features from three typical image quality models (namely, a pretrained image quality metric, a statistical metric, and a lightweight CNN metric) are also included for comparison purposes.Second, visual features from each LMM are concatenated with the distortion features of a certain quality model, which is controlled by a switch in our architecture.Then, the vector of concatenated features is processed by the FC block to predict a quality score.The final output from the FC block is pass through a sigmoid function.The output of the sigmoid function is interpreted as the quality score between 0 and 1.During training, the objective is to minimize the Mean Squared Error (MSE) to measure the difference between the predicted score and Mean Opinion Score.The Mean Opinion Score (MOS) is a measure of perceived quality.It is an average score obtained through subjective experiments where subjects are asked to rate the images.According to the ITU-T standard P.800.1 [45], the rating typically has five levels: 5 (excellent), 4 (good), 3 (fair), 2 (poor), and 1 (bad).For image quality prediction, the MOS serves as the ground truth to judge the predicted quality score.

CLIP (Contrastive Language-Image Pretraining)
CLIP [6] is a powerful model created by OpenAI for learning visual representations from natural language supervision.Trained on a vast dataset pairing images with textual descriptions, CLIP excels in tasks like visual classification and image-text retrieval [6].CLIP is trained on WebImageText datasets [6] created by collecting 400 M image-text pairs publicly available on the Internet.It has both ViT- [40] and ResNet-based [41]  ALTCLIP [22] is a bilingual model designed to understand both English and Chinese.It enhances the text encoder of CLIP with XLM-R [54], a multilingual pretrained model.This means that the text encoder is a student text encoder that has learned from a teacher training stage.It uses the Conceptual Captions (CC-3M) [46] and TSL2019-5M [55] datasets as training data.The student text encoder possesses similar capabilities to the multilingual model that is used as its teacher model.Comparing both CLIP and ALTCLIP on downstream tasks like image classification using both the English and Chinese languages, CLIP outperformed ALTCLIP on the English language, while ALTCLIP outperformed CLIP on the Chinese language [22].ALTCLIP enables cross-language understanding, allowing it to process and interpret information in both English and Chinese [22].

HPS (Human Preference Score)
HPS is a model designed to solve the misalignment issues (artifacts) inherent in generated images using human preferences.The study was divided into two parts: firstly, the authors [21] collected a large-scale dataset guided by human preference.Second, they designed a human preference classifier (HPC) using the collected dataset as training data [21].The dataset is made up of 98,807 images created by 25,205 prompts of human choices.For each prompt, multiple images are provided.The user selects one image as the preferred choice, while the remaining images are considered non-preferred negatives [21].The classifier was designed by fine-tuning CLIP ViT-L/14 with a few modifications.For both image and text encoder blocks, they used only the last 10 and 6 layers.The collected dataset was used as training data to achieve accurate semantic understanding and preference scoring.HPS was designed to guide models to generate human-preferred images.

ALIGN (A Large-Scale ImaGe and Noisy-Text Embedding)
ALIGN is an LMM that aligns images and noisy texts using straightforward dual encoders.It was developed as an LMM competitive with CLIP.Its image encoder backbone was designed using EfficientNet [56].For the text encoder, the BERT base model was used.Its training data are approximately one billion noisy image alt-text pairs [23].For image and text encoders, it has approximately 480 M and 340 M parameters, respectively.ALIGN also employs a contrastive loss to train image and text encoders.This approach facilitates the alignment of visual and textual information, making it an effective model for understanding the relationships between images and their associated noisy texts.ALIGN outperformed CLIP in downstream tasks such as image-text retrieval (Flickr30k-1K test set and MsCOCO-5K test set) and the image classification task (ImageNet) [23].In this study, the implementation in [57], which is available in the transformer library, was used.

Distortion Features
For comparison, we also employed perceptual distortion features provided by various models that focus on the perceptual image quality.Three models were selected with increasing levels of complexity, namely, a statistical model, a lightweight CNN model, and a meta-learning model.This is important for understanding how effective LMM features are in comparison to the features that have been conventionally used to predict the quality of images.

Statistical Features
We used the BRISQUE (Blind/Referenceless Image Spatial Quality Evaluator) algorithm [28].BRISQUE was developed based on the observation that the normalized lumi-nance coefficients of natural images follow a nearly Gaussian distribution.This means that the presence of distortions can be attributed to the possible loss of naturalness.The authors designed BRISQUE [28] by using the scene statistics of locally normalized luminance coefficients to account for losses of naturalness.BRISQUE features are generated by applying the BRISQUE algorithm to each image sample.The resulting feature vector has a size of 36.

Lightweight CNN Features
A very simple CNN model for image quality assessment, denoted by CNNIQA [31], was employed.It is one of the earliest models developed for no-reference quality assessment.The structure is simple, with only one convolutional layer.The pooling operations include both max pooling and min pooling to pool features.For regression, 2 fully connected layers with a vector size of 800 are used.In training, both feature learning and the regression process are jointly optimized.

Meta-Learning Features
The state-of-the-art MetaIQA model [2] was included for our quality evaluation.This approach is based on meta-learning, which involves learning the shared meta-knowledge inherent in diverse image distortions, thereby mimicking the human visual system (HVS).The meta-learning model is based on a ResNet-18 model extensively trained on the TID2013 [58] and KADID-10k [59] datasets to learn the essential aspects of image quality distortions.The model uses bi-level gradient optimization to learn the shared meta-knowledge.To adapt to unknown data, we fine-tuned the pretrained MetaIQA model.Compared to BRISQUE and CNNIQA, the features of MetaIQA provide a much deeper representation of image quality characteristics.

Combined Features
In this study, the combined features between an LMM and a quality model were also considered.It is possible that LMMs may not have been trained with distorted images as image quality models.Therefore, augmenting LMM features with distortion features could be effective.This results in twelve (12) different combinations between the above four LMMs (CLIP, ALTCLIP, HPS, ALIGN) and the three IQA models (BRISQUE, CNNIQA, MetaIQA).As shown later in the experiments, it turns out that even the simple handcrafted features of BRISQUE are surprisingly effective in boosting the performance of every LMM's features.

Datasets and Implementation Details
In our evaluation, four authentically distorted IQA datasets were used.They are KonIQ-10k [60], LIVE [61], SPAQ [62], and CID2013 [63].Through this thorough assessment, we are able to see how well the features of each LMM perform across a wide range of authentic distorted IQA datasets.Table 1 shows the characteristics of each of the selected IQA datasets.It also highlights some properties of the datasets used in our evaluation, such as the MOS range, the number of samples, and the type of subjective experiment conducted.Our evaluation was implemented in PyTorch 1.10.2 for the training and testing tasks.The experimental settings, such as the learning rate, epochs, and batch size, were similar to those of MetaIQA [2].To extract CLIP features, we used CLIP ViT-B/16 with a feature size of 512.ALTCLIP and HPS models were designed with CLIP ViT-L/14, resulting in a feature size of 768.The ALIGN model is based on EfficientNet B7 and outputs an image feature size of 640.For distortion features, the CNNIQA [31] model was adapted by resizing the default feature vector size from 800 to 512.The feature vector is passed through an FC block consisting of 3 layers.Layer 1 takes the feature vector as input, layer 2 has a size of 512, and layer 3 culminates in a single scalar value, which represents the quality score.Due to the small feature size of BRISQUE features (36), the number of FC layers was reduced to 2, with an input size of 36 and an output size of 1.The Adam optimizer [64] was used in all experiments.

Experiments
This section presents the experimental settings and the evaluation results.Also, the trends of different feature combination cases are discussed in detail.

Experimental Settings
In our experiment, two widely recognized metrics, SROCC and PLCC, were used to evaluate the no-reference quality assessment [1,24].These metrics are used to quantify non-linear and linear relationships between the ground-truth MOS and the predicted scores.The scale spans from −1 to 1, where higher values signify superior predictive accuracy.
For LIVE and KonIQ-10k, we used the splitting described in [2].For SPAQ, similar splitting to that in [62] was used.In each of the three datasets, 80% of the images were used as training samples, while the remaining 20% was used for testing.CID2013 contains 6 subsets in each set; we used four (4) subsets from each set for training and two (2) subsets for testing.For each dataset, we performed the random train-test splitting procedure 10 times and report the mean PLCC and SROCC values, along with the corresponding standard deviation (std) values.The experiment was carried out on a computer with an NVIDIA RTX 3090 GPU.

Evaluation Results
In this part, we first compare features from different LMMs and quality models.Then, the combinations of LMM features with distortion features are investigated.Also, the performance trends and behaviors of the features are discussed in detail.
The performance comparison of the features is shown in Table 2.The results reveal that MetaIQA consistently achieves the highest performance in most cases.The performance of LMM features is good but usually lower than that of MetaIQA.The only exceptions are the SRCC values of CLIP features on the CID2013 dataset and the HPS features on the LIVE dataset, which are a little higher than those of MetaIQA.Anyway, LMM features perform significantly better than the features of BRISQUE and CNNIQA.This means that large models, though trained in a general manner, can extract more distortion information than the hand-crafted or simple features of BRISQUE and CNNIQA.
The results also show that using CLIP features does not always provide the best results.More specifically, it is interesting that, when considering LMMs only, each LMM has the best performance on only one dataset (i.e., CLIP on CID2013, ALIGN on KonIQ-10k, ALTCLIP on SPAQ, and HPS on LIVE).So, no LMM can be considered the best model in this experiment.
The above results suggest that these LMMs might not have been as extensively trained with various visual distortions as MetaIQA.So, LMM features may be augmented by distortion-oriented features.For this reason, we further investigated various cases of feature combinations between an LMM and a conventional quality model.Tables 3-5 show the results of LMM features combined with the features of BRISQUE, CNNIQA, and MetaIQA, respectively.In Tables 3-5, it is interesting to see that adding distortion features can significantly boost the performance of all LMMs.For instance, examining Table 3 (last row), the combination of HPS features and BRISQUE (hand-crafted) features provides the best results on three datasets (LIVE, KonIQ-10k, SPAQ), which are higher than those of MetaIQA.Meanwhile, on CID2013, the combination of CLIP and BRISQUE is the best among all LMMs and better than MetaIQA.It should be noted that BRISQUE's features are simply hand-crafted, and its own performance is very low, as shown in Table 2.However, it is surprisingly effective in improving the performance of LMM features.The reason could be that BRISQUE features contain low-level distortion information, which is missed by the high-level features of LMMs.
When the features of CNNIQA or MetaIQA are combined with the features of an LMM, the performance, in most cases, is further improved.This can be explained by the fact that the distortion features of CNNIQA or MetaIQA are learned features, so they are better than the hand-crafted features of BRISQUE.
The visualization of the data in Tables 2-5 is provided in Figure 3, where each curve shows the performance (in terms of PLCC and SROCC) for each LMM.The horizontal axis of each graph represents four cases: (1) original (denoted by "Orig"), (2) combination with BRISQUE (denoted by "Orig + BRISQUE", (3) combination with CNNIQA (denoted by "Orig + CNNIQA"), (4) combination with MetaIQA (denoted by "Orig + Meta").In Figure 3, the performance trends of LMM features when combined with distortion features can be more easily recognized.It is obvious that, in most cases, the gains obtained by combining features with BRISQUE, CNNIQA, or MetaIQA features increase in ascending order.This means that distortion features learned by meta-learning (i.e., MetaIQA) are richer than those by lightweight CNN learning (i.e., CNNIQA), which are, in turn, better than the simple hand-crafted features of BRISQUE.Some exception cases occur with HPS, where some combinations do not follow this trend.For example, on the LIVE dataset (Figure 3a,b), when HPS features are combined with BRISQUE features and CNNIQA features, the PLCC value is improved from 0.8290 to 0.8529 (approx.3%) and 0.8602 (approx.4%), respectively.But the result of the combination with MetaIQA features is just 0.8503, which is lower than that of the combination with BRISQUE.Similar behavior can be found with HPS on the SPAQ dataset.Anyway, the results of such exception cases are still better than those of using only HPS features.This phenomenon can be explained as follows.The combination of MetaIQA and HPS features may not effectively exploit the combination because both MetaIQA and HPS produce rich features, which are possibly not complementary to each other.The evaluation results suggest that LMMs only need to be combined with simple distortion features to compete with state-of-the-art quality metrics.
From Figure 3, it can also be seen that the performance of LMM features when augmented by distortion features varies across datasets.In particular, CLIP features are only the best the on CID2013 dataset.On other datasets, the performance of HPS features is mostly higher than or comparable to that of other LMMs' features.Notably, though the ALIGN model is the main competitor of CLIP, the performance of its features is usually lower than that of CLIP features.One exception case is the SROCC curve of ALIGN on KonIQ-10k (Figure 3d).This suggests that CLIP should be preferred to ALIGN in IQA tasks.

Conclusions
Recently, the Contrastive Language-Image Pretraining (CLIP) model and related large multimodal models (LMMs) have been introduced as key achievements in the deep learning era.In this study, we investigated the use of the visual features of CLIP and related LMMs for perceptual image quality assessment.Based on the obtained results and above discussions, the key findings from this evaluation study can be summarized as follows: • The features of LMMs still cannot be as effective as those of state-of-the-art IQA models.This could be because the training data of these models do not specifically include distorted images with quality scores as labels.However, the performance of LMM features is much better than that of hand-crafted features or that of lightweight CNN.• CLIP features have the best results on only one dataset.In fact, the performance of different LMMs varies widely, depending on the given dataset and the feature combination.

•
The features of ALIGN, which is the main competitor of CLIP, usually are not as effective as the features of CLIP and its variants in image quality assessment.In addition, the cases involving features from HPS demonstrate exceptional performance, especially on large IQA datasets.
• An important finding is that distortion features can be combined with LMM features to significantly boost performance.Even the very basic distortion features of BRISQUE are useful in improving the performance of LMM features.
The above findings suggest that, in practice, developers may just extract LMM features and simple distortion features using some existing APIs, without having to build new and complex models for IQA.The synergistic combination of LMM visual features with traditional distortion features showcases a promising and efficient approach to image quality assessment.In future work, we will extend our investigation to the quality assessment of video contents.

Figure 1 .
Figure 1.Performance of CLIP visual features compared with other LMMs on four public natural datasets.

Figure 2 .
Figure 2. Architecture of our evaluation study.

Figures 4 -Figure 4 .Figure 5 .Figure 6 .Figure 7 .
Figures 4-7 present scatter plots, which are used to elucidate the relationship between the predicted quality score produced by our evaluation architecture and the ground-truth MOS.The scatter plots can also be used to identify anomalous patterns, outliers, or clusters that exist between the two values.As previously mentioned, our experiment involved 10 random data splits, and in this presentation, we show the results of the last split (10) and epoch 25.The scatter points marked in red, orange, and green signify the results provided by the combination of MetaIQA + LMM, BRISQUE + LMM, and CNNIQA + LMM, backbones.For ViT-based CLIP models, CLIP ViT-/14 has 304 million parameters.CLIP ViT-B/32 and CLIP ViT-B/16 have 87.8 million and 86.2 million parameters, respectively.ResNet-based CLIP versions include RN50, RN101 (where 50 and 101 indicate the model depth), RN50x4, RN50x16, and RN50x64 (where x denotes up-scaling).CLIP-RN50x64 is the largest, with 420 million parameters, while CLIP-RN50 is the smallest, with 38.3 million parameters.CLIP-RN101 has 56.2 million parameters.

Table 2 .
Performance comparison using only LMM visual features on the four selected datasets.Red and blue boldfaced entries indicate the best and second best for EACH model.

Table 3 .
Performance comparison using LMM visual features and the BRISQUE model on the four selected datasets.Red and blue boldfaced entries indicate the best and second best for each fusion type.

Table 4 .
Performance comparison using LMM visual features and CNNIQA on the four selected datasets.Red and blue boldfaced entries indicate the best and second best for each fusion type.

Table 5 .
Performance comparison using LMM visual features and METAIQA on the four selected datasets.Red and blue boldfaced entries indicate the best and second best for each fusion type.