Conv-Former: A Novel Network Combining Convolution and Self-Attention for Image Quality Assessment

To address the challenge of no-reference image quality assessment (NR-IQA) for authentically and synthetically distorted images, we propose a novel network called the Combining Convolution and Self-Attention for Image Quality Assessment network (Conv-Former). Our model uses a multi-stage transformer architecture similar to that of ResNet-50 to represent appropriate perceptual mechanisms in image quality assessment (IQA) to build an accurate IQA model. We employ adaptive learnable position embedding to handle images with arbitrary resolution. We propose a new transformer block (TB) by taking advantage of transformers to capture long-range dependencies, and of local information perception (LIP) to model local features for enhanced representation learning. The module increases the model’s understanding of the image content. Dual path pooling (DPP) is used to keep more contextual image quality information in feature downsampling. Experimental results verify that Conv-Former not only outperforms the state-of-the-art methods on authentic image databases, but also achieves competing performances on synthetic image databases which demonstrate the strong fitting performance and generalization capability of our proposed model.


Introduction
Image quality assessment (IQA) is a branch of computer vision research that aims to give computers the same ability to judge image quality as humans. IQA is a crucial research area since it can be applied to image restoration and used to distinguish between distinct visual perception experiences [1]. These days, digital images captured by cameras on cell phones, professional imaging equipment, and remote sensing satellites are widely used [2]. Digital images introduce noise signals in the acquisition process, and they can lose perceptual information during compression, storage, and transmission [3]. As a result, IQA is becoming increasingly important and has numerous applications, such as designing image restoration models to remove blur, noise, clouds, and other artifacts from images using IQA algorithms. It can also assist image acquisition equipment in evaluating and debugging the product's imaging parameters and determining whether the imaging system is degraded. Although subjective human assessment can truly reflect human visual perception, and the assessment results are direct, accurate, and reliable, which is the ultimate basis for judging image quality, the implementation process is time-consuming and costly. It is also easily influenced by personal subjective emotions and preferences, and it has many limitations in meeting practical application needs [4]. To promote the use of IQA in real-world engineering, accurate and effective objective IQA algorithms must be developed. The objective IQA method is an evaluation method that builds a mathematical model based on the visual system of the human eye and scores the image to be measured. This method is low-cost, has the advantages of batch processing and reproducible results, and can be more easily applied to a variety of scenarios. Traditionally, objective IQA algorithms are classified into three categories based on whether a reference image is required or not: full reference IQA (FR-IQA) [5], reduced-reference IQA (RR-IQA) [6], and no reference IQA (NR-IQA) [7]. The full-reference image quality evaluation method uses the image under test to compare with the original reference image in an adequately defined image space; the reduced-reference image quality evaluation does not use the original image directly but uses some information of the original image such as structural features and distortion types to evaluate the image quality, and the no-reference image quality evaluation does not use any original reference image information in the evaluation process. The main goal of all three image quality evaluation methods is to predict a quality score relevant to human visual perception. NR-IQA has attracted a large number of researchers' attention in recent years because no reference information is available or may not even exist in many realistic situations. NR-IQA has been a challenging problem that has not been well addressed by many methods.
Early models for NR-IQA were based on manual feature extraction [8][9][10], which relied heavily on our summarized knowledge of the probabilistic architecture of the visual world, the mechanisms of image degradation, and the composition of the human visual system (HVS). In recent years with the development of convolutional neural networks (CNN), more and more vision tasks have benefited from this [11][12][13]. Current deep learning-based image quality assessment methods have achieved remarkable success in extracting visual features using CNN [4,14], which have the following advantages over manual feature extraction: (1) By carefully designing a deep neural network with the problem to be solved and the input data, it is possible to automatically learn the relationships implicit within the data from the training dataset without the need for tedious manual feature extraction; (2) Deep neural network models can contain thousands of parameters; thus, deep features can have better differentiation and representation capabilities. Compared with manually extracted features, it has more prominent advantages in extracting multi-level features and contextual information of images; (3) Deep learning can change the model architecture by simply adjusting the parameters, which enables the network to automatically model itself according to the specific characteristics of the task, with good generalization and efficiency.
In recent years, Self-Attention based transformer architecture has achieved great success in the field of natural language processing (NLP) by establishing a long-range interaction in a scalable manner [15][16][17][18] and continues to make constant and ground-breaking progress in building on this foundation. Researchers have applied the transformer directly to computer vision by slicing images into patches, such as ViT [19] and DETR [20]. The transformer has shown great promise in the field of computer vision, outperforming CNN in various mainstream tasks such as image classification [11] and target detection [12], and is likely to replace CNN as the new backbone in the future, mainly due to the transformer's ability to capture long-range pixel interactions and aggregate global information from the entire input sequence.
Transformer mainly uses Multi-Headed Self-Attention (MHSA) to model long-range interactions, as shown in Figure 1a below is the schematic architecture of MHSA. The input image X image ∈ R H×W×C is cropped into N = HW/P 2 image blocks of size P × P × C, and each image block is expanded into a one-dimensional vector to finally obtain X ∈ R N×p 2 c , where H, and W represent the length and width of the image, respectively, N represents the number of image blocks, i.e., the number of tokens, and d = P 2 C represents the feature dimension of each token. As in Equation (1) X is linearly transformed, and together with the class token a new vector is formed as the input to the transformer encoder. where X class is the class token added to implement the classification task, W is the matrix that implements the linear mapping, and E pos is the position embedding. Implementing Self-Attention requires defining three key elements. The query Q = XW Q , the key K = XW K , and the value V = XW V . Where W Q , W K , W V ∈ R d×d is the weight matrix that implements the linear mapping. The output Z can be formulated as: where √ d represents approximate normalization and the matrix product of QK T calculates the similarity between each pair of tokens to achieve Self-Attention. As shown in Figure 1b, the input X is i-quantized, i.e., X = [x 1 , x 2 , · · ·, x i ], and x i ∈ R N× d i is entered separately into the Self-Attention module to obtain the multi-headed attention values by the connected single attention values Z = [z 1 , z 2 , · · ·, z i ]. The formulation can be further rewritten as: where Attention(·) is the standard self-attentive function based on QKV, [·] 1:i indicating that the results obtained are concatenated.
Although the transformer has made breakthroughs in computer vision tasks, there is still a significant performance and computational cost gap between the simple multilayer transformer encoder architecture and the previous convolutional neural network, making the transformer unavailable on limited hardware resources. Figure 2 shows an overview of the ResNet50 [21], the vision transformer [19], and the multi-stage transformer [22]. The main idea of ResNet50 is to divide the feature map extraction into different stages, so that, given an input image, we can generate feature maps at different scales, which has proved to be very useful in many intensive prediction tasks [21,23]. The image is split directly into non-overlapping patches in the vision transformer and then fed into the transformer encoder after linear projection. Inspired by the two architectures mentioned above, The researchers designed a novel transformer network architecture to generate hierarchical feature representations like ResNet50, which usually apply a pooling layer before each stage to reduce the size of intermediate features by 2x downsampling, and stack several encoder blocks in each stage. Compared to traditional vision transformer architecture, this multi-stage transformer network can significantly reduce the number of parameters, allowing us to train and extract multi-scale feature representations that have been shown to be beneficial for many vision tasks. We can obtain four hierarchical feature maps at different resolutions, similar to a typical convolutional neural network. Our proposed Conv-Former network also exploits the advantages of this hierarchical architecture and improves upon it. stage transformer network can significantly reduce the number of parameters, all us to train and extract multi-scale feature representations that have been shown to b eficial for many vision tasks. We can obtain four hierarchical feature maps at dif resolutions, similar to a typical convolutional neural network. Our proposed Conv-F network also exploits the advantages of this hierarchical architecture and improves upo
In this paper, we aim to design an IQA model (Conv-Former) using the longinteraction capabilities of a transformer and the local feature extraction capabili CNN. The network can give predictions that are more consistent with human visu tem perception. Therefore, we introduce a multi-stage transformer network archit to our IQA model. Using a local information perception module and a transformer t ture local information-aware features and global semantic features in an image, th work is able to collect fine-grained detail and global information using both loca global features. The dual-path pooling allows the multi-stage transformer to capt much contextual information as possible, and the Conv-Former has experimentally pro be highly capable of local feature perception and image content understanding, both of are quite important in IQA tasks. The main contributions of this paper are as follows: (1) We designed an end-to-end neural network model called Conv-Former for no ence image quality assessment. The overall architecture uses a multi-stage arc ture similar to that of ResNet-50 to obtain multi-scale features, which can s cantly reduce the number of parameters compared to the traditional transform chitecture. At the same time, multi-scale features are more conducive to the e tion of image quality features. This architecture enables the generation of appro perceptual mechanisms in image quality assessment to build an accurate IQA m  [19] (c) Multi-stage transformer [22].
In this paper, we aim to design an IQA model (Conv-Former) using the long-range interaction capabilities of a transformer and the local feature extraction capabilities of CNN. The network can give predictions that are more consistent with human visual system perception. Therefore, we introduce a multi-stage transformer network architecture to our IQA model. Using a local information perception module and a transformer to capture local information-aware features and global semantic features in an image, the network is able to collect fine-grained detail and global information using both local and global features. The dual-path pooling allows the multi-stage transformer to capture as much contextual information as possible, and the Conv-Former has experimentally proven to be highly capable of local feature perception and image content understanding, both of which are quite important in IQA tasks. The main contributions of this paper are as follows: (1) We designed an end-to-end neural network model called Conv-Former for no-reference image quality assessment. The overall architecture uses a multi-stage architecture similar to that of ResNet-50 to obtain multi-scale features, which can significantly reduce the number of parameters compared to the traditional transformer architecture. At the same time, multi-scale features are more conducive to the extraction of image quality features. This architecture enables the generation of appropriate perceptual mechanisms in image quality assessment to build an accurate IQA model; (2) In this work, we introduce an effective hybrid architecture for image quality assessment networks that utilize local information from CNNs and global semantic information captured by the transformer to further improve the accuracy of IQA, implemented by replacing the linear layer that generates the qkv matrix with a local information-aware module that is able to further obtain local information in image quality, acquire fine-grained features and obtain detailed and overall information representation in the image. Network analysis experiments also demonstrate that the network outperforms other models for understanding the content of the input images. This enables the neural network to focus better on the subtle differences between images and thus obtain a more accurate image quality score. In order to reduce the image quality information loss in the feature downsampling process under multi-stage architecture, we designed the dual path pooling module to keep more contextual information; (3) The position embedding of traditional transformer networks cannot adapt to the input of different resolution images and the use of local information perception modules. Therefore, this paper proposes an adaptive 2D position embedding module, which solves the problem that traditional CNN networks cannot input images with different resolutions, and at the same time, the 2D position embedding is more in line with the characteristics of images. It can effectively represent the position information between tokens; (4) We experimented with Conv-Former on two different authentic image quality assessment datasets, LIVE Challenge (LIVEC) and KonIQ-10k, as well as on synthetic datasets LIVE, TID2013, and CSIQ, and compared the performance of the algorithm on different distortion types. The extensive experimental results show that Conv-Former has competitive results, which demonstrate the strong fitting performance and generalization capability of our proposed model. As shown in Figure 3, we can find that the results of Conv-Former are more in line with the Mean Opinion Score (MOS).

Attention Mechanism in CNN
The attention mechanism is an important feature of the human visual system, which means that only a portion of all visible information is noticed by humans. The attention mechanism is introduced in CNNs by simulating the human visual perception process, which can ignore the interference of irrelevant information and thus improve the generalization performance of the network. This has enabled CNNs to make breakthroughs in areas such as object detection, image generation, and target tracking.
The attention mechanism was originally used to encode long input sentences as part of the encoder-decoder framework in recurrent neural networks (RNN) and has since been widely used in RNN [24]. Attention is widely used to enhance the representation of features. For example, Hu et al. [25] propose that SENet uses channel attention to explicitly model the interdependencies between feature maps and adaptively acquire the importance of each feature map by learning and then updating the original data based on this importance. In this way, SENet increases the importance of features that are more useful for the task and decreases the importance of useless features to achieve better results. By embedding this module into other networks, the computational resources of the neural network can be more rationally allocated with a small increase in the cost of the number of parameters, resulting in a significant improvement in network performance. Wang et al. [26] proposed efficient channel attention by improving the SENet, which is a local cross-channel interaction strategy without dimensionality reduction and an adaptive selection of the one-dimensional convolutional kernel size to obtain more accurate attention information by aggregating cross-channel information through a one-dimensional convolutional layer. Convolutional Block Attention Module (CBAM) [27] is constructed by combining the spatial attention module (SAM) and the channel. The CBAM is built by combining the spatial attention module (SAM) and the channel attention module (CAM), aggregating attention information from both spatial and channel aspects respectively, and fusing the information to a certain extent to obtain more comprehensive and reliable attention information and provide more appropriate guidance on the allocation of computational resources. Based on CBAM, fu et al. [28] proposed DA-Net, which also integrates channel attention and spatial attention. Unlike CBAM, where the acquisition of attention information in both directions is parallel, DA-Net captures global feature dependencies in the spatial and channel dimensions, using a spatial attention module to learn spatial interdependencies of features and a channel attention module to model channel interdependencies.

No-Reference Image Quality Assessment
No-reference image quality assessment means that no reference image is required, and the quality is assessed only based on the distorted image's characteristics, which is also more in line with practical needs, as reference images are difficult to obtain or do not exist in practical applications. NR-IQA methods can be divided into two categories, distortion type specific IQA methods [29][30][31] and generic IQA methods [8][9][10]. Distortion type specific IQA methods are designed based on specific distortion types, such as noise, JPEG compression artifacts, blurred artifacts, and other distortion types, and these methods design specific feature extraction methods by looking at the histogram of the pixel distribution of the image after distortion, after which the quality prediction score of the image is obtained. However, this method is limited in that it can only detect quality losses caused by specific distortions and is therefore not widely used. The generic NR-IQA method is more effective because the image distortion type is usually unknown in advance.
As shown in Figure 4, most of the traditional NR-IQA methods are based on natural scene statistics (NSS) methods, which first manually extract features from distorted images and then use probabilistic or regression models for quality prediction of distorted images. In early NSS-based NR-IQA methods, features are extracted from transform domains such as wavelet or cosine transform domains. Moorth et al. [32,33] proposed a class of image quality evaluation methods divided into two stages, first identifying distortion types and then performing the distortionspecific quality assessment. For example, Blind Image Quality Index (BIQI) [32], an image authenticity and integrity image quality evaluation algorithm based on distortion type identification (DIIVINE) [33], both of which are based on training a support vector machine (SVM) to obtain a classifier for the image distortion type, then extracting the image features and relying on a support vector regressor to regress the quality prediction scores for each DIIVINE improves the process of extracting image features based on BIQI, uses NSS to estimate the coefficient distribution of wavelets, and extracts global features to determine the image quality score. Saad et al. proposed a blind image integrity labeling algorithm based on discrete cosine transform statistics (BLIINDS) [34], which extracts features from the DCT domain and then uses a multivariate Gaussian model to obtain the quality scores of distorted images. Saad et al. later proposed an optimized BLIINDS (BLIINDS-II) [35] by extracting more complex DCT features, using a generalized Gaussian mixture model to fit different multiscale discrete cosine transform coefficient distributions as frequency domain statistics and a Bayesian which wasmore time-consuming because of the transformation of the image domain. To avoid the transformation of the domain, methods based on spatial domain features have emerged. Mittal et al. [8] proposed a spatial domain non-reference image quality assessment algorithm (BRISQUE), which uses the local mean and variance of the image to calculate the local normalized brightness of the image, and then uses a generalized Gaussian model to model the local normalized brightness distribution as its spatial domain natural statistical features to obtain the prediction score. Ye et al. [36] proposed a codebook-based manual feature extraction NR-IQA algorithm, which uses a K-mean clustering method to learn codebooks directly from training image blocks, then codebooks are used to encode on test images to obtain features of the images, and finally, SVR is used to predict the quality scores of distorted images. Zhang et al. [37] used these features to extract salient regions of semantic objects for quality estimation. Xu et al. [9] improved the feature set and predicted the quality score by merging the higher-order statistical information of the images. However, these manual feature extraction methods require specialized design and are very time-consuming. In addition, scene statistical features characterize image quality from a global perspective and thus cannot measure local distortions common in real distorted images.
Inspired by the breakthroughs in deep learning for other vision tasks [11][12][13]21], researchers have proposed several learning-based methods for image quality assessment that extract quality-related image features and automatically learn correction parameters through deep learning. Thus, better results than traditional manual feature extraction are obtained.

Proposed Method
In this section, we describe in detail the architecture of our proposed Conv-Former model and outline the specific roles each block plays. First, we describe the overview of the Conv-Former block. On this basis, the key modules of the algorithm are described, such as the local information-awareness module and the adaptive position embedding.

Overview
An overview of the model is depicted in Figure 5. The processing of the input image can be divided into three stages and let the output features of each stage be F 1 , F 2 and F 3 respectively. Each feature consists of spatial tokens [X 1 spatial , · · ·, X 3 spatial ] and classification token [X 1 class , · · ·, X 3 class ] and can be expressed by Equation (4). In this paper, the feature channel numbers D1, D2, and D3 are taken to be 192, 384, and 512 respectively. First, we reshape the input image X ∈ R H×W×C to the feature map F 1 ∈ R H p × W p ×D 1 by convolutional feature extraction, where (H, W) is the resolution of the original image, C is the number of channels, p is the resolution decay after the convolution operation, D 1 is the dimension of the feature map F 1 , and N = HW/P 2 is the resulting number of tokens, which serves as the effective input sequence length for the Transformer. The Adaptive Position Embedding described in Section 3.2 is added to the patch embeddings to retain positional information. As shown in Equation (6), The process from feature F 1 to F 2 is similar to the process from F 2 to F 3 , with three layers of transformer modules and a layer of Dual path pooling(DDP) modules for 2x down adoption in between, reducing the resolution of the features while increasing the number of feature channels, and after obtaining feature F 3 , the classification token containing the image quality information is fed alone into the MLP to obtain the final image quality assessment score.

Adaptive 2D Position Embedding
In CNN-based image quality evaluation models, the input images need to be resized or cropped to a fixed shape for batch training. However, this pre-processing changes the aspect ratio and composition of the image, which affects the image quality. However, by processing the position encoding part of the transformer-based network, it is possible to input images of any resolution into the network. No pre-processing of the input image is required, in line with the human visual system.
Position encoding is an integral part of the transformer architecture, through which the position encoding can be determined to explicitly model the position of the token and improve the representational power of the model. Its effectiveness has been well demonstrated in the field of natural language processing [15,16,18]. Since images can be considered as two-dimensional sequences, there is a need to extend the one-dimensional position encoding to two-dimensional position encoding, regardless of the input image size. The method mentioned in this section can effectively provide the position information required for object localization.
The specific implementation steps are as follows. Suppose the size of the input feature map is R h×w×c , then we define a learnable parameter matrix L ∈ R s×s , where the size of the matrix s is a hyperparameter, set to 10 in this paper, and we obtain the position code A ∈ R h×w by adaptively deflating the learnable matrix, the size of the position code is consistent with the feature map, as shown in Figure 6, let (h i , w j ) be a point on the position code matrix A, then the corresponding position code at that point can be determined by the following equation.
where Round(·) stands for rounding the floating point number inside the brackets and A(· , ·), L(· , ·), represent the coded values at the corresponding positions respectively.

Transformer Block
Local features can be captured in CNN by convolutional operations, and although global features can be captured by continuously deepening the neural network, the global features suffer a significant loss in the process. With the advent of transformer, the longrange dependencies of token are captured by Self-Attention and multi-layer perceptron (MLP) architecture, but such architectures ignore local detail features. As shown in Figure 7, in order to combine the advantages of local features and global representations and thus improve the performance of the transformer network, we designed a novel local information perception (LIP) module to generate QKV that improves the discriminability between background and foreground. As shown in Equation (7), Let the input tensor X ∈ R h×w×d be projected into the query vector Q ∈ R L×d , key vector K ∈ R L×d , and value tensor V ∈ R L×d , where d is the dimension size of each token and L = h × w + 1 is the number of tokens.
where T cls stands for classification token.

Dual Path Pooling
For the feature downsampling at the end of each stage, we designed a dual-path pooling (DPP) layer, as shown in Figure 5. It consists of two branches: one is a 3 × 3 depthwise convolution with a step size of two; the other is a pooling layer and a 1 × 1 convolution. It is possible to achieve twice as much downsampling. During feature downsampling, the features on both paths are fused together by channel stacking to retain more contextual information. Experimental results show that DPP performs better than a direct maximum pooling layer. In equation terms, this can be described as follows.

Datasets
In this work, five widely used datasets in the field of image quality assessment were used, which can be split into authentic datasets and synthetic datasets based on the method of obtaining distorted images. The synthetic datasets include LIVE [38], TID2013 [39], and CSIQ [40]. The authentic distortion image dataset includes the LIVE Challenge (LIVEC) [41] and KonIQ-10k [42] datasets. A detailed description of them is given in Table 1.
The University of Texas at Austin's Image and Video Engineering Laboratory established the LIVE image quality assessment dataset [38] in 2006. It consists of 779 distorted images developed from 29 source images using a total of five different forms of distortion (JP2K compression, JPEG compression, additive white Gaussian noise, Gaussian blur, and Simulated fast-fading Rayleigh channel). The scores are expressed by the Differential Mean Opinion Score (DMOS), the difference between the human eye's evaluation score of the reference image and the distorted image, with lower values indicating higher image visual quality. The TID2013 dataset is an extension of the TID2008 dataset [43] and contains 3000 distorted images based on 25 reference images with 24 different distortion types and five distortion levels. Image distortion categories include Additive Gaussian noise, Impulse noise, Chromatic aberrations, and so on. The Mean Opinion Score (MOS) values [0,9] are employed. The higher the value, the greater the visual quality. Because the TID2013 dataset contains more types of distortion, it places more demands on the algorithm, and many traditional methods cannot be used effectively. In Figure 8, we compare the attention maps of the three different network architectures. The Computational Perception and Image Quality Lab at Oklahoma State University created the CSIQ dataset [40], which contains 30 raw images and 866 images distorted by JPEG compression, JP2K compression, Gaussian blur, Gaussian white noise, Gaussian pink noise, or contrast variation, with five or four levels of each distortion type. The photos are 512 × 512 in size. The DMOS values acquired are in the [0, 1] range, with lower values suggesting greater visual quality. We show a selection of images from the dataset in Figure 9. LIVE Challenge [41] contains 1162 images taken in a variety of natural environments, with complex losses due to the level of photography and imaging equipment used to capture them, typically a combination of overexposure or underexposure, blur, grain, or compression, with MOS ranging from [0, 100], the higher the value the better. We show a selection of images from the dataset in Figure 10. The KonIQ-10k dataset consists of 10,073 images selected from the large public multimedia database YFCC100m [44]. The sampled images cover as wide and uniform a quality distribution as possible in terms of brightness, colour, contrast and sharpness, and the types of distortion present in these images include noise, JPEG artifacts, blending, lens motion blur, over-sharpening, and so on. The researchers conducted a large-scale crowdsourcing experiment based on the collected dataset, receiving 1.2 million assessments from 1467 observers utilizing statistical approaches such as taking the mean and deleting extreme scores to determine the final MOS values. The photos were 1024 × 768 in size. MOS values were in the [0, 5] range, with higher values indicating less distortion.  Table 2.

Evaluation Metrics
In order to quantitatively compare the performance of IQA algorithms, researchers often use the following three evaluation criteria.
(1) Spearman rank-order correlation coefficient (SROCC), SRCC is used to measure the monotonicity of IQA algorithm predictions and is calculated as follows.
where d i denotes the difference between the subjective quality score ranking of the i-th image and the objective quality score ranking, and I denotes the number of images in the test set.
(2) The Pearson linear correlation coefficient (PLCC), PLCC is used to assess the accuracy and degree of linear correlation of IQA model predictions.
where q i andq i denote the MOS value and algorithm prediction score of the i-th image, respectively, and q mqm denote the mean MOS value and the mean algorithm prediction score of the test image samples, respectively.
(3) The root mean square error (RMSE), RMSE is used to assess the consistency of the IQA model's predictions. It is used to measure the absolute error between the algorithm's predicted score and the subjective evaluation score and is calculated as follows.

Implementation Details
In the experiments, for each dataset, 80% of the images were randomly selected for training and 20% for testing. The training is conducted using a SGD optimizer with a batch size of eight. We trained our models with an initial learning rate of 0.001, with a warm up cosine learning rate decay scheduler. We adopted MSE loss for training: where model(I Dist ) denotes the output of the proposed Conv-Former, s denotes the groundtruth normalized MOS or DMOS value. We implemented our proposed model Conv-Former in Pytorch version 1.12.0 and python version 3.9, which was trained using a single NVIDIA GeForce RTX 3090 GPU. The CUDA and CuDNN versions are 11.6 and 8.4.0 respectively.

Comparing with the State-of-The Art (SOTA)
We assessed the performance of our model with PLCC and SRCC. PLCC assesses the linear correlation between ground truth and the predicted quality scores, whereas SRCC describes the level of monotonic correlation.
We evaluated the effectiveness of Conv-Former on five benchmark datasets. For all of our tests, we followed the above experimental setup. It can be shown in Table 2 that Conv-Former outperforms or is competitive with 14 NR-IQA methods: BRISQUE, NIQE, DIIVINE, HOSA, WaDIQaM, BIECON, SFA, PQR, DBCNN, SHN, RankIQA, ResNet-ft, TRIQ and MUSIQ. We found that our method achieves the best PLCC/SRCC results in comparison to other works. Especially on the moe complex dataset TID2013, our proposed model achieved a solid improvement over previous work. Even though Conv-Former achieves 0.965 on PLCC and 0.964 on SRCC, which means the metric is consistent with the human perspective. The effective feature fusion by CNN and ViT and the proposed multiscale prediction module make our method substantially superior to other transformer-based image quality assessment network; examples include TRIQ and MUSIQ. Although the model achieved better results on more complex datasets, there was no major improvement for datasets where most algorithms performed well, such as the LIVE dataset. In order to visualise the advantages of Conv-Former, we present the data in Table 2 as a histogram in Figure 11.
In order to evaluate the effectiveness of Conv-Former on different types of distorted images, we also compare the PLCC/SRCC performances on five kinds (JP2K compression, JPEG compression, additive white Gaussian noise, Gaussian blur, Simulated fast fading Rayleigh channel) of distorted types in the LIVE dataset and six kinds (JPEG compression, JP2K compression, Gaussian blur, Gaussian white noise, Gaussian pink noise, contrast variation) of distorted types in the CSIQ dataset. Tables 3 and 4 show the results of different IQA methods on different types of distorted images. In the tables, we find our method achieves the best or most competitive performances on all different types of distorted images than other IQA methods.  To further investigate the effectiveness of the proposed Conv-Former, we demonstrate the scatter plots between the MOS and the prediction scores and analyze the correlation. Figures 12 and 13 show the scatter plots of different IQA methods on the CSIQ and LIVEC datasets, respectively. The red points denote the testing instances. The blue line is the ideal linear relationship between MOS and the prediction score. The green line is the curve fitted using the test instance. All values are normalized in the range of −5 to 5 for a better view. In the Figures 12 and 13, we can find that the results of Conv-Former are more in line with the MOS. Compared with WaDIQaM and BIECON, our Conv-Former has fewer outliers and performs more consistently with the MOS. To evaluate the generalization of our proposed Conv-Former, we conduct the cross dataset evaluation on LIVE, CSIQ, TID2013, and LIVEC. We train the model on one train dataset separately, then test it on the full set of the other three benchmark datasets. As shown in Table 5, Conv-Former achieves good generalization ability.

Ablation Studies
In this section, to evaluate the efficiency of our proposed components, we analyze the effectiveness of the proposed network by conducting ablation studies. With different configuration and implementation strategies, we evaluate the effect of each of the four major components: Multi-Scale transformer architecture (MS), Adaptive Position Embedding (APE), Dual Path Pooling (DPP), and local information perception (LIP) module. We conduct ablation experiments on the LIVEC, KonIQ, CSIQ, and TID2013 databases. The results are shown in Table 6. We first examine the effectiveness of our proposed Adaptive Position Embedding module. From the results, we can see a significant improvement in SRCC and PLCC with the use of positional coding. In KonIQ, the SRCC and PLCC increased by 2.8% and 2.2%, respectively. As a result, this demonstrates the critical importance of Position Embedding for our proposed Conv-Former. All subsequent ablation experiments were carried out with the Position Embedding. Table 6 shows that the best performance can be achieved when all three components are available. The table shows that the lack of any of the following three components in our Conv-Former will negatively impact the objective performance metrics.

Analysis and Discussion
To further validate the proposed method's effectiveness and analyze the internal mechanism of the difference in the performance of the neural networks, in Figure 8, we compare the attention maps of the three different network architectures.
The proposed Conv-Former tends to focus on the regions that significantly impact the quality assessment scores, as seen from the attention maps so that the obtained image quality assessment scores can be remarkably consistent with the subjective assessment results of the human eye. For example, people tend to assess the quality of an image based on the target when it is in a pure background, and some images are blurred to highlight the target, so people tend to focus on the clear parts to assess the quality of the image. This aligns with our intuition, resulting in superior results compared to other methods. Compared to the traditional approach using CNN, the approach using the transformer tends to activate more significant regions rather than local areas, implying enhanced long-range feature dependency. Since the local information-aware module provides detailed local features, the Conv-former can retain important detailed local features that are often corrupted by the vision transformer. In addition, the target attention areas can be more complete in a significant area context, meaning that Conv-former learns feature representations with higher discriminative power.
On the other hand, humans perceive different image qualities in different ways when the image's content is different. Our proposed Conv-Former can identify the image content well in the process of image quality evaluation and can try to understand the image before predicting it, which is more in line with the laws of human perception of the objective world.

Conclusions
In this paper, we propose a novel network, combining convolution and Self-Attention for an image quality assessment network (Conv-Former) for the no-reference image quality assessment task. Our model can obtain global features through transformer and local information perception (LIP). We evaluate the effectiveness of Conv-Former on five benchmark datasets, we found that our method achieves the best PLCC/SRCC results compared to other works. Especially on the more complex dataset TID2013, our proposed model achieved a solid improvement over previous work. Even though Conv-Former achieves 0.965 on PLCC and 0.964 on SRCC, which means the metric is consistent with the human perspective. Experimental results showed that our proposed approach outperforms the state-of-the-art (SOTA) methods on IQA databases, which has strong generalization ability and provides prospects for the broader application of IQA tasks. Future work will focus on developing more generic IQA models in which a single model can be adapted to diverse image content and imaging devices. Our proposed model should have a sufficient dataset so that the trained model can have a stronger generalization ability and obtain excellent results on images where the model has not been trained. The size of the model is also a key factor in its ability to be deployed in practical applications, and we will continue to optimize the model in the future so that it has fewer parameters and runs faster.