Client-Oriented Blind Quality Metric for High Dynamic Range Stereoscopic Omnidirectional Vision Systems

A high dynamic range (HDR) stereoscopic omnidirectional vision system can provide users with more realistic binocular and immersive perception, where the HDR stereoscopic omnidirectional image (HSOI) suffers distortions during its encoding and visualization, making its quality evaluation more challenging. To solve the problem, this paper proposes a client-oriented blind HSOI quality metric based on visual perception. The proposed metric mainly consists of a monocular perception module (MPM) and binocular perception module (BPM), which combine monocular/binocular, omnidirectional and HDR/tone-mapping perception. The MPM extracts features from three aspects: global color distortion, symmetric/asymmetric distortion and scene distortion. In the BPM, the binocular fusion map and binocular difference map are generated by joint image filtering. Then, brightness segmentation is performed on the binocular fusion image, and distinctive features are extracted on the segmented high/low/middle brightness regions. For the binocular difference map, natural scene statistical features are extracted by multi-coefficient derivative maps. Finally, feature screening is used to remove the redundancy between the extracted features. Experimental results on the HSOID database show that the proposed metric is generally better than the representative quality metric, and is more consistent with the subjective perception.


Introduction
Virtual reality (VR) technologies can provide users with a unique immersive experience with their high resolution, high reproducibility, and full-field viewing [1][2][3]. As one of the most important carriers of VR systems, stereoscopic omnidirectional visual signals can provide users with 360 × 180 • field of view (FoV) immersion, binocular perception and view interaction [4,5]. In stereoscopic omnidirectional imaging with the FoV of 360 × 180 • , its illumination intensity of the scene is usually very different. Thus, high dynamic range (HDR) stereoscopic omnidirectional vision systems, combining HDR imaging [6,7] and stereoscopic omnidirectional imaging technologies, can better describe real information of a scene. In such a system, a HDR stereoscopic omnidirectional image (HSOI) may suffer distortion during its generation, encoding/transmission and visualization with headmounted display (HMD), which results in degradation of HSOI quality. Therefore, it is more challenging to establish efficient blind HSOI quality metrics.
Generally, related to the research of stereoscopic omnidirectional image quality assessment (SOIQA), quality estimation has undergone 2D image quality assessment (2D-IQA), stereoscopic image quality assessment (SIQA) and omnidirectional image quality assessment (OIQA). Visual content quality metrics can be divided into full-reference, reducedreference and blind/no-reference, according to the usage of the reference image information.
For SIQA, Zhang et al. [18] proposed a full reference SIQA metric with multiscale perceptual features and genetic algorithm training-based support vector machine regression. Jiang et al. [19] proposed an SIQA metric by learning non-negative matrix factorization-based monocular perception and binocular interaction-based color visual features. Liu et al. [20] used a spatial activity model for weighting a cyclopean image of a stereoscopic image pair, and extracted the corresponding features to form an S3D integrated quality (SINQ) metric. Chen et al. [21] considered binocular perception and disparity information, and applied cyclopean images to design an SIQA metric. Li et al. [22] established a two-channel convolutional neural network (CNN) to simulate binocular fusion and binocular competition for SIQA. Meng et al. [23] combined a visual intersection model, multiscale information fusion model and attention-simulated binocular fusion model to design an SIQA metric.
The above 2D-IQA and SIQA metrics are designed for traditional 2D/3D images, which do not take visual perception of omnidirectional images (OIs) and stereoscopic omnidirectional image (SOI) quality assessment into account. For OIQA, initially, some full-reference OIQA metrics based on PSNR and SSIM were proposed. After that, starting from representations of OI, Zheng et al. [24] proposed a segmented spherical projectionbased blind OIQA metric (called SSP-OIQA), in which the bipolar and equatorial regions of OI are obtained by the segmented spherical projection, and different feature extraction schemes are designed for evaluating distorted OI. Jiang et al. [3] proposed a perceptiondriven blind OIQA framework based on cubemap projection (CMP). Considering the HMD viewport viewing mode of OI and the effectiveness of depth learning in visual computing, Sun et al. [25] designed a multi-channel CNN for blind OIQA, which uses six parallel ResNet-34 networks to process viewport images and a quality regression to predict the quality score of distorted OIs. Li et al. [26] proposed an OI-oriented attentive deep stitching method and presented an attention-driven OIQA metric with global and local measures. Fu et al. [27] designed an adaptive hypergraph convolution network for OIQA, which consists of a multi-level viewport descriptor and modeling the viewport interaction through a hypergraph.
For SOIQA, Qi et al. [28] considered perception factors such as the viewport, user behavior and binocular perception, and proposed a viewport perception-based blind SOIQA metric, which mainly includes a binocular perception model and an omnidirectional perception model. Zhou et al. [29] proposed an SOIQA metric based on projection-invariant features and visual saliency; they combined the visual saliency model of chrominance and contrast perception factors to improve the prediction accuracy. Xu et al. established a stereoscopic omnidirectional image database (named SOLID) and proposed a multiple viewports-based full-reference SOIQA metric [30], in which they used the difference map between left and right views to estimate depth perception related features. Chen et al. [4] further proposed a full-reference SOIQA metric based on predictive coding theory. In [31], deep learning was used to design an SOIQA metric.
To visualize HDR images on the displays with standard dynamic range (SDR), an efficient approach is to perform tone-mapping (TM) on HDR images, but this may result in the corresponding degradation in visual quality. Regarding this issue, Gu et al. [32] designed a blind tone-mapped quality index (BTMQI), which combined information entropy, structure and natural scene statistics for quality evaluation. Jiang et al. [33] analyzed texture distortion and color distortion of different brightness regions in tone-mapped HDR images, and designed a blind tone-mapped image quality assessment (BTMIQA) metric that considered the details of the bright and dark regions, as well as their naturalness and aesthetics features for quality evaluation. Fang et al. [34] proposed a tone-mapped HDR image quality metric with gradient and color difference statistics, which used the sensitivity of human eyes to image structure changes to measure image degradation, and used local binary pattern to describe color distortion. Yue et al. [35] presented a tone-mapped HDR image quality metric by extracting three quality-sensitive features, namely color, naturalness and structure. Zhao et al. [36] extracted features of a tone-mapped HDR image from pixel domain, sharpness and chromaticity for predicting the quality of the tone-mapped HDR image.
HSOI quality assessment (HSOIQA) involves not only binocular perception and OI perception, but also HDR/TM perception. Up until now, HSOIQA has been an unstudied and challenging issue. To solve this issue, in this paper, a client-oriented blind HSOIQA metric based on visual perception is proposed, which includes two main modules, that is, a monocular perception module (MPM) and a binocular perception module (BPM). In the MPM, the global color distortion, symmetric/asymmetric distortion and scene distortion are characterized. In the BPM, new feature extraction schemes of binocular fusion map and a binocular difference map based on joint image filtering are designed. The corresponding features are extracted by brightness segmentation of the binocular fusion map, and the natural statistical features are extracted from the binocular difference map. All viewport image-based features are aggregated according to the significance, and feature screening is performed and an objective quality score of HSOI is predicted. Experimental results show that the proposed metric outperforms the representative blind quality metrics. The main contributions of this paper are as follows: (1) A client-oriented blind HSOIQA metric based on visual perception is established for the client's distorted HSOI in HSOV system, which combines binocular perception, OI perception and HDR/TM perception. (2) New feature extraction schemes of the binocular fusion map and the binocular difference map based on joint image filtering are designed for efficiently evaluating the quality of distorted HSOI. (3) In feature extraction, the information expression and perception capabilities of HSOIs at different resolutions are further explored with multiscale computing methods.
The rest of this paper is arranged as follows. Section 2 describes the proposed metric for HSOIs at the client of the HSOV system. Section 3 gives experimental results and analyses. Finally, Section 4 concludes the paper.

The Proposed Metric
This section states the research ideas from the perspective of visual perception and gives an overview of the proposed metric; then, the proposed metric is described in detail.

Overview of the Proposed Metric
Generally, the HSOV system consists of processes such as HSOI generation, encoding/decoding with JPEG XT/H.265 and visualization by using a head mounted display (HMD) with SDR. The processes may produce corresponding distortion, such as encoding distortion, TM distortion and mixed distortion, resulting in degradation of the quality of the user's visual experience.
The human binocular vision system has two visual pathways: the dorsal pathway and the ventral pathway [37][38][39]. The dorsal pathway starts from the primary visual cortex V1 area, and flows through the V2, V3 and V5 areas; its function is to complete the guidance from visual information to action [38]. The ventral pathway starts from the V1 area, and flows through V2, V3 and V4 areas to complete the perception and recognition of visual behaviors [39]; it is also related to long-term memory. For visual content quality assessment, visual perception of distortion is extremely important, so the ventral pathway with the V1, V2, V3 and V4 areas has a certain guiding significance for perceptual quality assessment. In the visual cortex, there are two types of cells: simple cells and complex cells [40]. The simple cells process the retinal information of the left and right views received from the corresponding lateral geniculate nucleus. The complex cells connect the left and right view signals with the binocular signals. Specifically, the V1 area corresponds to the simple features such as luminance, chromaticity, edge, spatial frequency, and the V2 area will recognize higher level features, such as texture and shape, in addition to transmitting lower level features of the V1 area. The V3 and V4 areas belong to the occipital cortex and actually have little correlation with perceptual quality, but they can further encode complex image features. The perception process of the V1, V2, V3 and V4 areas provides a theoretical basis for the subsequent modeling of HSOIQA.
Here, based on the human visual system (HVS), the distorted HSOI is input as a visual stimulus of HVS to simulate the left view/right view information processed by simple cells in the V1 area to model the monocular perception of HSOI. At the same time, it also simulates the complex cell processing process to model the binocular perception effect, as shown in Figure 1. For the monocular perception modeling of the distorted HSOI, it combines the symmetric/asymmetric encoding distortion characteristics, OI viewport characteristics and scene characteristics (outdoor/indoor/night scenes) of distorted HSOI to extract global color information, symmetric/asymmetric distortion perception and scene distortion perception, respectively. For the binocular perception modeling of distorted HSOI, the information transfer mechanism of the V1, V2, V3 and V4 areas is simulated. Combined with the ventral pathway, primary features such as brightness are first perceived in the V1 area; when the information from the V1 area is transferred to the V2 area, higher-level features are recognized. Thereafter, color information is perceived in the V4 area. In addition, when the user wears HMD to browse the image content in the current viewport, the image content may guide the user's behavior in selecting the next viewport to be browsed. For example, when the image browsed in the current viewport is incomplete, the user is very likely to rotate their head to observe next viewport to browse the complete image content. This process of actively selecting the viewport for browsing will be completed in the V5 area. After the next viewport is selected, the above process will be repeated until the user completes their viewing of the entire HSOI.
tion of visual behaviors [39]; it is also related to long-term memory. For visual content quality assessment, visual perception of distortion is extremely important, so the ventral pathway with the V1, V2, V3 and V4 areas has a certain guiding significance for perceptual quality assessment. In the visual cortex, there are two types of cells: simple cells and complex cells [40]. The simple cells process the retinal information of the left and right views received from the corresponding lateral geniculate nucleus. The complex cells connect the left and right view signals with the binocular signals. Specifically, the V1 area corresponds to the simple features such as luminance, chromaticity, edge, spatial frequency, and the V2 area will recognize higher level features, such as texture and shape, in addition to transmitting lower level features of the V1 area. The V3 and V4 areas belong to the occipital cortex and actually have little correlation with perceptual quality, but they can further encode complex image features. The perception process of the V1, V2, V3 and V4 areas provides a theoretical basis for the subsequent modeling of HSOIQA.
Here, based on the human visual system (HVS), the distorted HSOI is input as a visual stimulus of HVS to simulate the left view/right view information processed by simple cells in the V1 area to model the monocular perception of HSOI. At the same time, it also simulates the complex cell processing process to model the binocular perception effect, as shown in Figure 1. For the monocular perception modeling of the distorted HSOI, it combines the symmetric/asymmetric encoding distortion characteristics, OI viewport characteristics and scene characteristics (outdoor/indoor/night scenes) of distorted HSOI to extract global color information, symmetric/asymmetric distortion perception and scene distortion perception, respectively. For the binocular perception modeling of distorted HSOI, the information transfer mechanism of the V1, V2, V3 and V4 areas is simulated. Combined with the ventral pathway, primary features such as brightness are first perceived in the V1 area; when the information from the V1 area is transferred to the V2 area, higher-level features are recognized. Thereafter, color information is perceived in the V4 area. In addition, when the user wears HMD to browse the image content in the current viewport, the image content may guide the user's behavior in selecting the next viewport to be browsed. For example, when the image browsed in the current viewport is incomplete, the user is very likely to rotate their head to observe next viewport to browse the complete image content. This process of actively selecting the viewport for browsing will be completed in the V5 area. After the next viewport is selected, the above process will be repeated until the user completes their viewing of the entire HSOI. Based on the above analysis, for client-oriented HSOIQA in the HSOV system, we propose a visual perception-based blind HSOIQA metric, as shown in Figure 2. The Based on the above analysis, for client-oriented HSOIQA in the HSOV system, we propose a visual perception-based blind HSOIQA metric, as shown in Figure 2. The proposed metric mainly includes viewport sampling, a monocular perception module (MPM), a binocular perception module (BPM), feature screening and quality regression. For the MPM, firstly, HSOI is transformed from an equirectangular projection (ERP) format to CMP format, and then the global color features are extracted in the spatial and discrete cosine transform (DCT) domains. Secondly, considering the unique symmetric/asymmetric distortion of HSOI, its distorted left and right views are measured by multiscale retinex (MSR) decomposition. Then, combining with a Laplacian pyramid decomposition model, the distortions of the different scenes with indoor/outdoor and day/night are measured from the characteristics of contrast, detail and structure. For the BPM, based on joint image filtering, a binocular fusion map is generated to represent the similarity of the HSOI's left and right views; the brightness-based viewport image is further segmented to distinguish perceptual characteristics of different brightness regions, which is consistent with the information transmission process of the V1, V2 and V4 areas in the human visual system. The calculated binocular difference map represents the difference information between left and right views. Before quality regression, all the extracted feature vectors are processed by feature screening. Finally, quality regression with a random forest model is used to predict the objective quality score of distorted HSOI.
For the MPM, firstly, HSOI is transformed from an equirectangular projection (ERP) format to CMP format, and then the global color features are extracted in the spatial and discrete cosine transform (DCT) domains. Secondly, considering the unique symmetric/asymmetric distortion of HSOI, its distorted left and right views are measured by multiscale retinex (MSR) decomposition. Then, combining with a Laplacian pyramid decomposition model, the distortions of the different scenes with indoor/outdoor and day/night are measured from the characteristics of contrast, detail and structure. For the BPM, based on joint image filtering, a binocular fusion map is generated to represent the similarity of the HSOI's left and right views; the brightness-based viewport image is further segmented to distinguish perceptual characteristics of different brightness regions, which is consistent with the information transmission process of the V1, V2 and V4 areas in the human visual system. The calculated binocular difference map represents the difference information between left and right views. Before quality regression, all the extracted feature vectors are processed by feature screening. Finally, quality regression with a random forest model is used to predict the objective quality score of distorted HSOI.  Figure 2. The framework of the proposed HSOIQA metric for the client in the HSOV system.

Viewport Sampling
Let IHSOI be the HSOI signal output from the HSOV system to user's HMD, which may suffer encoding distortion, TM distortion and mixed distortion, where IHSOI = {IL, IR}, IL and IR represent the left and right views of HSOI, respectively. IHSOI can be represented by the ERP format, CMP format, spherical format and viewport images, respectively.
At the client of HSOV system, the user can actively select a viewport according to the content of the HSOI through HMD with SDR. For HSOIQA, the HSOI can be divided into the equatorial region and bipolar regions from the perspective of the user's behavior. Let M be the number of viewports uniformly sampled in the equatorial region, and the bipolar regions are sampled according to the significance of the binocular product, and one viewport is taken for each polar region; thus, the total number of viewports is M + 2.

Viewport Sampling
Let I HSOI be the HSOI signal output from the HSOV system to user's HMD, which may suffer encoding distortion, TM distortion and mixed distortion, where I HSOI = {I L , I R }, I L and I R represent the left and right views of HSOI, respectively. I HSOI can be represented by the ERP format, CMP format, spherical format and viewport images, respectively.
At the client of HSOV system, the user can actively select a viewport according to the content of the HSOI through HMD with SDR. For HSOIQA, the HSOI can be divided into the equatorial region and bipolar regions from the perspective of the user's behavior. Let M be the number of viewports uniformly sampled in the equatorial region, and the bipolar regions are sampled according to the significance of the binocular product, and one viewport is taken for each polar region; thus, the total number of viewports is M + 2.
Assuming that the vertical FoV of a viewport is ϕ, the equatorial region corresponds to the latitude range of [−ϕ/2, ϕ/2], and the bipolar regions correspond to the latitude ranges of (ϕ/2, 90 • ] and [−90 • , −ϕ/2), respectively. If the vertical FoV angle of HMD device is 110 • , ϕ will be set to 110 • . In the equatorial region, M viewports are uniformly sampled at equiangular intervals with the angle set to 2π/M. For the bipolar regions, the position with the largest pixel value in the corresponding binocular product saliency map S LR is selected as the center of the viewport. S LR is obtained as follows: (1) For the left and right views I L and I R , their saliency maps S L and S R are computed by the method in [41], respectively; (2) S LR is viewed is the correlation measure between S L and S R after stereoscopic matching, S LR = {S LR (i,j)}, and expressed as follows: where d i,j denotes the disparity of S R relative to S L on the pixel at the position of (i,j), which is calculated by the optical flow method in [42].

Monocular Perception Module (MPM) for Distorted HSOI
This subsection extracts the monocular perceptual features of the distorted HSOI from three aspects: global color information, symmetric/asymmetric encoding distortion perception, and scene distortion perception. Among them, the color information is a subjective overall perception, and all viewport images need to be reconstructed first when viewed by users, so, the CMP format of I HSOI is used for global color feature extraction. For symmetric/asymmetric distortion perception and scene distortion perception, the perceptual features are extracted based on the viewport images.
(1) Global color feature extraction As the output at the client of HSOV system, the distorted HSOI may consist of encoding distortion, TM distortion and the mixed distortion. Compared with the ERP representation of I HSOI , its CMP representation more easily describes the monocular distortion of HSOI. Furthermore, numerous studies have shown that color information is processed with coloropponency in the human visual system. Therefore, the global color features are extracted in the spatial domain and DCT domain, respectively, based on the color-opponency space.
According to the work of Hasler et al. [43], the color in an image is quantized to qualitatively process or code the impact on color visually. Here, taking the left view I L of I HSOI as an example, I L = (R L , G L , B L ). Its CMP format can be expressed as I L = {I Li , i = 1, 2, . . . , 6}, I Li is the i-th face of the six faces in the CMP format. First, I L is converted from RGB space to the red-green and yellow-blue opponency channels, denoted as ρ Lrg and ρ Lyb , ρ Lrg = R L − G L , ρ Lyb = 0.5(R L + G L ) − B L . Let µ Lrg and µ Lyb denote the mean of ρ Lrg and ρ Lyb , and σ Lrg and σ Lyb denote the variance of ρ Lrg and ρ Lyb ; then, two statistic features µ 2 Lrg−Lyb and σ 2 Lrg−Lyb of ρ Lrg and ρ Lyb are expressed as follows: The most intuitive TM operators (TMOs) are generally to change the mean value of the pixel value distribution, and then change the degree of numerical dispersion of pixels. Therefore, the joint statistical measure J Lrg−Lyb of ρ Lrg and ρ Lyb is expressed to describe the spatial color feature of I L , and calculated as follows [43]: Similarly, for the right view I R of I HSOI , its joint statistical measure J Rrg−Ryb can also be obtained. Thus, the global spatial color feature F CS is defined for the distorted HSOI, Then, the color features in the transform domain are extracted. Taking the CMP format of I L as an example, for its two color-opponency channels, ρ Lrg and ρ Lyb , their non-overlapping N u × N v blocks are transformed with DCT. Let {ξ L,k (u,v); u = 1, 2, . . . , N u , v = 1, 2, . . . , N v } denote DCT coefficients of a block, where k represents the color antagonist channel ρ Lrg or ρ Lyb ; here, N u and N v are set to 5. For {ξ L,k (u,v)}, its DC component is discarded, and its AC components are divided into three frequency bands: low frequency (LF), middle frequency (MF) and high frequency (HF) as shown in Figure 3. The variance of the three frequency bands of each image block is calculated separately as the band energy feature, the mean of the three frequency bands' variances of all image blocks is considered as the final energy feature of the corresponding frequency band; then, the energy features of six faces of the CMP format of I L are averaged. For two color-opponency channels, ρ Lrg and ρ Lyb , of I L , 6-dimensional features can be extracted. Similarly, for ρ Rrg and ρ Ryb , of I R , 6-dimensional energy features can also be obtained, which constitutes 12-dimensional color features in the DCT domain, F CD , of HSOI.
non-overlapping Nu × Nv blocks are transformed with DCT. Let {ξL,k(u,v); u = 1, 2,…, Nu, v = 1, 2,…, Nv} denote DCT coefficients of a block, where k represents the color antagonist channel ρLrg or ρLyb; here, Nu and Nv are set to 5. For {ξL,k(u,v)}, its DC component is discarded, and its AC components are divided into three frequency bands: low frequency (LF), middle frequency (MF) and high frequency (HF) as shown in Figure 3. The variance of the three frequency bands of each image block is calculated separately as the band energy feature, the mean of the three frequency bands' variances of all image blocks is considered as the final energy feature of the corresponding frequency band; then, the energy features of six faces of the CMP format of IL are averaged. For two color-opponency channels, ρLrg and ρLyb, of IL, 6-dimensional features can be extracted. Similarly, for ρRrg and ρRyb, of IR, 6-dimensional energy features can also be obtained, which constitutes 12-dimensional color features in the DCT domain, FCD, of HSOI.
Finally, the global color features of HSOI are denoted as FCE, FCE = (FCS, FCD). (2) Symmetric/asymmetric encoding distortion measure Different from 2D image coding, a stereoscopic image can be encoded with asymmetric encoding by using different quantization parameters (QPs) for its left and right views, so as to improve the encoding efficiency by taking the binocular masking effect of human eyes. The distortion-level difference between the left and right views has a great impact on the quality of user's experience to the encoded stereoscopic image. Here, a correlation measure between the left and right views is designed to evaluate the information difference caused by different distortion levels of the left and right views.
For IL and IR, viewport sampling is first performed to obtain the corresponding left and right viewport image sets {VL,m} and {VR,m}, respectively, where m = 1, 2,…, M + 2. From the multi-resolution perception of the human visual system, in the process of gradually reducing the image resolution from high to low, the focus of the human eyes shifts from fine textures to rough structures. MSR decomposition [44] is used for image preprocessing in this work. The complementary information of different scales can effectively detect the image content that is not easy to find at a single scale.
For a given image, the illumination component can be estimated by MSR decomposition. Taking where ⨂ is convolution operation, g(x,y) is Gaussian function, g(x,y) = Ngexp(−(x 2 + y 2 )/η), Ng is normalization factor; η is the scale parameter of Gaussian function. When the value of η is large, the detail recovery is coarse, and when the value is small, the detail recovery is fine. Here, in order to reflect the multiscale characteristics, three scale factors with (2) Symmetric/asymmetric encoding distortion measure Different from 2D image coding, a stereoscopic image can be encoded with asymmetric encoding by using different quantization parameters (QPs) for its left and right views, so as to improve the encoding efficiency by taking the binocular masking effect of human eyes. The distortion-level difference between the left and right views has a great impact on the quality of user's experience to the encoded stereoscopic image. Here, a correlation measure between the left and right views is designed to evaluate the information difference caused by different distortion levels of the left and right views.
For I L and I R , viewport sampling is first performed to obtain the corresponding left and right viewport image sets {V L,m } and {V R,m }, respectively, where m = 1, 2, . . . , M + 2. From the multi-resolution perception of the human visual system, in the process of gradually reducing the image resolution from high to low, the focus of the human eyes shifts from fine textures to rough structures. MSR decomposition [44] is used for image preprocessing in this work. The complementary information of different scales can effectively detect the image content that is not easy to find at a single scale.
For a given image, the illumination component can be estimated by MSR decomposition. Taking V L,m as an example, V L,m = {V L,m (x, y)}, its illumination component Ψ L,m , Ψ L,m = { Ψ L,m (x, y)} can be calculated as follows: where ⊗ is convolution operation, g(x,y) is Gaussian function, g(x,y) = N g exp(−(x 2 + y 2 )/η), N g is normalization factor; η is the scale parameter of Gaussian function. When the value of η is large, the detail recovery is coarse, and when the value is small, the detail recovery is fine. Here, in order to reflect the multiscale characteristics, three scale factors with significant differences were used: small, medium and large, η can be set to one element of {η 1 , η 2 , η 3 }. MSR decomposition can be used to describe illumination features by three different scale filtering on the image and then weighted summation; here, the gray-scale images of the viewport's left and right views are directly processed to obtain the corresponding illumination components with different η (η ∈ {η 1 , η 2 , η 3 }), which are, respectively, denoted as Ψ L,m and Ψ R,m , where Ψ L,m = {Ψ η L,m } and Ψ R,m = {Ψ η R,m }. Figure 4 shows an example of MSR decomposition of distorted HSOI in the HDR stereoscopic omnidirectional image database (HSOID) [45] at the client of the HSOV system (here, η 1 = 25, η 2 = 100, η 3 = 240). The original HSOI at the server in the HSOV system is encoded with an asymmetric encoding distortion level of (L1, L3), i.e., the encoding distortion level of the left view is L1, and the encoding distortion level of the right view is L3. To visualize the compressed HSOI on HMD with SDR, DurandTMO [46] is used in TM processing. It can be found as follows: (i) For Ψ L,m and Ψ R,m , their MSR decomposition with η 1 can show more details than those of MSR decomposition with η 2 and η 3 , especially in the window regions in Figure 4. It indicates that the image's MSR decomposition with different η values contains different information, and three-scale MSR decomposition can complement each other. (ii) Compared with the viewport's left view, the distortion level of the viewport's right view is lower, and its block effect is less, Ψ R,m has a clearer texture than that of Ψ L,m , especially in the ceiling and ground regions. It indicates that MSR decomposition can reflect different distortion characteristics of the left and right views with different distortion levels to a certain extent.
MSR decomposition can be used to describe illumination features by three different scale filtering on the image and then weighted summation; here, the gray-scale images of the viewport's left and right views are directly processed to obtain the corresponding illumination components with different η (η ∈ {η1, η2, η3}), which are, respectively, denoted as  Figure 4 shows an example of MSR decomposition of distorted HSOI in the HDR stereoscopic omnidirectional image database (HSOID) [45] at the client of the HSOV system (here, η1 = 25, η2 = 100, η3 = 240). The original HSOI at the server in the HSOV system is encoded with an asymmetric encoding distortion level of (L1, L3), i.e., the encoding distortion level of the left view is L1, and the encoding distortion level of the right view is L3. To visualize the compressed HSOI on HMD with SDR, DurandTMO [46] is used in TM processing. It can be found as follows:

and
, , their MSR decomposition with η1 can show more details than those of MSR decomposition with η2 and η3, especially in the window regions in Figure 4. It indicates that the image's MSR decomposition with different η values contains different information, and three-scale MSR decomposition can complement each other.
(ii) Compared with the viewport's left view, the distortion level of the viewport's right view is lower, and its block effect is less, , has a clearer texture than that of , , especially in the ceiling and ground regions. It indicates that MSR decomposition can reflect different distortion characteristics of the left and right views with different distortion levels to a certain extent. For , and , , their feature maps are further calculated. First, de-mean normalization is performed on them. Second, a local derivative pattern (LDP) [47] is used to measure the texture information of 100, 240}. The correlation coefficient can be used to measure the correlation degree between two random variables, and its value range is [−1,1]; the larger the absolute value of the correlation coefficient, the higher the correlation between the two. Then, based on For Ψ L,m and Ψ R,m , their feature maps are further calculated. First, de-mean normalization is performed on them. Second, a local derivative pattern (LDP) [47] is used to measure the texture information of Ψ L,m and Ψ R,m after obtaining the second derivative of Ψ L,m and Ψ R,m in four directions of {0 • , 45 • , 90 • , 135 • }. The LDP map in each direction can be quantized by a 10-dimensional histogram according to the rotation-invariant uniform local binary pattern. After the above operations, the quantized LDP histograms of Ψ L,m and Ψ R,m are obtained, and respectively expressed as , η ∈ {25, 100, 240}. The correlation coefficient can be used to measure the correlation degree between two random variables, and its value range is [−1, 1]; the larger the absolute value of the correlation coefficient, the higher the correlation between the two. Then, based on H η L,m and H η R,m , the absolute correlation coefficient C A and correlation distance C D of each 10-dimensional histogram are calculated as the similarity features of the left and right views, C A and C D are computed where corr(·) represents the correlation coefficient function, pdist(·) represents the correlation distance function, and |·| represents the absolute operation.
Finally, the absolute correlation coefficients and correlation distances of Gaussian functions with three scale parameters are taken as the symmetric/asymmetric encoding distortion feature vector f ccor . Taking the scenarios in the HSOID [45] as examples, Table 1 shows C A and C D of the distorted viewport's left and right views processed by DurandTMO under encoded with 9 distortion levels (the first 5 are asymmetric encoding distortion, and the last 4 are symmetric encoding distortion). In addition, Table 2 shows the relationship between C A and correlation degree, which reflects the correlation degree of the left and right views with different distortion levels. It can be found as follows: Table 1. C A and C D of distorted viewport' left and right views processed by DurandTMO under encoding with 9 distortion levels.

Distortion
Level  Table 2. Relationship between C A and correlation degree. , can be judged as very strong correlation degrees, their absolute correlation coefficients C A are distributed in different levels; the first two are in the range of 0.8 to 0.9, while the latter are in the range of 0.9 to 1.0. It shows that C A and C D can effectively distinguish different levels of asymmetric distortion. (ii) The correlation degrees of the distorted viewport's left and right views, encoded with 4 symmetrical distortion levels, are all very strongly correlated, and generally their C A values tend to be larger as the distortion level is lower. (iii) The C A values of the distorted viewports, encoded with the symmetrical distortion levels, are generally larger than those with the asymmetrical distortion levels; it indicates that C A and C D can effectively distinguish the types of symmetrical/asymmetrical encoding distortion and their degree of distortion to a certain extent.

Absolute Correlation Coefficient C A
(3) Feature extraction with scene analysis The HSOID [45] includes the different scenes of indoor, outdoor, day and night. In the imaging of indoor scenes, most of the light usually comes from the ceiling light, which may not be sufficient for outdoor scenes; at the same time, most of them contain window regions, and the brightness of the window regions is relatively high, which is prone to loss of details in the imaging. In the imaging of outdoor scenes during the daytime, the light is relatively sufficient and the contrast is relatively high. There may be a large region of sky in the outdoor imaging, because the sky region is relatively flat, the block effect caused by encoding distortion is more perceptible. Especially when wearing an HMD to view the HSOI, it can be seen from the near-eye perception that the distortion in this region is more likely to affect the subjective quality. The night scene is generally dark, with relatively low contrast and fuzzy structure.
In summary, based on contrast, detail and structure of an image, feature extraction can be performed to synthesize perceptual distortion features of various scenes. Among them, the details and structures can be represented by the image's detail layer and base layer in combination with the idea of image decomposition.
To generate the base and detail layers of the image, Laplacian pyramid decomposition [48] can be used. Taking a distorted viewport's left view V L,m as an example, it is decomposed by Laplacian pyramid at three scales, and a total of three detail layers and three base layers are obtained.  [49] can describe spatial domain information by encoding the spatial position relationship between the center pixel and its neighbor pixels within a certain radius, different patterns can characterize structures such as points, lines, and edges, and a contrast-weighted LBP (CLBP) is adopted in combination with the contrast information.
Taking D L,m as an example, its LBP can be expressed as follows: where P is the number of neighborhoods, and R is the neighborhood radius, P and R are set to 8 and 1 in [49]. D L,m,c and D L,m,i represent the values of the pixels in D L,m and the pixels in the neighborhood centered on D L,m , respectively. T h (·) is the threshold function, and expressed as follows: The rotation invariant uniform LBP is expressed as follows: where the superscript 'riu2' reflects the use of rotation invariant uniform patterns that have u value less than 2, and u(·) is the uniform metric, and expressed as follows: Generally, the rotation invariant uniform LBP has P + 2 modes, and each mode represents different image content information. Let k be the mode index. Histogram of CLBP is expressed as follows: where N is the number of pixels of an image in D L,m , C is the contrast map of V L,m , Similarly, for B R,m , the corresponding eigenvalues of the structural tensor matrix can be obtained as λ R1 , λ R2 , λ R3 and λ R4 . Then, the structural features of the distorted left and right views are denoted as f st .
Finally, the feature set extracted by the MPM is the global color feature F CE , symmetric/asymmetric distortion feature f corr , detailed feature f CLBP and structural feature f st .
(4) Feature aggregation with viewport significance When viewing omnidirectional visual contents, users are usually guided to select viewports by the saliency as regions of interest. The contribution of different viewports to the overall quality needs to be weighted according to their saliency. For binocular product salience map S LR , viewport sampling is performed first to obtain a series of saliency viewport maps, S LR V = S LR V 1 , S LR V 2 , . . . , S LR V M+2 . The viewport-normalized saliency value W S = {W m ; m = 1, 2, . . . , M + 2} is calculated as the significance weight to express user's preference for different viewport images, which can be calculated as follows: where S LR V m (p) is the pixel's value at the position of p. For the above extracted features f corr , f CLBP and f st , Equation (12) is used to perform feature aggregation on them to obtain the aggregated features, which are, finally, expressed as F corr , F CLBP and F st .

Binocular Perception Module (BPM) for Distorted HSOI
Generally, the human binocular perception has three aspects, the initial stage is binocular fusion; information that cannot be fused leads to binocular competition; in the process of competition, if one view is completely dominant, binocular suppression occurs. The process of binocular perception is a complex physiological mechanism. Both the user's eyes and brain play a role in this process, and it is difficult for traditional signal processing-based methods to achieve this process through mathematical formulas and derivation. Therefore, current research generally expresses the processes of fusion and competition by simulat-ing biological mechanisms to establish binocular effects for perceptual quality evaluation. Based on joint image filtering, this subsection designs new binocular mechanism modeling schemes, and combines the perceptual characteristics of the V1, V2 and V4 areas of the cerebral cortex for feature extraction.
(1) Joint image filtering Previous studies [50] showed that the content difference between the left and right views of a stereoscopic image was due to the existence of parallax, but the final result of human binocular perception undergoing three fluctuations is to form a stable stereoscopic image. Therefore, it can be inferred that there is an interactive filtering effect between the left and right views.
(2) Binocular fusion map and feature extraction The traditional binocular fusion images are mainly realized based on gray-scale images. However, Den Oudenet et al. [51] proved the contribution of color information to binocular matching; it indicates that color information helps solve the binocular matching problem of complex images, and color and brightness information have independent contributions. Thus, for one of the distorted viewport images, (V L,m , V R,m ), it is first converted from RGB space to YUV space. On the channels of Y, U and V, joint image filtering f g (·) is performed on them, respectively, and the filtered left view of the distorted viewport image is recorded as Φ L,m and Φ R,m are further weighted by E L,m and E R,m , respectively. Thus, the fusion image Φ m is expressed as follows: For Φ m = {Φ Y m , Φ U m , Φ V m }, the final binocular fusion map Φ m is generated by using the Y, U and V channels of Φ m . Figure 5 shows an example of binocular fusion map obtained by taking Figure 4 as the test viewport images, including the single-channel fusion map of the Y, U, or V channel and the fusion map Φ m after the combination of the three channels. Because YUV space is used in Figure 5, the color of the map in Figure 5d is pseudo-color and different from the RGB image. The red part in Figure 5d is the high brightness region, which corresponds to the white highlight part of the general RGB image, and the green part in Figure 5d corresponds to the low brightness region of the RGB image. The channels of Y, U, and V display different information, respectively. After combining the three channels, both color information and fused image information are displayed.
For Φm = {Φ Y m, Φ U m, Φ V m}, the final binocular fusion map Φ′m is generated by using the Y, U and V channels of Φm. Figure 5 shows an example of binocular fusion map obtained by taking Figure 4 as the test viewport images, including the single-channel fusion map of the Y, U, or V channel and the fusion map Φ′m after the combination of the three channels. Because YUV space is used in Figure 5, the color of the map in Figure 5d is pseudo-color and different from the RGB image. The red part in Figure 5d is the high brightness region, which corresponds to the white highlight part of the general RGB image, and the green part in Figure 5d corresponds to the low brightness region of the RGB image. The channels of Y, U, and V display different information, respectively. After combining the three channels, both color information and fused image information are displayed.     [46], Khan [52], KimKautz [53], Rein-hard02 [54] and Reinhard05 [55]), where the high brightness and low brightness regions are marked with red rectangular boxes and orange rectangular boxes, respectively.   Figure 6. The binocular fusion maps of the viewport images processed by five TMOs (from left to right is DurandTMO [46], Khan [52], KimKautz [53], Reinhard02 [54] and Reinhard05 [55]). (a1-a5) Binocular fusion maps; (b1-b5) Partially enlarged maps of highlighted regions, corresponding to the red rectangular boxes in (a1-a5); (c1-c5) Partially enlarged maps of low-dark regions, corresponding to the orange rectangular boxes in (a1-a5).
To sum up, the brightness segmentation based on the maximum entropy threshold [56] is performed on the fusion map Φ′m, and the high brightness region Φ′m,H, the low brightness region Φ′m,L and the middle brightness region Φ′m,M can be also obtained. The maximum entropy threshold segmentation can relatively completely separate the three different brightness regions, and is in line with the subjective brightness perception of the human eyes.
Some studies have shown that visual perception of bright/dark information is unbalanced [57], so different feature extraction schemes should be designed for high/low brightness regions of TM-distorted images; for example, combining the functions of cones and rod cells on retinal photoreceptors, as cone cells mainly work in bright regions and can recognize texture information, while in dark regions, rod cells will work and recognize contour features. Thus, texture features can be extracted for high brightness Figure 6. The binocular fusion maps of the viewport images processed by five TMOs (from left to right is DurandTMO [46], Khan [52], KimKautz [53], Reinhard02 [54] and Reinhard05 [55]). (a1-a5) Binocular fusion maps; (b1-b5) Partially enlarged maps of highlighted regions, corresponding to the red rectangular boxes in (a1-a5); (c1-c5) Partially enlarged maps of low-dark regions, corresponding to the orange rectangular boxes in (a1-a5).
To sum up, the brightness segmentation based on the maximum entropy threshold [56] is performed on the fusion map Φ m , and the high brightness region Φ m,H , the low brightness region Φ m,L and the middle brightness region Φ m,M can be also obtained. The maximum entropy threshold segmentation can relatively completely separate the three different brightness regions, and is in line with the subjective brightness perception of the human eyes.
Some studies have shown that visual perception of bright/dark information is unbalanced [57], so different feature extraction schemes should be designed for high/low brightness regions of TM-distorted images; for example, combining the functions of cones and rod cells on retinal photoreceptors, as cone cells mainly work in bright regions and can recognize texture information, while in dark regions, rod cells will work and recognize contour features. Thus, texture features can be extracted for high brightness regions, and contour features can be extracted for low brightness regions. To ensure information integrity, chrominance features are extracted in the middle brightness region. In particular, the process from brightness segmentation to feature extraction of texture, contour and chromaticity is in line with the human vision system, that is, the V1 area perceives primary luminance features, the V2 area perceives high-level features such as texture and shape, and the V4 area perceives color information delivery mechanism.
For the high brightness region Φ m,H of the binocular fusion map Φ m , its gray-gradient co-occurrence matrix (GGCM) is calculated to characterize its texture features. GGCM combines image's gray-scale elements and gradient elements. It can clearly describe the statistical characteristics of gray-scale values and gradients of each pixel in an image and the spatial position relationship between each pixel and its neighboring pixels. Here, gradient is added to the gray level co-occurrence matrix to make it accurate for describing texture. Let Y denote the gray-scale map of Φ m,H , and G Y denote the gradient map of Y. First, gradient normalization G Y is performed on G Y as follows: where INT(·) is the rounding function, (i,j) is the position of pixel in G Y , G Ymax and G Ymin are the maximum and minimum values of G Y , respectively, and L g is set to 32. Similarly, gray-scale normalization Y is performed on Y as follows: where Y max and Y min are the maximum and minimum values of Y, and L x is set to 32. Let P denote a GGCM, then, its normalized GGCM, P N , can be expressed as follows: where a and b are the pixel values at the same position in Y and G Y , respectively. Based on GGCM, a series of statistical features are derived to describe the image's texture features. Here, five important statistical measures are adopted to describe texture features, namely, gray mean square deviation T 1 , gradient mean square deviation T 2 , gray entropy T 3 , gradient entropy T 4 and mixed entropy T 5 , and calculated as follows: where µ Y and µ G are the mean values of Y and G Y , respectively. Then, T 1 , T 2 , T 3 , T 4 and T 5 are taken as the texture features f GGCM of the high brightness region Φ m,H .
For the low brightness region Φ m,L of Φ m , its narrow-sense contour feature is extracted. The contour feature is usually an efficient representation of the shape of an object in an undistorted image. However, considering that the encoding distortion in the low brightness region appears as the block effect, which visually represents the rectangular outline information. The narrow-sense contour feature is defined here to describe the visual distortion phenomenon of the object shape and the block effect. Because different TMOs change the image information in their own ways, when performing brightness segmentation on the TM image, it first appears that the segmented edges are inconsistent; second, different TMOs will cause different visual effects of encoding distortion in low brightness region. In Figure 6, the encoding distortion of DurandTMO in Figure 6c1 is more obvious in the low brightness region, and its block effect outline is more obvious, resulting in drastic changes in the gray-scale value of edge pixels; the block effect caused by Reinhard05 in Figure 6c2 is the second. The results of the other three TMOs in Figure 6c1-c5 are visually similar. Based on this, the energy of gradient (EoG) function is used to measure this change. Let E og denote EoG of Φ m,L , then, it is calculated as follows: (25) Here, the average value E og of E og is defined as the narrow-sense contour feature. Considering that the block effect at different resolutions may be different, Φ m,L is downsampled with three scales, and the E og values at the four scales are taken as the final narrow-sense contour features f EOG of the low brightness region Φ m,L . Obviously, f EOG is a multiscale feature vector. Table 3 lists the E og values of the five TMOs at four scales of Φ m,L . Obviously, the E og values with DurandTMO [46] and Reinhard05 [55] are in the top two positions, followed by Reinhard02 [54]; while Khan [52], KimKautz [53] are numerically similar. According to Figure 6c1-c5, DurandTMO and Reinhard05 lead to an obvious block effect visually; for Reinhard02, a small amount of block effect can be observed, while the block effect can be hardly observed for Khan [52], KimKautz [53]. It means that it is effective to use EoG to measure narrow-sense contour features, which is consistent with subjective perception. For the middle brightness region Φ m,M , chrominance statistical features are extracted. Considering that the image distortion changes the natural scene statistical distribution of its mean subtracted contrast normalized (MSCN) coefficients, the asymmetric generalized Gaussian distribution (AGGD) model can fit this distribution, and the difference in the fitting parameters represents the statistical distribution changes. Thus, the four parameters after AGGD fitting, that is, mean δ m , shape parameter θ m , left scale parameter φ 2 l and right scale parameter φ 2 r , are used as chrominance statistical features. The chrominance statistical features of the U and V channels are used as the natural statistical features f AGGD of the middle brightness region Φ m,M .
In summary, the above features extracted are expressed as the texture features f GGCM , narrow-sense contour features f EoG and natural scene statistical features f AGGD . These features are all based on viewport images, so according to Equation (12), they are aggregated according to the viewport saliency to obtain F GGCM , F EoG and F AGGD , respectively. Thus, the final features extracted by the binocular fusion model are F fus = {F GGCM , F EOG , F AGGD }.
(3) Binocular difference map and feature extraction Because the left and right views of HSOI's viewpoints are processed by the same TMO for viewing HSOI by user's HMD with SDR, generally speaking, there is no new color difference between the left and right views after TM. Whereby, the binocular difference information is directly described in the gray-scale channel. Based on joint image filtering, letĤ L,m andĤ R,m be the viewport's left and right views after the joint image filtering of their gray-scale channel, respectively.Ĥ L,m andĤ R,m can be regarded as image contents that can be initially fused during binocular matching. The absolute difference maps produced by subtracting the distorted viewport images (V L,m , V R,m ) from their jointly filtered viewport images are taken as the left and right monocular difference maps (MD L,m , MD R,m ), where MD L,m = V L,m −Ĥ L,m and MD R,m = V R,m −Ĥ R,m . The monocular difference map represents the information that cannot be fused between left and right views, and the information that cannot be fused may lead to binocular competition.
The related studies [58] showed that binocular competition occurs in all contrast, and the higher the contrast of a monocular stimuli, the stronger its dominant perception. Therefore, a contrast map is calculated as the competition factor, which weights the left and right monocular difference maps (MD L,m , MD R,m ) to obtain the binocular difference map BD m . As mentioned, the contrast map is expressed as C = σ e /(µ e + ε). Let CE L,m and CE R,m be the contrast maps of MD L,m and MD R,m , respectively, then, the binocular difference map BD m is computed as follows: Considering that the binocular difference map mainly represents the contour information dominated by structure, discrete multidimensional differentiators [59] are used to characterize the binocular difference map BD m , in which five types of derivative maps on BD m are computed, that is, first-order horizontal derivative map g x , first-order vertical derivative map g y , second-order horizontal derivative map g xx , second-order vertical derivative map g yy and second-order mined derivative map g xy . Figure 7a shows the MSCN coefficient distribution curves of the five derivative maps of the HSOI which is first compressed by JPEG XT and then processed by DurandTMO. Figure 7b shows the MSCN distribution curves of g x of the HSOI which is first compressed by JPEG XT and then processed by the five TMOs. In order to describe their MSCN coefficient distribution, the generalized Gaussian distribution (GGD) model, f (x; α g , σ 2 g ), is used for fitting, where α g and σ 2 g represent the shape and variance parameters of the GGD model, respectively. Considering that the binocular difference map mainly represents the contour information dominated by structure, discrete multidimensional differentiators [59] are used to characterize the binocular difference map BDm, in which five types of derivative maps on BDm are computed, that is, first-order horizontal derivative map gx, first-order vertical derivative map gy, second-order horizontal derivative map gxx, second-order vertical derivative map gyy and second-order mined derivative map gxy. Figure 7a shows the MSCN coefficient distribution curves of the five derivative maps of the HSOI which is first compressed by JPEG XT and then processed by Du-randTMO. Figure 7b shows the MSCN distribution curves of gx of the HSOI which is first compressed by JPEG XT and then processed by the five TMOs. In order to describe their MSCN coefficient distribution, the generalized Gaussian distribution (GGD) model, parameters of the GGD model, respectively. For BDm, the shape and variance parameters of the GGD model of its five types of derivative maps are extracted as the binocular difference features, and expressed as fdif. With Equation (12), fdif is further weighted by viewport significance, and the aggregated features are generated as Fdif.

Feature Screening and Quality Prediction
As mentioned above, a total of 133 dimensional features are extracted and denoted as FHSOI = {Fec, Fcorr, Fclbp, Fst, Ffus, Fdif}, as shown in Table 4. The proposed metric designs a variety of feature extraction processes for visual perception of the distorted HSOI, however, there may be redundancy in FHSOI. Therefore, by performing feature screening on FHSOI, we can obtain a screened feature vector that is conducive to achieve the best performance, and seek a balance between the feature dimension and objective quality evaluation performance. For BD m , the shape and variance parameters of the GGD model of its five types of derivative maps are extracted as the binocular difference features, and expressed as f dif . With Equation (12), f dif is further weighted by viewport significance, and the aggregated features are generated as F dif .

Feature Screening and Quality Prediction
As mentioned above, a total of 133 dimensional features are extracted and denoted as F HSOI = {F ec , F corr , F clbp , F st , F fus , F dif }, as shown in Table 4. The proposed metric designs a variety of feature extraction processes for visual perception of the distorted HSOI, however, there may be redundancy in F HSOI . Therefore, by performing feature screening on F HSOI , we can obtain a screened feature vector that is conducive to achieve the best performance, and seek a balance between the feature dimension and objective quality evaluation performance. The Gini coefficient of random forest [60] can be used to calculate the contribution of a single feature on each decision tree. The average contribution of all decision trees is the contribution value of this feature and also the weight of this feature in regression prediction.
The contribution values of all feature vectors are arranged in descending order, and the features with high contribution values are selectively retained for feature screening. The screened feature vector after feature screening is recorded as F FS_HSOI .
The screened feature vector F FS_HSOI is used as input, and the random forest model R F (·) is used as the quality regression model to realize the mapping from the screened feature vector to predict the objective quality score of distorted HSOI, and expressed as follows:

Experimental Results and Analyses
To verify the effectiveness of the proposed HSOIQA metric, it is tested on the HDR stereoscopic omnidirectional image database (HSOID) [45]. The HSOID includes ten scenes (i.e., indoor, outdoor and night scenes), Figure 8 shows ten scenes with the ERP format in the HSOID, which are generated from the stereoscopic omnidirectional video dataset VRQ-TJU [61], the SOLID dataset [62], NBU-SOID dataset [28] and the YouTube. The HSOID includes nine JPEG XT encoding distortion levels and five kinds of TM distortions resulted from five different TMOs; thus, a total of 450 distorted images are contained.
For the distorted HSOI with the JPEG XT encoding, nine encoding distortion levels are designed, including five asymmetric encoding distortions and four symmetric encoding distortions. While for TM distortion, five representative TMOs are selected to map HSOI to SDR domain, and five TMOs are DurandTMO [46], Khan [52], KimKautz [53], Rein-hard02 [54] and Reinhard05 [55], respectively. The distortion-produced process includes JPEG XT coding and TM. JPEG XT technology decomposes HSOI into the base layer and extension layer. Distortion degree of the base layer is determined by one quality factor q, and that of the extension layer is determined by another quality factor Q. Four quality factor pairs are set for compressing each view of HSOI with the JPEG XT, and the corresponding distortion level is represented by L1, L2, L3 and L4, respectively, from high to low. The four quality factor pairs (q, Q) are set to (16,9), (30,19), (50,30) and (90, 72), which correspond to L1, L2, L3 and L4, respectively. Considering the stereoscopic perception of HSOI, the left view or right view is, respectively, compressed with one of the four quality factor pairs to produce nine distortion levels, including four symmetric encoding distortion levels and five asymmetric encoding distortion levels, as shown in Table 5. Thirty subjects were invited to participate in the subjective experiment. The subjective experiment was conducted in Ningbo University. A total of 30 subjects aged between 20 and 30 years old, including male and female, professional and non-professional, were invited to participate in the experiment to ensure that the experimental data was completely authentic and reliable. The experimental equipment is the HTC Vive Pro HMD with a monocular (left view or right view) viewport resolution of 1440 × 1600 and its FoV angle is 110 • . In the experiments, a rotatable seat was provided for the subject, and the subjects wore the HMD to view the omnidirectional images from various viewing angles. Oral guidance was given first before the subjective evaluation, and the subjects were informed of the characteristics of JPEG distortion, TM distortion, mixed distortion and the relevant information such as the scoring standard. To prevent the subjects from being unable to give accurate scores due to discomfort such as visual fatigue and dizziness, when the subjects finished evaluating 45 test images, they rested for 10 min to improve the reliability of the scores as much as possible [45]. Figure 8. (a-j) Ten scenes in the HSOID database. Table 5. Quality factors of nine encoding distortion levels.

Distortion Type Symmetric Encoding Distortion Asymmetric Encoding Distortion Symbol Left View Right View Symbol Left View Right View
Distortion level The subjective score has nine quality levels, and the higher the score, the better the quality. The outliers in the subjective scores were eliminated in strict accordance with the screening criteria, and the average value of the remaining effective scores of each HSOI was taken as its MOS value [45].  Figure 9 are the same as the results in [45], where the subjective scoring values of the distorted HSOIs in the HSOID are displayed in different visual ways. It can be found that: (1) For symmetric/asymmetric distortion in all scenarios, the MOS values of the HSOIs processed with DurandTMO and Khan operators are relatively low, while the other three TMOs have little difference. This may be because DurandTMO will "create an illusion" in color, and the TM distortion with Khan is more easily perceived. (2) For symmetric distortion, the MOS values of all scenes increase with the reduction of distortion, and the change trend of MOS among different levels is relatively "steep", which indicates the rationality of quality factor setting. (3) For asymmetric distortion, the change of MOS values between ad1-ad2 is gentler than those among ad2-ad4-ad5, as shown in Figure 9l,q. Referring to Table 5, it can be found that when the distortion level of the left view is L1 and the distortion level of the right view changes from L3 to L4, the human eyes are not sensitive to this change. When the distortion level of the right view is L4 and that of the left view changes from L1 to L3, the MOS values increase rapidly. This phenomenon is consistent with the fact that JPEG XT coding belongs to information additive distortion, and generally the party with more serious distortion is dominant, and this phenomenon will exist even after TM processing.
The subjective score has nine quality levels, and the higher the score, the better the quality. The outliers in the subjective scores were eliminated in strict accordance with the screening criteria, and the average value of the remaining effective scores of each HSOI was taken as its MOS value [45]. Figure 9a-j show MOS values of symmetric distorted HSOIs of ten scenes, and Figure 9k-t are those of asymmetric distorted HSOIs. Here, the MOS curves shown in Figure 9 are the same as the results in [45], where the subjective scoring values of the distorted HSOIs in the HSOID are displayed in different visual ways. It can be found that: (1) For symmetric/asymmetric distortion in all scenarios, the MOS values of the HSOIs processed with DurandTMO and Khan operators are relatively low, while the other three TMOs have little difference. This may be because DurandTMO will "create an illusion" in color, and the TM distortion with Khan is more easily perceived. (2) For symmetric distortion, the MOS values of all scenes increase with the reduction of distortion, and the change trend of MOS among different levels is relatively "steep", which indicates the rationality of quality factor setting. (3) For asymmetric distortion, the change of MOS values between ad1-ad2 is gentler than those among ad2-ad4-ad5, as shown in Figure 9l,q.
Referring to Table 5, it can be found that when the distortion level of the left view is L1 and the distortion level of the right view changes from L3 to L4, the human eyes are not sensitive to this change. When the distortion level of the right view is L4 and that of the left view changes from L1 to L3, the MOS values increase rapidly. This phenomenon is consistent with the fact that JPEG XT coding belongs to information additive distortion, and generally the party with more serious distortion is dominant, and this phenomenon will exist even after TM processing.  In the experiments, the random forest model is used to complete the regression prediction task, and the K-fold cross validation is used to divide the test and training sets. Specifically, the HSOID database is divided into K subsets, where K = 10, corresponding to the number of scenes in the dataset. All images with the same scene form a subset to ensure that the training set does not overlap with the test set. The model trained using the K − 1 subset is tested on the remaining subset, iterating from the first scenario until all scenarios are traversed. Finally, the average performance of the K cross validation is reported. The accuracy of the proposed metric is evaluated based on three classic indexes: PLCC, SROCC and RMSE. PLCC is the correlation between subjective and objective scores, and SROCC is the monotonicity correlation index between two ordered variables. Both values are between [−1,1], and the closer the absolute value is to one, the higher the accuracy of the regression task is, and the closer the RMSE is to zero, the better.
Here, the relevant results of the influence of feature screening on the performance of the proposed metric are first discussed. Then, the proposed metric is compared with several representative 2D-IQA metrics as well as some blind IQA metrics which consider at least one feature of HSOI. The effects of different feature sets involved in the proposed metric are also analyzed, the symmetric/asymmetric distortion is discussed, and the influence of the number of viewports is analyzed.

Feature Screening in the Proposed Metric
The main purpose of feature screening is to select the feature vector from the features extracted by the proposed perceptual modules, so as to optimize the performance of objective quality metrics. Figure 10a depicts a descending arrangement of contribution values of all extracted features in the proposed metric. As shown in Table 4, the initial extracted feature set has 133 dimensions in total. When the feature dimensions drop to 106 dimensions, the contribution of the remaining features has dropped below 0.2, indicating that most of the extracted features are relatively effective. In order to further select the best feature set, the quality regression is conducted with features of different dimensions to test the objective quality evaluation performance. The results are shown in Figure 10b. Through experiments and verification, when the first 54 features screened according to feature importance are selected, the selected features cover all perceptual modules and achieve the better performance. Thus, the first 54 dimensional features are selected as the final feature set F FS_HSOI in the proposed metric. In the experiments, the random forest model is used to complete the regression prediction task, and the K-fold cross validation is used to divide the test and training sets. Specifically, the HSOID database is divided into K subsets, where K = 10, corresponding to the number of scenes in the dataset. All images with the same scene form a subset to ensure that the training set does not overlap with the test set. The model trained using the K − 1 subset is tested on the remaining subset, iterating from the first scenario until all scenarios are traversed. Finally, the average performance of the K cross validation is reported. The accuracy of the proposed metric is evaluated based on three classic indexes: PLCC, SROCC and RMSE. PLCC is the correlation between subjective and objective scores, and SROCC is the monotonicity correlation index between two ordered variables. Both values are between [−1,1], and the closer the absolute value is to one, the higher the accuracy of the regression task is, and the closer the RMSE is to zero, the better.
Here, the relevant results of the influence of feature screening on the performance of the proposed metric are first discussed. Then, the proposed metric is compared with several representative 2D-IQA metrics as well as some blind IQA metrics which consider at least one feature of HSOI. The effects of different feature sets involved in the proposed metric are also analyzed, the symmetric/asymmetric distortion is discussed, and the influence of the number of viewports is analyzed.

Feature Screening in the Proposed Metric
The main purpose of feature screening is to select the feature vector from the features extracted by the proposed perceptual modules, so as to optimize the performance of objective quality metrics. Figure 10a depicts a descending arrangement of contribution values of all extracted features in the proposed metric. As shown in Table 4, the initial extracted feature set has 133 dimensions in total. When the feature dimensions drop to 106 dimensions, the contribution of the remaining features has dropped below 0.2, indicating that most of the extracted features are relatively effective. In order to further select the best feature set, the quality regression is conducted with features of different dimensions to test the objective quality evaluation performance. The results are shown in Figure 10b. Through experiments and verification, when the first 54 features screened according to feature importance are selected, the selected features cover all perceptual modules and achieve the better performance. Thus, the first 54 dimensional features are selected as the final feature set FFS_HSOI in the proposed metric.

Overall Performance of the Proposed Metric
In order to illustrate the effectiveness of the proposed metric, in addition to some representative 2D-IQA metrics, the proposed metric is compared with four types of blind IQA metrics, namely, one SIQA metric (i.e., SINQ [20]), one OIQA metric (i.e., SSP-OIQA [24]), two TM-IQA metrics (i.e., BTMQI [32], BTMIQA [33]), and one SOIQA metric (i.e., Qi's metric [28]). These four types of metrics involve HSOI's one or more perception characteristics including binocular perception, OI viewport perception and HDR/TM visual perception. All supervised learning-based metrics are trained by K-fold cross validation. To ensure the fairness and reliability of the data, all metrics are tested with the codes released by the corresponding authors. Table 6 shows the objective assessment results of different metrics on the HSOID dataset, and highlights the best performance indicators in bold. SINQ [20] takes into account the perceptual features of stereovision, SSP-OIQA [24] considers the characteristics of OIs, and two TM-IQA metrics take into account the perceptual features of TM distortion; each of them belongs to one of the perceptual features of HSOI. It is obvious that the TM-IQA metrics (BTMQI and BTMIQA) outperform SINQ; moreover, the PLCC and SROCC of Qi's metric are almost equal to those of BTMQI; it indicates that TM distortion plays an important role in visual perception of HSOI. This may be because human eyes are extremely sensitive to color changes in TM distortion, and color is an intuitive global attribute. SSP-OIQA uses the SSP format to evaluate the OI quality. The reason for poor performance may be that it does not consider the stereoscopic perception and the TM distortion. It can be found from Table 6 that the performance of the proposed metric is the best, with PLCC and SROCC values reaching 0.8766 and 0.8724, respectively. This is because the proposed metric is based on the characteristics of stereoscopic perception, combined with HDR/TM perception and omnidirectional perception, and utilizes a series of effective feature extraction schemes to evaluate various distortions. Therefore, the proposed metric has better performance of HSOIQA.

Performance Analysis of Different Feature Sets
There are six perceptual feature sets involved in the proposed metric, i.e., F HSOI = {F ec , F corr , F clbp , F st , F fus , F dif }, specifically, global color feature F ec , symmetric/asymmetric coding distortion feature F corr , detail contrast feature F clbp , structural feature F st , binocular fusion feature F fus and binocular difference feature F dif . The random forest model is used to train each feature set or combined feature separately, and then its performance is reported, as shown in Table 7, from which the observations can be obtained as follows.

Performance Analysison Symmetric/Asymmetric Distortions
For HSOIQA, the binocular perception is one of the important characteristics to be considered. A symmetric/asymmetric distortion measurement model is purposely designed to measure the asymmetric encoding distortion in the HSOID dataset. To verify the effectiveness of the proposed metric, it is compared with the other metrics in Table 6 except the 2D-IQA metrics. The HSOID dataset is divided into the symmetric distorted HSOIs and asymmetric distorted HSOIs to form two sub-datasets so that the relevant metrics can be tested on the two kinds of HSOIs separately. Table 8 lists the performance results of these metrics, where ∆PLCC is the PLCC value of the sub-dataset of the symmetric distorted HSOIs minus the overall PLCC value. ∆PLCC indicates the performance degradation caused by the sub-dataset of the asymmetric distorted HSOIs; and the smaller the ∆PLCC value, the better the performance. In Table 8, the best performance values are shown in bold. It can be found that: (1) The performance of all metrics for the symmetric distorted HSOIs is better than that for asymmetric distorted HSOIs, and the overall performance is between the two ones; and the better the performance of asymmetric distorted HSOIs, the better the overall performance, which implies that asymmetric distortion has to be characterized.
(2) For symmetric/asymmetric distortion, all indexes of the proposed metric are significantly better than those of the other metrics, this means that the presented feature extraction schemes for symmetric/asymmetric distortions in the proposed metric is reasonable.

Effect of the Number of Viewports
As described in the Section 2.2, there are totally M + 2 viewports sampled from HSOI in the proposed metric, M viewports in the equatorial region and two in the bipolar regions. Here, the effect of M on the performance of the proposed metric will be tested, where M ∈ {4, 6, 8, 10}. When M ≥ 4, all image information of the equatorial region can be covered. In the experiment, the performance of single feature set as well as the combined feature sets under different viewport numbers are compared, that is, F corr , F clbp , F st , F fus , F dif , F V (F V = F corr + F clbp + F st + F fus + F dif ), all initial extracted feature sets F HSOI and the screened feature vector F FS_HSOI . Table 9 shows the experimental results with respect to different M. The best performance of the same feature set is shown in bold, and the optimal number is the number of times that the corresponding M achieves the best performance highlighted with bold. It can be found from Table 9 that when M is 8, the optimal number is 17, in which the viewport model and overall performance are the best. Therefore, the total number of viewports is finally determined to be 10, including 8 viewports in the equatorial region and 2 viewports sampled in the polar region. On the other hand, the difference in viewport performance may be due to the redundant information brought by overlapping viewport sampling and the sensitivity of the extracted features to viewport content.

Conclusions
From the perspective of perception of high dynamic range (HDR) stereoscopic omnidirectional vision system, a visual perception based blind HDR stereoscopic omnidirectional image (HSOI) quality assessment metric has been proposed in this paper. The proposed metric can be divided into two main modules, that is, monocular perception module, and binocular perception module. For the monocular perception module, firstly, the projection format of HSOI is transformed, and then the metric of combining spatial domain and discrete cosine transform domain based on antagonistic channel is designed to measure the global color distortion. Secondly, to measure the symmetric/asymmetric distortion, the absolute correlation coefficient and correlation distance of the left and right views are calculated based on multiscale retinex decomposition; then, considering the characteristics of the indoor, outdoor and night scenes, the detail contrast features and structural features are extracted with Laplacian pyramid decomposition. In the binocular perception module, based on joint image filtering, the binocular fusion map and binocular difference map are calculated. Further, brightness segmentation is performed based on the binocular fusion map, so that the texture, narrow-sense and chromaticity statistical features can be extracted separately for the high, low and middle brightness regions. This process conforms to the information transmission mechanism of the V1, V2 and V4 areas of the human vision system. For binocular difference map, the derivative map is calculated and natural statistical features are extracted. The effectiveness of the proposed metric is compared and analyzed on the HSOID dataset. The experimental results show that the proposed metric is an effective HSOI quality evaluator. In this work, we have presented a comprehensive analysis and empirical study on the problem of HSOI quality assessment, and proposed a new metric. The experimental results also verify its effectiveness. In future work, the performance of HSOI quality evaluation can be improved in two aspects. Firstly, we will further explore the mechanism whereby the visual system perceives the HSOI. Meanwhile, compared with hand-crafted feature extraction, the learning-based feature extraction can be more consistent with the process of the brain processing information. Therefore, a deep learning-based method can be integrated into the evaluation model for performance improvement. In future work, we will consider combining deep learning and visual perception to propose a new network to improve the performance of HSOIQA.