Semantically Adaptive JND Modeling with Object-Wise Feature Characterization, Context Inhibition and Cross-Object Interaction

Performance bottlenecks in the optimization of JND modeling based on low-level manual visual feature metrics have emerged. High-level semantics bear a considerable impact on perceptual attention and subjective video quality, yet most existing JND models do not adequately account for this impact. This indicates that there is still much room and potential for performance optimization in semantic feature-based JND models. To address this status quo, this paper investigates the response of visual attention induced by heterogeneous semantic features with an eye on three aspects, i.e., object, context, and cross-object, to further improve the efficiency of JND models. On the object side, this paper first focuses on the main semantic features that affect visual attention, including semantic sensitivity, objective area and shape, and central bias. Following that, the coupling role of heterogeneous visual features with HVS perceptual properties are analyzed and quantified. Second, based on the reciprocity of objects and contexts, the contextual complexity is measured to gauge the inhibitory effect of contexts on visual attention. Third, cross-object interactions are dissected using the principle of bias competition, and a semantic attention model is constructed in conjunction with a model of attentional competition. Finally, to build an improved transform domain JND model, a weighting factor is used by fusing the semantic attention model with the basic spatial attention model. Extensive simulation results validate that the proposed JND profile is highly consistent with HVS and highly competitive among state-of-the-art models.


Introduction
Regarding the human visual system (HVS) as a communication system with limited bandwidth and processing capacity, it is continuously receiving data input. As a result, the HVS can only sense variations in signal strength above a certain threshold, which is termed as JND [1]. Research on JND can be traced back to the experimental psychology of Ernst Weber [2] in the 19th century and was transferred to the field of digital multimedia at the end of the 20th century. Visual JND can build computational models by virtue of relevant physiology, psychology and neural research, combined with feature detection. Over the past decades, JND has proven to be a multi-factorial problem, including contrast sensitivity function (CSF), luminance adaptation (LA), masking effects and visual attention. An overview of these factors can be found in the survey by researchers in [3]. Using the computational domain as a classification criterion, there are two branches of existing JND models, i.e., the pixel domain (where JND thresholds are computed directly for each pixel) [4][5][6][7][8][9][10][11][12][13] and the transform domain (where the image is first transformed into a subband domain, and then JND thresholds are calculated for each subband) [14][15][16][17][18][19][20][21][22][23][24]. Nevertheless, both model types generally follow the same design philosophy of simulating the visual-masking effects of several elements before combining (multiplying or adding) them to obtain an overall JND estimation. The pixel-domain JND model has undergone a protracted evolution. Based on the study of luminance and contrast, the pioneering work of Chou et al. [4] computed luminance adaptation and contrast masking (CM), using the winner as the JND thresholds. Yang et al. [5] considered the overlap effect between LA and CM, and proposed a nonlinear additivity model of masking (NAMM). Furthermore, the image can be decomposed into plain, texture and edge regions for more accurate modeling of CM [6]. Since the visual sensitivity decreases with increasing retinal eccentricity, the fovea masking was introduced [7]. Inspired by the internal generation mechanism, the masking effects of ordered and disordered components were quantified based on free energy theory [8], and the pattern masking was extended sequentially [9]. In addition, the RMS contrast was used as the spatial CSF of JND in the pixel domain [10]. Based on hierarchical predictive coding theory, self-information and information entropy were utilized to calculate the perceptual surprise and perceptual suppression effects at different levels to improve the accuracy of the JND model [12].
Transform coding is a widespread means of mainstream image/video coding, and transform domain JND modeling is also a significant research area. In 1992, Ahumada [14] proposed the first DCT domain JND model combining spatial CSF and LA. Based on [14], Watson [15] introduced the CM effect and proposed the DCTune model, which laid the research foundation for the DCT domain JND modeling. Subsequently, a more realistic LA function was proposed to improve JND estimation by integrating a block classification (plain, edge, and texture) strategy [16]. For video, corresponding video JND models for the DCT domain were proposed considering spatiotemporal CSF and eye movement compensation [17], LA effect based on gamma correction and CM effect based on more accurate block classification [18], texture complexity and frequency, visual sensitivity and visual attention [19], the various sizes of DCT blocks (from 4 × 4 to 32 × 32) [20], motion direction [21], fovea masking [22], temporal duration and residual fluctuations [23].
To fill the vacancy of JND databases for image and video compression, several scholars have suggested MCL-JCI [34], JND-pano [35], MCL-JCV [36], VideoSet [37], etc. Based on these datasets, various modeling techniques for JND eventually emerged, most notably subjective data regression [25,27], binary classification [26], picture/video-wise JND (PWJND/VWJND) or satisfied user ratio (SUR) modeling [27][28][29][30], and finding appropriate weighting factors for the JND models [25,26]. In particular, motivated by the smoothing performance (non-smoothed regions have better capability for hiding noise than smoothed regions), Wu et al. [31] proposed an unsupervised learning method for generating JND images in the pixel domain. Considering that JND should be evaluated in the human brain perception domain, Jin et al. [33] presented an HVS-based signal degradation network and employed visual attention loss to further regulate JND estimation. In addition, JND modeling also extends to machine vision [38,39]; however, this specific topic is outside the scope of this paper. Visual attention is a key attribute of HVS [40], and models combined with visual attention can improve the accuracy of JND threshold. In a study by Chen et al. [7], attention-induced fixation was detected and a scheme was designed to reflect the increase in JND with increasing retinal eccentricity. Video JND estimation for smooth pursuit eye movements (SPEM) and non-SPEM situations was accomplished by investigating the relationship between JND thresholds and parameters (spatial frequency, eccentricity, retinal motion velocity) [22]. Considering the special focus of HVS on people, Wang et al. [41] combined Itti's [42] and Judd's [43] attention model (with face detector and person detector) to adjust the JND threshold.
The main feature of the HVS attention mechanism is the information-selection process, in which only a small number of visual signals are transmitted to the brain for processing. Indeed, the object-based attention theory suggests that humans are drawn to objects and advanced concepts [44]. Although Wang at al. [41] considered the advanced semantics of faces, there are often more advanced semantics in images than in plain faces. In any case, traditional visual attention models mainly consider low-level features such as color, luminance, and orientation, while underestimating high-level semantic information, which leads to reduced JND model accuracy under the attention mechanism of HVS. This is mainly due to the difficulty of semantic feature extraction and fusion. Meanwhile, it is well known that deep neural networks are full of semantic information [45], but current JND models based on deep learning mainly try to construct PWJND/VWJND or SUR without fully considering semantic character behaviors.
In addition, neurons transmit and express visual information by consuming energy [46]. The structure and function of neural networks follow the essential principles of resource allocation and minimization of energy constraints. Therefore, not all stimuli trigger neuronal responses, i.e., biased competition for visual attention influences neuronal activity [47]. The theory of biased competition suggests that selectivity of one (or more) points in the visual process is caused by interaction between visual objects for neural representations. More interestingly, with the limited attentional resources, there will always be certain items and their visual features that win out in the competition. Therefore, it is inevitable to consider such interaction while constructing a semantics-based attention model.
From the above discussion, it is clear that semantics is crucial for accurate estimation of JND thresholds. Furthermore, it should be noted that this paper concentrates on the tuning scheme for DCT-based JND estimation because the majority of image/video processing, especially in image/video compression, is conducted in the DCT domain. Particularly, this paper argues that there are three semantic features to be considered in the attention allocation problem of videos, namely, object, context, and cross-object. Since it is troublesome to extract semantic attributes manually, this paper uses grad-CAM [48] and VNext [49] to extract semantic objects and features through deep learning. Based on this, attention effects induced by object-based semantic features are analyzed, including the sensitivity and size of semantic objects and central bias. Second, contextual information is another vital element that enhances instance differentiation and is considered as a form of center-surround contrast. Furthermore, considering limited attentional resources and infinite information, this paper describes the attentional competition effects of different semantic features (including focus intensity, spatial distance, and relative area) in terms of cross-objects to achieve an accurate model of semantic attention.
Specifically, inspired by the framework of [23], this paper proposes a statistical probability model corresponding to the feature parameters and quantifies the response of visual attention in a perceptual sense using information theory. Then, the attentional competition factor for calibrating the semantic attention model is proposed. Finally, the adaptive semantic attention weight is obtained by unifying and fusing the perceptual feature parameters, and the JND model is adjusted using this weight.
The rest of this paper is organized as follows. Section 2 section presents three features and describes the quantification strategies for these parameters. Section 3 section analyzes the interaction between the objects and context. Section 4 section depicts the fusion approaches of these parameters together with the competition properties of attention. Section 5 part details the proposed JND profile. Simulation results and conclusions are given in the Sections 6 and 7, respectively.

Object-Wise Semantics Parameter Extraction and Quantification
As stated, the literature lacks a clear understanding of the mechanism of interaction between semantic perception and semantic features. This motivates us to formulate an effectual JND profile leveraging the HVS semantic traits, which is subject to challenges. The first issue is how to precisely extract and quantify the feature parameters that thoroughly depict the perceptual response of semantic HVS features in videos, such as semantic sensitivity, objective area, central bias.

Semantic Instances Extraction
Accurate extraction of high-level semantic features of videos helps to better visual attention modeling. On that basis, this paper uses grad-CAM for the extraction of high-level semantic features. In this network, the further the layer, the ampler the semantics is and the more spatial information can be preserved. In this respect, grad-CAM contains all the semantics of our target of interest. For more details, please see [48]. With grad-CAM, we can draw heat maps which are shown in Figure 1 below. Moreover, to effectively utilize the various properties of semantic objects, this paper uses VNext [49] to extract semantic objects, and the results are shown in Figure 2.

Semantic Sensitivity Quantification
The research [50] has shown that attention is focused on informative content. One definition of attention describes it as a limited resource for information processing, with the area of interest carrying the bulk of the information transmitted to the brain. Hence, Bruce et al. [51] pointed out that Shannon self-information can be used to outline attention. Whereupon, the attention value of a pixel located at (x, y) in a video frame is defined below.
where F denotes a random variable of the feature of pixel (x, y), f (x, y) represents the image features of pixel (x, y), P(F = f (x, y)) indicates the probability of feature F. The concept of constructing statistically sparse representations of images appears to be fundamental to the primate visual system, as evidenced by a large body of research. That is, to some extent, computing the probability distribution of features is essentially a matter of statistical feature sparsity. If a pixel associated with a feature has a lower probability of appearing, it has a better chance of attracting attention because it conveys a greater amount of information. In view of this, the probability density function (PDF) of feature F is measured as follows by performing histogram statistics on the whole picture.
Accordingly, the initial attention based on semantic sensitivity (m) can be calculated as follows.
It is worth noting that instead of using a uniform-fitting function, to better suit the image content, the histogram of each frame is adaptively fitted. Figure 3 shows the histogram of the 2nd frame in the HEVC standard test sequence "BasketballDrill" and its PDF fitting curve, separately.

Objective Area Quantification
The literature demonstrates that the larger the area, the higher the probability of attention, i.e., the probability of attention of semantic objects is proportional to size, however, this property becomes smooth after a certain saturation threshold [52]. This relationship can be summarized below using a piecewise function [53], with values within [0,1].
where 0 < j ≤ N, N is the number of semantic objects in an image frame, η j represents the size of semantic object, which is calculated as: η j = B j /(W × H), B j represents the number of objective pixels, and W, H refer to the width and height of the image frame. Figure 4 shows the relationship between area and attention. Likewise, for the same size semantic objects, attention will be affected by the aspect ratio. As is known to all, the aspect ratio of current popular displays is 16:9. It follows that the closer the aspect ratio of a semantic object is to our vision field, the more it attracts attention. Figure 5 is an illustration for the aspect ratio. For a more clear description, the blue rectangle on the left is labeled as A, the one on the right is B, and the container containing these two rectangles is C. Despite having the same area, A and B have various aspect ratios. The aspect ratios of A and C are 16:9, while B is 9:16. It is obvious from the example that rectangle A attracts more attention than rectangle B. Note that the rectangle is used merely for convenience of observation, and the results hold true other shapes as well.
Therefore, the joint effect of semantic object size, the aspect ratio (R r ) is used to measure the attention of semantic object size.
where α 1 , β 1 , β 2 are all adjustment control parameters, r j denotes the aspect ratio of the j-th semantic object and is normalized to [0,1]. The model is consistent with HVS subjective perception that HVS perceptual attention intensity increases with the size of the object but there is a saturation point. The reason is that when the object area is too large, the target and background are non-separable, i.e., the target is likely to be the background component at this point. In addition, targets with an aspect ratio closer to 16:9 attract greater HVS attention, which raises perceptual sensitivity and lowers JND thresholds.

Central Bias Quantification
It is stated that most images have the foreground object in the center of the image frame [43]. It follows that when a person looks at an image, he or she habitually looks at the center of the image first. Eye-movement experiments have shown that HVS tends to focus on the center of the image, and the deviation from the target center to the image center, i.e., the central bias, has been considered a noteworthy prior in attention modeling [44]. Based on this prior knowledge, many attention modeling algorithms based on central bias use various methods to increase the attention value at the image center location as a way to highlight the salient targets in the image.
Nevertheless, the feedback brought by the central bias is not always favorable. For some unconventional cases where the target is off-center in the image, attention enhancement at the image center can lead to over/under estimation of the target far from the image center. In general, the problem with the classical center-priority based attention modeling algorithm is that it tends to lack flexibility, which leads to significant inaccuracies when the foreground targets of the image are not positioned as expected.
The formation of a top-notch attention map is the ultimate goal of attention modeling. The complexity of attention modeling is greatly reduced and the quality of the final attention map is improved if the locations of the salient targets are approximated before proceeding to formal attention modeling.
In this letter, the object (M) closest to the image center is calculated, and the central bias is constructed with M as the experienced center of vision. The closer an object is to the visual center, the higher probability it is to be noticed, i.e., the attention probability of semantic objects is inversely proportional to the central bias distance, which is denoted as follows.
where d j is the distance from the semantic object to M, and normalized to [0,1].
here, O j is a semantic object, f denotes the pixel of O j , (x f , y f ) represents the coordinates corresponding to f , (x M , y M ) means the center coordinates of M, · is the euclidean distance, ω 1 , ω 2 serve as the adjustment parameters. As illustrated in Figure 6, the modified central bias result is better in line with HVS perception. Since the target M in the figure is closest to the image center, it obviously draws the highest attention, which is shown by the brightest brightness. In this instance, the target M stands in for the image center as the experienced center of vision and the closer the other objects are to the target M, the more attention they attract, and the brighter they are presented in the figure.

Context-Wise Attentional Inhibition Quantification
Given that HVS still beats the state-of-the-art computer vision systems, modeling visual attention in natural scenes is a hot topic of research today. How is such great effectiveness achieved? The ability of HVS to utilize the context of a scene to direct attention before most items are recognized appears to be a key factor compared to artificial systems [54].
How does the context of a scene serves to focus attention on objects in the scene? Davenport and Potter [55] suggest that there is an interaction between objects and context during scene processing, which is supported by research. On this basis, this paper conceptualizes the local spatial arrangement of the image background as an alternative to the context and focuses primarily on the complexity of the patterns.
In most situations, the region between the target and its background is heterogeneous and has a complex organization. The density of the visual information presented is normally used to describe the complexity of the visual background. In general, a scene is complex when its background displays a large amount of information and the unpredictability of this information is considerable [56]. Typically, the complexity of the visual context usually has a detrimental impact on the visual task. The higher the visual complexity, the longer the response latency of the nerve cells and the stronger the degree of inhibition of the target [57].
As the contextual complexity of the target suppresses the attention strength, the more complex the pattern of the background, the stronger the suppression, while this is going on, the findings of numerous human visual field experiments indicate that the human binocular visual field is approximately circular (oval), with the proximate form depicted in Figure 7 [58]. Hence, to represent the inhibition of each semantic target background, we utilize the average of the contextual complexity in the circumcircle of semantic targets to simplify the calculation. Since entropy reflects the chaotic degree of a system, we estimate contextual complexity by local entropy. As a result, attention and contextual complexity are related in the following way.
where u j is the contextual complexity, and normalized to [0,1].
here, ξ 1 and ξ 2 serve as the scaling factors, circle(·) denotes circumcircle, ent f ilt(·) is the function to calculate the local entropy. Figure 8 shows the inhibitory effect of contextual complexity on semantics.

Fusion Strategies and Cross-Object-Wise Attentional Competition
Furthermore, given elaborately chosen feature parameters, the second sore point is how to quantify the interaction among these feature parameters, i.e., how to fuse these heterogeneous feature parameters.
Consolidating the semantics-based attention model is the aim of this part. In general, different feature maps do not have the same attributes, and different features contain complementary global contextual information and local detailed information between them. Consequently, to obtain reliable results for the attention distribution, it is essential to select the appropriate weights for each feature map.
In this paper, to estimate the weights of the feature maps above, the gradient change value of each feature map is first calculated, where the greater the variation of the data, the more information can be gleaned. The gradient variation is then used to estimate the relative weights of these feature maps. Finally, utilizing their individual weights, the feature maps are normalized and combined into a single attention map.
here, λ i (i = 1, 2, 3) serves as the weighting factor, and is calculated as follows.
vg(z i ) = max(g(z i )) − min(g(z i )) mean(g(z i )) (12) where z i ∈ {I(m), I(η), P(A|d)} is the attention map of different semantic features. Figure 9 displays the initial attention map with different weighting factors. As illustrated, compared to map (a), the map (b) with the gradient variation based weights is more consistent with the HVS subjective perception. This is partly due to the fact that we fully take the content characteristics of different feature maps into account. However, there is a competitive mechanism that manifests itself as a relative enhancement of responses to task-relevant objects or a relative inhibition of neglected objects within the brain [59]. That is, when modeling the semantic attention model, it is not sufficient to consider only the semantic object itself, but also to take full account of the biased competition between different objects.

•
Focused Intensity: when multiple stimuli are present in the visual field at the same time, inhibitory competition in the visual cortex affects the allocation of attention, which probably stem from the inability to bias the interaction toward a particular object [60]. In reality, the brain's representation of information is essential for human vision, and regardless of the distance between neurons or neuronal populations, they do not operate independently, but constantly interact with each other. Because their activities are interconnected and competing, they cannot provide "independent attentional resources" [61]. That is, when there are multiple semantic objects in an image, attention to each semantic object is impacted by the other objects. The higher the intensity of the focus, the more pronounced the suppression of the other objects; • Spatial Distance: event-related potentials, functional MRI investigations as well as single-cell recordings in monkeys [62] and humans [63] have pointed out that neural enhancement in attentional focus may be followed by neural inhibition in peripheral regions. Detection is slower and discrimination performance is poorer at interference locations close to the target compared to interference locations far from the target [64].
In the receptive fields of cells, shifting attention from one stimulus to another can have a strong effect when two stimuli are in close proximity to each other. For stimuli that are far apart, the effect is much less [59]. In other words, the more spatially distant the objects are from each other, the smaller the inhibitory impact; • Relative Area: since an object generally has several neighbors, the relative area of an object shows with its neighbors affects its attention competition. The rationale for selecting relative area is that, even if the object has an attractive intrinsic area, it may not stand out unless it exhibits the greatest contrast if all of its neighbors also have attractive areas [53]. In particular, the higher the area contrast, the more strongly other objects will be suppressed.
Based on the points discussed above, the attentional competition weight is defined as follows: (14) in which, D τ (k, l), D ν (k, l) are the average focused intensity distance and the spatial proximity distance between the k-th, l-th semantic objects, respectively.
where F(O k ) denotes the average focused intensity fo the k-th object.
where µ is a scaling factor, (x k , y k ) and (x l , y l ) are the coordinates of the center point of the k-th, l-th semantic objects.
where S δ (k) represents the relative area competition weight of the k-th semantic object, G(k) is the relative area of the k-th object with respect to its neighbors, see [53] for more details. As a result, the semantics-based attention model is constructed by melting the attentional competition weight.
Sem att = att pre · C A (19) Figure 10a-c takes the 2nd frame of the "Basketball Drill" sequence as an illustration, showing the attention map of semantic sensitivity, objective area and central bias. Figure 10d displays the inhibition effect of the contextual complexity. Figure 10e shows the attentional competition result. Figure 10f exhibits the final attention map. Intuitively, brighter areas indicate higher attention, inhibition and competition.

Semantic-Based Spatio-Temporal Transform Domain JND Profile
As mentioned above, this paper combines semantic sensitivity, objective area, central bias, contextual complexity and attentional competition to measure semantic perceptual attention. Then, based on the model in [65], considering the impact of semantics, this study modifies the spatial attention factor, aiming for a more accurate JND model. The framework of the proposed JND profile is shown in Figure 11. By introducing the weighting factor of the semantics-based attention, we propose the following JND model.
here, JND st (t, n, i, j) is the spatio-temporal JND threshold of coefficient (i, j) of the n-th block in the t-th frame, considering the LA, CSF and masking effects [23].
Taking the spatio-temporal JND threshold JND st as the basis, this work proposes a patch-wise weighting factor, the attentional weight w s (t, n), accounting for the visual perceptual attention of semantics, aiming at developing a more accurate JND model. Patch level w s (t, n) is determined as the mean of pixel-wise adjustment factors in the spatial domain, w s (t, i, j), of the n-th image block in the t-th frame.
To be specific, by following [65], we estimate spatial attention A s (x), and combine semantic attention Sem att (x) to get semantics-based spatial attention A F (x) using NAMM [5].
In general, the larger the visual attention A F (x), the smaller the JND threshold. Intuitively, it seems sensible to apply the sigmoid-like function to normalize A F (x) to [0,1] and measure the spatial attention weight w s (t, n) below.
where ϑ 1 , ϑ 2 are normal constants, A F (t, n) is the mean of the n-th block of A F (x). According to subjective experiments, ϑ 1 and ϑ 2 are set to 2.5 and 2, respectively. Figure 12a,b display the obtained attention map and the corresponding weight map. Figure 12c shows the final spatio-temporal JND threshold map. In Figure 12a,c, brighter areas indicate higher visual attention or masking, while the opposite is true for Figure 12b.

Comparison of Model Performance
An ideal JND model should, in a sense, distribute noise more fairly, concealing more of it, with acceptable perceptual quality. Thus, we add coefficient-wise JND-guided noise to video sequences, as described in [22], to assess the effectiveness.
R(t, n, i, j) = R(t, n, i, j) + ρ · rand(t, n, i, j) · JND(t, n, i, j) where R(t, n, i, j) is the transform coefficients of original sequence,R(t, n, i, j) is the JND noise contaminated coefficients, ρ regulates the energy of JND noise, and rand(t, n, i, j) ∈ {+1, −1} is a bipolar random noise. We evaluate the effectiveness of the proposed work in comparison to four benchmark models, namely Bae 2017 [22], Zeng 2019 [66], Wang 2020 [12], Xing 2021 [23], and Li 2022 [67]. Ten videos with diverse semantics were chosen from the HEVC standard sequences in order to meet the requirements of different resolutions. Four of them are videos with 1920 × 1080 full HD resolution, namely, "Kimono1", "ParkScene", "Basketball Drive" and "BQTerrace", as shown in Figure 13a-d. Three of them, "FourPeople", "Johnny", "KristenAndSara", are 1280 × 720 resolution videos, as shown in Figure 13e-g. The remaining three, "Basketball Drill", "PartyScene", and "RaceHorses", are videos with 832 × 480 resolution, as shown in Figure 13h-j. Using Peak Signal-to-Noise Ratio (PSNR) as an objective evaluation criterion, a lower PSNR indicates a better ability to mask noise. However, despite that PSNR is the most popular objective quality evaluation metric, the results of a large number of empirical-psychological studies have demonstrated that PSNR scores fall short of properly capturing HVS perception due to the involvement of visual physiological and psychological mechanisms. Furthermore, the human eye is the receiver of the final visual signal, and therefore, subjective quality metrics must be taken into account when evaluating the proposed JND model.
In this paper, the visual quality of JND-contaminated videos was assessed by recruiting 17 subjects with normal or corrected normal vision, using the subjective viewing test in [68]. Specifically, the monitor used for the display was a 27-in LED monitor, while the view distance was six times the height of the video frame. The difference between the original and processed sequences was observed using the Double Stimulus Continuous Quality Scale (DSCQS) method [23]. Figure 14 illustrates the testing procedure of the DSCQS. In the experiment, for each presentation, the reference and test sequences are arranged in a pseudo-randomized form. During the voting period, participants are expected to rate the quality of each of the two videos. The visual quality of the JND-contaminated videos is then measured by calculating the MOS difference (DMOS) between the original and matched processing sequences. The calculation is described below.
where MOS ORI and MOS JND are the measured average opinion score values for the original and test videos, respectively. Five quality scales are used, i.e., excellent (80-100), good (60-80), fair , poor , and bad (0-20). The smaller the value of DMOS, the better the visual quality of the JND-polluted video.  Figure 14. DSCQS method, where the original sequence (ORI) and test sequence (TEST) are pseudorandomly ordered for each presentation.
The detailed performance results of different JND models are shown in Table 1. From the panoramic viewpoint, it can be found that the proposed JND model has the best performance on PSNR and DMOS for all videos except the "RaceHorses" sequence. This is mainly due to the large proportion of foreground semantic objects in the "RaceHorses" sequence, while our model guides only a small amount of noise into the foreground semantic object region. Therefore, the PSNR value of the proposed JND model is slightly higher than this of Xing, 2021. In addition, as shown in Table 2, the Video Multimethod Assessment Fusion (VMAF) scores [69] of the noise-contaminated videos of different JND models are measured, and the proposed model obtains optimal or suboptimal scores. The larger VMAF scores indicate that the subjective quality of the noise-contaminated sequences is closer to that of the original sequences. The results in Table 2  To visualize the data, Figure 15 displays the bar graphs of the average PSNR, VMAF and accompanying DMOS results for test sequences of various resolutions. Apparently, the noise-contaminated video generated by our model achieves the best perceptual quality (smallest DMOS score) and the largest distortion (lowest PSNR value) compared to the other four models. These experimental results validate the superiority of our JND profile in guiding noise injection.  Table 1 for 832 × 480, 1280 × 720, and 1920 × 1080; (b) Average DMOS values in Table 1 for 832 × 480, 1280 × 720, and 1920 × 1080; (c) Average VMAF scores in Table 2 for 832 × 480, 1280 × 720, and 1920 × 1080 [12,22,23,66,67]. In order to compare these five JND models more clearly, Figure 16 takes the 4th frame of the "Kimono1" sequence as an example. In general, HVS tends to pay more attention to the semantic targets [65]. In Figure 16a, the woman is prominent relative to the background. Therefore, a lower JND value should be applied in this region to inject less noise. We mainly focus on the head and body to allow for facilitate visual comparison. Apparently, there is considerable visible noise in Figure 16d-g, while the eyes and ears in Figure 16c are blurred. In contrast, Figure 16h is significantly clearer with the lowest PSNR value. As far as the body is concerned, the noise in Figure 16j-m is plainly apparent without exception, while in Figure 16o it is almost undetectable. As a consequence, the above objective and subjective evaluations validate that the proposed semantic attention-based weighting model is effective and excellent. As mentioned in the introduction, high-level semantic objects other than humans often appear in image and video frames. Frame 70 of the "Basketball Drill" sequence is depicted in Figure 17 as an illustration. Clearly, this is a scene depicting basketball training. Naturally, in this case, the basketball and the ball frame are obviously of interest in addition to the basketball players. With respect to the basketball frame, Figure 17c-g show significant distortion, while Figure 17h exhibits strong performance. For the basketball, Figure 17j-l,n depict considerable distortion, Figure 17m shows significant distortion at the lower right of the basketball, while the noise in Figure 17o is barely visible.

Ablation Experiment and Analysis
To verify the effectiveness of each module in this algorithm, this paper performed ablation experiments from three perception modules: object, context, and cross-object, and evaluated their impacts on the perception results.
The experimental results in Table 3 demonstrate that a combination of the three modules achieves optimal results, while the other combinations do not achieve the desired effect. In other words, when considering the influence of semantics on JND, one of the three aspects of object, context and cross-object features is indispensable. It is worth noting that since the context and cross-object modules are calculated based on object, combinations that do not contain object module are discarded. In practical applications, different kinds of videos focus on different scales of semantic features, and the combination of three modules achieves better generality.

Conclusions
A novel semantic-based JND model was proposed in this paper by thoroughly mining and characterizing the semantic feature parameters that affect attention. The interaction between semantic visual features and HVS was investigated from object, contextual and cross-object perspectives. This analysis included HVS responses induced by semantic sensitivity, objective area and central bias, as well as perceptual suppression brought about by contextual complexity and cross-object competition for attention. In conjunction with underlying attention in the spatial domain, perceptual attention to stimuli was measured with information-theoretic support and incorporated into a patch-level weighting factor. Using the semantic-based attentional weight, the JND model for the spatiotemporal trans-form domain was modified. The experimental results demonstrate the effectiveness of the proposed JND model with superior performance and stronger distortion-hiding ability compared to the state-of-the-art JND model.
Existing JND models only consider the effects of unimodal signals, while the study of cross-modal JND remains an open problem. However, there exists great incentive to transfer this issue from the laboratory to real-world adoption. In the future, we will further investigate multimodal asynchronous perception, seek a more unified approach to fusing multi-modal feature parameters, and obtain more accurate JND thresholds.