3.1. Motivation and Framework Overview
The human visual system (HVS) does not perceive stereoscopic images through a simple additive process. Instead, it involves a sophisticated interplay between binocular fusion and rivalry [
20,
21,
33]. When monocular inputs are consistent, the brain achieves smooth integration; however, in the presence of asymmetric distortions, perception is often dominated by the view with higher information salience or better quality [
10,
33].
Inspired by these biological principles, we propose two core mechanisms to enhance SIQA performance. First, the ASP strategy functionally simulates the progressive nature of binocular rivalry. Unlike static weighting schemes, ASP dynamically models competitive interactions, allowing the model to adaptively shift focus between views across processing stages. This mimics the HVS’s ability to handle dynamic perceptual transitions; specifically, the nonlinear weight reinforcement in ASP Stage 1 is grounded in Stevens’ Power Law [
49], simulating the discriminative selection behavior where perceptual response scales nonlinearly with conflict intensity. Second, the HCF module abstracts the HVS’s parallel neural pathways. Recognizing that low-level textures and high-level semantics contribute differently to quality, HCF employs independent fusion channels with stage-specific weights. This strategy, analogous to ensemble learning, ensures that complementary quality-aware attributes at various levels are preserved and integrated in a manner consistent with hierarchical visual processing.
It is important to clarify that ASP and HCF are functional abstractions rather than detailed biophysical simulations. While they capture essential perceptual behaviors such as rivalry dynamics and hierarchical fusion, they do not replicate the continuous adaptation, recurrent connectivity, or neuromodulatory dynamics of the biological visual system. To translate these biological insights into a computationally feasible model, we explicitly adopt several engineering approximations. For instance, we utilize the entropy of the Mean Subtracted Contrast Normalization (MSCN) coefficients [
50] to capture local information complexity and perceptual uncertainty, avoiding the complexity of simulating stochastic neural spiking. Similarly, the hierarchy of three pathways serves as a discrete abstraction of the continuous processing across multiple layers of the cortex. Furthermore, a fixed backbone with shared weights based on the Swin Transformer is employed to ensure feature stability during training, which contrasts with the dynamic synaptic plasticity of biological networks. Regarding the mathematical implementation, the functional forms governing rivalry and fusion are chosen for their computational efficiency and are parameterized within perceptually plausible ranges derived from established psychophysical models. This hybrid approach allows the MSCE framework to leverage fundamental perceptual principles while remaining implementable for practical quality assessment tasks.
Building on these motivations, we introduce the MSCE framework to align deep learning architectures with the biological principles of binocular perception. MSCE formalizes the NR-SIQA task as a progressive and selective integration process, mapping a stereoscopic image pair to a perceptual quality score Q through a hierarchy of quality-aware representations.
As illustrated in
Figure 1, the proposed framework adopts a stage-wise processing pipeline consisting of three synergistic components. The input stereo image is first processed by a shared hierarchical backbone based on the Swin Transformer Tiny (SwinT) architecture [
26]. By utilizing weight-sharing across sequential stages, the backbone extracts a range of multi-stage feature representations, capturing quality-aware cues ranging from detailed texture integrity to high-level semantic structures. This hierarchical design enables the model to adapt to diverse distortion types by incorporating features at multiple levels of abstraction.
The core of the integration process centers on the HCF module, which operates under the guidance of the ASP strategy. Rather than employing fusion at a single stage, the HCF component implements three independent pathways that correspond to the distinct computational stages of the backbone. This architecture ensures that complementary perceptual attributes are preserved without mutual interference. Within these pathways, the fusion process is dynamically regulated by the ASP strategy. By implementing a nonlinear reinforcement logic grounded in MSCN-based entropy statistics, ASP adaptively modulates binocular weights by characterizing binocular inconsistencies. This strategy ensures that features critical for quality discrimination from the dominant view are accentuated while the influence of heavily distorted counterparts is adaptively attenuated, effectively simulating the dynamic gain control of the HVS.
Finally, the fused features from the three independent pathways are concatenated into a global quality descriptor. This aggregated representation is subsequently mapped to the final perceptual quality score through SVR. By synthesizing these complementary features, the MSCE framework ensures a comprehensive evaluation that accounts for both local structural details and global semantic content. The strategy of extracting deep features from multiple layers and mapping them via Global Average Pooling (GAP) and SVR has been proven effective in several NR-IQA studies. For instance, Varga [
43,
44] demonstrated that multi-scale pooling of deep features can effectively capture varied distortion granularities. While our framework adopts a similar regression backbone, the core distinction lies in the introduction of the ASP strategy. This approach transcends such static aggregation by modeling the dynamic, rivalry-aware binocular interactions across hierarchical levels [
6], moving beyond simple spatial scales to address the complex perception specific to stereoscopic vision.
3.3. Global Binocular Weighting Baseline
To establish a statistical foundation for modeling binocular rivalry and fusion, we derive global priors to estimate the relative perceptual reliability of each view. This baseline weighting mechanism serves as the reference for subsequent stage-wise propagation. We first apply the MSCN transform [
50] to extract structural information from local luminance. For a pixel
, the normalized coefficient
is defined as:
where
is a stability constant. The local mean
and standard deviation
are computed using a Gaussian window
:
Since distortions typically disrupt the natural statistics of MSCN coefficients, we quantify this degradation through information entropy. The entropy
E is defined as:
where
represents the probability distribution of quantized MSCN coefficients. Given the entropies of the left and right views,
and
, the baseline fusion weights
are formulated as:
This formulation assigns a higher weight to the view with lower statistical uncertainty, effectively identifying the dominant view that likely governs the initial stage of binocular rivalry. These priors grounded in entropy measures serve as a robust statistical foundation for the subsequent reinforcement of binocular rivalry within the ASP stage.
To empirically validate the proposed weighting mechanism, we conduct a representative case study using samples from the Waterloo-IVC 3D Phase II (WIVC-II) [
51] dataset. As illustrated in
Figure 2, the weights assigned to the pristine view exhibit a monotonic decrease as monocular degradation intensifies. This trend reveals a distortion-governing behavior in binocular perception: rather than simply prioritizing the high-quality channel, the human visual system is acutely sensitive to significant monocular artifacts, which effectively become the dominant determinant of the overall viewing experience. The declining influence of the pristine view indicates that severe distortions in one channel can dominate the overall quality perception, overriding the contribution of the intact signal. This effect is particularly pronounced in structural degradations (GB and JP2K), where the weight converges toward the distorted channel to reflect the resulting perceptual collapse. As the distortion level increases, the weight shift toward the distorted channel demonstrates the linear relationship between the perceived quality and the degradation severity, aligning with the observed quality scores. The MSCN entropy-based weights thus ensure that the loss of structural integrity is fully captured in the feature representation, accurately reflecting the perceptual effects observed in human vision.
3.4. Adaptive Selective Propagation (ASP) Strategy
Although the entropy-based global weights establish a robust global prior, applying them uniformly across all network stages ignores the dynamic nature of binocular integration. The HVS processes visual information hierarchically, where weak rivalry typically requires different integration strategies compared to strong rivalry. To simulate this physiological behavior, we propose the ASP strategy to deterministically propagate and reinforce the baseline weights across multi-level feature hierarchies. The ASP operates through three conceptual phases applied to the global rivalry intensity: Baseline Preservation, Rivalry-Aware Reinforcement, and Adaptive Smoothing. These adjusted weights directly govern the three parallel pathways within the HCF module, ensuring that quality-aware features are integrated through a “sharpen-then-smooth” logic: the weights are first intensified in Stage 1 to emphasize the dominant view, and subsequently regularized in Stage 2 to achieve a balanced and stable feature fusion.
We first define the binocular rivalry intensity
C to quantify the deviation from equilibrium implied by the baseline weights:
The rivalry intensity
C, derived from the global entropy priors, provides a stage-invariant measure of the competitive landscape, ranging from weak competition characterized by ambiguous dominance (
) to to strong competition clear dominance (
). Leveraging this consistent competitive context, the ASP strategy generates stage-specific fusion weights
for
through a structured three-phase process:
(1) Stage 0: Baseline Preservation. At the earliest stage, the system strictly adheres to the global statistical prior to maintain the fundamental quality assessment derived from MSCN entropy. This ensures that the low-level texture integrity in Path A is preserved based on the initial reliability estimate. The Stage 0 weights are defined as:
(2) Stage 1: Rivalry-Aware Reinforcement. Recognizing that mid-level features benefit from decision sharpening, we introduce a nonlinear reinforcement factor
that modulates the perceptual gain according to the competitive landscape:
The weights for Stage 1 are subsequently computed as:
where
ensures that the weights remain within the valid range
. Based on extensive empirical validation, the hyperparameters are set to
, and the adaptivity factor
. Crucially, setting
introduces a significant nonlinear gain to reinforce the dominance of the leading view. The inclusion of the adaptivity factor
optimizes the response curvature; our sensitivity analysis indicates that
provides high sensitivity to binocular disparities, effectively simulating the binocular rivalry mechanism. While more aggressive reinforcement (e.g.,
) can further polarize weights in extreme asymmetric cases, our experiments reveal that the proposed configuration achieves an effective balance between enhancing the perceptual sensitivity to binocular rivalry and preserving the structural stability required for symmetric fusion.
(3) Stage 2: Adaptive Smoothing. At the deepest semantic level, we apply a smoothing factor
to prevent excessive feature polarization in Path C, ensuring that global semantic consistency is maintained through perceptual integration:
The hyperparameter
is empirically set to 0.3 based on grid-search validation. This value strikes an optimal balance between the sharp reinforcement of monocular rivalry and the stable integration of binocular fusion features. The corresponding right-view weights are consistently defined as
.
The mathematical formulation in Equations (9)–(12) serves as a computational proxy for the nonlinear gain control observed in biological binocular rivalry. The reinforcement function
simulates discriminative selection by incorporating a power law term
. This design is motivated by Stevens’ Power Law (
) [
49], in which the perceived magnitude
corresponds to the feature reinforcement weight, and the physical stimulus intensity
I corresponds to the contrast between views. This modeling choice is further supported by studies on visual contrast discrimination [
52], which show that perceptual responses follow a power law relationship with stimulus intensity differences.
In the proposed model, the adaptivity factor regulates the curvature of the response function. Setting to 0.6 produces a convex profile in which the response gradient becomes steeper near the state of zero conflict, increasing sensitivity to subtle binocular discrepancies. As a result, even minor quality differences between the two views can trigger a clear shift toward the prioritized view. The bounds and define the dynamic range of the rivalry process, allowing the network to transition smoothly between discriminative selection and binocular fusion according to the clarity of the input signal.
To validate the proposed ASP strategy, we perform a detailed analysis of its response dynamics. As illustrated in
Figure 3, the mapping functions from the baseline entropy weight
to the propagated weights reveal three critical behaviors. In the “Weak Competition” zone (
), the Stage 1 curve (red solid line) exhibits a highly aggressive gradient compared to the Stage 0 baseline. This intensified nonlinear effect confirms that ASP performs decision sharpening, effectively amplifying subtle inter-view quality differences to facilitate the identification of dominant visual cues. As the input weight
moves toward the extrema, the enhancement effect quickly reaches saturation. This ensures that for structurally evident asymmetric distortions, the strategy effectively reinforces the perceived dominance of the superior view by adhering to the established statistical priors. Furthermore, the Stage 2 curve (blue dash-dotted line) demonstrates an adaptive smoothing behavior, residing between the sharpened Stage 1 output and the baseline to provide essential regularization. This multi-stage evolution prevents over-fitting to noisy estimations and ensures a stable optimization landscape.
To verify this strategy in real-world scenarios, we visualized the weight distribution after Stage 1 propagation (
) for 460 image pairs from the WIVC-II dataset. As shown in
Figure 4, the empirical data distribution closely aligns with these optimized response curves. First, symmetric samples (green circles,
) are accurately captured by the steepest region of the gradient, confirming active decision sharpening for ambiguous inputs. Second, asymmetric samples (blue squares,
) cluster in the dominance preservation zones. This data-driven evidence suggests that the proposed ASP module effectively switches between sharpening and dominance-preservation modes in accordance with the competitive intensity of realistic stereoscopic content.
To further investigate the internal dynamics of the ASP strategy, we conduct a representative case study using samples from the WIVC-II [
51] dataset. To isolate the impact of distortion-driven competition, all selected samples originate from the same scene (“CraftLoom”), covering symmetric and various asymmetric scenarios.
The results in
Table 1 confirm the precision of the ASP strategy. For the symmetric sample (G2–G2), the weights remain near
, indicating that the scheme maintains integrative consistency when competition is balanced. In contrast, for the asymmetric case (Re–J2), where the disparity between views is significant, the weights are nonlinearly modulated to capture the impact of the distorted channel. In Stage 1, the weight for the pristine view
is reduced from
to
, reflecting the increased perceptual dominance of the distorted right view (J2) in anchoring the overall quality. In Stage 2, this is followed by a slight smoothing adjustment to
for the pristine view, which ensures numerical stability while maintaining the distorted signal as the primary determinant of the final score. For cases with subtle binocular disparity (e.g., W2–W4), the strategy effectively sharpens the influence of the perceptually dominant view. This stage-wise evolution validates our proposed “sharpen-then-smooth” logic, where the initial reinforcement of competitive signals is followed by an adaptive smoothing process, ensuring robust feature integration across diverse distortion landscapes.
3.5. Hierarchical Complementary Fusion (HCF)
The HCF module serves as the structural core of the MSCE framework, designed to integrate binocular information across three independent and parallel pathways. By modeling the fusion process into stage-wise operations, HCF ensures that quality-aware features at different abstraction levels are preserved and modulated without mutual interference. Crucially, rather than relying on static aggregation, the HCF module operates under the deterministic guidance of the ASP-refined weights , which adaptively govern the contribution of each view based on the competitive landscape.
These three pathways correspond directly to the ASP conceptual phases: Path A implements the baseline preservation phase by integrating Stage 0 features to preserve low-level texture integrity using the baseline weights
; Path B realizes the competition-aware reinforcement phase by fusing Stage 1 features guided by the nonlinearly reinforced weights
to model the binocular rivalry mechanism, where dominant structural cues are adaptively accentuated through a selective dominance logic to ensure that quality-discriminative information prevails over distorted counterparts; and Path C executes the adaptive smoothing phase by merging Stage 2 semantic representations using the regularized weights
to maintain global scene consistency through integrative consistency. The fused feature map for each stage
is formally expressed as:
To further optimize these initial representations, each pathway incorporates a complementary feature refinement stage consisting of residual convolutional layers and a local attention strategy. This unit re-calibrates the feature response
based on the stage-specific receptive field to extract discriminative quality cues while mitigating potential fusion inconsistencies. The resulting refined maps
are then transformed into compact global descriptors
via GAP:
Finally, these descriptors are concatenated to form the integrated quality-aware representation:
This multi-stage ensemble vector encapsulates a wide spectrum of binocular interactions, ranging from coherent fusion in symmetric scenarios to the selective dominance induced by asymmetric rivalry. The final quality score
Q is regressed through an SVR module with a Radial Basis Function (RBF) kernel. The SVR effectively models the nonlinear relationship between multi-stage features and human subjective perception, while ensuring robust generalization on the typically small-sample datasets characteristic of SIQA tasks.