SCISA-Net: Scene-Constrained Inverse-to-Subband Attention for Semantic Inference from Wall-Mediated Indirect Observations

Dai, Jihao; Qin, Hongshuai; Li, Guowen; Liu, Jin; Zhang, Xiaoshuai; Qi, Huiyu; Zheng, Zhiwen; Huang, Xingru

doi:10.3390/photonics13060575

Open AccessArticle

SCISA-Net: Scene-Constrained Inverse-to-Subband Attention for Semantic Inference from Wall-Mediated Indirect Observations

by

Jihao Dai

¹

,

Hongshuai Qin

²

,

Guowen Li

¹,

Jin Liu

³,

Xiaoshuai Zhang

⁴

,

Huiyu Qi

³,

Zhiwen Zheng

^3,* and

Xingru Huang

^3,*

¹

HDU-ITMO Joint Institute, Hangzhou Dianzi University, Hangzhou 310018, China

²

Information Engineering College, Hangzhou Dianzi University, Hangzhou 310018, China

³

School of Communication Engineering, Hangzhou Dianzi University, Hangzhou 310018, China

⁴

Faculty of Information Science and Engineering, Ocean University of China, Qingdao 266404, China

^*

Authors to whom correspondence should be addressed.

Photonics 2026, 13(6), 575; https://doi.org/10.3390/photonics13060575

Submission received: 27 April 2026 / Revised: 31 May 2026 / Accepted: 7 June 2026 / Published: 11 June 2026

(This article belongs to the Special Issue AI for Photonics: Intelligent Imaging, Learning-Driven Optics, and Photonic Computing)

Download

Browse Figures

Versions Notes

Abstract

We study whether the semantic category of a hidden display terminal can be inferred from a wall-mediated indirect observation when the display remains outside the camera field of view under a controlled and calibrated scene configuration. This setting provides a security-motivated feasibility test for indirect optical semantic leakage, but it remains challenging for two reasons. First, indirect propagation makes the wall pattern dominated by the occluder contour, while category-bearing evidence survives only as weak radiometric variations, making stable extraction difficult. Second, even after front-end recovery, low-frequency support is relatively stable, whereas the mid- and high-frequency details required for class separation remain weak and distortion-prone; as a result, the classifier may drift toward dominant but weakly informative coarse-grained patterns and fail to consistently accumulate fine-grained discriminative cues. We propose SCISA-Net, which combines scene-constrained inversion with multi-stage Haar-subband attention to reorganize indirect observations, compensate residual feature degradation, and aggregate class-relevant subband evidence. Experiments on a paired 31-class benchmark show stable recognition, robustness to illumination attenuation and ambient background interference, matched scene-operator re-parameterization capability, and clear degradation when key inverse or subband components are disrupted. These results support the feasibility of category-level semantic inference from calibrated wall-mediated indirect observations.

Keywords:

wall-mediated indirect observation; hidden display terminal; indirect optical semantic leakage; calibrated non-line-of-sight recognition; scene-constrained inversion; semantic inference; Haar-subband attention; subband-aware representation learning

1. Introduction

In security-relevant or access-controlled indoor spaces, display light that is blocked from direct view may still leave faint indirect traces on nearby surfaces after propagation through the surrounding geometry [1]. Such traces are typically weak, spatially mixed, and strongly dependent on the scene configuration; nevertheless, they motivate a focused feasibility question for wall-mediated non-line-of-sight semantic recognition: whether the semantic category carried by a hidden visual cue can be inferred when only the wall-borne optical response is available. Motivated by this question, we study a controlled and calibrated scenario in which a locally connected terminal presents a brief gesture cue to an on-site user, while a camera on the other side of a visual barrier records only the indirect intensity pattern formed on the wall after that cue is modulated by an occluder. In our implementation, this cue is instantiated as a gesture image displayed on a local screen. The experiments are conducted in a light-isolated dark room divided into two regions by a vertical partition, as shown in Figure 1, where the annotated Cartesian axes indicate the coordinate frame used for the calibrated scene parameterization. The display on the screened side is the only active light source in the scene; its emitted light is first modulated by an occluder and then projected onto the opposing wall, while the camera records only the resulting wall-borne intensity pattern. Because the surrounding walls and the partition jointly block all direct sightlines, the screen itself never enters the camera field of view. Under this geometry, the recorded wall image is mainly governed by the occluder contour, whereas the signal carried by the hidden screen remains only as weak intensity variations embedded in that pattern [2]. The security relevance of this geometry serves as the motivating context of this study, and the technical scope of the paper is a calibrated feasibility investigation of wall-mediated indirect semantic inference under the specified scene configuration.

According to the observation modality and the stage at which hidden information is recovered or discriminated, existing studies related to this problem can be broadly divided into two categories: indirect optical hidden scene recovery and recognition [3,4], and frequency- or subband-aware visual representation learning [5,6]. The first category studies how hidden geometry, appearance, or semantic content can be inferred from relay surface measurements under non-line-of-sight or indirect-view conditions [3,7,8]. O’Toole et al. proposed confocal non-line-of-sight imaging based on the light cone transform, which converts time-resolved transient measurements into hidden object reconstructions under a confocal scanning geometry [9]. Lindell et al. introduced a wave-based NLOS imaging method using fast f–k migration, which reformulates hidden scene reconstruction through a frequency domain wave propagation model [10]. Liu et al. proposed a phasor field virtual wave framework for NLOS imaging, showing that hidden scenes can be reconstructed by modeling transient light transport as virtual wave propagation [11]. Subsequently, Liu et al. proposed a Bayesian NLOS framework without requiring fixed illumination and detection patterns, and introduced a signal–object collaborative regularization strategy for reconstructing albedo and surface normals under more general relay settings [12]. Czajkowski and Murray-Bruce further showed that a full-colour three-dimensional hidden scene can be recovered from an ordinary indirect photograph by exploiting two scene edges as angular encoders [2]. Chen et al. proposed a long-range NLOS imaging method based on projected images from different light fields and deep learning, and Tian et al. introduced a sparse Bayesian learning front-end to enhance transient fidelity before geometric inversion under photon-starved NLOS conditions [13,14]. These studies indicate that hidden scene information may remain recoverable or discriminable after indirect optical propagation [3,4,7,8,15,16]; however, most existing NLOS-oriented studies are designed for hidden scene reconstruction, localization, transient enhancement, or protocol-specific passive recognition, and their sensing modalities, output objectives, and data protocols are not directly aligned with the single-wall-image 31-class semantic inference benchmark considered here.

The second category, frequency- and subband-aware visual representation learning, studies how discriminative cues can be preserved or emphasized when informative details are unevenly distributed across spectral components [5,6,17]. Li et al. proposed WaveCNets, which replace conventional downsampling operators with discrete wavelet transform and inverse wavelet transform, so that low- and high-frequency components can be explicitly maintained during feature propagation [18]. Qin et al. reinterpreted channel attention from a frequency analysis perspective and proposed FcaNet, showing that global average pooling can be viewed as a special case of frequency domain decomposition and extending it to multi-spectral channel attention [19]. Rao et al. proposed GFNet, which replaces self-attention with a global filter layer implemented by Fourier transform, learnable spectral filtering, and inverse Fourier transform [20]. Guibas et al. introduced adaptive Fourier neural operators for efficient token mixing in the spectral domain [21], and subsequent frequency-aware recognition models further incorporated DCT, Fourier, or wavelet domain operations into visual classification pipelines [22,23,24]. These methods show that explicit spectral decomposition or filtering can provide useful mechanisms for separating coarse structural support from fine-detail responses [18,19,20,21,22,23,24]. Nevertheless, they are mostly developed for ordinary line-of-sight visual recognition, where the input image already contains directly observable object structure. They do not explicitly address the preceding scene-constrained inverse recovery problem required when semantic evidence has first been diluted, low-pass-biased, spatially overlapped, and entangled with occluder-induced interference through indirect optical propagation.

Two challenges arise in this setting. The first challenge is that before being recorded by the camera, the wall-mediated indirect observation must traverse a long and highly lossy physical path [25]. Light first originates from the hidden screen, is reshaped by the occluder, reaches the wall, and is then captured by the camera only after diffuse reflection. As a result, the observed wall pattern is dominated primarily by the global contour introduced by the occluder, whereas the class-related evidence from the hidden symbol is preserved only in a distributed manner as weak radiometric variations across space [2]. The useful signal is therefore diluted, exhibits a low-pass bias, and becomes partially entangled with scattering-related artifacts [2,25]. This makes the truly class-relevant cues difficult to extract in a stable manner. The second challenge is that, even if the coarse structure of the hidden screen pattern can be partially recovered, the distribution of discriminative evidence across different representation levels remains uneven. Low-frequency information is more stable and easier to preserve, but it mainly conveys overall shape; by contrast, the subtle differences that distinguish symbol classes are more likely to reside in weaker mid- and high-frequency details [5,6,26]. As a result, the classifier is more likely to drift toward dominant yet weakly informative coarse-grained patterns, rather than continuously and reliably extracting and accumulating class-relevant fine-grained discriminative cues during feature learning [27].

To address these challenges, we propose the Scene-Constrained Inverse-to-Subband Attention Network (SCISA-Net), a two-stage architecture for semantic inference from wall-mediated indirect observations. Its first component, the Scene-Constrained Inversion Module (SCIM), treats the recorded wall pattern as the output of a scene-constrained transport process, applies Scene-Aware Regularized Inverse Encoding to reorganize diluted class evidence under the guidance of the physical operator, suppresses unstable spectral components and scattering-induced artifacts through regularized compensation, and produces a coarse representation that is more consistent with the hidden screen content. Its second component, the Multi-Stage Haar-Subband Attention Network (MS-HSANet), operates on this compensated representation, explicitly separates low-frequency structure from weaker mid- and high-frequency cues through Haar-subband decomposition, adaptively reweights them over stages with attention, and progressively aggregates the mid- and high-frequency fine-grained evidence required for class discrimination. By coupling scene-constrained inverse compensation with subband-aware discriminative learning, SCISA-Net is designed to test the central feasibility question of this work: whether the semantic content carried by a hidden display can still be inferred from an occlusion-dominated wall pattern when the screen itself never enters view.

The contributions of this work can be summarized in four points. First, we propose SCISA-Net for semantic inference from wall-mediated indirect observations, coupling scene-constrained inversion with subband-aware discriminative learning, and verifying the feasibility of inferring class semantics when the hidden display never enters the field of view. Second, we introduce the Scene-Constrained Inversion Module (SCIM), which handles physically degraded and noise-entangled observations through scene-aware regularized inverse encoding and feature compensation, and reorganizes weak class evidence into a representation more consistent with the hidden screen content. Third, we introduce the Multi-Stage Haar-Subband Attention Network (MS-HSANet), which addresses the imbalance of discriminative fidelity across frequency bands through multi-stage Haar-subband decomposition and attention reweighting, and progressively aggregates the fine-grained cues required for stable class discrimination. Fourth, we construct a paired 31-class wall-mediated indirect observation benchmark based on the RGB Arabic Alphabets Sign Language Dataset, conduct baseline, ablation, robustness, ambient background, and matched scene re-parameterization experiments on this benchmark, and provide empirical support for controlled semantic inferability under calibrated wall-mediated indirect-view conditions.

2. Materials and Methods

2.1. Experimental Setup

This subsection specifies the experimental setting used for evaluating semantic inference from wall-mediated indirect observations. We first define the classification task under the indirect-view optical geometry, then describe the source domain symbol set and the paired indirect observation benchmark constructed from it, and finally introduce the evaluation metrics used for quantitative assessment.

Task Definition. The goal of this study is to evaluate whether the semantic category carried by a hidden display terminal can be inferred from wall-mediated indirect observations under a calibrated indirect-view optical geometry. In this setting, the camera does not directly observe the display screen, but performs inference based on the illumination pattern formed after occluder modulation and wall reflection. The task is formulated as a 31-class image-level classification problem, which is used to assess whether SCISA-Net can recover stable class-discriminative evidence from contour-dominated indirect observations. Except for the scene re-parameterization experiment, all quantitative evaluations keep the calibrated baseline geometry and its corresponding scene information encoding operator unchanged. In the degraded observation experiments, only the wall-mediated indirect observations are perturbed, while the underlying scene geometry and operator remain unchanged.
Dataset and Benchmark Construction. We use the public RGB Arabic Alphabets Sign Language (AASL) dataset as the source domain symbol set. According to the original dataset paper, AASL is a fully labeled RGB image dataset for Arabic sign language alphabet classification, collected from more than 200 participants under diverse conditions, including variations in lighting, background, orientation, image size, and image resolution, and released on Kaggle [28]. Based on this source domain symbol set, we construct a paired wall-mediated indirect observation benchmark following the controlled indirect optical acquisition geometry described in the Introduction and modeled below. In our experimental version, both the source domain symbol set and the paired indirect observation benchmark contain 7855 PNG images from 31 classes, with 6271 images for training and 1584 images for validation. Both datasets use image-level single-label annotations. The train/validation partition is kept strictly aligned across the two domains, and all indirect observation images are stored at a native resolution of $128 \times 128$ .
Scene Calibration. To improve the reproducibility of the controlled indirect-view experiment, the display plane, imaging wall, camera field of view, and occluder geometry were explicitly parameterized in a three-dimensional Cartesian coordinate system. Following the coordinate frame annotated in Figure 1, the x-axis denotes the horizontal direction in the scene, the z-axis denotes the vertical direction, and the y-axis denotes the depth direction from the effective display image plane toward the imaging wall. The effective display image plane was set as $y = 0$ , and the imaging wall was located at $y = 0.95 m$ . The active display region was calibrated as a $0.40 m \times 0.40 m$ square region, while the valid wall imaging region was fixed as a $0.375 m \times 0.375 m$ field of view. The occluder was modeled as an opaque calibrated geometric body placed between the display region and the imaging wall, and the camera position, focus, and wall-region cropping were kept fixed during data acquisition. The detailed scene parameters, calibration procedure, occluder geometry, and wall reflectance assumption are provided in the section “Experimental Scene Parameters” of the Supplementary Materials. The scene information encoding operator used by SCIM is constructed from this calibrated configuration. Accordingly, the use of $A$ assumes consistency between the physical setup, the calibrated scene description, and the operator used during SCIM inference. In the scene re-parameterization experiment, the scene parameters and $A$ are updated together before inference.
Evaluation Metrics. We report Precision, Recall, F1 Score, Accuracy, AUC Score, Cohen’s $κ$ , Brier score loss, g-mean, and specificity. For Precision, Recall, F1 Score, Accuracy, AUC Score, Cohen’s $κ$ , g-mean, and specificity, higher values indicate better performance, whereas a lower Brier score loss indicates better probabilistic calibration. We adopt macro-F1 as the primary model-selection criterion, since it provides a more balanced assessment of class-wise discrimination than overall accuracy in this 31-class setting [29]. All quantitative results in this paper are reported on the held-out validation split.

2.2. Overall Framework

This work builds SCISA-Net as a two-component framework for the calibrated wall-mediated semantic inference task defined above. The front-end Scene-Constrained Inversion Module (SCIM) uses the calibrated scene information encoding operator to reorganize the wall-mediated observation into a source-oriented intermediate representation. The back-end Multi-Stage Haar-Subband Attention Network (MS-HSANet) then performs hierarchical subband-aware classification on the compensated representation. The overall architecture of SCISA-Net is illustrated in Figure 2.

In SCIM, Scene-Aware Regularized Inverse Encoding (SARIE) applies truncated Tikhonov-filtered inversion and TV-based refinement under the calibrated transport operator, followed by Multi-Scale Channel-Adaptive Feature Compensation (MSCAFC). In MS-HSANet, stage0 provides the stem feature, while stage1–stage3 stack 3, 4, and 6 Haar-Subband Attention Blocks (HSABs), respectively. The outputs of the stem and three HSAB stages are aggregated for the final 31-class prediction.

2.3. Scene-Constrained Inversion Module

In the present setting, the wall-mediated observation recorded by the camera is not a directly readable presentation of the hidden screen pattern. After scene-constrained transport, the dominant visual energy is governed by the occluder-induced coarse contour, whereas the category-related evidence of the hidden screen remains embedded as weak radiometric deviations inside the modulated light spot [2,30]. The role of the Scene-Constrained Inversion Module (SCIM) is therefore to reorganize this contour-dominated observation into a representation that is more consistent with the hidden-screen source domain before discriminative classification. As shown in Figure 3, SCIM contains two consecutive components: Scene-Aware Regularized Inverse Encoding (SARIE) and Multi-Scale Channel-Adaptive Feature Compensation (MSCAFC).

Let

X^{(c)} \in R^{N_{s}}

and

Y^{(c)} \in R^{N_{w}}

denote the vectorized hidden-screen pattern and wall-mediated observation of channel

c \in {r, g, b}

, respectively. Under the calibrated dark-room geometry, the forward relation is written in a scene-dependent linear transport form [30,31,32,33]

Y^{(c)} = A X^{(c)},

where

A \in R^{N_{w} \times N_{s}}

is the scene information encoding operator. This operator encodes the display–occluder–wall transport response under the calibrated geometry, including distance attenuation, surface orientation factors, occluder visibility, and display angular emission response. In our implementation, the effective hidden display and wall receiving regions are discretized into

64 \times 64

emitting elements and

128 \times 128

receiving samples, respectively, yielding

A \in R^{128^{2} \times 64^{2}}

. The continuous transport model, visibility definition, display angular response, and numerical construction of

A

are provided in the section “Construction of the Scene-Information Encoding Operator A” of the Supplementary Materials. The implementation-level details of SARIE and MSCAFC are provided in the section “Details of the Scene-Constrained Inversion Module” of the Supplementary Materials. Accordingly, SCIM assumes that the operator used during inference is consistent with the calibrated physical configuration.

SARIE performs regularized inverse encoding on this operator rather than directly classifying the wall observation. Since the wall-side signal is diluted, low-pass-biased, and spatially overlapped after indirect propagation, direct reciprocal inversion of the full operator spectrum would amplify unstable spectral tail components. Therefore,

A

is decomposed as

A = \sum_{ℓ = 1}^{r} σ_{ℓ} u_{ℓ} v_{ℓ}^{⊤} = U Σ V^{⊤},

and a truncated Tikhonov-filtered inverse encoder is used [34,35]:

{\hat{X}}^{(c)} = R_{k, λ} (Y^{(c)}; A) = V_{1 : k} diag ({\{\frac{σ_{ℓ}}{σ_{ℓ}^{2} + λ^{2}}\}}_{ℓ = 1}^{k}) U_{1 : k}^{⊤} Y^{(c)} .

The truncation keeps the inverse-coded representation within the scene-dominant transport subspace, while the Tikhonov shrinkage suppresses unstable amplification associated with small singular values. Unless otherwise specified, the truncation number and Tikhonov coefficient are fixed at

k_{0} = 1024

and

λ_{0} = 0.5

throughout the experiments. No sample-wise or SNR-dependent retuning is applied at inference time.

The inverse-coded RGB channels are concatenated as

\hat{X} = Cat ({\hat{X}}^{(r)}, {\hat{X}}^{(g)}, {\hat{X}}^{(b)}) .

Because contour-coupled scattering and spectral truncation may still introduce ringing, stripe-like residuals, and sparse anomalous gradients, SARIE further applies TV-based variational refinement [36,37,38]:

X^{★} = \arg \min_{Z} \sum_{c \in {r, g, b}} [\frac{1}{2} {∥Z^{(c)} - {\hat{X}}^{(c)}∥}_{2}^{2} + β \sum_{i, j} (|Z_{i, j + 1}^{(c)} - Z_{i, j}^{(c)}| + |Z_{i + 1, j}^{(c)} - Z_{i, j}^{(c)}|)] .

This refinement contracts high-variation residuals while preserving the inverse-encoded structure needed by the downstream classifier.

Although

X^{★}

is more aligned with the hidden-screen domain than the raw wall observation, regularized inversion can still leave residual texture blunting, truncated high-frequency deficiency, and chromatic mismatch across independently inverted channels. MSCAFC therefore performs feature domain compensation after SARIE. It first lifts

X^{★}

through a

3 \times 3

convolution, processes the lifted feature with two multi-branch recalibration blocks, and adds a

1 \times 1

residual bypass:

F_{SCIM} = C_{2} (C_{1} (ϕ_{3 \times 3} (X^{★}))) + ϕ_{1 \times 1} (X^{★}) .

Each compensation block

C_{q}

uses parallel receptive-field branches followed by channel-adaptive recalibration, following the principle of multi-scale branch aggregation and channel-wise feature reweighting [39,40]. The detailed branch-level formulation is provided in the section “Details of the Scene-Constrained Inversion Module” of the Supplementary Materials. In this way, SCIM keeps the main pipeline focused on three operations: scene-constrained inverse reorganization, posterior residual contraction, and feature domain compensation before the subsequent subband-aware classifier.

2.4. Multi-Stage Haar-Subband Attention Network

Although the Scene-Constrained Inversion Module (SCIM) reorganizes the wall-mediated observation into an intermediate representation that is more consistent with the hidden screen pattern, the resulting feature map

F_{SCIM}

still exhibits frequency-dependent discriminative fidelity. After scene-constrained inversion and feature compensation, relatively stable low-frequency support is retained in the form of coarse contour continuity, principal region layout, and large-scale radiometric organization, whereas the semantic differences between hidden screen classes are more strongly tied to weaker mid- and high-frequency responses that carry edge-localized perturbations, subtle spatial-chromatic deviations, and fine texture transitions [41]. The methodological difficulty at this stage is therefore to prevent the classifier from relying mainly on globally dominant coarse support while overlooking finer class-dependent responses whose fidelity remains uneven after SCIM [27,42].

To address this difficulty, we introduce the Multi-Stage Haar-Subband Attention Network (MS-HSANet), which takes

F_{SCIM}

as input and organizes the discriminative stream into one stem stage and three successive Haar-Subband Attention Block (HSAB) stages. The stem first contracts the compensated representation into a stable stage-wise carrier through a

7 \times 7

convolution and max pooling. The three HSAB stages then use depths

(L_{1}, L_{2}, L_{3}) = (3, 4, 6)

and channel dimensions

(C_{1}, C_{2}, C_{3}) = (128, 256, 512)

to progressively propagate subband-aware discriminative evidence from shallower appearance-sensitive responses toward deeper semantic organization [43]. The architecture of MS-HSANet is illustrated in Figure 4.

Each HSAB consists of local convolutional mixing, Haar-prior depthwise subband decomposition, channel–spatial recalibration, and residual fusion. Specifically, the block first applies a

3 \times 3

convolution to form a local feature carrier, and then uses grouped

2 \times 2

depthwise filters initialized by the canonical

LL

,

LH

,

HL

, and

HH

Haar subband prototypes [44]. These filters remain learnable during training and are only weakly anchored by the Haar regularization term, so the transform preserves a subband-oriented inductive bias while adapting to SCIM-compensated features. The resulting subband responses are fused by

1 \times 1

projections and recalibrated by a CBAM-style channel–spatial attention operator [45]. This design allows the predominantly structural

LL

response to provide coarse contextual support, while the weaker

LH / HL / HH

responses can be selectively emphasized when they encode class-dependent deviations.

After the last block of each stage, MS-HSANet aggregates the stem feature and the outputs of the three HSAB stages by global average pooling:

z = Cat (GAP (G_{0}), GAP (G_{1, L_{1}}), GAP (G_{2, L_{2}}), GAP (G_{3, L_{3}})) \in R^{960} .

The concatenated descriptor is passed to a two-layer classifier with a hidden dimension of 1024 and a K-class softmax output. This cross-stage aggregation is used because frequency-selective cues are not confined to a single depth: earlier stages retain more local appearance-sensitive fluctuations, whereas deeper stages organize more abstract semantic configurations. The implementation-level equations for the stem, HSAB, Haar-prior depthwise convolution, CBAM-style recalibration, residual shortcut, and final classifier are provided in the section “Detailed Formulation of MS-HSANet” of the Supplementary Materials.

2.5. Implementation Details

All experiments initialize SCISA-Net from scratch. Specifically, the filters in the Haar-prior depthwise convolution are initialized with the canonical

2 \times 2

Haar bases corresponding to

LL

,

LH

,

HL

, and

HH

, and are weakly constrained during training by an

ℓ_{2}

regularization term toward these initial prototypes. All remaining convolutional and fully connected layers, including those in MSCAFC, the stage0 stem, the projection layers inside HSAB, CBAM, and the final classifier, are initialized with Kaiming normal initialization [46], while all batch-normalization layers are initialized with scale set to 1 and bias set to 0 [47].

The network input is resized to

368 \times 368

. The training objective is defined as

L = L_{CE} + λ_{haar} L_{haar},

where

L_{CE}

denotes the standard cross-entropy loss for the 31-class semantic classification task, and

λ_{haar} = 10^{- 4}

is the coefficient of the

ℓ_{2}

regularization term imposed on the learnable Haar transform weights. This regularization term serves as a weak subband-structural prior, encouraging the HP-DConv filters to remain in a Haar-oriented decomposition regime while still allowing them to adapt during discriminative training. A sensitivity analysis of

λ_{haar}

is provided in the “Haar Regularization as a Weak Subband Prior” section of the Supplementary Materials.

We optimize the model using AdamW with an initial learning rate of

10^{- 3}

and a weight decay of

10^{- 4}

, and adopt CosineAnnealingWarmRestarts with

T_{0} = 50

and

T_{mult} = 2

[48,49]. The batch size is 32 for both training and validation, and the maximum number of training epochs is 200.

During training, images are first resized to 368, and are then processed with RandomCrop(368), RandomHorizontalFlip(), RandomRotation(15), and ColorJitter(brightness = 0.2, contrast = 0.2). All samples are subsequently converted to tensors and normalized using mean =

[0.485, 0.456, 0.406]

and std =

[0.229, 0.224, 0.225]

. During validation, only deterministic preprocessing is applied, namely, Resize(368), tensor conversion, and the same normalization. To stabilize optimization, we insert Dropout(0.5) before the classification head, apply gradient clipping with

{∥ \cdot ∥}_{2} \leq 1.0

, and perform validation every 10 epochs. We further adopt an early stopping strategy with validation macro-F1 as the monitoring criterion: training is terminated if macro-F1 fails to improve over three consecutive validation rounds, and the checkpoint achieving the best validation macro-F1 is retained as the final model.

All experiments are implemented in PyTorch 2.4.0+cu121 and conducted on a single NVIDIA GeForce RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). During inference, inputs follow the same deterministic preprocessing pipeline used in validation, with an input resolution of

368 \times 368

. The trained model is loaded from the saved best-performing .pth checkpoint for prediction.

3. Results

This section reports the experimental results on the paired 31-class wall-mediated indirect observation benchmark. Unless otherwise stated, all quantitative results are evaluated on the held-out validation split using the train/validation partition and evaluation protocol described in Section 2.1. The results are organized into overall baseline comparison, ablation study, robustness evaluation, scene re-parameterization evaluation, gesture-morphology-stratified evaluation, and attention feature visualization.

3.1. Overall Performance and Baseline Comparison

We benchmark SCISA-Net against three groups of representative competitors that can be trained and evaluated under the same 31-way wall-mediated indirect semantic inference protocol: conventional CNN backbones, including VGG, ResNetV2, DenseNet, RegNet, EfficientNet, and ConvNeXt [50,51,52,53,54,55]; transformer or hybrid vision backbones, including Swin, DaViT, and UniFormer [56,57,58]; and frequency-aware recognizers [19,20,21,22,23,59,60]. All compared direct baselines are initialized with pretrained weights provided by their corresponding official implementations, equipped with randomly initialized 31-class prediction heads, and fully fine-tuned on the same wall-mediated training split.

We further considered specialized NLOS recovery and recognition studies as potential references. However, a direct numerical comparison is not well-defined for the present benchmark because many existing NLOS-oriented methods rely on transient or scanning measurements, active illumination, task-specific passive measurement protocols, hidden scene reconstruction objectives, localization targets, or different hidden object datasets, rather than the single-image wall-observation-to-semantic-label setting used here. Therefore, Table 1 is intended as an in-protocol comparison among methods that can be trained on exactly the same input–output pairs. The scope of specialized NLOS baselines and the rationale for not mixing cross-protocol results are further discussed in the section “Scope of Specialized NLOS Recognition Baselines” of the Supplementary Materials.

To make the comparison reproducible, the baseline input preprocessing, data augmentation, optimizer, learning rate schedule, batch size, validation frequency, and checkpoint selection criterion are reported in the section “Baseline Comparison Settings” of the Supplementary Materials. A quantitative comparison is reported in Table 1.

The results show that SCISA-Net achieves more stable recognition performance than the compared in-protocol direct classification baselines under the present indirect-view setting. Generic CNN, transformer, hybrid, and frequency-aware classifiers remain limited on this task, whereas SCISA-Net reaches a macro-F1 of 0.7170 and an AUC of 0.9759 on the validation split. This comparison should not be interpreted as an exhaustive ranking over all NLOS sensing systems, because specialized NLOS methods often operate under different measurement modalities, reconstruction objectives, or dataset protocols. Rather, the result indicates that, when all compared models are constrained to the same wall-mediated single-image semantic inference protocol, directly applying ordinary visual classifiers to the wall observation is insufficient. The input image is not a directly observable semantic pattern but an occlusion-dominated indirect intensity distribution; therefore, scene-constrained inverse reorganization is required before discriminative classification.

Beyond the aggregate metrics, the ROC curves and confusion matrix further show that the recovered evidence is distributed across classes rather than concentrated on only a few easy categories. As shown in Figure 5 and Figure 6, the per-class ROC trajectories are largely close to the upper-left region, and the normalized confusion matrix exhibits a predominantly diagonal structure with sparse off-diagonal dispersion.

This behavior is consistent with the optical formation mechanism of the task. The wall observation is dominated by the occluder-carried coarse contour, while the hidden-screen semantics survive only as weak, low-pass, and spatially mixed radiometric deviations [2,25,30]. Direct classifiers are therefore easily driven toward strong but weakly informative contour energy. By contrast, SCISA-Net first reorganizes the observation through scene-constrained inverse encoding and feature compensation, and then separates low-frequency support from mid- and high-frequency detail responses for subband-aware discrimination.

3.2. Ablation Study

The ablation results are summarized in Table 2, the internal HSAB ablation within MS-HSANet is reported in Table 3, the stage depth analysis of the multi-stage HSAB hierarchy is reported in Table 4, and the inference time subband intervention analysis is reported in Table 5. These experiments examine whether the performance of SCISA-Net depends on the two proposed components, namely the Scene-Constrained Inversion Module (SCIM) and the Multi-Stage Haar-Subband Attention Network (MS-HSANet), rather than on a generic increase in network capacity or an unstructured latent shortcut.

Effect of SCIM. To isolate the role of SCIM, we compare the complete SCISA-Net with a variant that removes SCIM and feeds the raw wall-mediated observation directly into MS-HSANet, while keeping the train/validation split and optimization protocol unchanged. The full model improves macro-F1 from 0.0129 to 0.7170, raises AUC from 0.5206 to 0.9759, and markedly reduces the Brier score. This result shows that SCIM is not a dispensable pre-processing stage, but a necessary front-end for the present indirect semantic inference task. Without scene-constrained inversion, the back-end still receives a contour-dominated observation that is severely misaligned with the hidden-screen source domain, and the classifier remains close to chance level.

We further test whether SCIM can be replaced by generic front-end encoders by substituting it with representative CNN, MLP-based, and transformer-style modules, including ResNet-101, VGG19, RepVGG, MLP-Mixer, ResMLP, Swin-Transformer-V2-B, and ViT-B [50,51,61,62,63,64,65], while keeping the downstream MS-HSANet unchanged. All such replacements collapse to near-chance performance, with macro-F1 remaining only in the range of 0.0025–0.0056 and AUC staying around 0.498–0.518. This gap indicates that generic learned front-ends cannot substitute for the explicit scene-aware inverse reorganization provided by operator-constrained spectral inversion and posterior regularized refinement.

Effect of MS-HSANet. To evaluate the necessity of the back-end classifier, we remove MS-HSANet and attach a lightweight classifier head directly to the SCIM output, while keeping the front-end and training protocol unchanged. This SCIM-only variant reaches only 0.0306 macro-F1 and 0.5608 AUC, far below the full SCISA-Net with 0.7170 macro-F1 and 0.9759 AUC. This result shows that a physically improved intermediate representation is still not equivalent to a stable discriminative representation. After SCIM, low-frequency support is relatively more stable than mid- and high-frequency details, so a plain classifier head remains insufficient for stable 31-way discrimination.

We then replace MS-HSANet with representative frequency-aware classification networks, including GFNet, FCANet, FreqNet, Scattering-based Vision Transformer, AFNONet, FFCNet, and DCTNet [19,20,21,22,23,59,60], while fixing SCIM and all training settings. Although these alternatives are substantially stronger than removing the back-end entirely, none of them matches the original MS-HSANet. The best competing variants remain around 0.679–0.680 macro-F1, still below the 0.7170 achieved by the full model, and most also exhibit lower Cohen’s

κ

and inferior calibration. This comparison indicates that the gain does not come from using a frequency-aware classifier in a generic sense, but from a structure matched to the feature characteristics of the SCIM output.

Effect of HSAB. Finally, we perform an internal ablation of MS-HSANet by replacing the stacked HSAB units in stage1–stage3 with parameter-matched alternatives, such as Inception-v2, Global Filter, gMLP, MobileViT, ShuffleNetV2, Ghost Bottleneck, ResNeXt blocks, and DenseNet bottlenecks [47,52,66,67,68,69,70,71]. Under this controlled setting, the original HSAB-based design remains the best performer, exceeding the strongest internal substitute by 4.25 percentage points in macro-F1. This result supports the design of HSAB, which explicitly decouples coarse and fine subband carriers and selectively emphasizes weak mid- and high-frequency evidence that encodes inter-class differences.
Effect of the multi-stage HSAB hierarchy. To directly examine whether MS-HSANet benefits from a hierarchical subband representation rather than from a single local Haar decomposition, we further conduct a stage depth ablation study, as reported in Table 4. In this experiment, the SCIM front-end, data split, preprocessing pipeline, and optimization protocol are kept unchanged, while only the number and stage-wise arrangement of HSAB units are modified. One_HSAB retains stage0 and a single HSAB is placed in stage1. Stage_01, stage_012, and stage_0123 progressively retain stage0–stage1, stage0–stage2, and stage0–stage3, respectively. HSAB_13 keeps the same total number of HSAB units as the complete model but places all 13 HSABs within stage1.

The results show a clear stage-wise improvement. The macro-F1 score increases from 0.2849 for One_HSAB to 0.5261 for stage_01, 0.6355 for stage_012, and 0.7170 for the complete stage_0123 setting. A similar trend is observed for AUC, which increases from 0.8819 to 0.9759 as the hierarchical stages are progressively introduced. Importantly, HSAB_13 reaches only 0.6006 macro-F1, although it uses the same total number of HSAB units as the complete model. This result indicates that the improvement cannot be explained merely by increasing the number of HSAB blocks. Instead, distributing HSABs across multiple stages is important because the local

2 \times 2

Haar-prior decomposition is embedded into progressively deeper feature hierarchies with enlarged effective receptive fields and different semantic granularities. Therefore, MS-HSANet should be interpreted as a hierarchical multi-stage subband encoder rather than as a single-scale Haar decomposition module.

Inference time subband intervention in HSAB. To examine whether the trained classifier relies on structured subband evidence learned by HSAB, we conduct an inference time intervention analysis on the HP-DConv outputs, as reported in Table 5. The trained checkpoint is kept fixed throughout this analysis, and no retraining or parameter update is performed. Complete HP-DConv denotes the original setting in which the $LL$ , $LH$ , $HL$ , and $HH$ subbands are all preserved in their normal order. $LL$ retains only the low-frequency $LL$ branch; Remove $LL$ , Remove $LH$ , Remove $HL$ , and Remove $HH$ suppress the corresponding subband; and Random permutation averages the results over all 23 non-identity permutations of the four subband positions. This setting provides a controlled diagnostic test because the network parameters, input samples, and downstream classifier are unchanged, while the availability or positional consistency of the subband evidence is selectively perturbed.

The complete HP-DConv setting obtains the best overall performance, with a macro-F1 of 0.7170 and an AUC of 0.9759. When only the

LL

branch is retained, the macro-F1 decreases to 0.0121, indicating that the coarse low-frequency carrier alone cannot support the 31-class semantic decision. Suppressing

LL

also leads to near-chance performance, with a macro-F1 of 0.0177, which shows that directional mid- and high-frequency responses need the low-frequency structural support to form a usable discriminative representation. Removing

LH

or

HL

further causes clear degradation, reducing the macro-F1 to 0.2228 and 0.0229, respectively. These results indicate that the final prediction depends on the coordinated use of low-frequency support and directional detail responses, rather than on a single dominant low-level cue.

The random permutation setting provides the most direct evidence against an unstructured shortcut interpretation. In this setting, all four subband responses are still retained, but their

LL

/

LH

/

HL

/

HH

correspondence is deliberately disrupted before the subsequent projection and attention layers. The performance decreases to a macro-F1 of 0.0174 and an AUC of 0.5196, approaching the behavior of near-random discrimination. This collapse under subband-order perturbation indicates that the downstream classifier is sensitive to the semantic organization of the subband channels, instead of only exploiting arbitrary response magnitudes or other unstructured activation patterns. Therefore, the discriminative behavior of HSAB is tied to the intended ordered subband representation. Removing

HH

causes a comparatively mild decrease in macro-F1 from 0.7170 to 0.6953 and in AUC from 0.9759 to 0.9735. This suggests that, under the present geometry and dataset, the diagonal high-frequency branch is less dominant than the

LL

-supported

LH

/

HL

directional responses, while the complete four-subband configuration still provides the strongest overall result.

3.3. Robustness Under Degraded Observation Conditions

To examine whether the semantic inference capability of SCISA-Net remains reliable when observation conditions are degraded, we perturb the wall-mediated indirect observations at inference time while keeping the trained checkpoint fixed. Three perturbation axes are considered: hidden-screen illumination attenuation, simulated observation noise, and ambient-light background interference on the wall observation.

Illumination attenuation. In our acquisition setup, the reference wall-mediated indirect observation is obtained under a $600 nit$ hidden-screen emission condition. We evaluate illumination robustness by progressively reducing the hidden-screen luminance from $600 nit$ to $300 nit$ while keeping the trained checkpoint fixed. This perturbation is physically meaningful because a decrease in screen emission directly compresses the photon budget that survives occluder modulation, reaches the wall, and is finally captured by the camera [72]. The corresponding quantitative results and representative attenuated observations are shown in Table 6 and Figure 7, respectively.

Table 6. Quantitative evaluation of SCISA-Net under progressive hidden-screen emission luminance attenuation on the wall-mediated indirect semantic inference task. Hidden-screen luminance is reduced from

600 nit

(100%) to

300 nit

(50%), and validation performance is reported as mean ± standard deviation in terms of Precision, Recall, Macro-F1, Accuracy, AUC, Cohen’s

κ

, Brier score, G-mean, and Specificity, where ↑ and ↓ denote preferable larger and smaller values, respectively. Red font denotes the

600 nit

reference condition, and blue font denotes the best competing result for each metric.

Table 6. Quantitative evaluation of SCISA-Net under progressive hidden-screen emission luminance attenuation on the wall-mediated indirect semantic inference task. Hidden-screen luminance is reduced from

600 nit

(100%) to

300 nit

(50%), and validation performance is reported as mean ± standard deviation in terms of Precision, Recall, Macro-F1, Accuracy, AUC, Cohen’s

κ

, Brier score, G-mean, and Specificity, where ↑ and ↓ denote preferable larger and smaller values, respectively. Red font denotes the

600 nit

reference condition, and blue font denotes the best competing result for each metric.

Hidden-Screen Luminance	Precision ↑	Recall ↑	Macro-F1 ↑	Accuracy ↑	AUC ↑	Cohen’s $κ ↑$	Brier Score ↓	G-Mean ↑	Specificity ↑
$600 nit$ (100%)	0.7260 ± 0.1017	0.7160 ± 0.1025	0.7170 ± 0.0884	0.7160 ± 0.1025	0.9759 ± 0.0134	0.7141	0.0151 ± 0.0244	0.8400 ± 0.0622	0.9908 ± 0.0041
$540 nit$ (90%)	0.7046 ± 0.1103	0.6984 ± 0.1234	0.6971 ± 0.1026	0.6984 ± 0.1234	0.9720 ± 0.0206	0.6883	0.0162 ± 0.0249	0.8281 ± 0.0756	0.9899 ± 0.0044
$480 nit$ (80%)	0.6563 ± 0.1117	0.6516 ± 0.1341	0.6490 ± 0.1097	0.6516 ± 0.1341	0.9622 ± 0.0245	0.6400	0.0187 ± 0.0258	0.7980 ± 0.0857	0.9884 ± 0.0045
$420 nit$ (70%)	0.6164 ± 0.1338	0.6045 ± 0.1422	0.6031 ± 0.1215	0.6045 ± 0.1422	0.9490 ± 0.0311	0.5913	0.0209 ± 0.0262	0.7662 ± 0.0973	0.9868 ± 0.0069
$360 nit$ (60%)	0.5309 ± 0.1446	0.5081 ± 0.1487	0.5061 ± 0.1171	0.5081 ± 0.1487	0.9303 ± 0.0333	0.4917	0.0253 ± 0.0264	0.6982 ± 0.1083	0.9836 ± 0.0094
$300 nit$ (50%)	0.4845 ± 0.1512	0.4419 ± 0.1893	0.4329 ± 0.1174	0.4419 ± 0.1893	0.9017 ± 0.0371	0.4237	0.0288 ± 0.0263	0.6411 ± 0.1436	0.9814 ± 0.0154

Figure 7. Representative wall-mediated indirect observations under progressive hidden-screen emission attenuation, used in the illumination robustness evaluation of Section 4.4. Hidden-screen luminance decreases from

600 nit

(100%) to

300 nit

(50%) in 10% steps, yielding progressively weakened wall-carried radiometric contrast while largely preserving the occluder-shaped global carrier.

Figure 7. Representative wall-mediated indirect observations under progressive hidden-screen emission attenuation, used in the illumination robustness evaluation of Section 4.4. Hidden-screen luminance decreases from

600 nit

(100%) to

300 nit

(50%) in 10% steps, yielding progressively weakened wall-carried radiometric contrast while largely preserving the occluder-shaped global carrier.

The results show a gradual degradation trend rather than an abrupt breakdown. Under the

600 nit

reference condition, SCISA-Net attains 0.7170 macro-F1 and 0.9759 AUC. When the luminance is reduced to

540 nit

, performance remains close to the reference level, with macro-F1 and AUC still reaching 0.6971 and 0.9720, respectively. At

480 nit

and

420 nit

, the model continues to preserve meaningful semantic discrimination, yielding 0.6490/0.9622 and 0.6031/0.9490 in macro-F1/AUC, respectively. Even when the luminance is further reduced to

360 nit

and

300 nit

, SCISA-Net still retains non-trivial discriminative capability, with macro-F1/AUC of 0.5061/0.9303 and 0.4329/0.9017, respectively. Throughout this attenuation range, the Brier score rises from 0.0151 to 0.0288, while specificity remains high, decreasing from 0.9908 to 0.9814.

These results show that SCISA-Net degrades gradually under illumination attenuation, with meaningful discrimination retained down to

300 nit

in the tested setting.

Noise contamination. We further evaluate robustness under three types of simulated corruption added to the wall-mediated observation: Gaussian noise, Poisson noise, and Scatter noise. These perturbations stress different failure modes of the inference pipeline: Gaussian noise introduces distributed random fluctuation, Poisson noise perturbs the observation through photon-statistics-related randomness, and Scatter noise more directly mimics structured contamination relevant to the present indirect-view optical formation process [73,74]. Because all tests are performed with the same trained checkpoint and without any retraining, the results reflect the perturbation tolerance of the learned representation.

For a wall-mediated observation

Y

, the Gaussian-corrupted observation is defined as

Y_{G} = Y + |ϵ_{G}|, ϵ_{G, p} \sim N (0, σ_{G}^{2}) .

The Poisson-corrupted observation is defined as

Y_{P} = Y + |P (Y + o_{P}) - Y|,

where

P (\cdot)

denotes element-wise Poisson sampling. The Scatter-corrupted observation is defined as

Y_{S} = Y + η_{S}^{+}, η_{S, p, c} = m_{p} u_{p, c}, m_{p} \sim Bernoulli (ρ), u_{p, c} \sim U (- S, S) .

The corresponding quantitative results and representative corrupted observations are shown in Table 7 and Figure 8, respectively.

Under Gaussian noise, SCISA-Net shows strong tolerance once the perturbation is not overly severe. Macro-F1 rises from 0.3728 at

125 dB

to 0.5661 at

130 dB

and 0.6835 at

135 dB

, and reaches 0.7316 and 0.7182 at

145 dB

and

155 dB

, respectively, with the corresponding AUC values recovering to 0.9770 and 0.9774. At the mild end of Gaussian perturbation, the model becomes comparable to, and occasionally slightly above, the clean baseline, indicating that the learned decision path is not brittle to weak stochastic fluctuation.

Under Poisson noise, the degradation is more persistent. Across the tested 80–

110 dB

settings, macro-F1 remains in the range of 0.4055–0.4609 and AUC stays between 0.8541 and 0.8794, which is substantially below the clean condition. Compared with Gaussian corruption, Poisson perturbation damages the task more consistently, suggesting that count-like random fluctuations more directly disrupt the weak radiometric differences that encode class identity in the wall observation.

Under Scatter noise, performance is poor at the harshest setting but recovers rapidly as the perturbation weakens. At

110 dB

, macro-F1 is only 0.1888 and AUC is 0.7721, indicating that strong structured contamination severely interferes with semantic recovery. However, the model improves to 0.3925/0.8859 at

115 dB

and 0.5892/0.9514 at

120 dB

, and returns close to the clean case at

130 dB

and

140 dB

, where macro-F1 reaches 0.7011 and 0.7238 and AUC reaches 0.9734 and 0.9762, respectively. This trend suggests that SCISA-Net is not inherently fragile to structured disturbance itself; rather, failure occurs when such disturbance dominates the weak class-bearing residual.

Overall, the noise robustness results support a differentiated conclusion: SCISA-Net is highly tolerant to mild Gaussian perturbation, can recover to near-clean performance under moderate Scatter noise, but is more consistently affected by Poisson corruption. When the perturbation destroys weak radiometric cues too aggressively, especially in a photon-statistics-like manner, the residual evidence becomes insufficient for stable 31-way discrimination.

SCIM parameter sensitivity under Poisson noise. Since SCIM relies on a truncated Tikhonov-filtered inverse encoder, the singular-value truncation number k and the Tikhonov coefficient $λ$ can affect the balance between semantic detail preservation and noise suppression. To address the concern that fixed inversion parameters may be suboptimal under noisy observations, we further conduct a parameter sensitivity analysis under the $100 dB$ Poisson noise condition. This setting is selected because Poisson corruption produces the most persistent degradation among the tested noise types, and therefore provides a representative case for examining the stability of the SCIM inverse parameters.

Starting from the default setting

(k_{0}, λ_{0})

used in the main experiments, we evaluate four controlled variants:

0.5 k_{0}

,

1.5 k_{0}

,

0.5 λ_{0}

, and

1.5 λ_{0}

. In each variant, only one SCIM parameter is changed at a time, while the trained checkpoint, downstream MS-HSANet, preprocessing pipeline, and all remaining inference settings are kept unchanged. The results are reported in Table 8.

As shown in Table 8, the default setting achieves the best or near-best performance under the

100 dB

Poisson noise condition. It obtains the highest Recall, F1, Accuracy, AUC, Cohen’s

κ

, G-mean, and Specificity, as well as the lowest Brier score. In particular, the default setting reaches 0.4609 macro-F1 and 0.8742 AUC. Reducing the truncation number to

0.5 k_{0}

decreases macro-F1 to 0.4267, while increasing it to

1.5 k_{0}

further reduces macro-F1 to 0.4073. This indicates that retaining too few singular modes removes useful class-bearing information, whereas retaining more modes can introduce additional noise-amplified components under Poisson corruption.

Changing the Tikhonov coefficient has a milder effect, but neither

0.5 λ_{0}

nor

1.5 λ_{0}

improves the overall result over the default setting. The macro-F1 values are 0.4572 and 0.4532, respectively, both slightly lower than the default value. The only metric for which a variant marginally exceeds the default setting is Precision under

0.5 λ_{0}

, but this small increase is not accompanied by improvements in Recall, F1, AUC, calibration, or G-mean. These results suggest that the fixed SCIM parameters used in the main experiments are not a fragile or pathological choice under the representative Poisson-corrupted condition. They also indicate that, within the tested range, the adopted

(k_{0}, λ_{0})

setting provides a reasonable balance between preserving weak semantic evidence and suppressing noise-induced spectral amplification.

Ambient-light background interference. The experiments above are conducted under a light-isolated dark-room assumption, where the hidden display is the only active light source and the background term in the wall observation is negligible. In practical deployments, however, environmental illumination may also irradiate the wall and introduce an additional background component. Such background illumination increases the wall-side intensity floor and reduces the signal-to-background ratio ( $SBR$ ), making the already weak class-bearing radiometric deviations more difficult to recover. To examine this effect, we further evaluate SCISA-Net by adding controlled ambient-light background patterns directly to the wall-mediated indirect observation images at inference time, while keeping the trained checkpoint fixed.

Let

Y

denote a clean wall-mediated indirect observation and let

B

denote an added ambient-light background field. The perturbed observation is defined as

Y_{amb} = Y + B .

The

{SBR}_{dB}

is controlled by

{SBR}_{dB} = 10 \log_{10} \frac{\frac{1}{C | Ω |} \sum_{c = 1}^{C} \sum_{p \in Ω} Y_{c, p}^{2}}{\frac{1}{C | Ω |} \sum_{c = 1}^{C} \sum_{p \in Ω} B_{c, p}^{2}},

where

Ω

denotes the wall image pixel domain and C is the number of color channels. A smaller

SBR

therefore corresponds to stronger background interference. Three representative background patterns are considered. The first is a uniform background, which models spatially homogeneous wall irradiance. The second is a random-direction linear gradient background, whose spatial template is constructed as

B_{c, p}^{lin} = γ_{lin} (1 + α g_{θ} (p)), g_{θ} (u, v) = \frac{u \cos θ + v \sin θ}{\max_{(u, v) \in Ω} | u \cos θ + v \sin θ |}, θ \sim U (0, 2 π) .

The third is a random-center radial gradient background, defined by

B_{c, p}^{rad} = γ_{rad} (1 + α g_{r} (p)), g_{r} (u, v) = 1 - 2 \frac{r (u, v)}{r_{\max}},

where

r (u, v) = \sqrt{{(u - c_{x})}^{2} + {(v - c_{y})}^{2}}

and the radial center

(c_{x}, c_{y})

is randomly sampled within the normalized wall image coordinate range. For the two spatially varying backgrounds,

α

is fixed to

1 / 3

so that the unscaled background template has an approximate maximum-to-minimum intensity ratio of 2:1. The scaling coefficients

γ_{lin}

and

γ_{rad}

, as well as the constant value of the uniform background, are determined by the target

SBR

.

The results in Table 9 show that SCISA-Net degrades progressively as the ambient-light background becomes stronger, but it does not collapse under moderate background interference. Under uniform background illumination, macro-F1 decreases from the clean reference value of 0.7170 to 0.6903 at

70 dB

, 0.6381 at

60 dB

, and 0.5827 at

55 dB

. A similar trend is observed for the random-direction linear gradient background: macro-F1 remains 0.7030 at

65 dB

, decreases to 0.6280 at

55 dB

, and further drops to 0.4987 at

50 dB

. For the random-center radial gradient background, the model remains closest to the clean case at

70 dB

, with 0.7139 macro-F1 and 0.9755 AUC, and still maintains meaningful discrimination at

60 dB

and

55 dB

, where macro-F1 values are 0.6789 and 0.6272, respectively.

These results indicate that the proposed method is not only affected by detector-like random noise but also by wall-side background irradiance, as expected for an indirect optical semantic inference task. When the added background is weak or moderate, the SCIM-based inverse reorganization and the subband-aware classifier can still preserve useful semantic evidence. However, when the

SBR

is reduced more aggressively, the additional background raises the wall intensity floor and masks the subtle radiometric deviations that encode the hidden-screen category, leading to a clear performance decrease. Therefore, this experiment supports a more restrained conclusion: SCISA-Net remains usable under moderate ambient-light background interference, but strong background irradiance is still an important operating range limitation.

3.4. Evaluation Under Scene Re-Parameterization with Updated Operator

Since the Scene-Constrained Inversion Module uses the scene information encoding operator

A

during inverse encoding, we further evaluated the proposed pipeline under matched scene re-parameterization. This experiment examines whether the pipeline can be re-instantiated when the calibrated scene description is changed and the corresponding operator

A

is reconstructed consistently with the updated geometry.

Three representative scene parameter changes were considered: increasing the display–wall distance by

5 cm

while keeping the display–occluder distance unchanged, shifting the occluder

5 cm

toward the imaging wall, and shifting the effective display image region by

5 cm

along the positive x direction. For each modified setting, the scene information encoding operator

A

was reconstructed according to the updated geometry and then used in the same SCIM-based inference pipeline. The scene parameter settings and the construction procedure of

A

are described in the sections “Experimental Scene Parameters” and “Construction of the Scene-Information Encoding Operator A” of the Supplementary Materials, respectively. The results are reported in Table 10.

Compared with the default calibrated setting, the re-parameterized settings retain comparable recognition performance. The macro-F1 scores remain at 0.6894, 0.7239, and 0.7140 for the three modified configurations, respectively, while the AUC values remain above 0.971. These results show that SCISA-Net preserves effective semantic inference under consistent scene-operator re-parameterization. They also support the operating condition of SCIM: the operator used for inverse encoding should be constructed according to the calibrated physical configuration.

3.5. Gesture-Morphology-Stratified Evaluation

To further examine how the discriminative capability learned by SCISA-Net is distributed across different category types rather than only reflected by aggregate metrics, we perform a gesture-morphology-stratified evaluation on the held-out validation split. Specifically, the wall-mediated indirect observation samples in the validation split are regrouped according to the coarse morphology of their paired source domain gesture symbols from the RGB Arabic Alphabets Sign Language dataset [28], using the number of extended fingers as the grouping criterion. This yields three evaluation subsets: Closed, corresponding to indirect observations paired with source gestures containing 0–1 extended fingers; Half-open, corresponding to indirect observations paired with source gestures containing 2–3 extended fingers; and Spread, corresponding to indirect observations paired with source gestures containing 4–5 extended fingers. In this validation split, the Closed, Half-open, and Spread subsets contain 10 classes with 521 images, 15 classes with 758 images, and 6 classes with 305 images, respectively. No retraining is performed. The same best-performing checkpoint trained on the original 31-way task is directly applied to these three validation subsets without any subset-specific optimization. The subset-level quantitative results are summarized in Table 11.

The results show that SCISA-Net maintains meaningful semantic discrimination on all three subsets, but its performance is not uniformly distributed. Among them, the Half-open subset achieves the best overall balance, reaching 0.8388 Precision, 0.7562 Recall, 0.7918 F1 Score, 0.8307 Accuracy, 0.9793 AUC, 0.7482 Cohen’s

κ

, 0.8629 g-mean, and 0.9884 specificity, while also obtaining the lowest 0.0130 Brier score loss. The Spread subset follows with 0.7455 F1 and 0.9645 AUC, whereas the Closed subset yields 0.7294 F1 and 0.9543 AUC. Precision remains relatively close across the three subsets, whereas Recall exhibits a clearer variation, especially dropping to 0.6738 on Closed.

These subset-level results provide a more fine-grained view of the feasibility question addressed in this work. The learned representation of SCISA-Net is most effective on the indirect observations paired with source gestures of intermediate openness, namely the Half-open subset. The Closed subset is more challenging because its paired source gestures are morphologically compact and contain fewer explicit fine-scale differences, so their already weak class-specific cues are more easily compressed into mutually similar indirect patterns after scene transport [75]. The Spread subset remains more difficult than the Half-open subset because its discrimination depends more heavily on thin fingers, inter-finger separations, and other high-frequency structural details in the paired source gestures, which are more vulnerable to attenuation, blur, and distortion along the indirect optical path [76,77]. Therefore, the subset-level results indicate that discriminative evidence is most stably recoverable from indirect observations whose paired source gestures have moderate structural openness, whereas the remaining difficulty is concentrated either in compact source morphologies with insufficient morphological diversity or in highly spread source morphologies whose class identity depends more strongly on fragile fine-detail carriers.

3.6. Attention Feature Visualization

To further understand how SCISA-Net performs semantic inference from wall-mediated indirect observations, we visualize class-specific Grad-CAM responses in a unified wall observation coordinate system [78]. Grad-CAM is computed at the SCIM output and at the outputs of stage0–stage3 in MS-HSANet using the final predicted class as the supervision target, and all responses are projected back to the wall observation domain for direct comparison. The resulting visualization reveals a stage-wise evolution of discriminative attention, as shown in Figure 9.

As shown in Figure 9, the raw wall observation is dominated by the occluder-modulated light pattern, whereas the Grad-CAM response after SCIM is concentrated within the modulated light-spot region. This change is consistent with the ablation results in Table 2, where removing SCIM causes a severe performance drop. It suggests that SCIM helps convert the wall-mediated observation into an intermediate representation in which class-related evidence is more accessible to the downstream classifier.

After entering MS-HSANet, the response evolves from a compact activation in stage0 to broader structured responses in stage1–stage2, and finally to a more localized hotspot in stage3. This stage-wise trajectory agrees with the depth ablation results in Table 4, showing that the complete multi-stage hierarchy provides stronger discrimination than shallow or single-stage variants. Overall, the visualization supports the quantitative findings: SCIM localizes useful evidence in the compensated representation, while MS-HSANet further refines it through hierarchical subband-aware encoding.

4. Discussion

4.1. Feasibility of Wall-Mediated Semantic Inference

The results provide empirical support for the central feasibility question of this work: under the controlled and calibrated wall-mediated indirect-view geometry considered here, category-level semantic information carried by a hidden display terminal does not vanish completely after occluder modulation and wall reflection. Although the camera never observes the display directly, SCISA-Net achieves a macro-F1 of 0.7170 and an AUC of 0.9759 on the 31-class validation split. The per-class ROC curves and the normalized confusion matrix further show broadly distributed class separability across the evaluated categories, rather than recognition concentrated on only a few visually favorable classes. These observations indicate that the wall-borne pattern contains recoverable class-dependent radiometric evidence, even though this evidence is visually weak and embedded within a contour-driven indirect intensity distribution.

This finding is consistent with prior indirect optical studies showing that hidden scene information can survive relay surface propagation under suitable physical constraints [2,3,4]. In the present study, this principle is examined at the level of category-level semantic inference from a single wall-mediated observation. The security relevance of the geometry provides the motivating context, while the quantitative conclusion is tied to the calibrated display–occluder–wall–camera configuration and the perturbation conditions explicitly evaluated in this paper. Therefore, the reported performance should be interpreted as evidence of controlled semantic inferability under calibrated indirect-view conditions.

At the same time, the comparison with direct CNN, transformer, hybrid, and frequency-aware baselines shows that this inferability is not automatically accessible to ordinary visual classifiers. These models operate on the wall observation as if it were a directly observable image, whereas the useful class evidence in this task is weak, spatially mixed, and physically displaced from the source domain structure. The observed gap between SCISA-Net and the direct baselines therefore supports the working assumption that wall-mediated semantic inference requires both scene-constrained inverse reorganization and discriminative feature learning matched to the recovered representation.

4.2. Role of Scene-Constrained Inversion

The ablation results identify SCIM as the main prerequisite for making the indirect observation classifiable. When SCIM is removed and the raw wall-mediated observation is fed directly into MS-HSANet, macro-F1 decreases from 0.7170 to 0.0129 and AUC decreases from 0.9759 to 0.5206. Replacing SCIM with generic CNN-, MLP-, or transformer-based front-end encoders also leads to near-chance performance. These results indicate that the front-end gain cannot be explained by encoder capacity alone, because the wall observation domain must first be aligned with the hidden-screen source domain before discriminative learning becomes effective.

SCIM provides this alignment through the scene information encoding operator and a regularized inverse encoding procedure. The truncated Tikhonov-filtered inversion retains the dominant scene-consistent transport modes while avoiding unstable spectral tail amplification [34,35]. The subsequent TV-based refinement suppresses oscillatory residuals and sparse high-variation artifacts that remain after inverse encoding [36,37]. Therefore, the role of SCIM in the complete model is to expose weak class-related evidence in a physically constrained representation rather than to complete the classification task by itself.

This distinction is supported by the SCIM-only result. Although SCIM is essential, the SCIM-only variant reaches only 0.0306 macro-F1 and 0.5608 AUC. The inverse representation still contains frequency-dependent discriminative imbalance, residual texture blunting, and weakened fine details. Thus, scene-constrained inversion is necessary for making the indirect observation informative, but stable 31-way recognition further requires a back-end that can selectively organize the weak class-bearing cues.

4.3. Role of Subband-Aware Discriminative Learning

MS-HSANet converts the physically reorganized SCIM output into a stable discriminative representation. In the ablation results, representative frequency-aware alternatives improve over the SCIM-only variant but remain below the full SCISA-Net; the best competing frequency-aware variants reach approximately 0.679–0.680 macro-F1, whereas the complete model reaches 0.7170. The internal ablation also shows that replacing HSAB with parameter-matched substitutes reduces performance, with the original HSAB design exceeding the strongest internal substitute by 4.25 percentage points in macro-F1. These results suggest that the gain comes from the coordinated use of subband separation, attention redistribution, residual propagation, and cross-stage aggregation.

The stage depth ablation further shows that the local

2 \times 2

Haar-prior decomposition becomes effective only when it is embedded in a multi-stage hierarchy. A single HSAB reaches only 0.2849 macro-F1. As the hierarchy is progressively restored, macro-F1 increases from 0.5261 for stage_01 to 0.6355 for stage_012 and 0.7170 for the complete stage_0123 configuration. In contrast, HSAB_13, which uses the same total number of HSAB units as the complete model but concentrates them within stage1, reaches only 0.6006 macro-F1. This comparison indicates that the improvement is associated with distributing local subband primitives across stages, where they are repeatedly reorganized under enlarged receptive fields and higher semantic abstraction.

The inference time subband intervention analysis provides a more direct check of the trained decision path. the complete

LL

/

LH

/

HL

/

HH

configuration gives the strongest result, whereas the

LL

-only setting reduces macro-F1 to 0.0121, and Suppressing

LL

also leads to near-chance performance. Removing

LH

or

HL

causes substantial degradation, and random permutation of the four subband positions reduces macro-F1 to 0.0174 with an AUC of 0.5196, even though all four response maps are still retained. This order-sensitive degradation indicates that the classifier relies on the coordinated organization of low-frequency structural support and directional detail responses, rather than on unordered response energy alone.

The sensitivity analysis reported in the section “Haar Regularization as a Weak Subband Prior” of the Supplementary Materials further clarifies the role of the learnable Haar-prior filters. Removing the Haar regularizer still yields 0.7149 macro-F1 and 0.9737 AUC, while the default setting

λ_{haar} = 10^{- 4}

obtains a comparable macro-F1 of 0.7139 and an AUC of 0.9755, together with the smallest normalized Haar-kernel deviation among the tested settings. A smaller coefficient,

λ_{haar} = 10^{- 5}

, gives the highest macro-F1 of 0.7203. These results indicate that the Haar term acts as a soft structural anchor: the classification objective remains the dominant supervision, while the regularizer keeps HP-DConv within a controlled Haar-oriented neighborhood during task-driven adaptation.

The Grad-CAM visualization is consistent with this interpretation. SCIM first concentrates useful evidence within the modulated light-spot region, and MS-HSANet then progressively refines this evidence from compact activation to structured stage-wise responses and finally to a focused discriminative hotspot. Together with the ablation results, this supports the interpretation that stable inference arises from physically constrained inverse reorganization followed by subband-aware discriminative refinement.

4.4. Interpretation of Matched Scene Re-Parameterization

The matched scene re-parameterization experiment in Section 3.4 examines whether SCISA-Net can be re-instantiated when the calibrated scene description is changed and the corresponding scene information encoding operator

A

is updated accordingly. The three settings modify representative components of the display–occluder–wall transport configuration, including the display–wall distance, the occluder position, and the effective display image region. Under these matched scene-operator conditions, the macro-F1 scores remain at 0.6894, 0.7239, and 0.7140, respectively, with AUC values above 0.971.

These results show that SCISA-Net can maintain effective semantic inference across different calibrated scene parameterizations. The updated operator

A

provides the transport prior required by SCIM to map the wall-mediated indirect observation into a representation aligned with the hidden display domain. Therefore, the comparable performance under the re-parameterized settings supports the role of SCIM as a scene-constrained inverse encoding module and indicates that the learned classifier is not simply tied to the default wall intensity pattern.

This experiment also specifies the operating condition of the current method. The evaluated setting assumes matched changes between the physical scene parameters and

A

. Robustness to unknown geometry perturbations, calibration mismatch, or an incorrect scene operator remains outside the scope of this experiment.

4.5. Robustness, Failure Modes, and Operating Range

The robustness experiments further clarify the operating range of the proposed method. Under illumination attenuation, SCISA-Net degrades gradually as the hidden-screen luminance is reduced from

600 nit

to

300 nit

. The model remains close to the reference condition at

540 nit

, preserves clear semantic discrimination at 480–

420 nit

, and still avoids near-random collapse at 360–

300 nit

. This trend suggests that the method is not extremely brittle to moderate photon budget reduction. However, the monotonic degradation also shows that the recoverable semantic evidence remains limited by the amount of radiometric signal that survives the indirect optical path.

The noise experiments reveal more differentiated failure behavior. Mild Gaussian perturbation is relatively well tolerated, and the model recovers to near-clean performance when the perturbation becomes weak. Scatter noise is damaging under severe contamination, but performance improves rapidly as the perturbation weakens, indicating that structured disturbance is not necessarily fatal unless it dominates the class-bearing residual. Poisson noise produces more persistent degradation across the tested settings, suggesting that photon-statistics-like fluctuations more directly disturb the weak intensity differences that encode hidden-display semantics. These results are consistent with the design of SCISA-Net: regularized inversion and TV refinement can suppress part of the unstable residual variation, and subband attention can preserve useful detail responses when they remain above the noise floor; once the perturbation destroys the weak radiometric cues too strongly, stable 31-way discrimination becomes difficult.

The SCIM parameter sensitivity analysis under Poisson noise further refines this interpretation. As reported in Table 8, the default inverse encoding setting

(k_{0}, λ_{0})

achieves the strongest overall behavior among the tested variants under the

100 dB

Poisson noise condition, reaching 0.4609 macro-F1 and 0.8742 AUC. Reducing the truncation number to

0.5 k_{0}

lowers macro-F1 to 0.4267, whereas increasing it to

1.5 k_{0}

further lowers macro-F1 to 0.4073. This trend is consistent with the spectral trade-off in SCIM: retaining too few singular modes removes weak class-bearing components, while retaining too many modes exposes the inverse encoding to additional noise-amplified spectral components under Poisson corruption. Varying the Tikhonov coefficient has a milder effect, but neither

0.5 λ_{0}

nor

1.5 λ_{0}

improves overall discriminative or calibration behavior over the default setting. Therefore, the fixed

(k_{0}, λ_{0})

protocol used throughout the experiments should be interpreted as a reasonable operating point within the tested range under representative Poisson corruption, rather than as a dynamically optimized or universally optimal inverse setting.

The ambient-light background experiment further extends this analysis from detector-like or noise-like perturbations to wall-side background irradiance. In contrast to zero-mean random noise, ambient background illumination increases the wall intensity floor and reduces the signal-to-background ratio, thereby masking the subtle radiometric deviations that carry hidden-screen semantic information. As reported in Table 9, SCISA-Net still maintains meaningful discrimination under moderate background interference. For example, the macro-F1 remains 0.6903 under uniform background illumination at

70 dB

SBR

, 0.7030 under linear gradient background at

65 dB

SBR

, and 0.7139 under radial gradient background at

70 dB

SBR

. However, performance decreases clearly as the background becomes stronger, dropping to 0.5827, 0.4987, and 0.6272 macro-F1 under the strongest tested uniform, linear gradient, and radial gradient settings, respectively. This behavior indicates that the proposed method is not limited to an ideal detector-dark-current-only noise assumption, but it also confirms that strong environmental irradiance remains an important failure factor for wall-mediated semantic inference.

The gesture-morphology-stratified evaluation gives another view of the failure modes. The Half-open subset achieves the strongest subset-level performance, whereas the Closed and Spread subsets are more difficult for different reasons. Compact gestures provide fewer fine-scale structural differences, making their indirect patterns easier to compress into similar wall-borne observations. Highly spread gestures depend more on thin fingers and inter-finger separations, whose high-frequency carriers are more vulnerable to attenuation and blur along the indirect optical path [75,76,77]. Therefore, the remaining errors are not arbitrary: they are related to both the available photon budget and the morphology-dependent survival of fine discriminative details.

Taken together, the robustness results define a restrained operating range. SCISA-Net supports semantic inference under the controlled geometry and the moderate degradation levels tested in this work, but its reliability decreases when illumination is strongly reduced, when photon-statistics-like noise persistently disrupts the observation, when wall-side ambient irradiance substantially lowers the

SBR

, when the fixed inverse regularization setting no longer matches the available signal support, or when the class identity depends on fine structures that are poorly preserved by the wall-mediated transport process. This operating range is important for interpreting the security relevance of the study: the results show that wall-mediated indirect observations can contain recoverable semantic evidence under calibrated and controlled conditions, while the degree of inferability remains tied to the scene calibration, photon budget, background interference level, inverse regularization setting, and observation conditions evaluated in this work.

4.6. Limitations and Future Work

Several limitations remain. First, the present benchmark and SCIM inference are built upon a calibrated scene description, including the display position, occluder configuration, wall surface, camera viewpoint, and dark-room conditions. The scene information encoding operator

A

is constructed from this calibrated configuration and is used under the assumption that the physical setup, the calibrated scene description, and the operator remain consistent during inference. The matched scene re-parameterization experiment in Section 3.4 shows that the method can be re-instantiated when the scene parameters and

A

are updated consistently. However, automatic adaptation to calibration mismatch or unmodeled reflectance changes remains outside the current evaluation. Second, the semantic source domain is derived from the RGB Arabic Alphabets Sign Language dataset and is evaluated as a 31-class image-level classification task. Although this setting provides a clear and reproducible semantic benchmark, it does not cover broader symbol vocabularies, natural dynamic cues, or multi-frame temporal patterns. Third, the current robustness analysis includes illumination attenuation, simulated noise, and ambient-light background interference, while broader physical re-acquisition under different wall materials, occluder shapes, display radiance distributions, camera placements, and unseen display contents is left for future study.

Future work should therefore focus on extending the method from calibrated scene inference to scene-adaptive inference. One direction is to develop calibration, self-calibration, or joint scene operator estimation strategies that update or estimate the scene information encoding operator under geometry or reflectance changes, so that the inverse encoding stage can remain physically meaningful beyond a single calibrated setup. A second direction is photon-budget-aware regularization, where the truncation level, spectral shrinkage strength, and refinement parameters are adjusted according to the available signal support rather than fixed across all conditions. A third direction is to introduce domain generalization or adaptation across different displays, occluders, walls, and cameras, and to test whether the learned subband-discriminative representation remains stable under cross-scene shifts. Finally, extending the benchmark to larger semantic sets, temporal cues, and physically re-acquired degraded observations would provide a more complete assessment of both the security relevance and the practical boundary of wall-mediated hidden-display semantic inference.

5. Conclusions

This paper studies wall-mediated indirect semantic inference as a controlled, security-motivated feasibility problem: given only the indirect intensity pattern formed on a wall by a hidden display terminal outside the camera field of view, our goal is to predict the hidden semantic category and provide an experimental basis for analyzing indirect optical semantic leakage under calibrated non-line-of-sight transport. Although prior indirect optical studies have established hidden scene recovery, localization, and coarse recognition [3,4,7,8], stable single-image category-level semantic inference from occluder-contour-dominated wall observations remains largely underexplored. The first challenge is that after occluder modulation, wall projection, and diffuse reflection, the recorded wall pattern is dominated by the occluder contour, while class-related evidence survives only as weak, diluted, and noise-entangled intensity deviations, making physically consistent and class-discriminative cues difficult to extract stably. The second challenge is that, even after front-end recovery, relatively stable support remains concentrated in low-frequency structure, whereas the mid- and high-frequency details that carry inter-class differences stay weaker and more distortion-prone, making the classifier liable to organize its decision around coarse but weakly informative patterns.

To address these difficulties, we propose the Scene-Constrained Inverse-to-Subband Attention Network (SCISA-Net), which couples scene-constrained inverse reorganization with subband-aware discriminative learning for wall-mediated indirect semantic inference. Within SCISA-Net, the Scene-Constrained Inversion Module (SCIM) first combines Scene-Aware Regularized Inverse Encoding (SARIE) and Multi-Scale Channel-Adaptive Feature Compensation (MSCAFC), where SARIE performs operator-constrained inverse encoding and TV-based refinement to reorganize diluted evidence and suppress contour-coupled interference, and MSCAFC compensates residual texture blunting, spectral deficiency, and channel inconsistency; the subsequent Multi-Stage Haar-Subband Attention Network (MS-HSANet) uses a stage0 stem and stage1–stage3 stacks of Haar-Subband Attention Blocks (HSABs) to decompose features into ordered

LL

/

LH

/

HL

/

HH

subbands, adaptively reweight them with CBAM, and progressively aggregate fine-grained evidence for final classification.

On the 31-class benchmark constructed from the RGB Arabic Alphabets Sign Language dataset, SCISA-Net achieves a macro-F1 of 0.7170 and an AUC of 0.9759 on the validation split, retains meaningful discrimination under

420 nit

illumination attenuation with 0.6031 macro-F1 and 0.9490 AUC, remains usable under moderate ambient background interference, and preserves comparable performance under matched scene-operator re-parameterization with macro-F1 scores of 0.6894, 0.7239, and 0.7140 and AUC values above 0.971; the stage depth and inference time subband intervention analyses further support the hierarchical ordered-subband design, with macro-F1 decreasing to 0.6006 when all 13 HSABs are placed in stage1 and to 0.0174 under random subband permutation. A remaining boundary of the current study is that its strongest conclusion is confined to calibrated, operator-matched wall-mediated observations, where the scene information encoding operator

A

is constructed from the corresponding display position, occluder configuration, wall surface, camera viewpoint, and background condition. Guided by the frequency-selective degradation pattern observed in this study, in which fine-grained category-bearing deviations become less reliable earlier than coarse contour support along the indirect optical path, future work will investigate scene-adaptive transport calibration and photon-budget-aware dynamic regularization to extend calibrated wall-mediated indirect semantic inference to weaker signal and less controlled conditions.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/photonics13060575/s1, Supplementary Materials: experimental scene parameters, construction of the scene-information encoding operator

A

, details of the Scene-Constrained Inversion Module, detailed formulation of MS-HSANet, baseline comparison settings, scope of specialized NLOS recognition baselines, and Haar regularization as a weak subband prior; Figure S1: Shape and physical dimensions of the occluder used in the experimental setup; Table S1: Scope of specialized NLOS-oriented methods considered for baseline comparison. The main manuscript reports quantitative results only for methods that can be trained and evaluated under the same wall-mediated single-image semantic inference protocol; Table S2: Sensitivity analysis of the Haar regularization coefficient

λ_{haar}

. Each model is trained from scratch using the same initialization, data split, and training protocol, while only

λ_{haar}

is varied. Evaluation results are reported in terms of Macro-F1 and AUC, together with the two Haar-kernel deviation columns. For the two Haar-kernel deviation columns, values are reported as A/B, where A is the measured deviation under the corresponding

λ_{haar}

and B is the largest value in the same deviation column among all tested settings. Red font denotes the default setting used in the main experiments. The references cited in the Supplementary Materials are also included in the main reference list.

Author Contributions

Conceptualization, J.D., Z.Z. and H.Q. (Hongshuai Qin); methodology, J.D.; software, J.D.; validation, J.D. and G.L.; formal analysis, J.D.; resources, X.H., X.Z. and H.Q. (Hongshuai Qin); data curation, J.D., G.L. and X.Z.; writing—original draft preparation, J.D.; writing—review and editing, J.D., Z.Z. and H.Q. (Hongshuai Qin); visualization, J.D., G.L. and H.Q. (Huiyu Qi); supervision, Z.Z., X.H., J.L. and H.Q. (Huiyu Qi); funding acquisition, Z.Z., X.H., J.L. and X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the Young Scientists Fund of the National Natural Science Foundation of China under Grant 62506103; in part by the China Postdoctoral Science Foundation (General Program) under Grant 2024M760716; in part by the Special Financial Grant from the China Postdoctoral Science Foundation under Grant 2025T180962; in part by the Fundamental Research Funds for the Provincial Universities of Zhejiang under Grant GK259909299001-026; and in part by the Excellent Young Scientists Fund Program (Overseas) of Shandong Province under Grant 2025HWYQ-033.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source domain RGB Arabic Alphabets Sign Language dataset used in this study is publicly available on Kaggle at https://www.kaggle.com/datasets/muhammadalbrham/rgb-arabic-alphabets-sign-language-dataset (accessed on 6 June 2026) and is cited in the manuscript. The paired wall-mediated indirect observation dataset generated in this study, termed AASL_WMIO, has been publicly released and is available at https://github.com/PhoeMarx/AASL_WMIO/releases/download/v1.0.0/AASL_WMIO_v1.0.0.zip (accessed on 6 June 2026).

Acknowledgments

The authors used ChatGPT based on GPT-5.4 Thinking (OpenAI, https://chatgpt.com/) to assist with English-language refinement, Chinese–English translation, and wording revision during manuscript preparation. All AI-assisted outputs were manually checked, revised, and finalized by the authors, who take full responsibility for the accuracy, integrity, and content of the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

AASL	Arabic Alphabets Sign Language
AUC	Area Under the Receiver Operating Characteristic Curve
BN	Batch Normalization
CBAM	Convolutional Block Attention Module
CE	Cross-Entropy
CNN	Convolutional Neural Network
C-RF	Channel-Refined Feature
GAP	Global Average Pooling
GELU	Gaussian Error Linear Unit
Grad-CAM	Gradient-Weighted Class Activation Mapping
HH	High–High
HL	High–Low
LH	Low–High
LL	Low–Low
HP-DConv	Haar-Prior Depthwise Convolution
HPDC	Haar-Prior Depthwise Convolution
HSAB	Haar-Subband Attention Block
MLP	Multi-Layer Perceptron
MSCAFC	Multi-Scale Channel-Adaptive Feature Compensation
MS-HSANet	Multi-Stage Haar-Subband Attention Network
NLOS	Non-Line-of-Sight
ROC	Receiver Operating Characteristic
SARIE	Scene-Aware Regularized Inverse Encoding
SCIM	Scene-Constrained Inversion Module
SCISA-Net	Scene-Constrained Inverse-to-Subband Attention Network
SE Attention	Squeeze-and-Excitation Attention
SR-C-RF	Spatially Refined Channel-Refined Feature
SVD	Singular Value Decomposition
TV	Total Variation

References

Backes, M.; Dürmuth, M.; Unruh, D. Compromising Reflections-or-How to Read LCD Monitors around the Corner. In Proceedings of the 2008 IEEE Symposium on Security and Privacy, Oakland, CA, USA, 18–21 May 2008; IEEE Computer Society: Los Alamitos, CA, USA, 2008; pp. 158–169. [Google Scholar] [CrossRef]
Czajkowski, R.; Murray-Bruce, J. Two-edge-resolved three-dimensional non-line-of-sight imaging with an ordinary camera. Nat. Commun. 2024, 15, 1162. [Google Scholar] [CrossRef] [PubMed]
Faccio, D.; Velten, A.; Wetzstein, G. Non-line-of-sight imaging. Nat. Rev. Phys. 2020, 2, 318–327. [Google Scholar] [CrossRef]
Nam, J.H.; Brandt, E.; Bauer, S.; Liu, X.; Renna, M.; Tosi, A.; Sifakis, E.; Velten, A. Low-latency time-of-flight non-line-of-sight imaging at 5 frames per second. Nat. Commun. 2021, 12, 6526. [Google Scholar] [CrossRef]
Wu, J.; Li, J.; Yang, J.; Mei, S. Wavelet-integrated deep neural networks: A systematic review of applications and synergistic architectures. Neurocomputing 2025, 657, 131648. [Google Scholar] [CrossRef]
Zhou, H.; Tian, C.; Zhang, Z.; Li, C.; Xie, Y.; Li, Z. Frequency-aware feature aggregation network with dual-task consistency for RGB-T salient object detection. Pattern Recognit. 2024, 146, 110043. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, Y.; Huang, M.; Chen, Z.; Jia, Y.; Weng, Y.; Xiao, L.; Xiang, X. Accurate but fragile passive non-line-of-sight recognition. Commun. Phys. 2021, 4, 88. [Google Scholar] [CrossRef]
Wang, Z.; Huang, H.; Li, H.; Chen, Z.; Han, J.; Pu, J. Non-line-of-sight imaging and location determination using deep learning. Opt. Lasers Eng. 2023, 169, 107701. [Google Scholar] [CrossRef]
O’Toole, M.; Lindell, D.B.; Wetzstein, G. Confocal non-line-of-sight imaging based on the light-cone transform. Nature 2018, 555, 338–341. [Google Scholar] [CrossRef] [PubMed]
Lindell, D.B.; Wetzstein, G.; O’Toole, M. Wave-based non-line-of-sight imaging using fast f-k migration. ACM Trans. Graph. 2019, 38, 116. [Google Scholar] [CrossRef]
Liu, X.; Guillén, I.; La Manna, M.; Nam, J.H.; Reza, S.A.; Huu Le, T.; Jarabo, A.; Gutierrez, D.; Velten, A. Non-line-of-sight imaging using phasor-field virtual wave optics. Nature 2019, 572, 620–623. [Google Scholar] [CrossRef]
Liu, X.; Wang, J.; Xiao, L.; Shi, Z.; Fu, X.; Qiu, L. Non-line-of-sight imaging with arbitrary illumination and detection pattern. Nat. Commun. 2023, 14, 3230. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Li, M.; Chen, T.; Zhan, S. Long-Range Non-Line-of-Sight Imaging Based on Projected Images from Multiple Light Fields. Photonics 2023, 10, 25. [Google Scholar] [CrossRef]
Tian, Y.; Xu, W.; Wang, D.; Zhang, N.; Chen, S.; Gao, P.; Su, X.; Hao, W. Non-Line-of-Sight Imaging via Sparse Bayesian Learning Deconvolution. Photonics 2026, 13, 53. [Google Scholar] [CrossRef]
Li, Y.; Zhang, Y. Deep-Learning-Based Real-Time Passive Non-Line-of-Sight Imaging for Room-Scale Scenes. Sensors 2024, 24, 6480. [Google Scholar] [CrossRef]
Zhou, Y.; Li, W.; Li, W.; Dai, C.; Zeng, J.W.; Xu, F. Passive non-line-of-sight imaging at 10 meters. Opt. Lett. 2025, 50, 6333–6336. [Google Scholar] [CrossRef] [PubMed]
Nam, J.H.; Lee, S.C. FSDA: Frequency re-scaling in data augmentation for corruption-robust image classification. Pattern Recognit. 2024, 150, 110332. [Google Scholar] [CrossRef]
Li, Q.; Shen, L.; Guo, S.; Lai, Z. Wavelet Integrated CNNs for Noise-Robust Image Classification. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7243–7252. [Google Scholar] [CrossRef]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. FcaNet: Frequency Channel Attention Networks. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 763–772. [Google Scholar] [CrossRef]
Rao, Y.; Zhao, W.; Zhu, Z.; Zhou, J.; Lu, J. GFNet: Global Filter Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10960–10973. [Google Scholar] [CrossRef] [PubMed]
Guibas, J.; Mardani, M.; Li, Z.; Tao, A.; Anandkumar, A.; Catanzaro, B. Efficient Token Mixing for Transformers via Adaptive Fourier Neural Operators. In Proceedings of the International Conference on Learning Representations, Online, 25–29 April 2022. [Google Scholar]
Xu, K.; Qin, M.; Sun, F.; Wang, Y.; Chen, Y.K.; Ren, F. Learning in the Frequency Domain. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1737–1746. [Google Scholar] [CrossRef]
Tan, C.; Zhao, Y.; Wei, S.; Gu, G.; Liu, P.; Wei, Y. Frequency-aware Deepfake Detection: Improving Generalizability Through Frequency Space Domain Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 5052–5060. [Google Scholar]
Patro, B.N.; Namboodiri, V.P.; Agneeswaran, V.S. SpectFormer: Frequency and Attention is what you need in a Vision Transformer. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 9543–9554. [Google Scholar] [CrossRef]
Velten, A.; Willwacher, T.; Gupta, O.; Veeraraghavan, A.; Bawendi, M.G.; Raskar, R. Recovering three-dimensional shape around a corner using ultrafast time-of-flight imaging. Nat. Commun. 2012, 3, 745. [Google Scholar] [CrossRef] [PubMed]
Chen, W.; Wei, F.; Kutulakos, K.N.; Rusinkiewicz, S.; Heide, F. Learned feature embeddings for non-line-of-sight imaging and recognition. ACM Trans. Graph. 2020, 39, 1–18. [Google Scholar] [CrossRef]
Zhang, K.; Weng, J.; Cai, Y.; Li, S.; Luo, Z. Mitigating Low-Frequency Bias: Feature Recalibration and Frequency Attention Regularization for Adversarial Robustness. Neural Netw. 2026, 193, 108070. [Google Scholar] [CrossRef]
Al-Barham, M.; Alomari, O.A.; Elnagar, A. Arabic Sign Language Alphabet Classification via Transfer Learning. In Proceedings of the International Conference on Emerging Trends and Applications in Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2023; pp. 226–237. [Google Scholar]
Seo, S.; Kim, Y.; Han, H.J.; Son, W.C.; Hong, Z.; Sohn, I.; Shim, J.; Hwang, C.K. Predicting Successes and Failures of Clinical Trials with Outer Product-Based Convolutional Neural Network. Front. Pharmacol. 2021, 12, 670670. [Google Scholar] [CrossRef]
Heide, F.; O’Toole, M.; Zang, K.; Lindell, D.B.; Diamond, S.; Wetzstein, G. Non-line-of-sight Imaging with Partial Occluders and Surface Normals. ACM Trans. Graph. 2019, 38, 1–10. [Google Scholar] [CrossRef]
Marco, J.; Jarabo, A.; Nam, J.H.; Liu, X.; Cosculluela, M.Á.; Velten, A.; Gutierrez, D. Virtual Light Transport Matrices for Non-Line-of-Sight Imaging. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 2440–2449. [Google Scholar]
Young, S.I.; Lindell, D.B.; Girod, B.; Taubman, D.; Wetzstein, G. Non-Line-of-Sight Surface Reconstruction Using the Directional Light-Cone Transform. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Badano, A.; Flynn, M.J.; Martin, S.; Kanicki, J. Angular Dependence of the Luminance and Contrast in Medical Monochrome Liquid Crystal Displays. Med. Phys. 2003, 30, 2602–2613. [Google Scholar] [CrossRef][Green Version]
Huang, G.; Liu, Y.; Yin, F. Tikhonov Regularization with MTRSVD Method for Solving Large-Scale Discrete Ill-Posed Problems. J. Comput. Appl. Math. 2022, 405, 113969. [Google Scholar] [CrossRef]
Cui, J.; Peng, G.; Lu, Q.; Huang, Z. A Special Modified Tikhonov Regularization Matrix for Discrete Ill-Posed Problems. Appl. Math. Comput. 2020, 377, 125165. [Google Scholar] [CrossRef]
Pragliola, M.; Calatroni, L.; Lanza, A.; Sgallari, F. On and Beyond Total Variation Regularization in Imaging: The Role of Space Variance. SIAM Rev. 2023, 65, 601–685. [Google Scholar] [CrossRef]
Pang, Z.F.; Zhou, Y.M.; Wu, T.; Li, D.J. Image Denoising via a New Anisotropic Total-Variation-Based Model. Signal Process. Image Commun. 2019, 74, 140–152. [Google Scholar] [CrossRef]
Liu, G.; Liu, Q.; Fang, H.; Chen, X. Robust Total Variation-Based Destriping Model via Sparse Representation Learning for Business Infrared Imaging Systems. Infrared Phys. Technol. 2022, 121, 104005. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Chen, G.; Dai, K.; Yang, K.; Hu, T.; Chen, X.; Yang, Y.; Dong, W.; Wu, P.; Zhang, Y.; Yan, Q. Bracketing Image Restoration and Enhancement with High-Low Frequency Decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 16–22 June 2024; pp. 6097–6107. [Google Scholar]
Wang, S.; Wan, Q.; Gao, J.; Zeng, Z. Language-Driven Multi-Label Zero-Shot Learning with Semantic Granularity. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–20 October 2025; pp. 1968–1978. [Google Scholar]
Zheng, H.; Yang, S.; He, Z.; Yang, J.; Huang, Z. Hierarchical Cross-Modal Prompt Learning for Vision-Language Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–20 October 2025; pp. 1891–1901. [Google Scholar]
Liu, J.; Wang, J.; Zhang, P.; Wang, C.; Xie, D.; Pu, S. Multi-Scale Wavelet Transformer for Face Forgery Detection. In Proceedings of the Asian Conference on Computer Vision (ACCV), Macao, China, 4–8 December 2022; pp. 1858–1874. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2015; pp. 448–456. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; pp. 1–14. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
Radosavovic, I.; Kosaraju, R.P.; Girshick, R.; He, K.; Dollár, P. Designing Network Design Spaces. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10425–10433. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2019; pp. 6105–6114. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Ding, M.; Xiao, B.; Codella, N.; Luo, P.; Wang, J.; Yuan, L. DaViT: Dual Attention Vision Transformers. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 74–92. [Google Scholar]
Li, K.; Wang, Y.; Zhang, J.; Gao, P.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. UniFormer: Unifying Convolution and Self-Attention for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12581–12600. [Google Scholar] [CrossRef]
Patro, B.N.; Agneeswaran, V.S. Scattering Vision Transformer: Spectral Mixing Matters. In Proceedings of the Thirty-seventh Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Chi, L.; Jiang, B.; Mu, Y. Fast Fourier Convolution. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 4479–4488. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-style ConvNets Great Again. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13728–13737. [Google Scholar] [CrossRef]
Tolstikhin, I.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.P.; Keysers, D.; Uszkoreit, J.; et al. MLP-Mixer: An All-MLP Architecture for Vision. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021. [Google Scholar]
Touvron, H.; Bojanowski, P.; Caron, M.; Cord, M.; El-Nouby, A.; Grave, E.; Izacard, G.; Joulin, A.; Synnaeve, G.; Verbeek, J.; et al. ResMLP: Feedforward Networks for Image Classification with Data-Efficient Training. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 5314–5321. [Google Scholar] [CrossRef]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin Transformer V2: Scaling Up Capacity and Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Rao, Y.; Zhao, W.; Zhu, Z.; Lu, J.; Zhou, J. Global Filter Networks for Image Classification. In Proceedings of the Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 980–993. [Google Scholar]
Liu, H.; Dai, Z.; So, D.; Le, Q.V. Pay Attention to MLPs. Adv. Neural Inf. Process. Syst. 2021, 34, 9204–9215. [Google Scholar]
Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. In Proceedings of the International Conference on Learning Representations, Online, 25–29 April 2022. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features From Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
DaneshPanah, M.; Javidi, B.; Watson, E.A. Three Dimensional Object Recognition with Photon Counting Imagery in the Presence of Noise. Opt. Express 2010, 18, 26450–26460. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Yang, B.; Zhang, Y.; Shen, J.; Yuan, X.; Chen, M.K.; Liu, F.; Geng, Z. Comprehensive Compensation of Real-World Degradations for Robust Single-Pixel Imaging. Light. Sci. Appl. 2025, 14, 365. [Google Scholar] [CrossRef] [PubMed]
Guo, R.; Yang, Q.; Chang, A.S.; Hu, G.; Greene, J.; Gabel, C.V.; You, S.; Tian, L. EventLFM: Event Camera Integrated Fourier Light Field Microscopy for Ultrafast 3D Imaging. Light. Sci. Appl. 2024, 13, 144. [Google Scholar] [CrossRef]
Zuo, R.; Wei, F.; Mak, B.K.W. Natural Language-Assisted Sign Language Recognition. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14890–14900. [Google Scholar]
Gupta, M.; Agrawal, A.; Veeraraghavan, A.; Narasimhan, S.G. A Practical Approach to 3D Scanning in the Presence of Interreflections, Subsurface Scattering and Defocus. Int. J. Comput. Vis. 2013, 102, 33–55. [Google Scholar] [CrossRef]
Liu, C.; Narasimhan, S.G.; Dubrawski, A.W. Matting and Depth Recovery of Thin Structures Using a Focal Stack. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4782–4790. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]

Figure 1. Schematic illustration of the wall-mediated indirect observation geometry and the calibrated Cartesian coordinate frame used for scene parameterization. The effective display image plane is defined as

y = 0

, and the imaging wall is located along the positive y direction at

y = 0.95 m

; the x-axis denotes the horizontal scene direction, and the z-axis denotes the vertical direction. The hidden display terminal does not enter the camera field of view; only the wall-borne intensity pattern formed after occluder modulation and wall reflection is recorded.

Figure 1. Schematic illustration of the wall-mediated indirect observation geometry and the calibrated Cartesian coordinate frame used for scene parameterization. The effective display image plane is defined as

y = 0

, and the imaging wall is located along the positive y direction at

y = 0.95 m

; the x-axis denotes the horizontal scene direction, and the z-axis denotes the vertical direction. The hidden display terminal does not enter the camera field of view; only the wall-borne intensity pattern formed after occluder modulation and wall reflection is recorded.

Figure 2. Architecture of SCISA-Net. The upper branch, Scene-Constrained Inversion Module (SCIM), converts a wall-mediated indirect observation into a source-oriented intermediate representation through construction of the scene information encoding operator

A

, truncated Tikhonov-filtered inverse encoding, TV-based refinement, and multi-scale channel-adaptive feature compensation. The lower branch, Multi-Stage Haar-Subband Attention Network (MS-HSANet), receives the compensated feature, performs stem encoding followed by three successive stages based on HSAB for Haar-subband-aware representation learning, aggregates cross-stage pooled features, and outputs the semantic class.

Figure 2. Architecture of SCISA-Net. The upper branch, Scene-Constrained Inversion Module (SCIM), converts a wall-mediated indirect observation into a source-oriented intermediate representation through construction of the scene information encoding operator

A

, truncated Tikhonov-filtered inverse encoding, TV-based refinement, and multi-scale channel-adaptive feature compensation. The lower branch, Multi-Stage Haar-Subband Attention Network (MS-HSANet), receives the compensated feature, performs stem encoding followed by three successive stages based on HSAB for Haar-subband-aware representation learning, aggregates cross-stage pooled features, and outputs the semantic class.

Figure 3. Architecture of the Scene-Constrained Inversion Module (SCIM). SCIM consists of Scene-Aware Regularized Inverse Encoding (SARIE) and Multi-Scale Channel-Adaptive Feature Compensation (MSCAFC). Within SARIE, (a) constructs the scene information encoding operator

A

, (b) performs truncated Tikhonov-filtered spectral inverse encoding through the dominant singular modes of

A

, and (c) applies TV-based refinement to contract inversion residuals. The refined representation is then processed by (d) MSCAFC, which conducts multi-branch channel-adaptive compensation and residual fusion to produce the compensated output feature.

Figure 3. Architecture of the Scene-Constrained Inversion Module (SCIM). SCIM consists of Scene-Aware Regularized Inverse Encoding (SARIE) and Multi-Scale Channel-Adaptive Feature Compensation (MSCAFC). Within SARIE, (a) constructs the scene information encoding operator

A

, (b) performs truncated Tikhonov-filtered spectral inverse encoding through the dominant singular modes of

A

, and (c) applies TV-based refinement to contract inversion residuals. The refined representation is then processed by (d) MSCAFC, which conducts multi-branch channel-adaptive compensation and residual fusion to produce the compensated output feature.

Figure 4. Architecture of the Multi-Stage Haar-Subband Attention Network (MS-HSANet). A stem encoder first contracts the compensated feature, after which three successive Haar-Subband Attention Block (HSAB) stages perform hierarchical subband-aware encoding with depths 3, 4, and 6, respectively. Cross-stage pooled features are then concatenated for classification. The lower panels detail HSAB and Haar-Prior depthwise convolution (HPDC), where Haar-initialized per-channel filters decompose features into

LL

,

LH

,

HL

, and

HH

responses, followed by channel–spatial recalibration and residual fusion.

Figure 4. Architecture of the Multi-Stage Haar-Subband Attention Network (MS-HSANet). A stem encoder first contracts the compensated feature, after which three successive Haar-Subband Attention Block (HSAB) stages perform hierarchical subband-aware encoding with depths 3, 4, and 6, respectively. Cross-stage pooled features are then concatenated for classification. The lower panels detail HSAB and Haar-Prior depthwise convolution (HPDC), where Haar-initialized per-channel filters decompose features into

LL

,

LH

,

HL

, and

HH

responses, followed by channel–spatial recalibration and residual fusion.

Figure 5. Per-class receiver operating characteristic (ROC) curves of SCISA-Net on the 31-class wall-mediated indirect semantic inference task over the held-out validation split. Most class trajectories remain concentrated near the upper-left region, with consistently high AUC values, indicating stable class separability under wall-mediated indirect observation.

Figure 6. Normalized confusion matrix of SCISA-Net on the held-out validation split of the 31-class wall-mediated indirect semantic inference task. The concentration of response mass along the main diagonal, together with sparse off-diagonal dispersion, suggests that class-discriminative evidence remains recoverable across most categories under wall-mediated indirect observation.

Figure 8. Representative noisy wall-mediated indirect observations used in the robustness evaluation of Section 4.4. Starting from the clean reference observation, three corruption families—Gaussian noise, Poisson noise, and Scatter noise—are imposed at progressively varying

dB

levels, yielding distinct degradation patterns for assessing the perturbation tolerance of SCISA-Net.

Figure 8. Representative noisy wall-mediated indirect observations used in the robustness evaluation of Section 4.4. Starting from the clean reference observation, three corruption families—Gaussian noise, Poisson noise, and Scatter noise—are imposed at progressively varying

dB

levels, yielding distinct degradation patterns for assessing the perturbation tolerance of SCISA-Net.

Figure 9. Stage-wise Grad-CAM responses of SCISA-Net in the wall-observation coordinate system for the 31-class wall-mediated indirect semantic inference task. Shown from left to right are the wall-mediated indirect observation, the response at the SCIM output, and the responses back-projected from the outputs of MS-HSANet stage0–stage3. The colored overlays represent Grad-CAM heatmaps, where warmer colors, especially red and yellow, indicate stronger class-discriminative activation, whereas cooler colors such as green and blue indicate weaker activation. The activation evolves from compact evidence recovery after scene-constrained inversion, through intermediate carrier-aligned spatial expansion, to a more concentrated hotspot at the deepest stage, suggesting progressive consolidation of class-discriminative cues within the wall-borne modulation region.

Table 1. Quantitative comparison of SCISA-Net with representative CNN, transformer/hybrid, and frequency-aware baselines on the 31-class wall-mediated indirect semantic inference task. Results are reported on the validation split as mean ± standard deviation for Precision, Recall, F1, Accuracy, AUC, Cohen’s

κ

, Brier score, G-mean, and Specificity; ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively. Red font denotes SCISA-Net, and blue font denotes the best competing result in each metric.

Table 1. Quantitative comparison of SCISA-Net with representative CNN, transformer/hybrid, and frequency-aware baselines on the 31-class wall-mediated indirect semantic inference task. Results are reported on the validation split as mean ± standard deviation for Precision, Recall, F1, Accuracy, AUC, Cohen’s

κ

, Brier score, G-mean, and Specificity; ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively. Red font denotes SCISA-Net, and blue font denotes the best competing result in each metric.

Methods	Precision ↑	Recall ↑	F1 ↑	Accuracy ↑	AUC ↑	Cohen’s $κ ↑$	Brier ↓	G-Mean ↑	Specificity ↑
SCISA-Net	0.7260 ± 0.1017	0.7160 ± 0.1025	0.7170 ± 0.0884	0.7160 ± 0.1025	0.9759 ± 0.0134	0.7141	0.0151 ± 0.0244	0.8400 ± 0.0622	0.9908 ± 0.0041
VGG16	0.0013 ± 0.0070	0.0323 ± 0.1767	0.0025 ± 0.0135	0.0323 ± 0.1767	0.5060 ± 0.0346	0.0000	0.0312 ± 0.0002	0.0000 ± 0.0000	0.9677 ± 0.1767
VGG19	0.0038 ± 0.0105	0.0262 ± 0.0875	0.0062 ± 0.0172	0.0262 ± 0.0875	0.5261 ± 0.0379	−0.0072	0.0312 ± 0.0003	0.0441 ± 0.1223	0.9675 ± 0.0981
ResNetV2-101	0.0022 ± 0.0083	0.0309 ± 0.1395	0.0039 ± 0.0153	0.0309 ± 0.1395	0.5042 ± 0.0372	−0.0016	0.0314 ± 0.0010	0.0251 ± 0.0956	0.9677 ± 0.1444
DenseNet201	0.0746 ± 0.1828	0.0392 ± 0.0595	0.0311 ± 0.0360	0.0392 ± 0.0595	0.5456 ± 0.0609	0.0073	0.0313 ± 0.0012	0.1305 ± 0.1374	0.9680 ± 0.0456
RegNetY-3.2GF	0.0752 ± 0.0691	0.0749 ± 0.0919	0.0595 ± 0.0522	0.0749 ± 0.0919	0.6559 ± 0.0662	0.0462	0.0310 ± 0.0032	0.2121 ± 0.1593	0.9692 ± 0.0300
Inception-v3	0.0010 ± 0.0000	0.0323 ± 0.0312	0.0020 ± 0.0001	0.0316 ± 0.0000	0.4995 ± 0.0000	0.0000	0.0625 ± 0.0274	0.1748 ± 0.0000	0.9677 ± 0.0312
EfficientNet-B5	0.0326 ± 0.0454	0.0533 ± 0.0785	0.0347 ± 0.0420	0.0533 ± 0.0785	0.5895 ± 0.0617	0.0229	0.0310 ± 0.0014	0.1445 ± 0.1659	0.9685 ± 0.0450
ConvNeXt-B	0.0188 ± 0.0049	0.0188 ± 0.0049	0.0164 ± 0.0029	0.0276 ± 0.0040	0.5073 ± 0.0074	−0.0048	0.9722 ± 0.0006	0.0000 ± 0.0000	0.9676 ± 0.0001
Swin-T V1-B	0.0326 ± 0.0454	0.0533 ± 0.0785	0.0347 ± 0.0420	0.0533 ± 0.0785	0.5895 ± 0.0617	0.0229	0.0310 ± 0.0014	0.1445 ± 0.1659	0.9685 ± 0.0450
Swin-T V2-B	0.0962 ± 0.0779	0.0997 ± 0.1028	0.0824 ± 0.0691	0.0997 ± 0.1028	0.7051 ± 0.0624	0.0754	0.0305 ± 0.0041	0.2487 ± 0.1803	0.9702 ± 0.0270
Swin-T V2-G	0.0013 ± 0.0070	0.0323 ± 0.1767	0.0025 ± 0.0135	0.0323 ± 0.1767	0.5026 ± 0.0338	0.0000	0.0312 ± 0.0003	0.0000 ± 0.0000	0.9677 ± 0.1767
DaViT-B	0.0072 ± 0.0218	0.0347 ± 0.0982	0.0101 ± 0.0274	0.0347 ± 0.0982	0.5169 ± 0.0339	0.0026	0.0312 ± 0.0003	0.0554 ± 0.1459	0.9678 ± 0.0967
UniFormer	0.0063 ± 0.0168	0.0383 ± 0.1182	0.0101 ± 0.0269	0.0383 ± 0.1182	0.5236 ± 0.0466	0.0069	0.0312 ± 0.0004	0.0557 ± 0.1479	0.9680 ± 0.1046
CoCa	0.0828 ± 0.0334	0.0816 ± 0.0322	0.0814 ± 0.0315	0.0816 ± 0.0322	0.5358 ± 0.0365	0.0511	0.0804 ± 0.0145	0.2748 ± 0.0806	0.9694 ± 0.0072
DFCIL-HGR	0.0013 ± 0.0070	0.0323 ± 0.1767	0.0025 ± 0.0135	0.0323 ± 0.1767	0.5003 ± 0.0627	0.0000	0.0312 ± 0.0002	0.0000 ± 0.0000	0.9677 ± 0.1767
Human–Object Relation Network	0.0125 ± 0.0245	0.0373 ± 0.0925	0.0140 ± 0.0267	0.0373 ± 0.0925	0.5046 ± 0.0414	0.0057	0.0312 ± 0.0003	0.0746 ± 0.1509	0.9679 ± 0.0854

Table 2. Ablation study of SCISA-Net on the 31-class wall-mediated indirect semantic inference task. The upper block removes or substitutes SCIM while MS-HSANet is retained; the lower block removes or substitutes MS-HSANet while SCIM is retained. A check mark denotes component retention, whereas a cross denotes component removal or replacement by the module indicated in parentheses. Results on the validation split are reported in terms of Precision, Recall, F1, Accuracy, AUC, Cohen’s

κ

, Brier score, G-mean, and Specificity; ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively. Red font denotes the complete SCISA-Net model, and blue font highlights the best-performing ablated or replacement setting for each metric within the corresponding comparison block.

Table 2. Ablation study of SCISA-Net on the 31-class wall-mediated indirect semantic inference task. The upper block removes or substitutes SCIM while MS-HSANet is retained; the lower block removes or substitutes MS-HSANet while SCIM is retained. A check mark denotes component retention, whereas a cross denotes component removal or replacement by the module indicated in parentheses. Results on the validation split are reported in terms of Precision, Recall, F1, Accuracy, AUC, Cohen’s

κ

, Brier score, G-mean, and Specificity; ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively. Red font denotes the complete SCISA-Net model, and blue font highlights the best-performing ablated or replacement setting for each metric within the corresponding comparison block.

w/o		Classification Metrics
SCIM	MS-HSANet	Precision ↑	Recall ↑	F1 ↑	Accuracy ↑	AUC ↑	Cohen’s $κ ↑$	Brier ↓	G-Mean ↑	Specificity ↑
✔	✔	0.7260 ± 0.1017	0.7160 ± 0.1025	0.7170 ± 0.0884	0.7160 ± 0.1025	0.9759 ± 0.0134	0.7141	0.0151 ± 0.0244	0.8400 ± 0.0622	0.9908 ± 0.0041
×	✔	0.0080 ± 0.0204	0.0384 ± 0.0937	0.0129 ± 0.0319	0.0384 ± 0.0937	0.5206 ± 0.0461	0.0066	0.0312 ± 0.0004	0.0687 ± 0.1586	0.9680 ± 0.0830
× (ResNet-101)	✔	0.0026 ± 0.0098	0.0322 ± 0.1738	0.0032 ± 0.0139	0.0322 ± 0.1738	0.4986 ± 0.0374	0.0000	0.0312 ± 0.0004	0.0095 ± 0.0368	0.9677 ± 0.1713
× (VGG19)	✔	0.0029 ± 0.0112	0.0338 ± 0.1684	0.0046 ± 0.0176	0.0338 ± 0.1684	0.4979 ± 0.0226	0.0019	0.0312 ± 0.0001	0.0183 ± 0.0698	0.9678 ± 0.1632
× (RepVGG)	✔	0.0053 ± 0.0167	0.0322 ± 0.1626	0.0054 ± 0.0176	0.0322 ± 0.1626	0.5175 ± 0.0882	−0.0003	0.0314 ± 0.0017	0.0198 ± 0.0622	0.9677 ± 0.1658
× (MLP-Mixer)	✔	0.0045 ± 0.0145	0.0338 ± 0.1678	0.0056 ± 0.0176	0.0338 ± 0.1678	0.5000 ± 0.0408	0.0019	0.0312 ± 0.0002	0.0231 ± 0.0712	0.9678 ± 0.1614
× (ResMLP)	✔	0.0013 ± 0.0070	0.0323 ± 0.1767	0.0025 ± 0.0135	0.0323 ± 0.1767	0.5000 ± 0.0000	0.0000	0.0312 ± 0.0002	0.0000 ± 0.0000	0.9677 ± 0.1767
× (Swin-Transformer-V2-B)	✔	0.0013 ± 0.0070	0.0323 ± 0.1767	0.0025 ± 0.0135	0.0323 ± 0.1767	0.4995 ± 0.0099	0.0000	0.0312 ± 0.0002	0.0000 ± 0.0000	0.9677 ± 0.1767
× (ViT-B)	✔	0.0013 ± 0.0070	0.0323 ± 0.1767	0.0025 ± 0.0135	0.0323 ± 0.1767	0.5148 ± 0.0359	0.0000	0.0312 ± 0.0002	0.0000 ± 0.0000	0.9677 ± 0.1767
✔	×	0.0854 ± 0.0630	0.0890 ± 0.0682	0.0855 ± 0.0636	0.0890 ± 0.0682	0.6335 ± 0.0793	0.0602	0.0339 ± 0.0086	0.2594 ± 0.1365	0.9697 ± 0.0133
✔	× (GFNet)	0.6712 ± 0.1203	0.6597 ± 0.0931	0.6622 ± 0.0959	0.6597 ± 0.0931	0.9666 ± 0.0142	0.6547	0.0178 ± 0.0251	0.8056 ± 0.0591	0.9889 ± 0.0048
✔	× (FCANet)	0.5287 ± 0.1309	0.5223 ± 0.1245	0.5228 ± 0.1219	0.5223 ± 0.1245	0.9292 ± 0.0311	0.5151	0.0244 ± 0.0262	0.7115 ± 0.0902	0.9844 ± 0.0049
✔	× (FreqNet)	0.6529 ± 0.1047	0.6359 ± 0.1003	0.6370 ± 0.0812	0.6359 ± 0.1003	0.9653 ± 0.0148	0.6293	0.0177 ± 0.0233	0.7900 ± 0.0643	0.9880 ± 0.0055
✔	× (Scattering-based ViT)	0.6875 ± 0.1143	0.6779 ± 0.0902	0.6791 ± 0.0906	0.6779 ± 0.0902	0.9575 ± 0.0187	0.6737	0.0137 ± 0.0179	0.8172 ± 0.0550	0.9895 ± 0.0045
✔	× (AFNONet)	0.4496 ± 0.1243	0.4392 ± 0.0994	0.4407 ± 0.1041	0.4392 ± 0.0994	0.9129 ± 0.0311	0.4258	0.0260 ± 0.0222	0.6522 ± 0.0770	0.9815 ± 0.0060
✔	× (FFCNet)	0.6860 ± 0.1059	0.6776 ± 0.1010	0.6789 ± 0.0938	0.6776 ± 0.1010	0.9732 ± 0.0120	0.6737	0.0174 ± 0.0255	0.8165 ± 0.0623	0.9895 ± 0.0041
✔	× (DCTNet)	0.6863 ± 0.0923	0.6795 ± 0.1001	0.6801 ± 0.0872	0.6795 ± 0.1001	0.9516 ± 0.0227	0.6743	0.0142 ± 0.0150	0.8177 ± 0.0623	0.9895 ± 0.0040

Table 3. Internal ablation of the HSAB design within MS-HSANet on the 31-class wall-mediated indirect semantic inference task. The original HSAB stack is compared with its removal and with parameter-matched substitutions, including Inception-v2, Global Filter, gMLP, MobileViT, ShuffleNetV2, Ghost Bottleneck, ResNeXt Block, and DenseNet Bottleneck. Validation results are reported as mean ± standard deviation in terms of Precision, Recall, F1, Accuracy, AUC, Cohen’s

κ

, Brier score, G-mean, and Specificity, where ↑ and ↓ denote preferable larger and smaller values, respectively. Red font marks the original HSAB configuration; blue font marks the best competing result for each metric.

Table 3. Internal ablation of the HSAB design within MS-HSANet on the 31-class wall-mediated indirect semantic inference task. The original HSAB stack is compared with its removal and with parameter-matched substitutions, including Inception-v2, Global Filter, gMLP, MobileViT, ShuffleNetV2, Ghost Bottleneck, ResNeXt Block, and DenseNet Bottleneck. Validation results are reported as mean ± standard deviation in terms of Precision, Recall, F1, Accuracy, AUC, Cohen’s

κ

, Brier score, G-mean, and Specificity, where ↑ and ↓ denote preferable larger and smaller values, respectively. Red font marks the original HSAB configuration; blue font marks the best competing result for each metric.

Methods	Precision ↑	Recall ↑	F1 ↑	Accuracy ↑	AUC ↑	Cohen’s $κ ↑$	Brier ↓	G-Mean ↑	Specificity ↑
HSAB	0.7260 ± 0.1017	0.7160 ± 0.1025	0.7170 ± 0.0884	0.7160 ± 0.1025	0.9759 ± 0.0134	0.7141	0.0151 ± 0.0244	0.8400 ± 0.0622	0.9908 ± 0.0041
w/o HSAB	0.0885 ± 0.0586	0.0981 ± 0.0992	0.0844 ± 0.0702	0.0981 ± 0.0992	0.6795 ± 0.0755	0.0717	0.0310 ± 0.0052	0.2566 ± 0.1667	0.9701 ± 0.0216
Inception-v2	0.6497 ± 0.1376	0.6260 ± 0.1237	0.6283 ± 0.1043	0.6260 ± 0.1237	0.9753 ± 0.0103	0.6215	0.0155 ± 0.0142	0.7824 ± 0.0783	0.9878 ± 0.0069
Global Filter	0.5612 ± 0.0889	0.5527 ± 0.1092	0.5520 ± 0.0867	0.5527 ± 0.1092	0.9495 ± 0.0195	0.5437	0.0187 ± 0.0174	0.7340 ± 0.0751	0.9853 ± 0.0049
gMLP	0.5015 ± 0.1244	0.4913 ± 0.1113	0.4935 ± 0.1115	0.4913 ± 0.1113	0.9318 ± 0.0264	0.4824	0.0232 ± 0.0220	0.6902 ± 0.0825	0.9833 ± 0.0056
MobileViT	0.6801 ± 0.1018	0.6738 ± 0.0892	0.6745 ± 0.0869	0.6738 ± 0.0892	0.9739 ± 0.0127	0.6685	0.0147 ± 0.0194	0.8147 ± 0.0543	0.9893 ± 0.0039
ShuffleNetV2	0.6237 ± 0.0873	0.6198 ± 0.1088	0.6184 ± 0.0869	0.6198 ± 0.1088	0.9675 ± 0.0114	0.6130	0.0160 ± 0.0162	0.7792 ± 0.0700	0.9875 ± 0.0035
Ghost Bottleneck	0.6596 ± 0.1048	0.6561 ± 0.1047	0.6563 ± 0.1001	0.6561 ± 0.1047	0.9699 ± 0.0118	0.6508	0.0185 ± 0.0260	0.8027 ± 0.0676	0.9887 ± 0.0036
ResNeXt Block	0.6943 ± 0.1340	0.6664 ± 0.1127	0.6685 ± 0.0946	0.6664 ± 0.1127	0.9741 ± 0.0125	0.6607	0.0146 ± 0.0180	0.8088 ± 0.0685	0.9891 ± 0.0076
DenseNet Bottleneck	0.6442 ± 0.1128	0.6307 ± 0.1008	0.6325 ± 0.0910	0.6307 ± 0.1008	0.9659 ± 0.0149	0.6254	0.0164 ± 0.0187	0.7868 ± 0.0638	0.9879 ± 0.0048

Table 4. Stage depth ablation of the Multi-Stage Haar-Subband Attention Network. One_HSAB keeps stage0 and a single HSAB placed in stage1. Stage_01, stage_012, and stage_0123 progressively retain stage0–stage1, stage0–stage2, and stage0–stage3, respectively. HSAB_13 keeps stage0 and places 13 HSABs in stage1, where 13 equals the total number of HSABs used across stage1, stage2, and stage3 in the complete multi-stage setting. Validation results are reported as mean ± standard deviation in terms of Precision, Recall, F1, Accuracy, AUC, Cohen’s

κ

, Brier score, G-mean, and Specificity, where ↑ and ↓ denote preferable larger and smaller values, respectively. Red font denotes the complete stage_0123 setting.

Table 4. Stage depth ablation of the Multi-Stage Haar-Subband Attention Network. One_HSAB keeps stage0 and a single HSAB placed in stage1. Stage_01, stage_012, and stage_0123 progressively retain stage0–stage1, stage0–stage2, and stage0–stage3, respectively. HSAB_13 keeps stage0 and places 13 HSABs in stage1, where 13 equals the total number of HSABs used across stage1, stage2, and stage3 in the complete multi-stage setting. Validation results are reported as mean ± standard deviation in terms of Precision, Recall, F1, Accuracy, AUC, Cohen’s

κ

, Brier score, G-mean, and Specificity, where ↑ and ↓ denote preferable larger and smaller values, respectively. Red font denotes the complete stage_0123 setting.

MS-HSANet Setting	Precision ↑	Recall ↑	F1 ↑	Accuracy ↑	AUC ↑	Cohen’s $κ ↑$	Brier ↓	G-Mean ↑	Specificity ↑
One_HSAB	0.3480 ± 0.1455	0.2968 ± 0.1624	0.2849 ± 0.0997	0.2968 ± 0.1624	0.8819 ± 0.0424	0.2800	0.0262 ± 0.0100	0.5172 ± 0.1420	0.9768 ± 0.0173
stage_01	0.5640 ± 0.1268	0.5290 ± 0.1599	0.5261 ± 0.1122	0.5290 ± 0.1599	0.9552 ± 0.0189	0.5234	0.0188 ± 0.0172	0.7121 ± 0.1135	0.9846 ± 0.0097
stage_012	0.6452 ± 0.0969	0.6353 ± 0.1127	0.6355 ± 0.0906	0.6353 ± 0.1127	0.9699 ± 0.0129	0.6312	0.0152 ± 0.0182	0.7891 ± 0.0709	0.9881 ± 0.0047
stage_0123	0.7260 ± 0.1017	0.7160 ± 0.1025	0.7170 ± 0.0884	0.7160 ± 0.1025	0.9759 ± 0.0134	0.7141	0.0151 ± 0.0244	0.8400 ± 0.0622	0.9908 ± 0.0041
HSAB_13	0.6152 ± 0.0851	0.6012 ± 0.1288	0.6006 ± 0.0867	0.6012 ± 0.1288	0.9659 ± 0.0124	0.5972	0.0163 ± 0.0171	0.7657 ± 0.0820	0.9870 ± 0.0054

Table 5. Inference time subband intervention analysis of HP-DConv in HSAB. Complete HP-DConv denotes the original setting in which the

LL

,

LH

,

HL

, and

HH

subbands are all preserved and arranged in their normal order.

LL

preserves only the low-frequency

LL

subband, while Remove

LL

, Remove

LH

, Remove

HL

, and Remove

HH

suppress the corresponding subband. Random permutation reports the average result over all 23 non-identity permutations of the four Haar subbands. ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively. Red font denotes the complete HP-DConv setting.

Table 5. Inference time subband intervention analysis of HP-DConv in HSAB. Complete HP-DConv denotes the original setting in which the

LL

,

LH

,

HL

, and

HH

subbands are all preserved and arranged in their normal order.

LL

preserves only the low-frequency

LL

subband, while Remove

LL

, Remove

LH

, Remove

HL

, and Remove

HH

suppress the corresponding subband. Random permutation reports the average result over all 23 non-identity permutations of the four Haar subbands. ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively. Red font denotes the complete HP-DConv setting.

Subband Intervention	Precision ↑	Recall ↑	F1 ↑	Accuracy ↑	AUC ↑	Cohen’s $κ ↑$	Brier ↓	G-Mean ↑	Specificity ↑
Complete HP-DConv	0.7260 ± 0.1017	0.7160 ± 0.1025	0.7170 ± 0.0884	0.7160 ± 0.1025	0.9759 ± 0.0134	0.7141	0.0151 ± 0.0244	0.8400 ± 0.0622	0.9908 ± 0.0041
LL only	0.0126 ± 0.0325	0.0362 ± 0.1210	0.0121 ± 0.0288	0.0362 ± 0.1210	0.4945 ± 0.0750	0.0037	0.0548 ± 0.0126	0.0603 ± 0.1338	0.9679 ± 0.1128
Remove LL	0.0282 ± 0.0501	0.0313 ± 0.0760	0.0177 ± 0.0267	0.0313 ± 0.0760	0.5365 ± 0.0940	$- 0.0021$	0.0505 ± 0.0117	0.0939 ± 0.1351	0.9677 ± 0.0797
Remove LH	0.3842 ± 0.2951	0.2469 ± 0.2457	0.2228 ± 0.1736	0.2469 ± 0.2457	0.7857 ± 0.1220	0.2348	0.0396 ± 0.0242	0.4091 ± 0.2548	0.9753 ± 0.0478
Remove HL	0.0498 ± 0.1062	0.0473 ± 0.1739	0.0229 ± 0.0440	0.0473 ± 0.1739	0.6098 ± 0.1482	0.0156	0.0568 ± 0.0143	0.0740 ± 0.1322	0.9682 ± 0.1346
Remove HH	0.7137 ± 0.1016	0.6957 ± 0.1311	0.6953 ± 0.0926	0.6957 ± 0.1311	0.9735 ± 0.0146	0.6952	0.0159 ± 0.0247	0.8258 ± 0.0812	0.9902 ± 0.0050
Random permutation	0.0256 ± 0.0681	0.0402 ± 0.0407	0.0174 ± 0.0394	0.0402 ± 0.0407	0.5196 ± 0.0672	0.0087	0.0564 ± 0.0050	0.0665 ± 0.0730	0.9680 ± 0.0014

Table 7. Quantitative robustness evaluation of SCISA-Net under simulated observation corruption on the 31-class wall-mediated indirect semantic inference task. Validation performance is reported as mean ± standard deviation in terms of Macro-Precision, Macro-Recall, Macro-F1, Accuracy, AUC, Cohen’s

κ

, Brier loss, G-mean, and Specificity. ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively. Corruption settings comprise a noise-free reference and three noise families, namely Gaussian, Poisson, and Scatter noise, each examined at multiple intensity levels. Red font denotes the noise-free reference condition, and blue font denotes the best competing result for each metric.

Table 7. Quantitative robustness evaluation of SCISA-Net under simulated observation corruption on the 31-class wall-mediated indirect semantic inference task. Validation performance is reported as mean ± standard deviation in terms of Macro-Precision, Macro-Recall, Macro-F1, Accuracy, AUC, Cohen’s

κ

, Brier loss, G-mean, and Specificity. ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively. Corruption settings comprise a noise-free reference and three noise families, namely Gaussian, Poisson, and Scatter noise, each examined at multiple intensity levels. Red font denotes the noise-free reference condition, and blue font denotes the best competing result for each metric.

Corruption Setting	Macro-Precision ↑	Macro-Recall ↑	Macro-F1 ↑	Accuracy ↑	AUC ↑	Cohen’s $κ ↑$	Brier Loss ↓	G-Mean ↑	Specificity ↑
Noise-Free	0.7260 ± 0.1017	0.7160 ± 0.1025	0.7170 ± 0.0884	0.7160 ± 0.1025	0.9759 ± 0.0134	0.7141	0.0151 ± 0.0244	0.8400 ± 0.0622	0.9908 ± 0.0041
Gaussian Noise ( $125 dB$ )	0.4301 ± 0.1491	0.3839 ± 0.1957	0.3728 ± 0.1235	0.3839 ± 0.1957	0.8772 ± 0.0406	0.3633	0.0332 ± 0.0269	0.5911 ± 0.1546	0.9795 ± 0.0189
Gaussian Noise ( $130 dB$ )	0.5840 ± 0.1436	0.5694 ± 0.1595	0.5661 ± 0.1319	0.5694 ± 0.1595	0.9466 ± 0.0266	0.5550	0.0233 ± 0.0271	0.7404 ± 0.1133	0.9856 ± 0.0079
Gaussian Noise ( $135 dB$ )	0.6909 ± 0.1113	0.6871 ± 0.1350	0.6835 ± 0.1120	0.6871 ± 0.1350	0.9669 ± 0.0267	0.6767	0.0168 ± 0.0251	0.8197 ± 0.0902	0.9896 ± 0.0043
Gaussian Noise ( $145 dB$ )	0.7403 ± 0.1124	0.7323 ± 0.1168	0.7316 ± 0.1009	0.7323 ± 0.1168	0.9770 ± 0.0176	0.7233	0.0148 ± 0.0244	0.8489 ± 0.0718	0.9911 ± 0.0046
Gaussian Noise ( $155 dB$ )	0.7279 ± 0.1127	0.7194 ± 0.1196	0.7182 ± 0.1019	0.7194 ± 0.1196	0.9774 ± 0.0169	0.7100	0.0154 ± 0.0248	0.8408 ± 0.0756	0.9906 ± 0.0048
Poisson Noise ( $80 dB$ )	0.4892 ± 0.1585	0.4387 ± 0.1120	0.4464 ± 0.1039	0.4387 ± 0.1120	0.8768 ± 0.0397	0.4200	0.0291 ± 0.0267	0.6508 ± 0.0805	0.9813 ± 0.0135
Poisson Noise ( $85 dB$ )	0.4847 ± 0.1819	0.4258 ± 0.1458	0.4331 ± 0.1351	0.4258 ± 0.1458	0.8541 ± 0.0515	0.4067	0.0305 ± 0.0269	0.6351 ± 0.1178	0.9809 ± 0.0157
Poisson Noise ( $90 dB$ )	0.4456 ± 0.1649	0.3968 ± 0.1224	0.4055 ± 0.1208	0.3968 ± 0.1224	0.8567 ± 0.0524	0.3767	0.0314 ± 0.0261	0.6160 ± 0.0970	0.9799 ± 0.0139
Poisson Noise ( $100 dB$ )	0.4852 ± 0.1462	0.4565 ± 0.1038	0.4609 ± 0.1090	0.4565 ± 0.1038	0.8742 ± 0.0397	0.4383	0.0281 ± 0.0260	0.6637 ± 0.0873	0.9819 ± 0.0104
Poisson Noise ( $110 dB$ )	0.4831 ± 0.1421	0.4613 ± 0.1528	0.4576 ± 0.1241	0.4613 ± 0.1528	0.8794 ± 0.0440	0.4433	0.0282 ± 0.0266	0.6610 ± 0.1243	0.9820 ± 0.0110
Scatter Noise ( $110 dB$ )	0.2877 ± 0.2408	0.2097 ± 0.2042	0.1888 ± 0.1346	0.2097 ± 0.2042	0.7721 ± 0.0692	0.1833	0.0425 ± 0.0228	0.3756 ± 0.2383	0.9737 ± 0.0432
Scatter Noise ( $115 dB$ )	0.4544 ± 0.1489	0.3984 ± 0.1673	0.3925 ± 0.1132	0.3984 ± 0.1673	0.8859 ± 0.0463	0.3783	0.0318 ± 0.0265	0.6087 ± 0.1342	0.9799 ± 0.0169
Scatter Noise ( $120 dB$ )	0.6050 ± 0.1258	0.5919 ± 0.1443	0.5892 ± 0.1215	0.5919 ± 0.1443	0.9514 ± 0.0250	0.5783	0.0215 ± 0.0260	0.7576 ± 0.0991	0.9864 ± 0.0069
Scatter Noise ( $130 dB$ )	0.7093 ± 0.1308	0.7032 ± 0.1420	0.7011 ± 0.1249	0.7032 ± 0.1420	0.9734 ± 0.0208	0.6933	0.0157 ± 0.0247	0.8293 ± 0.0931	0.9901 ± 0.0051
Scatter Noise ( $140 dB$ )	0.7356 ± 0.1163	0.7258 ± 0.1301	0.7238 ± 0.1076	0.7258 ± 0.1301	0.9762 ± 0.0173	0.7167	0.0150 ± 0.0245	0.8439 ± 0.0833	0.9909 ± 0.0051

Table 8. Sensitivity analysis of the SCIM inversion parameters under the

100 dB

Poisson noise condition. The default setting uses the fixed singular-value truncation number

k_{0}

and Tikhonov regularization parameter

λ_{0}

adopted in the main experiments. Additional settings vary either k or

λ

while keeping the trained checkpoint and the remaining inference pipeline unchanged. Validation results are reported as mean ± standard deviation in terms of Precision, Recall, F1, Accuracy, AUC, Cohen’s

κ

, Brier score, G-mean, and Specificity, where ↑ and ↓ denote preferable larger and smaller values, respectively. Red font denotes the default setting, and blue font marks the best result for each metric among the non-default settings.

Table 8. Sensitivity analysis of the SCIM inversion parameters under the

100 dB

Poisson noise condition. The default setting uses the fixed singular-value truncation number

k_{0}

and Tikhonov regularization parameter

λ_{0}

adopted in the main experiments. Additional settings vary either k or

λ

while keeping the trained checkpoint and the remaining inference pipeline unchanged. Validation results are reported as mean ± standard deviation in terms of Precision, Recall, F1, Accuracy, AUC, Cohen’s

κ

, Brier score, G-mean, and Specificity, where ↑ and ↓ denote preferable larger and smaller values, respectively. Red font denotes the default setting, and blue font marks the best result for each metric among the non-default settings.

SCIM Setting	Precision ↑	Recall ↑	F1 ↑	Accuracy ↑	AUC ↑	Cohen’s $κ ↑$	Brier ↓	G-Mean ↑	Specificity ↑
Default	0.4852 ± 0.1462	0.4565 ± 0.1038	0.4609 ± 0.1090	0.4565 ± 0.1038	0.8742 ± 0.0397	0.4383	0.0281 ± 0.0260	0.6637 ± 0.0873	0.9819 ± 0.0104
$0.5 k_{0}$	0.4521 ± 0.1195	0.4242 ± 0.0974	0.4267 ± 0.0879	0.4242 ± 0.0974	0.8655 ± 0.0419	0.4050	0.0302 ± 0.0261	0.6401 ± 0.0778	0.9808 ± 0.0102
$1.5 k_{0}$	0.4354 ± 0.1577	0.4087 ± 0.1274	0.4073 ± 0.1124	0.4087 ± 0.1274	0.8508 ± 0.0470	0.3877	0.0306 ± 0.0261	0.6175 ± 0.1370	0.9802 ± 0.0112
$0.5 λ_{0}$	0.4866 ± 0.1405	0.4511 ± 0.0946	0.4572 ± 0.0983	0.4511 ± 0.0946	0.8721 ± 0.0404	0.4327	0.0283 ± 0.0260	0.6610 ± 0.0759	0.9817 ± 0.0112
$1.5 λ_{0}$	0.4845 ± 0.1456	0.4468 ± 0.0941	0.4532 ± 0.1014	0.4468 ± 0.0941	0.8720 ± 0.0430	0.4283	0.0283 ± 0.0261	0.6579 ± 0.0750	0.9816 ± 0.0114

Table 9. Quantitative evaluation of SCISA-Net under ambient-light background interference on the wall-mediated indirect observation images. Three representative background patterns are considered, including uniform background, random-direction linear gradient background, and random-center radial gradient background. Results are reported on the validation split as mean ± standard deviation for Precision, Recall, F1, Accuracy, AUC, Brier score, G-mean, and Specificity, while Cohen’s

κ

is reported as a scalar value. A smaller

SBR

indicates stronger background interference. Red font denotes the no-background reference condition, and blue font marks the best result for each metric within each background type. ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively.

Table 9. Quantitative evaluation of SCISA-Net under ambient-light background interference on the wall-mediated indirect observation images. Three representative background patterns are considered, including uniform background, random-direction linear gradient background, and random-center radial gradient background. Results are reported on the validation split as mean ± standard deviation for Precision, Recall, F1, Accuracy, AUC, Brier score, G-mean, and Specificity, while Cohen’s

κ

is reported as a scalar value. A smaller

SBR

indicates stronger background interference. Red font denotes the no-background reference condition, and blue font marks the best result for each metric within each background type. ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively.

Background Type	$SBR$	Precision ↑	Recall ↑	F1 ↑	Accuracy ↑	AUC ↑	Cohen’s $κ ↑$	Brier ↓	G-Mean ↑	Specificity ↑
No background	–	0.7260 ± 0.1017	0.7160 ± 0.1025	0.7170 ± 0.0884	0.7160 ± 0.1025	0.9759 ± 0.0134	0.7141	0.0151 ± 0.0244	0.8400 ± 0.0622	0.9908 ± 0.0041
Uniform background	$55 dB$	0.5984 ± 0.1127	0.5839 ± 0.1286	0.5827 ± 0.0980	0.5839 ± 0.1286	0.9481 ± 0.0207	0.5780	0.0219 ± 0.0265	0.7540 ± 0.0849	0.9864 ± 0.0062
Uniform background	$60 dB$	0.6490 ± 0.1093	0.6377 ± 0.1151	0.6381 ± 0.0983	0.6377 ± 0.1151	0.9622 ± 0.0188	0.6345	0.0190 ± 0.0261	0.7904 ± 0.0740	0.9882 ± 0.0050
Uniform background	$70 dB$	0.6974 ± 0.1060	0.6894 ± 0.1068	0.6903 ± 0.0966	0.6894 ± 0.1068	0.9715 ± 0.0158	0.6874	0.0165 ± 0.0252	0.8235 ± 0.0663	0.9899 ± 0.0040
Linear gradient background	$50 dB$	0.5302 ± 0.1321	0.5005 ± 0.1494	0.4987 ± 0.1089	0.5005 ± 0.1494	0.9259 ± 0.0282	0.4919	0.0260 ± 0.0269	0.6928 ± 0.1070	0.9836 ± 0.0105
Linear gradient background	$55 dB$	0.6424 ± 0.1125	0.6274 ± 0.1166	0.6280 ± 0.0962	0.6274 ± 0.1166	0.9577 ± 0.0203	0.6228	0.0197 ± 0.0263	0.7836 ± 0.0750	0.9878 ± 0.0057
Linear gradient background	$65 dB$	0.7109 ± 0.1214	0.7019 ± 0.1081	0.7030 ± 0.1037	0.7019 ± 0.1081	0.9735 ± 0.0161	0.6993	0.0158 ± 0.0247	0.8310 ± 0.0679	0.9903 ± 0.0045
Radial gradient background	$55 dB$	0.6382 ± 0.0990	0.6284 ± 0.1185	0.6272 ± 0.0936	0.6284 ± 0.1185	0.9585 ± 0.0198	0.6234	0.0197 ± 0.0261	0.7839 ± 0.0780	0.9879 ± 0.0046
Radial gradient background	$60 dB$	0.6850 ± 0.1082	0.6775 ± 0.1035	0.6789 ± 0.0987	0.6775 ± 0.1035	0.9701 ± 0.0157	0.6750	0.0167 ± 0.0249	0.8162 ± 0.0659	0.9895 ± 0.0039
Radial gradient background	$70 dB$	0.7224 ± 0.1111	0.7124 ± 0.1040	0.7139 ± 0.0970	0.7124 ± 0.1040	0.9755 ± 0.0146	0.7109	0.0152 ± 0.0244	0.8377 ± 0.0647	0.9907 ± 0.0044

Table 10. Performance under scene re-parameterization with the updated scene information encoding operator

A

. Validation results are reported as mean ± standard deviation in terms of Macro-Precision, Macro-Recall, Macro-F1, Accuracy, AUC, Brier loss, G-mean, and Specificity, while Cohen’s

κ

is reported as a scalar value. Red font denotes the default calibrated setting, and blue font marks the best result for each metric among the re-parameterized settings. ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively.

Table 10. Performance under scene re-parameterization with the updated scene information encoding operator

A

. Validation results are reported as mean ± standard deviation in terms of Macro-Precision, Macro-Recall, Macro-F1, Accuracy, AUC, Brier loss, G-mean, and Specificity, while Cohen’s

κ

is reported as a scalar value. Red font denotes the default calibrated setting, and blue font marks the best result for each metric among the re-parameterized settings. ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively.

Scene Parameter Setting	Macro-Precision ↑	Macro-Recall ↑	Macro-F1 ↑	Accuracy ↑	AUC ↑	Cohen’s $κ ↑$	Brier Loss ↓	G-Mean ↑	Specificity ↑
Default	0.7600 ± 0.1017	0.7160 ± 0.1025	0.7170 ± 0.0884	0.7160 ± 0.1025	0.9759 ± 0.0134	0.7141	0.0151 ± 0.0244	0.8400 ± 0.0622	0.9908 ± 0.0041
Wall–screen distance $D + 5 cm$	0.6975 ± 0.1091	0.6887 ± 0.1119	0.6894 ± 0.1014	0.6887 ± 0.1119	0.9715 ± 0.0133	0.6871	0.0162 ± 0.0248	0.8227 ± 0.0712	0.9899 ± 0.0042
Occluder shifted toward wall by $5 cm$	0.7317 ± 0.0988	0.7230 ± 0.0971	0.7239 ± 0.0877	0.7230 ± 0.0971	0.9777 ± 0.0117	0.7208	0.0147 ± 0.0243	0.8443 ± 0.0610	0.9910 ± 0.0040
Effective screen region shifted along $+ x$ by $5 cm$	0.7215 ± 0.1000	0.7130 ± 0.0970	0.7140 ± 0.0882	0.7130 ± 0.0970	0.9755 ± 0.0125	0.7106	0.0151 ± 0.0243	0.8383 ± 0.0607	0.9907 ± 0.0039

Table 11. Gesture-morphology-stratified evaluation of SCISA-Net on the held-out validation split. The original 31-class task is partitioned, without retraining, into three subset groups defined by the number of extended fingers: Closed (0–1), Half-Open (2–3), and Spread (4–5). Results are reported as mean ± standard deviation in terms of Macro-Precision, Macro-Recall, Macro-F1, Accuracy, AUC, Cohen’s

κ

, Brier Loss, G-Mean, and Specificity, where ↑ and ↓ indicate preferable larger and smaller values, respectively. Red font marks the full 31-class reference result, and blue font marks the best result for each metric among the morphology-stratified subset groups.

Table 11. Gesture-morphology-stratified evaluation of SCISA-Net on the held-out validation split. The original 31-class task is partitioned, without retraining, into three subset groups defined by the number of extended fingers: Closed (0–1), Half-Open (2–3), and Spread (4–5). Results are reported as mean ± standard deviation in terms of Macro-Precision, Macro-Recall, Macro-F1, Accuracy, AUC, Cohen’s

κ

, Brier Loss, G-Mean, and Specificity, where ↑ and ↓ indicate preferable larger and smaller values, respectively. Red font marks the full 31-class reference result, and blue font marks the best result for each metric among the morphology-stratified subset groups.

Evaluation Split	Macro-Precision ↑	Macro-Recall ↑	Macro-F1 ↑	Accuracy ↑	AUC ↑	Cohen’s $κ ↑$	Brier Loss ↓	G-Mean ↑	Specificity ↑
Full 31-Class Set	0.7260 ± 0.1017	0.7160 ± 0.1025	0.7170 ± 0.0884	0.7160 ± 0.1025	0.9759 ± 0.0134	0.7141	0.0151 ± 0.0244	0.8400 ± 0.0622	0.9908 ± 0.0041
Closed SubSet	0.8082 ± 0.1152	0.6738 ± 0.0915	0.7294 ± 0.0806	0.7907 ± 0.0815	0.9543 ± 0.0250	0.6509	0.0171 ± 0.0252	0.8094 ± 0.0566	0.9772 ± 0.0152
Half-Open SubSet	0.8388 ± 0.0716	0.7562 ± 0.0909	0.7918 ± 0.0663	0.8307 ± 0.0844	0.9793 ± 0.0083	0.7482	0.0130 ± 0.0232	0.8629 ± 0.0524	0.9884 ± 0.0060
Spread SubSet	0.8366 ± 0.0932	0.6860 ± 0.1081	0.7455 ± 0.0801	0.8178 ± 0.1075	0.9645 ± 0.0185	0.6443	0.0170 ± 0.0254	0.8104 ± 0.0653	0.9648 ± 0.0256

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dai, J.; Qin, H.; Li, G.; Liu, J.; Zhang, X.; Qi, H.; Zheng, Z.; Huang, X. SCISA-Net: Scene-Constrained Inverse-to-Subband Attention for Semantic Inference from Wall-Mediated Indirect Observations. Photonics 2026, 13, 575. https://doi.org/10.3390/photonics13060575

AMA Style

Dai J, Qin H, Li G, Liu J, Zhang X, Qi H, Zheng Z, Huang X. SCISA-Net: Scene-Constrained Inverse-to-Subband Attention for Semantic Inference from Wall-Mediated Indirect Observations. Photonics. 2026; 13(6):575. https://doi.org/10.3390/photonics13060575

Chicago/Turabian Style

Dai, Jihao, Hongshuai Qin, Guowen Li, Jin Liu, Xiaoshuai Zhang, Huiyu Qi, Zhiwen Zheng, and Xingru Huang. 2026. "SCISA-Net: Scene-Constrained Inverse-to-Subband Attention for Semantic Inference from Wall-Mediated Indirect Observations" Photonics 13, no. 6: 575. https://doi.org/10.3390/photonics13060575

APA Style

Dai, J., Qin, H., Li, G., Liu, J., Zhang, X., Qi, H., Zheng, Z., & Huang, X. (2026). SCISA-Net: Scene-Constrained Inverse-to-Subband Attention for Semantic Inference from Wall-Mediated Indirect Observations. Photonics, 13(6), 575. https://doi.org/10.3390/photonics13060575

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SCISA-Net: Scene-Constrained Inverse-to-Subband Attention for Semantic Inference from Wall-Mediated Indirect Observations

Abstract

1. Introduction

2. Materials and Methods

2.1. Experimental Setup

2.2. Overall Framework

2.3. Scene-Constrained Inversion Module

2.4. Multi-Stage Haar-Subband Attention Network

2.5. Implementation Details

3. Results

3.1. Overall Performance and Baseline Comparison

3.2. Ablation Study

3.3. Robustness Under Degraded Observation Conditions

3.4. Evaluation Under Scene Re-Parameterization with Updated Operator

3.5. Gesture-Morphology-Stratified Evaluation

3.6. Attention Feature Visualization

4. Discussion

4.1. Feasibility of Wall-Mediated Semantic Inference

4.2. Role of Scene-Constrained Inversion

4.3. Role of Subband-Aware Discriminative Learning

4.4. Interpretation of Matched Scene Re-Parameterization

4.5. Robustness, Failure Modes, and Operating Range

4.6. Limitations and Future Work

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI