Figure 1.
Schematic illustration of the wall-mediated indirect observation geometry and the calibrated Cartesian coordinate frame used for scene parameterization. The effective display image plane is defined as , and the imaging wall is located along the positive y direction at ; the x-axis denotes the horizontal scene direction, and the z-axis denotes the vertical direction. The hidden display terminal does not enter the camera field of view; only the wall-borne intensity pattern formed after occluder modulation and wall reflection is recorded.
Figure 1.
Schematic illustration of the wall-mediated indirect observation geometry and the calibrated Cartesian coordinate frame used for scene parameterization. The effective display image plane is defined as , and the imaging wall is located along the positive y direction at ; the x-axis denotes the horizontal scene direction, and the z-axis denotes the vertical direction. The hidden display terminal does not enter the camera field of view; only the wall-borne intensity pattern formed after occluder modulation and wall reflection is recorded.
Figure 2.
Architecture of SCISA-Net. The upper branch, Scene-Constrained Inversion Module (SCIM), converts a wall-mediated indirect observation into a source-oriented intermediate representation through construction of the scene information encoding operator , truncated Tikhonov-filtered inverse encoding, TV-based refinement, and multi-scale channel-adaptive feature compensation. The lower branch, Multi-Stage Haar-Subband Attention Network (MS-HSANet), receives the compensated feature, performs stem encoding followed by three successive stages based on HSAB for Haar-subband-aware representation learning, aggregates cross-stage pooled features, and outputs the semantic class.
Figure 2.
Architecture of SCISA-Net. The upper branch, Scene-Constrained Inversion Module (SCIM), converts a wall-mediated indirect observation into a source-oriented intermediate representation through construction of the scene information encoding operator , truncated Tikhonov-filtered inverse encoding, TV-based refinement, and multi-scale channel-adaptive feature compensation. The lower branch, Multi-Stage Haar-Subband Attention Network (MS-HSANet), receives the compensated feature, performs stem encoding followed by three successive stages based on HSAB for Haar-subband-aware representation learning, aggregates cross-stage pooled features, and outputs the semantic class.
Figure 3.
Architecture of the Scene-Constrained Inversion Module (SCIM). SCIM consists of Scene-Aware Regularized Inverse Encoding (SARIE) and Multi-Scale Channel-Adaptive Feature Compensation (MSCAFC). Within SARIE, (a) constructs the scene information encoding operator , (b) performs truncated Tikhonov-filtered spectral inverse encoding through the dominant singular modes of , and (c) applies TV-based refinement to contract inversion residuals. The refined representation is then processed by (d) MSCAFC, which conducts multi-branch channel-adaptive compensation and residual fusion to produce the compensated output feature.
Figure 3.
Architecture of the Scene-Constrained Inversion Module (SCIM). SCIM consists of Scene-Aware Regularized Inverse Encoding (SARIE) and Multi-Scale Channel-Adaptive Feature Compensation (MSCAFC). Within SARIE, (a) constructs the scene information encoding operator , (b) performs truncated Tikhonov-filtered spectral inverse encoding through the dominant singular modes of , and (c) applies TV-based refinement to contract inversion residuals. The refined representation is then processed by (d) MSCAFC, which conducts multi-branch channel-adaptive compensation and residual fusion to produce the compensated output feature.
Figure 4.
Architecture of the Multi-Stage Haar-Subband Attention Network (MS-HSANet). A stem encoder first contracts the compensated feature, after which three successive Haar-Subband Attention Block (HSAB) stages perform hierarchical subband-aware encoding with depths 3, 4, and 6, respectively. Cross-stage pooled features are then concatenated for classification. The lower panels detail HSAB and Haar-Prior depthwise convolution (HPDC), where Haar-initialized per-channel filters decompose features into , , , and responses, followed by channel–spatial recalibration and residual fusion.
Figure 4.
Architecture of the Multi-Stage Haar-Subband Attention Network (MS-HSANet). A stem encoder first contracts the compensated feature, after which three successive Haar-Subband Attention Block (HSAB) stages perform hierarchical subband-aware encoding with depths 3, 4, and 6, respectively. Cross-stage pooled features are then concatenated for classification. The lower panels detail HSAB and Haar-Prior depthwise convolution (HPDC), where Haar-initialized per-channel filters decompose features into , , , and responses, followed by channel–spatial recalibration and residual fusion.
Figure 5.
Per-class receiver operating characteristic (ROC) curves of SCISA-Net on the 31-class wall-mediated indirect semantic inference task over the held-out validation split. Most class trajectories remain concentrated near the upper-left region, with consistently high AUC values, indicating stable class separability under wall-mediated indirect observation.
Figure 5.
Per-class receiver operating characteristic (ROC) curves of SCISA-Net on the 31-class wall-mediated indirect semantic inference task over the held-out validation split. Most class trajectories remain concentrated near the upper-left region, with consistently high AUC values, indicating stable class separability under wall-mediated indirect observation.
Figure 6.
Normalized confusion matrix of SCISA-Net on the held-out validation split of the 31-class wall-mediated indirect semantic inference task. The concentration of response mass along the main diagonal, together with sparse off-diagonal dispersion, suggests that class-discriminative evidence remains recoverable across most categories under wall-mediated indirect observation.
Figure 6.
Normalized confusion matrix of SCISA-Net on the held-out validation split of the 31-class wall-mediated indirect semantic inference task. The concentration of response mass along the main diagonal, together with sparse off-diagonal dispersion, suggests that class-discriminative evidence remains recoverable across most categories under wall-mediated indirect observation.
Figure 8.
Representative noisy wall-mediated indirect observations used in the robustness evaluation of
Section 4.4. Starting from the clean reference observation, three corruption families—Gaussian noise, Poisson noise, and Scatter noise—are imposed at progressively varying
levels, yielding distinct degradation patterns for assessing the perturbation tolerance of SCISA-Net.
Figure 8.
Representative noisy wall-mediated indirect observations used in the robustness evaluation of
Section 4.4. Starting from the clean reference observation, three corruption families—Gaussian noise, Poisson noise, and Scatter noise—are imposed at progressively varying
levels, yielding distinct degradation patterns for assessing the perturbation tolerance of SCISA-Net.
Figure 9.
Stage-wise Grad-CAM responses of SCISA-Net in the wall-observation coordinate system for the 31-class wall-mediated indirect semantic inference task. Shown from left to right are the wall-mediated indirect observation, the response at the SCIM output, and the responses back-projected from the outputs of MS-HSANet stage0–stage3. The colored overlays represent Grad-CAM heatmaps, where warmer colors, especially red and yellow, indicate stronger class-discriminative activation, whereas cooler colors such as green and blue indicate weaker activation. The activation evolves from compact evidence recovery after scene-constrained inversion, through intermediate carrier-aligned spatial expansion, to a more concentrated hotspot at the deepest stage, suggesting progressive consolidation of class-discriminative cues within the wall-borne modulation region.
Figure 9.
Stage-wise Grad-CAM responses of SCISA-Net in the wall-observation coordinate system for the 31-class wall-mediated indirect semantic inference task. Shown from left to right are the wall-mediated indirect observation, the response at the SCIM output, and the responses back-projected from the outputs of MS-HSANet stage0–stage3. The colored overlays represent Grad-CAM heatmaps, where warmer colors, especially red and yellow, indicate stronger class-discriminative activation, whereas cooler colors such as green and blue indicate weaker activation. The activation evolves from compact evidence recovery after scene-constrained inversion, through intermediate carrier-aligned spatial expansion, to a more concentrated hotspot at the deepest stage, suggesting progressive consolidation of class-discriminative cues within the wall-borne modulation region.
Table 1.
Quantitative comparison of SCISA-Net with representative CNN, transformer/hybrid, and frequency-aware baselines on the 31-class wall-mediated indirect semantic inference task. Results are reported on the validation split as mean ± standard deviation for Precision, Recall, F1, Accuracy, AUC, Cohen’s , Brier score, G-mean, and Specificity; ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively. Red font denotes SCISA-Net, and blue font denotes the best competing result in each metric.
Table 1.
Quantitative comparison of SCISA-Net with representative CNN, transformer/hybrid, and frequency-aware baselines on the 31-class wall-mediated indirect semantic inference task. Results are reported on the validation split as mean ± standard deviation for Precision, Recall, F1, Accuracy, AUC, Cohen’s , Brier score, G-mean, and Specificity; ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively. Red font denotes SCISA-Net, and blue font denotes the best competing result in each metric.
| Methods | Precision ↑ | Recall ↑ | F1 ↑ | Accuracy ↑ | AUC ↑ | Cohen’s | Brier ↓ | G-Mean ↑ | Specificity ↑ |
|---|
| SCISA-Net | 0.7260 ± 0.1017 | 0.7160 ± 0.1025 | 0.7170 ± 0.0884 | 0.7160 ± 0.1025 | 0.9759 ± 0.0134 | 0.7141 | 0.0151 ± 0.0244 | 0.8400 ± 0.0622 | 0.9908 ± 0.0041 |
| VGG16 | 0.0013 ± 0.0070 | 0.0323 ± 0.1767 | 0.0025 ± 0.0135 | 0.0323 ± 0.1767 | 0.5060 ± 0.0346 | 0.0000 | 0.0312 ± 0.0002 | 0.0000 ± 0.0000 | 0.9677 ± 0.1767 |
| VGG19 | 0.0038 ± 0.0105 | 0.0262 ± 0.0875 | 0.0062 ± 0.0172 | 0.0262 ± 0.0875 | 0.5261 ± 0.0379 | −0.0072 | 0.0312 ± 0.0003 | 0.0441 ± 0.1223 | 0.9675 ± 0.0981 |
| ResNetV2-101 | 0.0022 ± 0.0083 | 0.0309 ± 0.1395 | 0.0039 ± 0.0153 | 0.0309 ± 0.1395 | 0.5042 ± 0.0372 | −0.0016 | 0.0314 ± 0.0010 | 0.0251 ± 0.0956 | 0.9677 ± 0.1444 |
| DenseNet201 | 0.0746 ± 0.1828 | 0.0392 ± 0.0595 | 0.0311 ± 0.0360 | 0.0392 ± 0.0595 | 0.5456 ± 0.0609 | 0.0073 | 0.0313 ± 0.0012 | 0.1305 ± 0.1374 | 0.9680 ± 0.0456 |
| RegNetY-3.2GF | 0.0752 ± 0.0691 | 0.0749 ± 0.0919 | 0.0595 ± 0.0522 | 0.0749 ± 0.0919 | 0.6559 ± 0.0662 | 0.0462 | 0.0310 ± 0.0032 | 0.2121 ± 0.1593 | 0.9692 ± 0.0300 |
| Inception-v3 | 0.0010 ± 0.0000 | 0.0323 ± 0.0312 | 0.0020 ± 0.0001 | 0.0316 ± 0.0000 | 0.4995 ± 0.0000 | 0.0000 | 0.0625 ± 0.0274 | 0.1748 ± 0.0000 | 0.9677 ± 0.0312 |
| EfficientNet-B5 | 0.0326 ± 0.0454 | 0.0533 ± 0.0785 | 0.0347 ± 0.0420 | 0.0533 ± 0.0785 | 0.5895 ± 0.0617 | 0.0229 | 0.0310 ± 0.0014 | 0.1445 ± 0.1659 | 0.9685 ± 0.0450 |
| ConvNeXt-B | 0.0188 ± 0.0049 | 0.0188 ± 0.0049 | 0.0164 ± 0.0029 | 0.0276 ± 0.0040 | 0.5073 ± 0.0074 | −0.0048 | 0.9722 ± 0.0006 | 0.0000 ± 0.0000 | 0.9676 ± 0.0001 |
| Swin-T V1-B | 0.0326 ± 0.0454 | 0.0533 ± 0.0785 | 0.0347 ± 0.0420 | 0.0533 ± 0.0785 | 0.5895 ± 0.0617 | 0.0229 | 0.0310 ± 0.0014 | 0.1445 ± 0.1659 | 0.9685 ± 0.0450 |
| Swin-T V2-B | 0.0962 ± 0.0779 | 0.0997 ± 0.1028 | 0.0824 ± 0.0691 | 0.0997 ± 0.1028 | 0.7051 ± 0.0624 | 0.0754 | 0.0305 ± 0.0041 | 0.2487 ± 0.1803 | 0.9702 ± 0.0270 |
| Swin-T V2-G | 0.0013 ± 0.0070 | 0.0323 ± 0.1767 | 0.0025 ± 0.0135 | 0.0323 ± 0.1767 | 0.5026 ± 0.0338 | 0.0000 | 0.0312 ± 0.0003 | 0.0000 ± 0.0000 | 0.9677 ± 0.1767 |
| DaViT-B | 0.0072 ± 0.0218 | 0.0347 ± 0.0982 | 0.0101 ± 0.0274 | 0.0347 ± 0.0982 | 0.5169 ± 0.0339 | 0.0026 | 0.0312 ± 0.0003 | 0.0554 ± 0.1459 | 0.9678 ± 0.0967 |
| UniFormer | 0.0063 ± 0.0168 | 0.0383 ± 0.1182 | 0.0101 ± 0.0269 | 0.0383 ± 0.1182 | 0.5236 ± 0.0466 | 0.0069 | 0.0312 ± 0.0004 | 0.0557 ± 0.1479 | 0.9680 ± 0.1046 |
| CoCa | 0.0828 ± 0.0334 | 0.0816 ± 0.0322 | 0.0814 ± 0.0315 | 0.0816 ± 0.0322 | 0.5358 ± 0.0365 | 0.0511 | 0.0804 ± 0.0145 | 0.2748 ± 0.0806 | 0.9694 ± 0.0072 |
| DFCIL-HGR | 0.0013 ± 0.0070 | 0.0323 ± 0.1767 | 0.0025 ± 0.0135 | 0.0323 ± 0.1767 | 0.5003 ± 0.0627 | 0.0000 | 0.0312 ± 0.0002 | 0.0000 ± 0.0000 | 0.9677 ± 0.1767 |
| Human–Object Relation Network | 0.0125 ± 0.0245 | 0.0373 ± 0.0925 | 0.0140 ± 0.0267 | 0.0373 ± 0.0925 | 0.5046 ± 0.0414 | 0.0057 | 0.0312 ± 0.0003 | 0.0746 ± 0.1509 | 0.9679 ± 0.0854 |
Table 2.
Ablation study of SCISA-Net on the 31-class wall-mediated indirect semantic inference task. The upper block removes or substitutes SCIM while MS-HSANet is retained; the lower block removes or substitutes MS-HSANet while SCIM is retained. A check mark denotes component retention, whereas a cross denotes component removal or replacement by the module indicated in parentheses. Results on the validation split are reported in terms of Precision, Recall, F1, Accuracy, AUC, Cohen’s , Brier score, G-mean, and Specificity; ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively. Red font denotes the complete SCISA-Net model, and blue font highlights the best-performing ablated or replacement setting for each metric within the corresponding comparison block.
Table 2.
Ablation study of SCISA-Net on the 31-class wall-mediated indirect semantic inference task. The upper block removes or substitutes SCIM while MS-HSANet is retained; the lower block removes or substitutes MS-HSANet while SCIM is retained. A check mark denotes component retention, whereas a cross denotes component removal or replacement by the module indicated in parentheses. Results on the validation split are reported in terms of Precision, Recall, F1, Accuracy, AUC, Cohen’s , Brier score, G-mean, and Specificity; ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively. Red font denotes the complete SCISA-Net model, and blue font highlights the best-performing ablated or replacement setting for each metric within the corresponding comparison block.
| w/o | Classification Metrics |
|---|
|
SCIM
|
MS-HSANet
| Precision ↑ | Recall ↑ | F1 ↑ | Accuracy ↑ | AUC ↑ |
Cohen’s
| Brier ↓ | G-Mean ↑ | Specificity ↑ |
|---|
| ✔ | ✔ | 0.7260 ± 0.1017 | 0.7160 ± 0.1025 | 0.7170 ± 0.0884 | 0.7160 ± 0.1025 | 0.9759 ± 0.0134 | 0.7141 | 0.0151 ± 0.0244 | 0.8400 ± 0.0622 | 0.9908 ± 0.0041 |
| × | ✔ | 0.0080 ± 0.0204 | 0.0384 ± 0.0937 | 0.0129 ± 0.0319 | 0.0384 ± 0.0937 | 0.5206 ± 0.0461 | 0.0066 | 0.0312 ± 0.0004 | 0.0687 ± 0.1586 | 0.9680 ± 0.0830 |
| × (ResNet-101) | ✔ | 0.0026 ± 0.0098 | 0.0322 ± 0.1738 | 0.0032 ± 0.0139 | 0.0322 ± 0.1738 | 0.4986 ± 0.0374 | 0.0000 | 0.0312 ± 0.0004 | 0.0095 ± 0.0368 | 0.9677 ± 0.1713 |
| × (VGG19) | ✔ | 0.0029 ± 0.0112 | 0.0338 ± 0.1684 | 0.0046 ± 0.0176 | 0.0338 ± 0.1684 | 0.4979 ± 0.0226 | 0.0019 | 0.0312 ± 0.0001 | 0.0183 ± 0.0698 | 0.9678 ± 0.1632 |
| × (RepVGG) | ✔ | 0.0053 ± 0.0167 | 0.0322 ± 0.1626 | 0.0054 ± 0.0176 | 0.0322 ± 0.1626 | 0.5175 ± 0.0882 | −0.0003 | 0.0314 ± 0.0017 | 0.0198 ± 0.0622 | 0.9677 ± 0.1658 |
| × (MLP-Mixer) | ✔ | 0.0045 ± 0.0145 | 0.0338 ± 0.1678 | 0.0056 ± 0.0176 | 0.0338 ± 0.1678 | 0.5000 ± 0.0408 | 0.0019 | 0.0312 ± 0.0002 | 0.0231 ± 0.0712 | 0.9678 ± 0.1614 |
| × (ResMLP) | ✔ | 0.0013 ± 0.0070 | 0.0323 ± 0.1767 | 0.0025 ± 0.0135 | 0.0323 ± 0.1767 | 0.5000 ± 0.0000 | 0.0000 | 0.0312 ± 0.0002 | 0.0000 ± 0.0000 | 0.9677 ± 0.1767 |
| × (Swin-Transformer-V2-B) | ✔ | 0.0013 ± 0.0070 | 0.0323 ± 0.1767 | 0.0025 ± 0.0135 | 0.0323 ± 0.1767 | 0.4995 ± 0.0099 | 0.0000 | 0.0312 ± 0.0002 | 0.0000 ± 0.0000 | 0.9677 ± 0.1767 |
| × (ViT-B) | ✔ | 0.0013 ± 0.0070 | 0.0323 ± 0.1767 | 0.0025 ± 0.0135 | 0.0323 ± 0.1767 | 0.5148 ± 0.0359 | 0.0000 | 0.0312 ± 0.0002 | 0.0000 ± 0.0000 | 0.9677 ± 0.1767 |
| ✔ | × | 0.0854 ± 0.0630 | 0.0890 ± 0.0682 | 0.0855 ± 0.0636 | 0.0890 ± 0.0682 | 0.6335 ± 0.0793 | 0.0602 | 0.0339 ± 0.0086 | 0.2594 ± 0.1365 | 0.9697 ± 0.0133 |
| ✔ | × (GFNet) | 0.6712 ± 0.1203 | 0.6597 ± 0.0931 | 0.6622 ± 0.0959 | 0.6597 ± 0.0931 | 0.9666 ± 0.0142 | 0.6547 | 0.0178 ± 0.0251 | 0.8056 ± 0.0591 | 0.9889 ± 0.0048 |
| ✔ | × (FCANet) | 0.5287 ± 0.1309 | 0.5223 ± 0.1245 | 0.5228 ± 0.1219 | 0.5223 ± 0.1245 | 0.9292 ± 0.0311 | 0.5151 | 0.0244 ± 0.0262 | 0.7115 ± 0.0902 | 0.9844 ± 0.0049 |
| ✔ | × (FreqNet) | 0.6529 ± 0.1047 | 0.6359 ± 0.1003 | 0.6370 ± 0.0812 | 0.6359 ± 0.1003 | 0.9653 ± 0.0148 | 0.6293 | 0.0177 ± 0.0233 | 0.7900 ± 0.0643 | 0.9880 ± 0.0055 |
| ✔ | × (Scattering-based ViT) | 0.6875 ± 0.1143 | 0.6779 ± 0.0902 | 0.6791 ± 0.0906 | 0.6779 ± 0.0902 | 0.9575 ± 0.0187 | 0.6737 | 0.0137 ± 0.0179 | 0.8172 ± 0.0550 | 0.9895 ± 0.0045 |
| ✔ | × (AFNONet) | 0.4496 ± 0.1243 | 0.4392 ± 0.0994 | 0.4407 ± 0.1041 | 0.4392 ± 0.0994 | 0.9129 ± 0.0311 | 0.4258 | 0.0260 ± 0.0222 | 0.6522 ± 0.0770 | 0.9815 ± 0.0060 |
| ✔ | × (FFCNet) | 0.6860 ± 0.1059 | 0.6776 ± 0.1010 | 0.6789 ± 0.0938 | 0.6776 ± 0.1010 | 0.9732 ± 0.0120 | 0.6737 | 0.0174 ± 0.0255 | 0.8165 ± 0.0623 | 0.9895 ± 0.0041 |
| ✔ | × (DCTNet) | 0.6863 ± 0.0923 | 0.6795 ± 0.1001 | 0.6801 ± 0.0872 | 0.6795 ± 0.1001 | 0.9516 ± 0.0227 | 0.6743 | 0.0142 ± 0.0150 | 0.8177 ± 0.0623 | 0.9895 ± 0.0040 |
Table 3.
Internal ablation of the HSAB design within MS-HSANet on the 31-class wall-mediated indirect semantic inference task. The original HSAB stack is compared with its removal and with parameter-matched substitutions, including Inception-v2, Global Filter, gMLP, MobileViT, ShuffleNetV2, Ghost Bottleneck, ResNeXt Block, and DenseNet Bottleneck. Validation results are reported as mean ± standard deviation in terms of Precision, Recall, F1, Accuracy, AUC, Cohen’s , Brier score, G-mean, and Specificity, where ↑ and ↓ denote preferable larger and smaller values, respectively. Red font marks the original HSAB configuration; blue font marks the best competing result for each metric.
Table 3.
Internal ablation of the HSAB design within MS-HSANet on the 31-class wall-mediated indirect semantic inference task. The original HSAB stack is compared with its removal and with parameter-matched substitutions, including Inception-v2, Global Filter, gMLP, MobileViT, ShuffleNetV2, Ghost Bottleneck, ResNeXt Block, and DenseNet Bottleneck. Validation results are reported as mean ± standard deviation in terms of Precision, Recall, F1, Accuracy, AUC, Cohen’s , Brier score, G-mean, and Specificity, where ↑ and ↓ denote preferable larger and smaller values, respectively. Red font marks the original HSAB configuration; blue font marks the best competing result for each metric.
| Methods | Precision ↑ | Recall ↑ | F1 ↑ | Accuracy ↑ | AUC ↑ | Cohen’s | Brier ↓ | G-Mean ↑ | Specificity ↑ |
|---|
| HSAB | 0.7260 ± 0.1017 | 0.7160 ± 0.1025 | 0.7170 ± 0.0884 | 0.7160 ± 0.1025 | 0.9759 ± 0.0134 | 0.7141 | 0.0151 ± 0.0244 | 0.8400 ± 0.0622 | 0.9908 ± 0.0041 |
| w/o HSAB | 0.0885 ± 0.0586 | 0.0981 ± 0.0992 | 0.0844 ± 0.0702 | 0.0981 ± 0.0992 | 0.6795 ± 0.0755 | 0.0717 | 0.0310 ± 0.0052 | 0.2566 ± 0.1667 | 0.9701 ± 0.0216 |
| Inception-v2 | 0.6497 ± 0.1376 | 0.6260 ± 0.1237 | 0.6283 ± 0.1043 | 0.6260 ± 0.1237 | 0.9753 ± 0.0103 | 0.6215 | 0.0155 ± 0.0142 | 0.7824 ± 0.0783 | 0.9878 ± 0.0069 |
| Global Filter | 0.5612 ± 0.0889 | 0.5527 ± 0.1092 | 0.5520 ± 0.0867 | 0.5527 ± 0.1092 | 0.9495 ± 0.0195 | 0.5437 | 0.0187 ± 0.0174 | 0.7340 ± 0.0751 | 0.9853 ± 0.0049 |
| gMLP | 0.5015 ± 0.1244 | 0.4913 ± 0.1113 | 0.4935 ± 0.1115 | 0.4913 ± 0.1113 | 0.9318 ± 0.0264 | 0.4824 | 0.0232 ± 0.0220 | 0.6902 ± 0.0825 | 0.9833 ± 0.0056 |
| MobileViT | 0.6801 ± 0.1018 | 0.6738 ± 0.0892 | 0.6745 ± 0.0869 | 0.6738 ± 0.0892 | 0.9739 ± 0.0127 | 0.6685 | 0.0147 ± 0.0194 | 0.8147 ± 0.0543 | 0.9893 ± 0.0039 |
| ShuffleNetV2 | 0.6237 ± 0.0873 | 0.6198 ± 0.1088 | 0.6184 ± 0.0869 | 0.6198 ± 0.1088 | 0.9675 ± 0.0114 | 0.6130 | 0.0160 ± 0.0162 | 0.7792 ± 0.0700 | 0.9875 ± 0.0035 |
| Ghost Bottleneck | 0.6596 ± 0.1048 | 0.6561 ± 0.1047 | 0.6563 ± 0.1001 | 0.6561 ± 0.1047 | 0.9699 ± 0.0118 | 0.6508 | 0.0185 ± 0.0260 | 0.8027 ± 0.0676 | 0.9887 ± 0.0036 |
| ResNeXt Block | 0.6943 ± 0.1340 | 0.6664 ± 0.1127 | 0.6685 ± 0.0946 | 0.6664 ± 0.1127 | 0.9741 ± 0.0125 | 0.6607 | 0.0146 ± 0.0180 | 0.8088 ± 0.0685 | 0.9891 ± 0.0076 |
| DenseNet Bottleneck | 0.6442 ± 0.1128 | 0.6307 ± 0.1008 | 0.6325 ± 0.0910 | 0.6307 ± 0.1008 | 0.9659 ± 0.0149 | 0.6254 | 0.0164 ± 0.0187 | 0.7868 ± 0.0638 | 0.9879 ± 0.0048 |
Table 4.
Stage depth ablation of the Multi-Stage Haar-Subband Attention Network. One_HSAB keeps stage0 and a single HSAB placed in stage1. Stage_01, stage_012, and stage_0123 progressively retain stage0–stage1, stage0–stage2, and stage0–stage3, respectively. HSAB_13 keeps stage0 and places 13 HSABs in stage1, where 13 equals the total number of HSABs used across stage1, stage2, and stage3 in the complete multi-stage setting. Validation results are reported as mean ± standard deviation in terms of Precision, Recall, F1, Accuracy, AUC, Cohen’s , Brier score, G-mean, and Specificity, where ↑ and ↓ denote preferable larger and smaller values, respectively. Red font denotes the complete stage_0123 setting.
Table 4.
Stage depth ablation of the Multi-Stage Haar-Subband Attention Network. One_HSAB keeps stage0 and a single HSAB placed in stage1. Stage_01, stage_012, and stage_0123 progressively retain stage0–stage1, stage0–stage2, and stage0–stage3, respectively. HSAB_13 keeps stage0 and places 13 HSABs in stage1, where 13 equals the total number of HSABs used across stage1, stage2, and stage3 in the complete multi-stage setting. Validation results are reported as mean ± standard deviation in terms of Precision, Recall, F1, Accuracy, AUC, Cohen’s , Brier score, G-mean, and Specificity, where ↑ and ↓ denote preferable larger and smaller values, respectively. Red font denotes the complete stage_0123 setting.
| MS-HSANet Setting | Precision ↑ | Recall ↑ | F1 ↑ | Accuracy ↑ | AUC ↑ | Cohen’s | Brier ↓ | G-Mean ↑ | Specificity ↑ |
|---|
| One_HSAB | 0.3480 ± 0.1455 | 0.2968 ± 0.1624 | 0.2849 ± 0.0997 | 0.2968 ± 0.1624 | 0.8819 ± 0.0424 | 0.2800 | 0.0262 ± 0.0100 | 0.5172 ± 0.1420 | 0.9768 ± 0.0173 |
| stage_01 | 0.5640 ± 0.1268 | 0.5290 ± 0.1599 | 0.5261 ± 0.1122 | 0.5290 ± 0.1599 | 0.9552 ± 0.0189 | 0.5234 | 0.0188 ± 0.0172 | 0.7121 ± 0.1135 | 0.9846 ± 0.0097 |
| stage_012 | 0.6452 ± 0.0969 | 0.6353 ± 0.1127 | 0.6355 ± 0.0906 | 0.6353 ± 0.1127 | 0.9699 ± 0.0129 | 0.6312 | 0.0152 ± 0.0182 | 0.7891 ± 0.0709 | 0.9881 ± 0.0047 |
| stage_0123 | 0.7260 ± 0.1017 | 0.7160 ± 0.1025 | 0.7170 ± 0.0884 | 0.7160 ± 0.1025 | 0.9759 ± 0.0134 | 0.7141 | 0.0151 ± 0.0244 | 0.8400 ± 0.0622 | 0.9908 ± 0.0041 |
| HSAB_13 | 0.6152 ± 0.0851 | 0.6012 ± 0.1288 | 0.6006 ± 0.0867 | 0.6012 ± 0.1288 | 0.9659 ± 0.0124 | 0.5972 | 0.0163 ± 0.0171 | 0.7657 ± 0.0820 | 0.9870 ± 0.0054 |
Table 5.
Inference time subband intervention analysis of HP-DConv in HSAB. Complete HP-DConv denotes the original setting in which the , , , and subbands are all preserved and arranged in their normal order. preserves only the low-frequency subband, while Remove , Remove , Remove , and Remove suppress the corresponding subband. Random permutation reports the average result over all 23 non-identity permutations of the four Haar subbands. ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively. Red font denotes the complete HP-DConv setting.
Table 5.
Inference time subband intervention analysis of HP-DConv in HSAB. Complete HP-DConv denotes the original setting in which the , , , and subbands are all preserved and arranged in their normal order. preserves only the low-frequency subband, while Remove , Remove , Remove , and Remove suppress the corresponding subband. Random permutation reports the average result over all 23 non-identity permutations of the four Haar subbands. ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively. Red font denotes the complete HP-DConv setting.
| Subband Intervention | Precision ↑ | Recall ↑ | F1 ↑ | Accuracy ↑ | AUC ↑ | Cohen’s | Brier ↓ | G-Mean ↑ | Specificity ↑ |
|---|
| Complete HP-DConv | 0.7260 ± 0.1017 | 0.7160 ± 0.1025 | 0.7170 ± 0.0884 | 0.7160 ± 0.1025 | 0.9759 ± 0.0134 | 0.7141 | 0.0151 ± 0.0244 | 0.8400 ± 0.0622 | 0.9908 ± 0.0041 |
| LL only | 0.0126 ± 0.0325 | 0.0362 ± 0.1210 | 0.0121 ± 0.0288 | 0.0362 ± 0.1210 | 0.4945 ± 0.0750 | 0.0037 | 0.0548 ± 0.0126 | 0.0603 ± 0.1338 | 0.9679 ± 0.1128 |
| Remove LL | 0.0282 ± 0.0501 | 0.0313 ± 0.0760 | 0.0177 ± 0.0267 | 0.0313 ± 0.0760 | 0.5365 ± 0.0940 | | 0.0505 ± 0.0117 | 0.0939 ± 0.1351 | 0.9677 ± 0.0797 |
| Remove LH | 0.3842 ± 0.2951 | 0.2469 ± 0.2457 | 0.2228 ± 0.1736 | 0.2469 ± 0.2457 | 0.7857 ± 0.1220 | 0.2348 | 0.0396 ± 0.0242 | 0.4091 ± 0.2548 | 0.9753 ± 0.0478 |
| Remove HL | 0.0498 ± 0.1062 | 0.0473 ± 0.1739 | 0.0229 ± 0.0440 | 0.0473 ± 0.1739 | 0.6098 ± 0.1482 | 0.0156 | 0.0568 ± 0.0143 | 0.0740 ± 0.1322 | 0.9682 ± 0.1346 |
| Remove HH | 0.7137 ± 0.1016 | 0.6957 ± 0.1311 | 0.6953 ± 0.0926 | 0.6957 ± 0.1311 | 0.9735 ± 0.0146 | 0.6952 | 0.0159 ± 0.0247 | 0.8258 ± 0.0812 | 0.9902 ± 0.0050 |
| Random permutation | 0.0256 ± 0.0681 | 0.0402 ± 0.0407 | 0.0174 ± 0.0394 | 0.0402 ± 0.0407 | 0.5196 ± 0.0672 | 0.0087 | 0.0564 ± 0.0050 | 0.0665 ± 0.0730 | 0.9680 ± 0.0014 |
Table 7.
Quantitative robustness evaluation of SCISA-Net under simulated observation corruption on the 31-class wall-mediated indirect semantic inference task. Validation performance is reported as mean ± standard deviation in terms of Macro-Precision, Macro-Recall, Macro-F1, Accuracy, AUC, Cohen’s , Brier loss, G-mean, and Specificity. ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively. Corruption settings comprise a noise-free reference and three noise families, namely Gaussian, Poisson, and Scatter noise, each examined at multiple intensity levels. Red font denotes the noise-free reference condition, and blue font denotes the best competing result for each metric.
Table 7.
Quantitative robustness evaluation of SCISA-Net under simulated observation corruption on the 31-class wall-mediated indirect semantic inference task. Validation performance is reported as mean ± standard deviation in terms of Macro-Precision, Macro-Recall, Macro-F1, Accuracy, AUC, Cohen’s , Brier loss, G-mean, and Specificity. ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively. Corruption settings comprise a noise-free reference and three noise families, namely Gaussian, Poisson, and Scatter noise, each examined at multiple intensity levels. Red font denotes the noise-free reference condition, and blue font denotes the best competing result for each metric.
| Corruption Setting | Macro-Precision ↑ | Macro-Recall ↑ | Macro-F1 ↑ | Accuracy ↑ | AUC ↑ | Cohen’s | Brier Loss ↓ | G-Mean ↑ | Specificity ↑ |
|---|
| Noise-Free | 0.7260 ± 0.1017 | 0.7160 ± 0.1025 | 0.7170 ± 0.0884 | 0.7160 ± 0.1025 | 0.9759 ± 0.0134 | 0.7141 | 0.0151 ± 0.0244 | 0.8400 ± 0.0622 | 0.9908 ± 0.0041 |
| Gaussian Noise () | 0.4301 ± 0.1491 | 0.3839 ± 0.1957 | 0.3728 ± 0.1235 | 0.3839 ± 0.1957 | 0.8772 ± 0.0406 | 0.3633 | 0.0332 ± 0.0269 | 0.5911 ± 0.1546 | 0.9795 ± 0.0189 |
| Gaussian Noise () | 0.5840 ± 0.1436 | 0.5694 ± 0.1595 | 0.5661 ± 0.1319 | 0.5694 ± 0.1595 | 0.9466 ± 0.0266 | 0.5550 | 0.0233 ± 0.0271 | 0.7404 ± 0.1133 | 0.9856 ± 0.0079 |
| Gaussian Noise () | 0.6909 ± 0.1113 | 0.6871 ± 0.1350 | 0.6835 ± 0.1120 | 0.6871 ± 0.1350 | 0.9669 ± 0.0267 | 0.6767 | 0.0168 ± 0.0251 | 0.8197 ± 0.0902 | 0.9896 ± 0.0043 |
| Gaussian Noise () | 0.7403 ± 0.1124 | 0.7323 ± 0.1168 | 0.7316 ± 0.1009 | 0.7323 ± 0.1168 | 0.9770 ± 0.0176 | 0.7233 | 0.0148 ± 0.0244 | 0.8489 ± 0.0718 | 0.9911 ± 0.0046 |
| Gaussian Noise () | 0.7279 ± 0.1127 | 0.7194 ± 0.1196 | 0.7182 ± 0.1019 | 0.7194 ± 0.1196 | 0.9774 ± 0.0169 | 0.7100 | 0.0154 ± 0.0248 | 0.8408 ± 0.0756 | 0.9906 ± 0.0048 |
| Poisson Noise () | 0.4892 ± 0.1585 | 0.4387 ± 0.1120 | 0.4464 ± 0.1039 | 0.4387 ± 0.1120 | 0.8768 ± 0.0397 | 0.4200 | 0.0291 ± 0.0267 | 0.6508 ± 0.0805 | 0.9813 ± 0.0135 |
| Poisson Noise () | 0.4847 ± 0.1819 | 0.4258 ± 0.1458 | 0.4331 ± 0.1351 | 0.4258 ± 0.1458 | 0.8541 ± 0.0515 | 0.4067 | 0.0305 ± 0.0269 | 0.6351 ± 0.1178 | 0.9809 ± 0.0157 |
| Poisson Noise () | 0.4456 ± 0.1649 | 0.3968 ± 0.1224 | 0.4055 ± 0.1208 | 0.3968 ± 0.1224 | 0.8567 ± 0.0524 | 0.3767 | 0.0314 ± 0.0261 | 0.6160 ± 0.0970 | 0.9799 ± 0.0139 |
| Poisson Noise () | 0.4852 ± 0.1462 | 0.4565 ± 0.1038 | 0.4609 ± 0.1090 | 0.4565 ± 0.1038 | 0.8742 ± 0.0397 | 0.4383 | 0.0281 ± 0.0260 | 0.6637 ± 0.0873 | 0.9819 ± 0.0104 |
| Poisson Noise () | 0.4831 ± 0.1421 | 0.4613 ± 0.1528 | 0.4576 ± 0.1241 | 0.4613 ± 0.1528 | 0.8794 ± 0.0440 | 0.4433 | 0.0282 ± 0.0266 | 0.6610 ± 0.1243 | 0.9820 ± 0.0110 |
| Scatter Noise () | 0.2877 ± 0.2408 | 0.2097 ± 0.2042 | 0.1888 ± 0.1346 | 0.2097 ± 0.2042 | 0.7721 ± 0.0692 | 0.1833 | 0.0425 ± 0.0228 | 0.3756 ± 0.2383 | 0.9737 ± 0.0432 |
| Scatter Noise () | 0.4544 ± 0.1489 | 0.3984 ± 0.1673 | 0.3925 ± 0.1132 | 0.3984 ± 0.1673 | 0.8859 ± 0.0463 | 0.3783 | 0.0318 ± 0.0265 | 0.6087 ± 0.1342 | 0.9799 ± 0.0169 |
| Scatter Noise () | 0.6050 ± 0.1258 | 0.5919 ± 0.1443 | 0.5892 ± 0.1215 | 0.5919 ± 0.1443 | 0.9514 ± 0.0250 | 0.5783 | 0.0215 ± 0.0260 | 0.7576 ± 0.0991 | 0.9864 ± 0.0069 |
| Scatter Noise () | 0.7093 ± 0.1308 | 0.7032 ± 0.1420 | 0.7011 ± 0.1249 | 0.7032 ± 0.1420 | 0.9734 ± 0.0208 | 0.6933 | 0.0157 ± 0.0247 | 0.8293 ± 0.0931 | 0.9901 ± 0.0051 |
| Scatter Noise () | 0.7356 ± 0.1163 | 0.7258 ± 0.1301 | 0.7238 ± 0.1076 | 0.7258 ± 0.1301 | 0.9762 ± 0.0173 | 0.7167 | 0.0150 ± 0.0245 | 0.8439 ± 0.0833 | 0.9909 ± 0.0051 |
Table 8.
Sensitivity analysis of the SCIM inversion parameters under the Poisson noise condition. The default setting uses the fixed singular-value truncation number and Tikhonov regularization parameter adopted in the main experiments. Additional settings vary either k or while keeping the trained checkpoint and the remaining inference pipeline unchanged. Validation results are reported as mean ± standard deviation in terms of Precision, Recall, F1, Accuracy, AUC, Cohen’s , Brier score, G-mean, and Specificity, where ↑ and ↓ denote preferable larger and smaller values, respectively. Red font denotes the default setting, and blue font marks the best result for each metric among the non-default settings.
Table 8.
Sensitivity analysis of the SCIM inversion parameters under the Poisson noise condition. The default setting uses the fixed singular-value truncation number and Tikhonov regularization parameter adopted in the main experiments. Additional settings vary either k or while keeping the trained checkpoint and the remaining inference pipeline unchanged. Validation results are reported as mean ± standard deviation in terms of Precision, Recall, F1, Accuracy, AUC, Cohen’s , Brier score, G-mean, and Specificity, where ↑ and ↓ denote preferable larger and smaller values, respectively. Red font denotes the default setting, and blue font marks the best result for each metric among the non-default settings.
| SCIM Setting | Precision ↑ | Recall ↑ | F1 ↑ | Accuracy ↑ | AUC ↑ | Cohen’s | Brier ↓ | G-Mean ↑ | Specificity ↑ |
|---|
| Default | 0.4852 ± 0.1462 | 0.4565 ± 0.1038 | 0.4609 ± 0.1090 | 0.4565 ± 0.1038 | 0.8742 ± 0.0397 | 0.4383 | 0.0281 ± 0.0260 | 0.6637 ± 0.0873 | 0.9819 ± 0.0104 |
| 0.4521 ± 0.1195 | 0.4242 ± 0.0974 | 0.4267 ± 0.0879 | 0.4242 ± 0.0974 | 0.8655 ± 0.0419 | 0.4050 | 0.0302 ± 0.0261 | 0.6401 ± 0.0778 | 0.9808 ± 0.0102 |
| 0.4354 ± 0.1577 | 0.4087 ± 0.1274 | 0.4073 ± 0.1124 | 0.4087 ± 0.1274 | 0.8508 ± 0.0470 | 0.3877 | 0.0306 ± 0.0261 | 0.6175 ± 0.1370 | 0.9802 ± 0.0112 |
| 0.4866 ± 0.1405 | 0.4511 ± 0.0946 | 0.4572 ± 0.0983 | 0.4511 ± 0.0946 | 0.8721 ± 0.0404 | 0.4327 | 0.0283 ± 0.0260 | 0.6610 ± 0.0759 | 0.9817 ± 0.0112 |
| 0.4845 ± 0.1456 | 0.4468 ± 0.0941 | 0.4532 ± 0.1014 | 0.4468 ± 0.0941 | 0.8720 ± 0.0430 | 0.4283 | 0.0283 ± 0.0261 | 0.6579 ± 0.0750 | 0.9816 ± 0.0114 |
Table 9.
Quantitative evaluation of SCISA-Net under ambient-light background interference on the wall-mediated indirect observation images. Three representative background patterns are considered, including uniform background, random-direction linear gradient background, and random-center radial gradient background. Results are reported on the validation split as mean ± standard deviation for Precision, Recall, F1, Accuracy, AUC, Brier score, G-mean, and Specificity, while Cohen’s is reported as a scalar value. A smaller indicates stronger background interference. Red font denotes the no-background reference condition, and blue font marks the best result for each metric within each background type. ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively.
Table 9.
Quantitative evaluation of SCISA-Net under ambient-light background interference on the wall-mediated indirect observation images. Three representative background patterns are considered, including uniform background, random-direction linear gradient background, and random-center radial gradient background. Results are reported on the validation split as mean ± standard deviation for Precision, Recall, F1, Accuracy, AUC, Brier score, G-mean, and Specificity, while Cohen’s is reported as a scalar value. A smaller indicates stronger background interference. Red font denotes the no-background reference condition, and blue font marks the best result for each metric within each background type. ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively.
| Background Type | | Precision ↑ | Recall ↑ | F1 ↑ | Accuracy ↑ | AUC ↑ | Cohen’s | Brier ↓ | G-Mean ↑ | Specificity ↑ |
|---|
| No background | – | 0.7260 ± 0.1017 | 0.7160 ± 0.1025 | 0.7170 ± 0.0884 | 0.7160 ± 0.1025 | 0.9759 ± 0.0134 | 0.7141 | 0.0151 ± 0.0244 | 0.8400 ± 0.0622 | 0.9908 ± 0.0041 |
| Uniform background | | 0.5984 ± 0.1127 | 0.5839 ± 0.1286 | 0.5827 ± 0.0980 | 0.5839 ± 0.1286 | 0.9481 ± 0.0207 | 0.5780 | 0.0219 ± 0.0265 | 0.7540 ± 0.0849 | 0.9864 ± 0.0062 |
| Uniform background | | 0.6490 ± 0.1093 | 0.6377 ± 0.1151 | 0.6381 ± 0.0983 | 0.6377 ± 0.1151 | 0.9622 ± 0.0188 | 0.6345 | 0.0190 ± 0.0261 | 0.7904 ± 0.0740 | 0.9882 ± 0.0050 |
| Uniform background | | 0.6974 ± 0.1060 | 0.6894 ± 0.1068 | 0.6903 ± 0.0966 | 0.6894 ± 0.1068 | 0.9715 ± 0.0158 | 0.6874 | 0.0165 ± 0.0252 | 0.8235 ± 0.0663 | 0.9899 ± 0.0040 |
| Linear gradient background | | 0.5302 ± 0.1321 | 0.5005 ± 0.1494 | 0.4987 ± 0.1089 | 0.5005 ± 0.1494 | 0.9259 ± 0.0282 | 0.4919 | 0.0260 ± 0.0269 | 0.6928 ± 0.1070 | 0.9836 ± 0.0105 |
| Linear gradient background | | 0.6424 ± 0.1125 | 0.6274 ± 0.1166 | 0.6280 ± 0.0962 | 0.6274 ± 0.1166 | 0.9577 ± 0.0203 | 0.6228 | 0.0197 ± 0.0263 | 0.7836 ± 0.0750 | 0.9878 ± 0.0057 |
| Linear gradient background | | 0.7109 ± 0.1214 | 0.7019 ± 0.1081 | 0.7030 ± 0.1037 | 0.7019 ± 0.1081 | 0.9735 ± 0.0161 | 0.6993 | 0.0158 ± 0.0247 | 0.8310 ± 0.0679 | 0.9903 ± 0.0045 |
| Radial gradient background | | 0.6382 ± 0.0990 | 0.6284 ± 0.1185 | 0.6272 ± 0.0936 | 0.6284 ± 0.1185 | 0.9585 ± 0.0198 | 0.6234 | 0.0197 ± 0.0261 | 0.7839 ± 0.0780 | 0.9879 ± 0.0046 |
| Radial gradient background | | 0.6850 ± 0.1082 | 0.6775 ± 0.1035 | 0.6789 ± 0.0987 | 0.6775 ± 0.1035 | 0.9701 ± 0.0157 | 0.6750 | 0.0167 ± 0.0249 | 0.8162 ± 0.0659 | 0.9895 ± 0.0039 |
| Radial gradient background | | 0.7224 ± 0.1111 | 0.7124 ± 0.1040 | 0.7139 ± 0.0970 | 0.7124 ± 0.1040 | 0.9755 ± 0.0146 | 0.7109 | 0.0152 ± 0.0244 | 0.8377 ± 0.0647 | 0.9907 ± 0.0044 |
Table 10.
Performance under scene re-parameterization with the updated scene information encoding operator . Validation results are reported as mean ± standard deviation in terms of Macro-Precision, Macro-Recall, Macro-F1, Accuracy, AUC, Brier loss, G-mean, and Specificity, while Cohen’s is reported as a scalar value. Red font denotes the default calibrated setting, and blue font marks the best result for each metric among the re-parameterized settings. ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively.
Table 10.
Performance under scene re-parameterization with the updated scene information encoding operator . Validation results are reported as mean ± standard deviation in terms of Macro-Precision, Macro-Recall, Macro-F1, Accuracy, AUC, Brier loss, G-mean, and Specificity, while Cohen’s is reported as a scalar value. Red font denotes the default calibrated setting, and blue font marks the best result for each metric among the re-parameterized settings. ↑ and ↓ indicate metrics for which larger and smaller values are preferred, respectively.
| Scene Parameter Setting | Macro-Precision ↑ | Macro-Recall ↑ | Macro-F1 ↑ | Accuracy ↑ | AUC ↑ | Cohen’s | Brier Loss ↓ | G-Mean ↑ | Specificity ↑ |
|---|
| Default | 0.7600 ± 0.1017 | 0.7160 ± 0.1025 | 0.7170 ± 0.0884 | 0.7160 ± 0.1025 | 0.9759 ± 0.0134 | 0.7141 | 0.0151 ± 0.0244 | 0.8400 ± 0.0622 | 0.9908 ± 0.0041 |
| Wall–screen distance | 0.6975 ± 0.1091 | 0.6887 ± 0.1119 | 0.6894 ± 0.1014 | 0.6887 ± 0.1119 | 0.9715 ± 0.0133 | 0.6871 | 0.0162 ± 0.0248 | 0.8227 ± 0.0712 | 0.9899 ± 0.0042 |
| Occluder shifted toward wall by | 0.7317 ± 0.0988 | 0.7230 ± 0.0971 | 0.7239 ± 0.0877 | 0.7230 ± 0.0971 | 0.9777 ± 0.0117 | 0.7208 | 0.0147 ± 0.0243 | 0.8443 ± 0.0610 | 0.9910 ± 0.0040 |
| Effective screen region shifted along by | 0.7215 ± 0.1000 | 0.7130 ± 0.0970 | 0.7140 ± 0.0882 | 0.7130 ± 0.0970 | 0.9755 ± 0.0125 | 0.7106 | 0.0151 ± 0.0243 | 0.8383 ± 0.0607 | 0.9907 ± 0.0039 |
Table 11.
Gesture-morphology-stratified evaluation of SCISA-Net on the held-out validation split. The original 31-class task is partitioned, without retraining, into three subset groups defined by the number of extended fingers: Closed (0–1), Half-Open (2–3), and Spread (4–5). Results are reported as mean ± standard deviation in terms of Macro-Precision, Macro-Recall, Macro-F1, Accuracy, AUC, Cohen’s , Brier Loss, G-Mean, and Specificity, where ↑ and ↓ indicate preferable larger and smaller values, respectively. Red font marks the full 31-class reference result, and blue font marks the best result for each metric among the morphology-stratified subset groups.
Table 11.
Gesture-morphology-stratified evaluation of SCISA-Net on the held-out validation split. The original 31-class task is partitioned, without retraining, into three subset groups defined by the number of extended fingers: Closed (0–1), Half-Open (2–3), and Spread (4–5). Results are reported as mean ± standard deviation in terms of Macro-Precision, Macro-Recall, Macro-F1, Accuracy, AUC, Cohen’s , Brier Loss, G-Mean, and Specificity, where ↑ and ↓ indicate preferable larger and smaller values, respectively. Red font marks the full 31-class reference result, and blue font marks the best result for each metric among the morphology-stratified subset groups.
| Evaluation Split | Macro-Precision ↑ | Macro-Recall ↑ | Macro-F1 ↑ | Accuracy ↑ | AUC ↑ | Cohen’s | Brier Loss ↓ | G-Mean ↑ | Specificity ↑ |
|---|
| Full 31-Class Set | 0.7260 ± 0.1017 | 0.7160 ± 0.1025 | 0.7170 ± 0.0884 | 0.7160 ± 0.1025 | 0.9759 ± 0.0134 | 0.7141 | 0.0151 ± 0.0244 | 0.8400 ± 0.0622 | 0.9908 ± 0.0041 |
| Closed SubSet | 0.8082 ± 0.1152 | 0.6738 ± 0.0915 | 0.7294 ± 0.0806 | 0.7907 ± 0.0815 | 0.9543 ± 0.0250 | 0.6509 | 0.0171 ± 0.0252 | 0.8094 ± 0.0566 | 0.9772 ± 0.0152 |
| Half-Open SubSet | 0.8388 ± 0.0716 | 0.7562 ± 0.0909 | 0.7918 ± 0.0663 | 0.8307 ± 0.0844 | 0.9793 ± 0.0083 | 0.7482 | 0.0130 ± 0.0232 | 0.8629 ± 0.0524 | 0.9884 ± 0.0060 |
| Spread SubSet | 0.8366 ± 0.0932 | 0.6860 ± 0.1081 | 0.7455 ± 0.0801 | 0.8178 ± 0.1075 | 0.9645 ± 0.0185 | 0.6443 | 0.0170 ± 0.0254 | 0.8104 ± 0.0653 | 0.9648 ± 0.0256 |