HOIMamba: Bidirectional State-Space Modeling for Monocular 3D Human–Object Interaction Reconstruction
Abstract
1. Introduction
- We propose HOIMamba, a bidirectional state-space framework for monocular 3D human–object interaction reconstruction.
- We introduce a multi-scale state-space module that captures hierarchical interaction structures and fine-grained contact details.
- We design a spatial-channel grouped SSM block that decouples geometric and semantic interaction reasoning into parallel, efficient state-space pathways.
- We achieve state-of-the-art performance on two benchmarks, with substantial gains on contact-sensitive evaluation metrics.
2. Related Work
2.1. 3D Human Reconstruction
2.2. 3D Human–Object Interaction Reconstruction
2.3. State-Space Models
3. Method
3.1. Initialization
- Human Initialization. Following [26], the backbone regresses the parameters of the SMPL-H model [22], including body shape , body pose , and hand pose . The SMPL-H function produces an initial human mesh and 3D joints . For computational efficiency in subsequent refinement, the mesh is downsampled to using a predefined downsampling matrix .
- Object Initialization. The initial 6D pose of the object is predicted by a lightweight regression head attached to the backbone, yielding rotation and translation . Applying the predicted pose to the object template vertices yieldswhere denotes the number of sampled object vertices.
- 3D-Aware Query Construction. To integrate visual appearance with 3D geometry, we construct hybrid query tokens. A weak-perspective camera is estimated from to project 3D points onto the image plane. For each 3D point , its 2D projection is computed, and an appearance feature is extracted via bilinear grid sampling on . The appearance feature is concatenated with the corresponding 3D coordinates to form a position-aware query:The final token sequence is , with and .
- Sequence Ordering. Tokens are concatenated in a fixed semantic order: joints, human vertices, and object vertices. Within each entity, tokens follow predefined joint hierarchies or mesh topologies. The fixed semantic order (joints → human vertices → object vertices, with internal topologies) is designed to impose a structured traversal over the interaction graph. This deterministic ordering allows the state-space models (SSMs) to accumulate interaction evidence in a controlled and spatially meaningful manner, which mimics a progressive scan of the human–object constellation. This design provides a strong structural prior for the SSM, which is crucial for capturing the long-range dependencies between distant body joints and the object, as well as local contacts.
3.2. Bidirectional State-Space Refinement Encoder
3.2.1. Preliminary: State-Space Models
3.2.2. Multi-Scale State-Space (MSSM) Block
3.2.3. Spatial-Channel Grouped SSM (SCSSM) Block
- Spatial State Pathway. Built on MSSM, this pathway operates along the token dimension, which captures geometric proximity and long-range spatial dependencies:
- Channel State Pathway. This pathway operates along the feature dimension and models semantic correlations across channels. We transpose the input as . We then apply a Mamba layer to capture channel-wise dependencies:followed by transposition back to . Compared to channel-wise MLPs or attention, channel SSMs enable progressive and directional feature aggregation, yielding more stable semantic alignment. A channel attention gate is applied:
- Fusion. The final SCSSM output is obtained by fusing both pathways:where . The forward pass (SCSSM()) processes the sequence in the order of [joints, human vertices, object vertices], which can be interpreted as information flowing from the human to the object (i.e., how the human’s state influences the object’s pose). Conversely, the backward pass processes the reversed order [object vertices, human vertices, joints], simulating information flow from the object back to the human (i.e., how the object’s geometry and pose constrain the human’s posture). By fusing both directions, the model captures the full bidirectional and asymmetric nature of the interaction, leading to more physically plausible reconstructions.
3.2.4. Bidirectional SSM Encoder Layer
3.2.5. Multi-Stage Encoder Architecture
3.3. Training Objectives
- Human Mesh Loss. The human loss combines multi-scale vertex supervision, joint supervision, edge-length regularization, and SMPL-H parameter supervision:
- Multi-scale vertex loss. Let denote the predicted coarse human vertices (e.g., ), and , be the upsampled vertices at two finer resolutions (e.g., , ). Let the corresponding ground-truth vertices be at each scale . We defineThis constrains human reconstruction quality progressively from coarse geometry to full-resolution details.
- Edge-length consistency. Let be the edge set of the human mesh topology (at full resolution), and indicates an edge between vertices i and j. We penalize the discrepancy between predicted and GT edge lengths:
- SMPL-H parameter loss. Let denote the predicted SMPL-H parameters (e.g., body pose, hand pose, and shape if used), and be the ground truth. We apply an penalty:
- Object Loss. We supervise object shape (vertex) and rigid pose:
- Object vertex loss. Let be the predicted object vertices (in the canonical object frame or a chosen frame), and be the ground truth vertices aligned in the same frame. We use
- Object pose loss. Let and be the predicted and GT object rotation and translation. We use a translation loss and a rotation geodesic loss on :where maps a rotation matrix to its Lie algebra (axis–angle vector).
- Hand Bounding Box Loss. Following common practice for whole-body mesh recovery, we supervise 2D hand bounding boxes. Let and denote the predicted and GT hand boxes (e.g., ):
3.4. Implementation Details
4. Experiments
4.1. Datasets
4.2. Evaluation Metrics
4.3. Quantitative Results and Comparisons
- Runtime Analysis. We also compare the runtime of HOIMamba with the attention-based baseline, i.e., CONTHO [12]. We use the same hardware configuration as in Table 1 and measure the average time per image and memory usage in seconds. CONTHO takes 0.1 s per image with a peak memory usage of 928 MB, while HOIMamba takes 0.07 s per image with a peak memory usage of 884 MB. This suggests that HOIMamba is able to achieve comparable runtime and memory usage while achieving higher accuracy.
4.4. Qualitative Results
4.5. Limitations
5. Ablation Study
5.1. Ablation Settings and Variant Definitions
5.2. Impact of State-Space Modeling
5.3. Effect of Multi-Scale State-Space Modeling
5.4. Spatial-Channel Factorization Analysis
5.5. Bidirectional Interaction Reasoning
5.6. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| HOI | Human–Object Interaction |
| SSM | Structured State-Space Model |
| MSSM | Multi-Scale Structured State-Space Model |
| SCSSM | Spatial-Channel Grouped Structured State-Space Model |
| SMPL | Skinned Multi-Person Linear model |
References
- Catalfamo, A.; Celesti, A.; Fazio, M.; Saif, A.S.; Lin, Y.S.; Silva, E.F.; Villari, M. An Approach to Enable Human–3D Object Interaction Through Voice Commands in an Immersive Virtual Environment. Big Data Cogn. Comput. 2025, 9, 188. [Google Scholar] [CrossRef]
- Wang, T.; Lu, T.; Fang, W.; Zhang, Y. Human–Object Interaction Detection with Ratio-Transformer. Symmetry 2022, 14, 1666. [Google Scholar] [CrossRef]
- Peng, W.; Li, C.; Tang, K.; Liu, X.; Fang, M. HOME: 3D Human–Object Mesh Topology-Enhanced Interaction Recognition in Images. Mathematics 2022, 10, 2841. [Google Scholar] [CrossRef]
- Luo, T.; Yuan, Q.; Zhu, B.; Guan, S.; Yang, R.; Smith, J.S.; Lim, E.G. Simple yet effective: An explicit query-based relation learner for human-object-interaction detection. Neurocomputing 2025, 649, 130709. [Google Scholar]
- Alhaji, B.; Beecken, J.; Ehlers, R.; Gertheiss, J.; Merz, F.; Müller, J.P.; Prilla, M.; Rausch, A.; Reinhardt, A.; Reinhardt, D.; et al. Engineering human–machine teams for trusted collaboration. Big Data Cogn. Comput. 2020, 4, 35. [Google Scholar] [CrossRef]
- Peng, B.; Chen, B. Bayesian prototypical pruning for transformers in human–robot collaboration. Mathematics 2025, 13, 1411. [Google Scholar] [CrossRef]
- Fernández-Rodicio, E.; Dondrup, C.; Sevilla-Salcedo, J.; Castro-González, Á.; Salichs, M.A. Predicting and Synchronising Co-Speech Gestures for Enhancing Human–Robot Interactions Using Deep Learning Models. Biomimetics 2025, 10, 835. [Google Scholar] [CrossRef]
- Luo, T.; Guan, S.; Yang, R.; Smith, J. From detection to understanding: A survey on representation learning for human-object interaction. Neurocomputing 2023, 543, 126243. [Google Scholar] [CrossRef]
- Luo, T.; Yuan, Q.; Zhu, B.; Guan, S.; Yang, R.; Smith, J.S.; Lim, E.G. Exploring interaction concepts for human–object-interaction detection via global- and local-scale enhancing. Neurocomputing 2025, 651, 130882. [Google Scholar] [CrossRef]
- Zhang, J.Y.; Pepose, S.; Joo, H.; Ramanan, D.; Malik, J.; Kanazawa, A. Perceiving 3d human-object spatial arrangements from a single image in the wild. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 34–51. [Google Scholar]
- Xie, X.; Bhatnagar, B.L.; Pons-Moll, G. Chore: Contact, human and object reconstruction from a single rgb image. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 125–145. [Google Scholar]
- Nam, H.; Jung, D.S.; Moon, G.; Lee, K.M. Joint reconstruction of 3d human and object via contact-based refinement transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 10218–10227. [Google Scholar]
- Wang, Z.; Zheng, Q.; Ma, S.; Ye, M.; Zhan, Y.; Li, D. End-to-End HOI Reconstruction Transformer with Graph-based Encoding. In Proceedings of the IEEE Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 27706–27715. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inform. Process. Syst. 2017, 30, 1–11. [Google Scholar]
- Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
- Bhatnagar, B.L.; Xie, X.; Petrov, I.A.; Sminchisescu, C.; Theobalt, C.; Pons-Moll, G. Behave: Dataset and method for tracking human object interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 15935–15946. [Google Scholar]
- Huang, Y.; Taheri, O.; Black, M.J.; Tzionas, D. InterCap: Joint markerless 3D tracking of humans and objects in interaction. In Proceedings of the DAGM German Conference on Pattern Recognition, Konstanz, Germany, 27–30 September 2022; pp. 281–299. [Google Scholar]
- Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A.A.; Tzionas, D.; Black, M.J. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10975–10985. [Google Scholar]
- Omran, M.; Lassner, C.; Pons-Moll, G.; Gehler, P.; Schiele, B. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 484–494. [Google Scholar]
- Yang, Y.; Dong, L.; Wang, W.; Zhong, Z.; Sun, X. Learnable SMPLify: A Neural Solution for Optimization-Free Human Pose Inverse Kinematics. arXiv 2025, arXiv:2508.13562. [Google Scholar] [CrossRef]
- Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M.J. SMPL: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2; ACM: New York, NY, USA, 2023; pp. 851–866. [Google Scholar]
- Kanazawa, A.; Black, M.J.; Jacobs, D.W.; Malik, J. End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7122–7131. [Google Scholar]
- Pavlakos, G.; Zhu, L.; Zhou, X.; Daniilidis, K. Learning to estimate 3D human pose and shape from a single color image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 459–468. [Google Scholar]
- Lin, K.; Wang, L.; Liu, Z. Mesh graphormer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 12939–12948. [Google Scholar]
- Lin, K.; Wang, L.; Liu, Z. End-to-end human pose and mesh reconstruction with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Montreal, QC, Canada, 10–17 October 2021; pp. 1954–1963. [Google Scholar]
- Krishnan, R.; Shalit, U.; Sontag, D. Structured inference networks for nonlinear state space models. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- Dao, T.; Gu, A. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv 2024, arXiv:2405.21060. [Google Scholar] [CrossRef]
- Lieber, O.; Lenz, B.; Bata, H.; Cohen, G.; Osin, J.; Dalmedigos, I.; Safahi, E.; Meirom, S.; Belinkov, Y.; Shalev-Shwartz, S.; et al. Jamba: A hybrid transformer-mamba language model. arXiv 2024, arXiv:2403.19887. [Google Scholar] [CrossRef]
- Waleffe, R.; Byeon, W.; Riach, D.; Norick, B.; Korthikanti, V.; Dao, T.; Gu, A.; Hatamizadeh, A.; Singh, S.; Narayanan, D.; et al. An empirical study of mamba-based language models. arXiv 2024, arXiv:2406.07887. [Google Scholar] [CrossRef]
- Liu, X.; Zhang, C.; Huang, F.; Xia, S.; Wang, G.; Zhang, L. Vision mamba: A comprehensive survey and taxonomy. IEEE Trans. Neural Netw. Learn. Syst. 2026, 37, 505–525. [Google Scholar] [PubMed]
- Zhang, H.; Zhu, Y.; Wang, D.; Zhang, L.; Chen, T.; Wang, Z.; Ye, Z. A survey on visual mamba. Appl. Sci. 2024, 14, 5683. [Google Scholar] [CrossRef]
- Hatamizadeh, A.; Kautz, J. Mambavision: A hybrid mamba-transformer vision backbone. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 25261–25270. [Google Scholar]
- Zhang, G.; Fan, L.; He, C.; Lei, Z.; Zhang, Z.; Zhang, L. Voxel mamba: Group-free state space models for point cloud based 3d object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 81489–81509. [Google Scholar]
- Jin, X.; Su, H.; Liu, K.; Ma, C.; Wu, W.; Hui, F.; Yan, J. UniMamba: Unified spatial-channel representation learning with group-efficient mamba for LiDAR-based 3D object detection. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 1407–1417. [Google Scholar]
- Xing, Z.; Ye, T.; Yang, Y.; Cai, D.; Gai, B.; Wu, X.J.; Gao, F.; Zhu, L. Segmamba-v2: Long-range sequential modeling mamba for general 3d medical image segmentation. IEEE Trans. Med. Imaging 2026, 45, 4–15. [Google Scholar] [CrossRef]
- Wang, C.; Xie, Y.; Chen, Q.; Zhou, Y.; Wu, Q. A comprehensive analysis of Mamba for 3D volumetric medical image segmentation. Pattern Recognit. 2025, 173, 112701. [Google Scholar] [CrossRef]
- Xu, J.; Lan, Y.; Zhang, Y.; Zhang, C.; Stirenko, S.; Li, H. CDA-mamba: Cross-directional attention mamba for enhanced 3D medical image segmentation. Sci. Rep. 2025, 15, 21357. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Z.; Liu, A.; Chen, Q.; Chen, F.; Reid, I.; Hartley, R.; Zhuang, B.; Tang, H. Infinimotion: Mamba boosts memory in transformer for arbitrary long motion generation. arXiv 2024, arXiv:2407.10061. [Google Scholar] [CrossRef]
- Wang, X.; Kang, Z.; Mu, Y. Text-controlled motion mamba: Text-instructed temporal grounding of human motion. IEEE Trans. Image Process. 2025, 34, 7079–7092. [Google Scholar] [CrossRef] [PubMed]
- Jordan, K.; Jin, Y.; Boza, V.; Jiacheng, Y.; Cesista, F.; Newhouse, L.; Bernstein, J. Muon: An Optimizer for Hidden Layers in Neural Networks. 2024. Available online: https://kellerjordan.github.io/posts/muon/ (accessed on 10 March 2026).





| Methods | BEHAVE | |||
| CDhuman↓ | CDobject↓ | Contactp↑ | Contactr↑ | |
| PHOSA [10] | 12.17 | 26.62 | 0.393 | 0.266 |
| CHORE [11] | 5.58 | 10.66 | 0.587 | 0.472 |
| CONTHO [12] | 4.99 | 8.42 | 0.628 | 0.496 |
| HOIMamba (Ours) | 4.56 | 7.91 | 0.658 | 0.563 |
| Methods | InterCap | |||
| CDhuman↓ | CDobject↓ | Contactp↑ | Contactr↑ | |
| PHOSA [10] | 11.20 | 20.57 | 0.228 | 0.159 |
| CHORE [11] | 7.01 | 12.81 | 0.339 | 0.253 |
| CONTHO [12] | 5.96 | 9.50 | 0.661 | 0.432 |
| HOIMamba (Ours) | 5.52 | 8.84 | 0.690 | 0.500 |
| Method | CDhuman↓ | CDobject↓ | Contactp↑ | Contactr↑ |
|---|---|---|---|---|
| Attention-based [12] | 4.99 | 8.42 | 0.628 | 0.496 |
| Single-scale SSM | 4.83 | 8.26 | 0.631 | 0.502 |
| w/o Channel | 4.65 | 8.32 | 0.640 | 0.558 |
| w/o Spatial | 7.96 | 11.10 | 0.476 | 0.421 |
| h→o | 4.62 | 8.37 | 0.642 | 0.553 |
| o→h | 4.63 | 8.03 | 0.658 | 0.549 |
| HOIMamba (SSM) | 4.56 | 7.91 | 0.658 | 0.563 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhang, J.; Lin, Y. HOIMamba: Bidirectional State-Space Modeling for Monocular 3D Human–Object Interaction Reconstruction. Biomimetics 2026, 11, 214. https://doi.org/10.3390/biomimetics11030214
Zhang J, Lin Y. HOIMamba: Bidirectional State-Space Modeling for Monocular 3D Human–Object Interaction Reconstruction. Biomimetics. 2026; 11(3):214. https://doi.org/10.3390/biomimetics11030214
Chicago/Turabian StyleZhang, Jinsong, and Yuqin Lin. 2026. "HOIMamba: Bidirectional State-Space Modeling for Monocular 3D Human–Object Interaction Reconstruction" Biomimetics 11, no. 3: 214. https://doi.org/10.3390/biomimetics11030214
APA StyleZhang, J., & Lin, Y. (2026). HOIMamba: Bidirectional State-Space Modeling for Monocular 3D Human–Object Interaction Reconstruction. Biomimetics, 11(3), 214. https://doi.org/10.3390/biomimetics11030214

