TriPose: A Multimodal Approach Integrating Images, Point Clouds, and Language for 3D Hand Pose Estimation
Abstract
1. Introduction
- Introducing structured language descriptions to assist 3D hand pose estimation: We incorporate structured textual representations as a third modality in the estimation framework. By leveraging CLIP’s text encoder, spatial cues embedded in language are aligned with visual and geometric features, compensating for the limitations of point clouds and images in handling complex poses and occlusions. This opens a new direction for multi-modal fusion in 3D hand pose estimation.
- Designing a unified image–text–point cloud alignment mechanism with spatial-awareness modules: We propose a tri-modal dual-alignment strategy based on contrastive learning, and introduce a spatial-aware Transformer in the image modality to enhance the ViT’s capacity for modeling local spatial structures. In the point cloud modality, we design a locality-sensitive Transformer encoder that efficiently captures the geometric configuration of hand shapes, enabling deep fusion of semantic and structural information.
- Tri-modal fusion strategy: We formalize a Tri-modal Symmetric Contrastive Learning (TSCL) objective that aligns image–point and text–point pairs (optionally image–text) in a shared feature space while keeping the CLIP encoders frozen. This fusion design narrows modality gaps and supports a single shared representation for downstream pose regression; its effect is isolated via ablations in Section 4.4.
- Proposing a stable and efficient two-stage training paradigm for robust estimation: Instead of end-to-end training, we adopt an “alignment-to-regression” two-stage optimization scheme that significantly improves training stability and the quality of cross-modal alignment. Combined with a multi-scale point cloud augmentation strategy, TriPose achieves strong generalization even on low-resource datasets like ICVL, demonstrating its robustness under data-scarce scenarios.
2. Related Work
2.1. 3D Hand Pose Estimation
2.2. Multi-Modal Learning for 3D Hand Pose Estimation
3. Method
3.1. Overall Architecture of TriPose
3.1.1. Image Encoder
3.1.2. Text Encoder
- : “From left to right, the joints are: …”
- : “From top to bottom, the joints are: …”
- : “From near to far, the joints are: …”
3.1.3. Point Cloud Encoder
3.2. Tri-Modal Fusion and Alignment Strategy
Data Formats and Alignment
: “From left to right: wrist, thumb_MCP, thumb_IP, index_MCP, …, pinky_TIP.”
: “From top to bottom: pinky_TIP, ring_TIP, …, wrist.”
: “From near to far: thumb_TIP, index_TIP, …, wrist.”
3.3. Pose Regression Module
- Input: ;
- Linear(512) → ReLU → Dropout;
- Linear(256) → ReLU → Dropout;
- Linear() → Reshape to .

3.4. Loss Function
3.4.1. Cross-Modal Contrastive Loss
3.4.2. Pose Regression Loss
3.4.3. Total Loss
4. Experiments
4.1. Training Protocol
4.2. Data Augmentation for Robustness
- Clean: original point cloud without distortion;
- Rotate: random 3D rotation;
- Scale: isotropic scaling transformations;
- Jitter: additive Gaussian noise;
- Drop Local: randomly remove points in local regions;
- Drop Global: uniformly drop points across the global surface;
- Add Local: add noisy points within localized areas;
- Add Global: inject global outliers to simulate sensor artifacts.
4.3. Datasets
4.3.1. Validation Protocol and Fairness
4.3.2. Dataset Characteristics and Complementarity
4.4. Comparison with State-of-the-Art Methods
4.5. Ablation Study
4.6. Runtime and Model Size
4.6.1. Protocol
4.6.2. Positioning
4.6.3. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Glossary
| Abbreviation | Definition |
| 3D HPE | 3D Hand Pose Estimation |
| CLIP | Contrastive Language–Image Pre-training |
| ViT | Vision Transformer |
| MSA | Multi-Head Self-Attention |
| MLP | Multi-Layer Perceptron |
| PCK | Percentage of Correct Keypoints |
| MSE | Mean Squared Error |
| TSCL | Tri-modal Symmetric Contrastive Learning |
| ULIP | Unified Language-Image Pre-training |
| NYU | New York University 3D Hand Pose Dataset |
| ICVL | Intel Collaborative Visualization Laboratory Hand Dataset |
| MSRA | Microsoft Research Asia Hand Pose Dataset |
References
- Cheng, W.; Tang, H.; Van Gool, L.; Ko, J.H. HandDiff: 3D Hand Pose Estimation with Diffusion on Image-Point Cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2274–2284. [Google Scholar]
- Chen, X.; Wang, G.; Zhang, C.; Kim, T.K.; Ji, X. Shpr-net: Deep semantic hand pose regression from point clouds. IEEE Access 2018, 6, 43425–43439. [Google Scholar] [CrossRef]
- Chen, X.; Wang, G.; Guo, H.; Zhang, C. Pose guided structured region ensemble network for cascaded hand pose estimation. Neurocomputing 2020, 395, 138–149. [Google Scholar] [CrossRef]
- Cheng, W.; Park, J.H.; Ko, J.H. Handfoldingnet: A 3d hand pose estimation network using multiscale-feature guided folding of a 2D hand skeleton. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 11260–11269. [Google Scholar]
- Du, K.; Lin, X.; Sun, Y.; Ma, X. Crossinfonet: Multi-task information sharing based hand pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9896–9905. [Google Scholar]
- Fang, L.; Liu, X.; Liu, L.; Xu, H.; Kang, W. Jgr-p2o: Joint graph reasoning based pixel-to-offset prediction network for 3d hand pose estimation from a single depth image. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VI 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 120–137. [Google Scholar]
- Ge, L.; Cai, Y.; Weng, J.; Yuan, J. Hand pointnet: 3d hand pose estimation using point sets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8417–8426. [Google Scholar]
- Ge, L.; Ren, Z.; Yuan, J. Point-to-point regression pointnet for 3d hand pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 475–491. [Google Scholar]
- Guo, H.; Wang, G.; Chen, X.; Zhang, C.; Qiao, F.; Yang, H. Region ensemble network: Improving convolutional network for hand pose estimation. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 4512–4516. [Google Scholar]
- Li, S.; Lee, D. Point-to-pose voting based hand pose estimation using residual permutation equivariant layer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11927–11936. [Google Scholar]
- Ren, P.; Sun, H.; Qi, Q.; Wang, J.; Huang, W. SRN: Stacked regression network for real-time 3D hand pose estimation. In Proceedings of the BMVC, Cardiff, UK, 9–12 September 2019; Volume 112. [Google Scholar]
- Ren, P.; Chen, Y.; Hao, J.; Sun, H.; Qi, Q.; Wang, J.; Liao, J. Two heads are better than one: Image-point cloud network for depth-based 3D hand pose estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Koriyama, Japan, 8–13 July 2023; Volume 37, pp. 2163–2171. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PmLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Sun, X.; Wei, Y.; Liang, S.; Tang, X.; Sun, J. Cascaded hand pose regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 824–832. [Google Scholar]
- Tang, D.; Jin Chang, H.; Tejani, A.; Kim, T.K. Latent regression forest: Structured estimation of 3d articulated hand posture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3786–3793. [Google Scholar]
- Tompson, J.; Stein, M.; Lecun, Y.; Perlin, K. Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans. Graph. (ToG) 2014, 33, 1–10. [Google Scholar] [CrossRef]
- Oikonomidis, I.; Kyriazis, N.; Argyros, A.A. Efficient model-based 3D tracking of hand articulations using Kinect. In Proceedings of the BMVC, Dundee, UK, 29 August–2 September 2011; Volume 1, p. 3. [Google Scholar]
- Ballan, L.; Taneja, A.; Gall, J.; Van Gool, L.; Pollefeys, M. Motion capture of hands in action using discriminative salient points. In Proceedings of the Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Proceedings, Part VI 12. Springer: Berlin/Heidelberg, Germany, 2012; pp. 640–653. [Google Scholar]
- Tzionas, D.; Ballan, L.; Srikantha, A.; Aponte, P.; Pollefeys, M.; Gall, J. Capturing hands in action using discriminative salient points and physics simulation. Int. J. Comput. Vis. 2016, 118, 172–193. [Google Scholar] [CrossRef]
- Khamis, S.; Taylor, J.; Shotton, J.; Keskin, C.; Izadi, S.; Fitzgibbon, A. Learning an efficient model of hand shape variation from depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2540–2548. [Google Scholar]
- Romero, J.; Tzionas, D.; Black, M.J. Embodied hands: Modeling and capturing hands and bodies together. arXiv 2022, arXiv:2201.02610. [Google Scholar] [CrossRef]
- Tkach, A.; Tagliasacchi, A.; Remelli, E.; Pauly, M.; Fitzgibbon, A. Online generative model personalization for hand tracking. ACM Trans. Graph. (ToG) 2017, 36, 1–11. [Google Scholar] [CrossRef]
- Remelli, E.; Tkach, A.; Tagliasacchi, A.; Pauly, M. Low-dimensionality calibration through local anisotropic scaling for robust hand model personalization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2535–2543. [Google Scholar]
- Keskin, C.; Kıraç, F.; Kara, Y.E.; Akarun, L. Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In Proceedings of the Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Proceedings, Part VI 12. Springer: Berlin/Heidelberg, Germany, 2012; pp. 852–863. [Google Scholar]
- Xu, C.; Cheng, L. Efficient hand pose estimation from a single depth image. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3456–3462. [Google Scholar]
- Ge, L.; Liang, H.; Yuan, J.; Thalmann, D. 3d convolutional neural networks for efficient and robust hand pose estimation from single depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 June 2017; pp. 1991–2000. [Google Scholar]
- Ge, L.; Liang, H.; Yuan, J.; Thalmann, D. Robust 3d hand pose estimation in single depth images: From single-view cnn to multi-view cnns. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3593–3601. [Google Scholar]
- Oberweger, M.; Lepetit, V. Deepprior++: Improving fast and accurate 3d hand pose estimation. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 585–594. [Google Scholar]
- Choi, C.; Kim, S.; Ramani, K. Learning hand articulations by hallucinating heat distribution. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3104–3113. [Google Scholar]
- Taylor, J.; Bordeaux, L.; Cashman, T.; Corish, B.; Keskin, C.; Sharp, T.; Soto, E.; Sweeney, D.; Valentin, J.; Luff, B.; et al. Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. ACM Trans. Graph. (ToG) 2016, 35, 1–12. [Google Scholar] [CrossRef]
- Wan, C.; Probst, T.; Van Gool, L.; Yao, A. Crossing nets: Dual generative models with a shared latent space for hand pose estimation. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 June 2017; Volume 7. [Google Scholar]
- Oberweger, M.; Wohlhart, P.; Lepetit, V. Training a feedback loop for hand pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3316–3324. [Google Scholar]
- Ye, Q.; Yuan, S.; Kim, T.K. Spatial attention deep net with partial pso for hierarchical hybrid hand pose estimation. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VIII 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 346–361. [Google Scholar]
- Tang, D.; Taylor, J.; Kohli, P.; Keskin, C.; Kim, T.K.; Shotton, J. Opening the black box: Hierarchical sampling optimization for estimating human hand pose. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3325–3333. [Google Scholar]
- Zheng, Z.; Xie, S.; Dai, H.; Chen, X.; Wang, H. An overview of blockchain technology: Architecture, consensus, and future trends. In Proceedings of the 2017 IEEE International Congress on Big Data (BigData Congress), Boston, MA, USA, 11–14 December 2017; pp. 557–564. [Google Scholar]
- Moon, G.; Chang, J.Y.; Lee, K.M. V2v-posenet: Voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5079–5088. [Google Scholar]
- Xiao, Y.; Liu, Y. Review on 3D Hand Pose Estimation Based on a RGB Image. J. Comput. Aided Des. Comput. Graph. 2024, 36, 161–172. [Google Scholar]
- Żywanowski, K.; Łysakowski, M.; Nowicki, M.R.; Jacques, J.T.; Tadeja, S.K.; Bohné, T.; Skrzypczyński, P. Vision-based hand pose estimation methods for Augmented Reality in industry: Crowdsourced evaluation on HoloLens 2. Comput. Ind. 2025, 171, 104328. [Google Scholar] [CrossRef]
- Neverova, N.; Wolf, C.; Taylor, G.W.; Nebout, F. Multi-scale deep learning for gesture detection and localization. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany; pp. 474–490. [Google Scholar]
- Molchanov, P.; Yang, X.; Gupta, P.; Kim, K.; Tyree, S.; Kautz, J. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural networks. In Proceedings of the CVPR Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; Volume 1, p. 3. [Google Scholar]
- Garg, M.; Ghosh, D.; Pradhan, P.M. Gestformer: Multiscale wavelet pooling transformer network for dynamic hand gesture recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2473–2483. [Google Scholar]
- Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
- Guo, S.; Cai, Q.; Qi, L.; Dong, J. Clip-hand3D: Exploiting 3D hand pose estimation via context-aware prompting. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 4896–4907. [Google Scholar]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 June 2017; pp. 652–660. [Google Scholar]
- Ren, J.; Pan, L.; Liu, Z. Benchmarking and analyzing point cloud classification under corruptions. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 18559–18575. [Google Scholar]
- Zhou, X.; Wan, Q.; Zhang, W.; Xue, X.; Wei, Y. Model-based deep hand pose estimation. arXiv 2016, arXiv:1606.06854. [Google Scholar] [CrossRef]
- Oberweger, M.; Wohlhart, P.; Lepetit, V. Hands deep in deep learning for hand pose estimation. arXiv 2015, arXiv:1502.06807. [Google Scholar]
- Guo, H.; Wang, G.; Chen, X.; Zhang, C. Towards good practices for deep 3d hand pose estimation. arXiv 2017, arXiv:1707.07248. [Google Scholar] [CrossRef]
- Wan, C.; Probst, T.; Van Gool, L.; Yao, A. Dense 3d regression for hand pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5147–5156. [Google Scholar]
- Cheng, W.; Ko, J.H. Handr2n2: Iterative 3d hand pose estimation using a residual recurrent neural network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 20904–20913. [Google Scholar]
- Rezaei, M.; Rastgoo, R.; Athitsos, V. TriHorn-net: A model for accurate depth-based 3D hand pose estimation. Expert Syst. Appl. 2023, 223, 119922. [Google Scholar] [CrossRef]











| Method | MSRA (mm) | ICVL (mm) | NYU (mm) |
|---|---|---|---|
| DeepModel [47] | – | 11.56 | 17.04 |
| DeepPrior [48] | – | 10.40 | 19.73 |
| Ren-4 × 6 × 6 [9] | – | 7.63 | 13.39 |
| Ren-9 × 6 × 6 [49] | 9.70 | 7.31 | 12.69 |
| DeepPrior++[29] | 9.50 | 8.10 | 12.24 |
| Pose-REN [9] | 8.65 | 6.79 | 11.81 |
| DenseReg [50] | 7.20 | 7.30 | 10.20 |
| 3DCNN [27] | 9.60 | – | 14.10 |
| SHPR-Net [2] | 7.76 | 10.78 | 10.78 |
| HandPointNet [7] | 8.50 | 6.94 | 10.54 |
| HandFoldingNet [4] | 7.34 | 6.30 | 8.42 |
| CrossInfoNet [5] | 7.86 | 6.73 | 10.08 |
| V2V-PoseNet [37] | 7.59 | 6.28 | 8.42 |
| HandR2N2 [51] | 6.42 | 5.70 | 7.27 |
| TriHorn-Net [52] | 7.10 | 5.73 | 7.68 |
| IPNet [12] | 6.92 | 5.76 | 7.17 |
| HandDiff [1] | 6.53 | 5.72 | 7.38 |
| TriPose (Ours) | 6.98 ± 0.12 | 5.68 ± 0.09 | 7.43 ± 0.11 |
| Model Variant | MSRA ↓ | ICVL ↓ | NYU ↓ |
|---|---|---|---|
| TriPose (Full Model) | 6.98 | 5.68 | 7.43 |
| w/o Text | 7.54 | 6.47 | 8.26 |
| w/o Image Transformer | 7.36 | 6.18 | 7.92 |
| w/o Point Augmentations | 7.42 | 6.59 | 8.03 |
| Two-modal (Image + Point) | 7.25 | 6.22 | 7.86 |
| Point Only | 8.34 | 7.17 | 8.96 |
| Three-modal + End-to-End | 7.12 | 5.94 | 7.67 |
| Augmentation Variant | MSRA ↓ | ICVL ↓ | NYU ↓ |
|---|---|---|---|
| TriPose (All Augmentations) | 6.98 | 5.68 | 7,43 |
| w/o Add Global | 7.14 | 5.89 | 7.56 |
| w/o Add Local | 7.09 | 5.92 | 7.61 |
| w/o Drop Global | 7.18 | 6.03 | 7.67 |
| w/o Drop Local | 7.23 | 6.14 | 7.88 |
| w/o Jitter | 7.29 | 6.32 | 7.79 |
| w/o Rotate | 7.16 | 5.97 | 7.64 |
| w/o Scale | 7.14 | 5.94 | 7.59 |
| Method | Parameters | Speed (FPS) | Time (ms) | GPU |
|---|---|---|---|---|
| V2V-PoseNet [37] | 457.5 M | 3.5 | 23 + 5.5 | TITAN X |
| HandPointNet [9] | 2.58 M | 48 | 8.2 + 11.3 | GTX 1080 |
| Point-to-Point [8] | 4.3 M | 41.8 | 8.2 + 15.7 | TITAN XP |
| HandFoldingNet [4] | 1.28 M | 84 | 8.2 + 3.7 | TITAN XP |
| TriPose (Ours) | 95.4 M | 40.0 | 8.2 + 16.8 | RTX 3090 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
She, L.; Guo, X.; Sun, H.; Liang, H. TriPose: A Multimodal Approach Integrating Images, Point Clouds, and Language for 3D Hand Pose Estimation. Electronics 2025, 14, 4485. https://doi.org/10.3390/electronics14224485
She L, Guo X, Sun H, Liang H. TriPose: A Multimodal Approach Integrating Images, Point Clouds, and Language for 3D Hand Pose Estimation. Electronics. 2025; 14(22):4485. https://doi.org/10.3390/electronics14224485
Chicago/Turabian StyleShe, Lihuang, Xiangli Guo, Haonan Sun, and Hanze Liang. 2025. "TriPose: A Multimodal Approach Integrating Images, Point Clouds, and Language for 3D Hand Pose Estimation" Electronics 14, no. 22: 4485. https://doi.org/10.3390/electronics14224485
APA StyleShe, L., Guo, X., Sun, H., & Liang, H. (2025). TriPose: A Multimodal Approach Integrating Images, Point Clouds, and Language for 3D Hand Pose Estimation. Electronics, 14(22), 4485. https://doi.org/10.3390/electronics14224485

