Semantics-Driven 3D Scene Retrieval via Joint Loss Deep Learning
Abstract
1. Introduction
- A large comprehensive scene semantic tree is created based on WordNet. It is a pioneering work to lead this semantics-driven 3D scene retrieval research, and it also provides a very useful scene semantic infrastructur for many related applications.
- A novel and effective semantic tree-based 3D scene retrieval framework (see Figure 1) is proposed that greatly enhances the 2D scene image-based 3D scene retrieval performance according to our experimental results.
2. Related Work
2.1. Deep Learning Technique-Based Scene Processing
2.2. Semantics-Based 3D Scene Understanding
2.3. Related 2D Scene Image and 3D Scene Benchmarks
2.3.1. SUN and SUN3D Datasets (2010 and 2016)
2.3.2. COCO Dataset (2014) and COCO-Stuff Dataset (2018)
2.3.3. SUNCG Dataset (2017)
2.3.4. Places Dataset (2018)
3. Semantics-Driven 2D Image-Based 3D Scene Model Retrieval
3.1. Step 1: Scene Semantic Tree Construction
3.2. Step 2: 3D Scene Model View Sampling
3.3. Step 3: Semantic Object Instances Segmentation
3.4. Step 4: Scene Semantic Information Learning
3.5. Step 5: VGG-Based Joint Loss Retrieval (JLR)
3.6. Computational Complexity Analysis of Our Approach
- (1)
- Object detection.
- Training. We adopt the default YOLO v3 model, which contains approximately 62 M parameters and performs around 65.9 G floating-point operations (FLOPs) per forward pass on a 640 × 640 image, offering a strong balance between speed and accuracy.
- Testing. The Inference Speed (GPU) during the testing stage is ∼6–8 ms/image (∼125–160 FPS) based on a RTX 2080 Ti GPU model.
- (2)
- Scene semantic information (SSI) learning.
- Training. We design a 9-layer fully connected DNN model to learn SSI. The number of nodes in each layer is 500, 625, 500, 400, 600, 300, 200, 120, and 210, which contains 1.36 M parameters and performs around 0.27 M FLOPs.
- Testing. This step is not necessary during testing stage.
- (3)
- VGG-based joint loss retrieval (JLR) model training.
- Training. The adopted VGG16 contains approximately 138 M parameters and performs ∼15.5–16 G FLOPs.
- Testing. The Inference Speed (GPU) during the testing stage is ∼6–8 ms/image (∼125–165 FPS) based on an RTX 2080 Ti GPU model.
4. Experiments and Discussion
4.1. Dataset
4.2. Scene Semantic Information Learning Results
4.3. 3D Scene Retrieval Results
4.3.1. Retrieval Accuracy Evaluation
4.3.2. Retrieval Efficiency and Scalability Evaluation
4.4. Discussion About Automatic Expansion of the Semantic Tree
5. Conclusions and Future Work
5.1. Conclusions
5.2. Limitations
5.3. Future Work
5.3.1. Improving Scene Object Detection Performance
5.3.2. Data Collection and Generation
- (1)
- Three-dimensional scene model data collection. As the main data sources, we will develop web crawlers to automatically download free 3D scenes from popular online public 3D repositories, such as 3D Warehouse [72], which hosts more than 4M free 3D models, as well as GrabCAD [73] (2.84M) and Sketchfab [74] (1.5M). All of the datasets mentioned above together provide scene models from a diverse number of categories, like generic, CAD, architecture, watertight, and RGB-D types, as well as 3D printer models.
- (2)
- Two-dimensional scene sketch data collection—I2S2: Image-to-scene sketch translation using conditional input and adversarial networks. As a preliminary work, we have proposed a full-scene image-to-sketch synthesis algorithm [75] with CycleGAN [76] using holistically nested edge detection (HED) [77] maps. We plan to use the scene sketch data generated based on this approach to further extend our proposed algorithm presented in this paper for sketch-based 3D scene retrieval, as well as to further enlarge the sketch-based retrieval (SBR) portion of the Scene_SBR_IBR benchmark [1] to make it a large-scale one and to further promote this research direction.
5.3.3. Developing an Adaptive Approach Supporting Processing Different Kinds of Scene Data
- (1)
- Scene data conversion. The first way is to convert all the data into the same type. For example, some sketches collected from online sources are too concise and have very little content, while other sketches contain more details. The more detailed the information that the sketches contain, the more accurate the performance of the neural network’s training and prediction will be. Therefore, to improve the retrieval performance, we may develop a scene sketch completion method that is able to automatically add more details to certain simple sketches to make them contain a similar level of detail.
- (2)
- Adaptive machine learning model. The second way is to train an adaptive machine learning model that can work on different types of scene data, whether detailed or not, realistic, or iconic. In order to achieve this goal, we could further develop a hybrid model that supports data with multiple types and modalities and levels of detail by training our model on various types of large-scale scene data at the same time. Meanwhile, to further improve the retrieval performance, we may also incorporate the scene semantic relatedness information of different types of scene data into the definition of the loss function of the final model.
5.3.4. Extension to Handle Partial 3D Model/Scene Similarity Retrieval
5.3.5. Evaluation on Additional Scene Datasets
5.3.6. Adaptive Loss Weighting
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Yuan, J.; Abdul-Rashid, H.; Li, B.; Lu, Y.; Schreck, T.; Bai, S.; Bai, X.; Bui, N.; Do, M.N.; Do, T.; et al. A comparison of methods for 3D scene shape retrieval. Comput. Vis. Image Underst. 2020, 201, 103070. [Google Scholar] [CrossRef]
- Miller, G.A. WordNet: A Lexical Database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar]
- Yuan, J.; Wang, T.; Zhe, S.; Lu, Y.; Li, B. Semantic Tree Based 3D Scene Model Recognition. In Proceedings of the IEEE 3rd International Conference on Multimedia Information Processing and Retrieval, MIPR 2020, Shenzhen, China, 6–9 August 2020. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. arXiv 2015. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018. [Google Scholar] [CrossRef]
- He, X.; Zhou, Y.; Zhou, Z.; Bai, S.; Bai, X. Triplet Center Loss for Multi-View 3D Object Retrieval. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Caglayan, A.; Imamoglu, N.; Can, A.B.; Nakamura, R. When CNNs Meet Random RNNs: Towards Multi-Level Analysis for RGB-D Object and Scene Recognition. arXiv 2020. [Google Scholar] [CrossRef]
- Wang, P.; Liu, Y.; Tong, X. Deep Octree-based CNNs with Output-Guided Skip Connections for 3D Shape and Scene Completion. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020, Seattle, WA, USA, 14–19 June 2020; pp. 1074–1081. [Google Scholar]
- Murez, Z.; van As, T.; Bartolozzi, J.; Sinha, A.; Badrinarayanan, V.; Rabinovich, A. Atlas: End-to-End 3D Scene Reconstruction from Posed Images. In Proceedings of the Computer Vision-ECCV 2020—16th European Conference, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J., Eds.; Proceedings, Part VII; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2020; Volume 12352, pp. 414–431. [Google Scholar]
- Werner, D.; Al-Hamadi, A.; Werner, P. Truncated Signed Distance Function: Experiments on Voxel Size. In Image Analysis and Recognition; Campilho, A., Kamel, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2014; pp. 357–364. [Google Scholar]
- Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.A.; Nießner, M. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 2432–2443. [Google Scholar]
- Tang, X.; Ma, Q.; Zhang, X.; Liu, F.; Ma, J.; Jiao, L. Attention Consistent Network for Remote Sensing Scene Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2030–2045. [Google Scholar]
- Jiang, J.; Kang, Z.; Li, J. Construction of a Dual-Task Model for Indoor Scene Recognition and Semantic Segmentation Based on Point Clouds. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2023, X-1/W1-2023, 469–478. [Google Scholar]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5099–5108. [Google Scholar]
- Naseer, A.; Jalal, A. Holistic Scene Recognition through U-Net Semantic Segmentation and CNN. In Proceedings of the 2024 19th International Conference on Emerging Technologies (ICET), Topi, Pakistan, 19–20 November 2024; pp. 1–6. [Google Scholar]
- Song, C.; Wu, H.; Ma, X.; Li, Y. Semantic-embedded similarity prototype for scene recognition. Pattern Recognit. 2024, 155, 110725. [Google Scholar] [CrossRef]
- Quattoni, A.; Torralba, A. Recognizing Indoor Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 413–420. [Google Scholar] [CrossRef]
- Xiao, J.; Hays, J.; Ehinger, K.A.; Oliva, A.; Torralba, A. SUN database: Large-scale scene recognition from abbey to zoo. In Proceedings of the CVPR, San Francisco, CA, USA, 13–18 June 2010; pp. 3485–3492. [Google Scholar]
- Zhou, B.; Lapedriza, À.; Khosla, A.; Oliva, A.; Torralba, A. Places: A 10 Million Image Database for Scene Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1452–1464. [Google Scholar]
- Sharifuzzaman Sagar, A.; Chen, Y.; Xie, Y.; Kim, H.S. MSA R-CNN: A comprehensive approach to remote sensing object detection and scene understanding. Expert Syst. Appl. 2024, 241, 122788. [Google Scholar] [CrossRef]
- Vijayakumar, A.; Vairavasundaram, S. YOLO-based object detection models: A review and its applications. Multimed. Tools Appl. 2024, 83, 83535–83574. [Google Scholar] [CrossRef]
- Liu, Z.; Mao, H.; Wu, C.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 11966–11976. [Google Scholar]
- Trigka, M.; Dritsas, E. A Comprehensive Survey of Machine Learning Techniques and Models for Object Detection. Sensors 2025, 25, 214. [Google Scholar] [CrossRef] [PubMed]
- Leacock, C.; Chodorow, M. Combining Local Context and WordNet Similarity for Word Sense Identification; MIT Press: Cambridge, MA, USA, 1998; Volume 49, pp. 265–283. [Google Scholar]
- Wu, Z.; Palmer, M. Verb Semantics and Lexical Selection. arXiv 1994. [Google Scholar] [CrossRef]
- Pedersen, T.; Patwardhan, S.; Michelizzi, J. WordNet:: Similarity-Measuring the Relatedness of Concepts. In Proceedings of the 19th National Conference on Artificial Intelligence, AAAI’04, San Jose, CA, USA, 25–29 July 2004; McGuinness, D.L., Ferguson, G., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2004; pp. 1024–1025. [Google Scholar]
- Hirst, G.; St-Onge, D. Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms; MIT Press: Cambridge, MA, USA, 1995; Volume 305. [Google Scholar]
- Banerjee, S.; Pedersen, T. An adapted Lesk algorithm for word sense disambiguation using WordNet. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, Mexico, 17–23 February 2002; pp. 136–145. [Google Scholar]
- Patwardhan, S. Incorporating Dictionary and Corpus Information into a Context Vector Measure of Semantic Relatedness. Master’s Thesis, University of Minnesota Duluth, Duluth, MN, USA, July 2003. [Google Scholar]
- Patwardhan, S.; Banerjee, S.; Pedersen, T. Using Measures of Semantic Relatedness for Word Sense Disambiguation. In Proceedings of the Computational Linguistics and Intelligent Text Processing, 4th International Conference, CICLing 2003, Mexico City, Mexico, 16–22 February 2003; Gelbukh, A.F., Ed.; Proceedings; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2003; Volume 2588, pp. 241–257. [Google Scholar]
- Pedersen, T.; Pakhomov, S.V.; Patwardhan, S.; Chute, C.G. Measures of semantic similarity and relatedness in the biomedical domain. J. Biomed. Inform. 2007, 40, 288–299. [Google Scholar] [CrossRef]
- Guo, Y.; Liu, Y.; Oerlemans, A.; Lao, S.; Wu, S.; Lew, M.S. Deep learning for visual understanding: A review. Neurocomputing 2016, 187, 27–48. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.S.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
- Chang, A.X.; Funkhouser, T.A.; Guibas, L.J.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. ShapeNet: An Information-Rich 3D Model Repository. arXiv 2015. [Google Scholar] [CrossRef]
- Huth, A.G.; Nishimoto, S.; Vu, A.T.; Gallant, J.L. A Continuous Semantic Space Describes the Representation of Thousands of Object and Action Categories across the Human Brain. Neuron 2012, 76, 1210–1224. [Google Scholar] [CrossRef]
- Bollacker, K.D.; Evans, C.; Paritosh, P.; Sturge, T.; Taylor, J. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, 10–12 June 2008; pp. 1247–1250. [Google Scholar]
- Ashburner, M.; Ball, C.A.; Blake, J.A.; Botstein, D.; Butler, H.; Cherry, J.M.; Davis, A.P.; Dolinski, K.; Dwight, S.S.; Eppig, J.T.; et al. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000, 25, 25–29. [Google Scholar]
- Bowman, S.R.; Angeli, G.; Potts, C.; Manning, C.D. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, 17–21 September 2015; pp. 632–642. [Google Scholar]
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.; Shamma, D.A.; et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef]
- Armeni, I.; He, Z.; Zamir, A.; Gwak, J.; Malik, J.; Fischer, M.; Savarese, S. 3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5663–5672. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R.B. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Lv, C.; Qi, M.; Li, X.; Yang, Z.; Ma, H. SGFormer: Semantic graph transformer for point cloud-based 3D scene graph generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 4035–4043. [Google Scholar]
- Huang, S.; Usvyatsov, M.; Schindler, K. Indoor Scene Recognition in 3D. arXiv 2020, arXiv:2002.12819. [Google Scholar]
- Caruana, R. Multitask Learning. Mach. Learn. 1997, 29, 41–75. [Google Scholar] [CrossRef]
- Li, J.; Han, K.; Wang, P.; Liu, Y.; Yuan, X. Anisotropic Convolutional Networks for 3D Semantic Scene Completion. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; pp. 3348–3356. [Google Scholar]
- Wald, J.; Dhamo, H.; Navab, N.; Tombari, F. Learning 3D Semantic Scene Graphs From 3D Indoor Reconstructions. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; pp. 3960–3969. [Google Scholar]
- Ku, T.; Veltkamp, R.C.; Boom, B.; Duque-Arias, D.; Velasco-Forero, S.; Deschaud, J.E.; Goulette, F.; Marcotegui, B.; Ortega, S.; Trujillo, A.; et al. SHREC 2020: 3D point cloud semantic segmentation for street scenes. Comput. Graph. 2020, 93, 13–24. [Google Scholar] [CrossRef]
- Chen, R.; Liu, Y.; Kong, L.; Zhu, X.; Ma, Y.; Li, Y.; Hou, Y.; Qiao, Y.; Wang, W. CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 7020–7030. [Google Scholar]
- Zemskova, T.; Yudin, D. 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding. arXiv 2025, arXiv:2412.18450. [Google Scholar]
- Deng, M.; Hu, J.; Wen, J.; Zhang, X.; Jin, Q. Object Detection-Based Visual SLAM Optimization Method for Dynamic Scene. IEEE Sens. J. 2025, 25, 16480–16488. [Google Scholar] [CrossRef]
- Cai, F.; Qu, Z.; Xia, S.; Wang, S. A method of object detection with attention mechanism and C2f_DCNv2 for complex traffic scenes. Expert Syst. Appl. 2025, 267, 126141. [Google Scholar] [CrossRef]
- Hu, J.; Wei, Y.; Chen, W.; Zhi, X.; Zhang, W. CM-YOLO: Typical Object Detection Method in Remote Sensing Cloud and Mist Scene Images. Remote Sens. 2025, 17, 125. [Google Scholar] [CrossRef]
- Zhao, T.; Feng, R.; Wang, L. SCENE-YOLO: A One-Stage Remote Sensing Object Detection Network with Scene Supervision. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5401515. [Google Scholar] [CrossRef]
- Xiao, J.; Ehinger, K.A.; Hays, J.; Torralba, A.; Oliva, A. SUN Database: Exploring a Large Collection of Scene Categories. Int. J. Comput. Vis. 2016, 119, 3–22. [Google Scholar] [CrossRef]
- Xiao, J.; Owens, A.; Torralba, A. SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, 1–8 December 2013; pp. 1625–1632. [Google Scholar]
- Lin, T.; Maire, M.; Belongie, S.J.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision-ECCV 2014—13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V. pp. 740–755. [Google Scholar]
- Caesar, H.; Uijlings, J.R.R.; Ferrari, V. COCO-Stuff: Thing and Stuff Classes in Context. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1209–1218. [Google Scholar]
- Song, S.; Yu, F.; Zeng, A.; Chang, A.X.; Savva, M.; Funkhouser, T.A. Semantic Scene Completion from a Single Depth Image. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017; pp. 190–198. [Google Scholar]
- Lesk, M. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th Annual International Conference on Systems Documentation (SIGDOC ’86), Toronto, ON, Canada, 8–11 June 1986; pp. 24–26. [Google Scholar]
- Su, H.; Maji, S.; Kalogerakis, E.; Learned-Miller, E.G. Multi-view Convolutional Neural Networks for 3D Shape Recognition. In Proceedings of the ICCV, Santiago, Chile, 7–13 December 2015; pp. 945–953. [Google Scholar]
- Yuan, J.; Abdul-Rashid, H.; Li, B.; Lu, Y. Sketch/Image-Based 3D Scene Retrieval: Benchmark, Algorithm, Evaluation. In Proceedings of the 2nd IEEE Conference on Multimedia Information Processing and Retrieval, MIPR 2019, San Jose, CA, USA, 28–30 March 2019; pp. 264–269. [Google Scholar]
- Naseer, M.; Khan, S.H.; Porikli, F. Indoor Scene Understanding in 2.5/3D: A Survey. arXiv 2018. [Google Scholar] [CrossRef]
- Handa, A.; Patraucean, V.; Badrinarayanan, V.; Stent, S.; Cipolla, R. Understanding RealWorld Indoor Scenes with Synthetic Data. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 4077–4085. [Google Scholar]
- Armeni, I.; Sener, O.; Zamir, A.R.; Jiang, H.; Brilakis, I.K.; Fischer, M.; Savarese, S. 3D Semantic Parsing of Large-Scale Indoor Spaces. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 1534–1543. [Google Scholar]
- Armeni, I.; Sax, S.; Zamir, A.R.; Savarese, S. Joint 2D-3D-Semantic Data for Indoor Scene Understanding. arXiv 2017. [Google Scholar] [CrossRef]
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. Version 8.0.0, 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 13 November 2025).
- Song, S.; Lichtenberg, S.P.; Xiao, J. SUN RGB-D: A RGB-D scene understanding benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, 7–12 June 2015; pp. 567–576. [Google Scholar]
- Xiang, Y.; Kim, W.; Chen, W.; Ji, J.; Choy, C.; Su, H.; Mottaghi, R.; Guibas, L.; Savarese, S. ObjectNet3D: A Large Scale Database for 3D Object Recognition. In Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 160–176. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Li, F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Trimble. 3D Warehouse. 2024. Available online: http://3dwarehouse.sketchup.com/?hl=en (accessed on 13 November 2025).
- GrabCAD. 2024. Available online: https://grabcad.com/ (accessed on 13 November 2025).
- Sketchfab. 2024. Available online: https://sketchfab.com/ (accessed on 13 November 2025).
- McGonigle, D.; Wang, T.; Yuan, J.; He, K.; Li, B. I2S2: Image-to-Scene Sketch Translation Using Conditional Input and Adversarial Networks. In Proceedings of the 32nd IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2020, Baltimore, MD, USA, 9–11 November 2020; pp. 773–778. [Google Scholar]
- Zhu, J.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017; pp. 2242–2251. [Google Scholar]
- Xie, S.; Tu, Z. Holistically-Nested Edge Detection. Int. J. Comput. Vis. 2017, 125, 3–18. [Google Scholar] [CrossRef]
- Tan, F.; Feng, S.; Ordonez, V. Text2Scene: Generating Compositional Scenes From Textual Descriptions. In Proceedings of the CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 6710–6719. [Google Scholar]
- Chandhok, S. SceneGPT: A Language Model for 3D Scene Understanding. arXiv 2024. [Google Scholar] [CrossRef]
- Poole, B.; Jain, A.; Barron, J.T.; Mildenhall, B. DreamFusion: Text-to-3D using 2D Diffusion. arXiv 2022. [Google Scholar] [CrossRef]
- Siddiqui, Y.; Alliegro, A.; Artemov, A.; Tommasi, T.; Sirigatti, D.; Rosov, V.; Dai, A.; Nießner, M. MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, 16–22 June 2024; pp. 19615–19625. [Google Scholar]
- Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In Proceedings of the Computer Vision-ECCV 2020—16th European Conference, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J., Eds.; Proceedings, Part I; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2020; Volume 12346, pp. 405–421. [Google Scholar]







| Dataset | Data Types | Annotation Types (Major) | ||
|---|---|---|---|---|
| SUN | 2D image | scene/2D object label | 130,519 | 908 |
| SUN3D | RGB-D video | 3D object label, semantic segmentation | 41 | 254 |
| 3D camera pose, 3D reconstruction | ||||
| COCO | 2D image | 2D object label, image caption | ~164,000 (labeled) | 80 |
| COCO-Stuff | 2D image | 2D object label | ~164,000 (labeled) | 182 |
| panoptic segmentation annotation | ||||
| SUNCG | 3D model | 3D object label | 4,562,210,624,928 | 84 |
| semantic segmentation, camera pose | ||||
| Places | 2D image | scene/2D object label, scene attributes | 10,624,928 | 434 |
| Dataset | Key Applications |
|---|---|
| SUN | Scene classification/recognition/semantic segmentation/attribute prediction |
| SUN3D | 3D Object detection, 3D reconstruction, SLAM, semantic mapping, indoor scene understanding |
| SUNCG | 3D scene understanding, semantic/instance segmentation, 3D scene completion/reconstruction |
| COCO | Object detection, instance segmentation, keypoint detection, image captioning |
| COCO-Stuff | Semantic and panoptic segmentation, semantic scene understanding |
| Places | Scene classification/attribute recognition, semantic scene understanding |
| Accuracy | NN | FT | ST | E | DCG | AP |
|---|---|---|---|---|---|---|
| VMV [62] | 0.122 | 0.458 | 0.573 | 0.452 | 0.644 | 0.390 |
| DRF [3] | 0.597 | 0.357 | 0.500 | 0.358 | 0.690 | 0.358 |
| TCL [7] | 0.632 | 0.375 | 0.521 | 0.376 | 0.706 | 0.378 |
| JLR (DNN + SL) | 0.614 | 0.366 | 0.510 | 0.367 | 0.698 | 0.368 |
| JLR | 0.718 | 0.435 | 0.582 | 0.435 | 0.751 | 0.446 |
| Method | GPU | Language | T |
|---|---|---|---|
| VMV [62] | 1 × NVIDIA Titan Xp | C++, Matlab | 0.04 |
| DRF [3] | 1 × NVIDIA Titan Xp | C++, Python | 0.03 |
| TCL [7] | 1 × NVIDIA Titan Xp | Python | 0.04 |
| JLR (DNN + SL) | 1 × NVIDIA Titan Xp | Python | 0.03 |
| JLR | 1 × NVIDIA Titan Xp | Python | 0.03 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yuan, J.; Wang, T.; Zhe, S.; Lu, Y.; Zhou, Z.; Li, B. Semantics-Driven 3D Scene Retrieval via Joint Loss Deep Learning. Mathematics 2025, 13, 3726. https://doi.org/10.3390/math13223726
Yuan J, Wang T, Zhe S, Lu Y, Zhou Z, Li B. Semantics-Driven 3D Scene Retrieval via Joint Loss Deep Learning. Mathematics. 2025; 13(22):3726. https://doi.org/10.3390/math13223726
Chicago/Turabian StyleYuan, Juefei, Tianyang Wang, Shandian Zhe, Yijuan Lu, Zhaoxian Zhou, and Bo Li. 2025. "Semantics-Driven 3D Scene Retrieval via Joint Loss Deep Learning" Mathematics 13, no. 22: 3726. https://doi.org/10.3390/math13223726
APA StyleYuan, J., Wang, T., Zhe, S., Lu, Y., Zhou, Z., & Li, B. (2025). Semantics-Driven 3D Scene Retrieval via Joint Loss Deep Learning. Mathematics, 13(22), 3726. https://doi.org/10.3390/math13223726

