Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (224)

Search Parameters:
Keywords = 3D scene understanding

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
19 pages, 2874 KB  
Article
Point Cloud Classification and Segmentation Network Based on Adaptive Feature Extraction
by Chengzhi Deng, Huaipei Wang, Zhaoming Wu, Xiaowei Sun, Shaoquan Zhang and Shengqian Wang
Sensors 2026, 26(12), 3689; https://doi.org/10.3390/s26123689 (registering DOI) - 10 Jun 2026
Abstract
Point cloud classification and segmentation are key technologies for 3D perception and scene understanding, whose accuracy and efficiency directly affect the performance of high-level applications such as 3D modeling, object recognition, and intelligent interaction. Existing methods still exhibit obvious deficiencies in local feature [...] Read more.
Point cloud classification and segmentation are key technologies for 3D perception and scene understanding, whose accuracy and efficiency directly affect the performance of high-level applications such as 3D modeling, object recognition, and intelligent interaction. Existing methods still exhibit obvious deficiencies in local feature representation, computational efficiency, and scene applicability. To address these issues, this paper proposes a lightweight point cloud classification and segmentation network based on adaptive feature extraction, referred to as AFE-PointNet. Firstly, an element-wise weighting set abstraction module based on the Hadamard product is designed. It leverages geometric topology learning to achieve adaptive feature enhancement, effectively improving the representation capability of local geometric structures. Meanwhile, a cascaded structure of feature aggregation and an inverted residual multi-layer perceptron (InvResMLP) is adopted for deep feature mining to achieve high-accuracy and high-efficiency point cloud classification and segmentation. Experimental results show that AFE-PointNet achieves an overall accuracy (OA) of 93.6% on the ModelNet40 dataset and 84.5% on the ScanObjectNN dataset, and attains a class mean intersection over union (Cls.mIoU) of 83.6% on the ShapeNetPart part segmentation dataset, yielding significant performance improvements over the PointNet++ model. The proposed adaptive feature enhancement and lightweight deep mining strategies effectively improve point cloud representation capability, providing a high-precision and efficient solution for 3D vision tasks. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

19 pages, 12332 KB  
Article
Zero-Shot 3D Asset Detection and Localisation Through Visual Grounding in Industrial Point Clouds
by Masoud Kamali, Behnam Atazadeh, Abbas Rajabifard and Yiqun Chen
AI 2026, 7(6), 205; https://doi.org/10.3390/ai7060205 - 5 Jun 2026
Viewed by 193
Abstract
3D scene understanding in industrial environments is crucial for effective operation and maintenance (O&M) and asset monitoring. However, accurate asset detection and localisation face significant challenges due to asset diversity and scene complexity in these environments. Existing learning-based methods rely heavily on labelled [...] Read more.
3D scene understanding in industrial environments is crucial for effective operation and maintenance (O&M) and asset monitoring. However, accurate asset detection and localisation face significant challenges due to asset diversity and scene complexity in these environments. Existing learning-based methods rely heavily on labelled training datasets, which are limited for industrial settings due to asset variability and intricate geometries. To address these challenges, this paper presents a novel framework for industrial asset detection and localisation without requiring labelled training datasets, using only point cloud data. Experimental results demonstrate the competitive performance of the proposed framework, achieving an average precision at 25% intersection over union (AP25) of 48.13% and an AP50 of 34.98%, significantly outperforming state-of-the-art (SOTA) methods. This framework can be employed to generate 3D digital models of brownfield industrial plants that lack up-to-date spatial information, serving as a foundational spatial layer for the development of digital twins within industrial environments. Full article
Show Figures

Figure 1

24 pages, 44455 KB  
Article
VISR-CNN: A Dual-Stream Framework for Meteorological Visibility Estimation via Multi-Scale Transmission Attention and Spectral Gating
by Wai Lun Lo, Kwok Wai Wong, Richard Tai Chiu Hsung, Henry Shu Hung Chung, Hong Fu, Harris Sik Ho Tsang and Tony Yulin Zhu
Algorithms 2026, 19(6), 434; https://doi.org/10.3390/a19060434 - 28 May 2026
Viewed by 373
Abstract
Accurate meteorological visibility estimation is vital for transportation safety and environmental monitoring. However, modeling the inherent nonlinear spatial and spectral degradations in hazy environments remains challenging. While recent Large Vision-Language Models (LVLMs) offer strong scene understanding, they lack the regression precision required for [...] Read more.
Accurate meteorological visibility estimation is vital for transportation safety and environmental monitoring. However, modeling the inherent nonlinear spatial and spectral degradations in hazy environments remains challenging. While recent Large Vision-Language Models (LVLMs) offer strong scene understanding, they lack the regression precision required for visibility estimation. In this paper, we propose the Visibility-Aware Refined CNN (VISR-CNN), a dual-stream architecture that synthesizes local spatial cues with global frequency-domain signatures. The model integrates a Multi-Scale Transmission Attention (MSTA) module, which uses parallel dilated convolutions to estimate atmospheric transmission, and a Global Frequency Branch that utilizes 2D Real Fast Fourier Transforms (RFFT) with Spectral Gating to quantify visibility-dependent blurring. A progressive training strategy is introduced to decouple spectral and spatial optimization, and a physics-informed loss function is designed to supervise numerical regression while enforcing a monotonic ranking constraint consistent with physical light-attenuation laws. Results on the HKCHC-VD dataset show that VISR-CNN achieves state-of-the-art performance (MAE: 1.54 km; RMSE: 2.31 km), representing a 13.0% improvement over VisNet. Further evaluations on the CP1 and SWH datasets confirm robust generalization, reducing overall MAE by 21% and 20%, respectively, compared with the hybrid ResNeXt-50 + ViT model. Notably, in safety-critical range (0–10 km), VISR-CNN reduces RMSE for the HKCHC-VD, CP1, and SWH datasets by approximately 55%, 64%, and 71%, respectively, when compared with VisNet. These findings demonstrate the superiority of specialized, physics-grounded architectures over general-purpose LVLMs for high-precision meteorological regression. Full article
Show Figures

Figure 1

23 pages, 1836 KB  
Article
Long-Tail Aware Cross-Modal Graph Attention Network for Fine-Grained Indoor 3D Semantic Segmentation of Point Clouds
by Erdal Özbay and Feyza Altunbey Özbay
Sensors 2026, 26(11), 3401; https://doi.org/10.3390/s26113401 - 27 May 2026
Viewed by 353
Abstract
Accurate and efficient semantic segmentation of point cloud data is critical in many application areas involving indoor scene understanding. In particular, fine-grained object categories, high data density, and class imbalance in high-resolution indoor datasets significantly limit class discrimination in 3D semantic segmentation. The [...] Read more.
Accurate and efficient semantic segmentation of point cloud data is critical in many application areas involving indoor scene understanding. In particular, fine-grained object categories, high data density, and class imbalance in high-resolution indoor datasets significantly limit class discrimination in 3D semantic segmentation. The multimodal data structure, high-fidelity geometry, and long-tail class distribution of the recently popular ScanNet++ dataset further exacerbate these challenges. This study proposes a novel Long-Tail Aware Cross-Modal Graph Attention Network (LT-CM-GACNet++) to address fine-grained 3D semantic segmentation under long-tail distributions. The proposed method integrates dynamic graph-based geometric feature extraction with a lightweight visual feature extractor based on MobileNetV3, enabling effective fusion of geometric and RGB-based information. The proposed Cross-Modal Graph Attention (CMGA) module facilitates adaptive information transfer between modalities, enabling more effective representation learning of both local and global contextual features. To mitigate the adverse effects of long-tail class distributions, prototype-based representation learning and a class frequency-aware loss function are jointly employed. This strategy improves the learning of rare classes while enhancing the discrimination between visually and geometrically similar categories. In the preprocessing stage, density-based sampling, normal vector estimation, and block-based fixed-size point cloud generation are applied to high-resolution mesh-derived data. The proposed model is evaluated on 50 scenes and 100 semantic classes selected from the ScanNet++ dataset. Experimental results demonstrate that the proposed method achieves significant improvements over existing approaches in terms of both overall segmentation performance and rare-class performance. In particular, notable gains are observed in mean Intersection over Union (mIoU) and rare-class mIoU metrics. These results highlight the effectiveness of cross-modal learning for high-resolution 3D scene segmentation under long-tail distributions. Full article
(This article belongs to the Special Issue Advances in Point Clouds for Sensing Applications)
Show Figures

Figure 1

28 pages, 1975 KB  
Article
Adaptive Exposure Control for Aerial Cameras in Maritime Scenes
by Haiying Liu, Yingchao Li, Shilong Xu, Huaide Zhou and Huilin Jiang
J. Mar. Sci. Eng. 2026, 14(11), 970; https://doi.org/10.3390/jmse14110970 - 24 May 2026
Viewed by 135
Abstract
Maritime aerial imaging is strongly affected by rapid illumination variations induced by dynamic sea conditions, which often cause conventional exposure control approaches to misinterpret intrinsic scene brightness as overexposure resulting from elevated camera settings. To overcome this issue, an adaptive exposure control framework [...] Read more.
Maritime aerial imaging is strongly affected by rapid illumination variations induced by dynamic sea conditions, which often cause conventional exposure control approaches to misinterpret intrinsic scene brightness as overexposure resulting from elevated camera settings. To overcome this issue, an adaptive exposure control framework based on a Glare-Aware Attention Network is proposed, implemented within an end-to-end dual-branch architecture. The framework utilizes an Exposure State Encoding (ESE) module to encode the current frame’s exposure parameters as conditional vectors, thereby resolving physical ambiguities in scene understanding. A Glare-Aware Spatial Attention (GASA) mechanism is further introduced, incorporating a glare prior map (GPM) generated using a “high-luminance, low-texture” heuristic to explicitly suppress sun glint effects. A Scene Difficulty-Adaptive Loss Weighting (SDAW) scheme is designed to adaptively regulate loss weights, and region-aware evaluation metrics, KREA and ISR, are defined. On a self-collected maritime aerial imaging dataset, the proposed approach significantly outperforms both traditional and deep learning-based methods in terms of full-frame and region-level performance metrics. Compared with the multi-task CNN baseline that has the closest parameter count, it achieves a 1.7 dB gain in PSNR. Cross-dataset validation on SeaDronesSee, temporal consistency analysis, and embedded platform testing further support the generalization and real-time feasibility of the proposed solution. Offering a high-accuracy, region-aware exposure control solution for aerial cameras in complex sea surface scenarios. Full article
(This article belongs to the Section Ocean Engineering)
Show Figures

Figure 1

24 pages, 8366 KB  
Article
Multi-Error Coupling Simulation for ToF 3D Imaging Based on Optical Path Unit Decomposition
by Gang Chen, Wuyang Zhang, Xubing Kang, Junming Zhang and Xuanquan Wang
Photonics 2026, 13(6), 508; https://doi.org/10.3390/photonics13060508 - 22 May 2026
Viewed by 248
Abstract
Time-of-Flight (ToF) 3D imaging suffers from diverse systematic and non-systematic errors that limit its practical performance and reliability. Reliable simulation is critical for understanding these error mechanisms and guiding performance improvement. Therefore, this paper proposes a multi-error coupling simulation framework for ToF 3D [...] Read more.
Time-of-Flight (ToF) 3D imaging suffers from diverse systematic and non-systematic errors that limit its practical performance and reliability. Reliable simulation is critical for understanding these error mechanisms and guiding performance improvement. Therefore, this paper proposes a multi-error coupling simulation framework for ToF 3D imaging based on optical path unit decomposition. By decomposing the full light propagation chain and systematically integrating established typical error mechanisms into their corresponding physical stages, we produce simulation results that closely match real-world sensor measurements. Validated through laboratory and real-scene experiments, the proposed method outperforms mainstream approaches in RMSE, PSNR, and relative error metrics, accurately reproducing the depth distortion and noise characteristics of real ToF sensors. This multi-error coupled modeling method effectively bridges the gap between simulation and actual measurement, offering a credible reference for ToF system error evaluation, parameter optimization, and performance enhancement. Full article
Show Figures

Figure 1

19 pages, 6663 KB  
Article
Using a Visual Positioning System for a Geolocated Visualization of an Archaeological Site in Augmented Reality
by František Mužík and Lukáš Běloch
ISPRS Int. J. Geo-Inf. 2026, 15(5), 219; https://doi.org/10.3390/ijgi15050219 - 20 May 2026
Viewed by 399
Abstract
In recent years, augmented reality has become a popular method of spatial data visualization, both via the most popular and basic plane-based method and more advanced automatic positioning of visualizations based on predefined real-world locations. The aim of this study is to provide [...] Read more.
In recent years, augmented reality has become a popular method of spatial data visualization, both via the most popular and basic plane-based method and more advanced automatic positioning of visualizations based on predefined real-world locations. The aim of this study is to provide new insights into geolocated 3D visualizations in AR using a visual positioning system (VPS). VPS technology enables the creation of visualizations that can be displayed with high accuracy directly on a specific area of interest. This approach is especially well-suited to cultural heritage preservation, as it can be used to visualize destroyed buildings or archaeological sites. The result of the study is a mobile application created using the Unity game engine, which allows users to access AR visualizations as well as additional context in the form of pop-up texts or photographs. Thanks to the display of AR visualization directly at the chosen location, the user can better understand the context of the whole scene. This is because it is a more immersive experience than simply viewing a 3D model on a computer or mobile phone screen. Full article
(This article belongs to the Special Issue Cartography and Geovisual Analytics)
Show Figures

Figure 1

28 pages, 7499 KB  
Article
HOSG-Nav: Hierarchical Open-Vocabulary Semantic Graph Navigation for Language-Guided Global Planning in 3D Gaussian Scenes
by Yuchen Li, Kai Qin, Weiyi Chen and Haitao Wu
Electronics 2026, 15(10), 2179; https://doi.org/10.3390/electronics15102179 - 19 May 2026
Viewed by 336
Abstract
Natural-language-driven robot navigation in complex indoor environments requires the joint capability of high-fidelity scene representation, structured semantic reasoning, and executable path planning. To address this challenge, this paper proposes HOSG-Nav, a unified framework for natural-language-driven global navigation that integrates open-vocabulary 3D Gaussian scene [...] Read more.
Natural-language-driven robot navigation in complex indoor environments requires the joint capability of high-fidelity scene representation, structured semantic reasoning, and executable path planning. To address this challenge, this paper proposes HOSG-Nav, a unified framework for natural-language-driven global navigation that integrates open-vocabulary 3D Gaussian scene representation, hierarchical semantic scene graph construction, and large-language-model-driven planning. First, an open-vocabulary 3D Gaussian field is constructed to jointly encode scene geometry, appearance, and semantic information, where compressed CLIP features are lifted into continuous 3D space and depth supervision is introduced to enhance geometric stability and metric-scale consistency. Second, the optimized Gaussian primitives are further abstracted into a semantic scene graph with a region–object hierarchical structure and traversable topological relations to support structured environment understanding. Finally, for natural language instructions, hierarchical semantic parsing is performed with the assistance of a large language model, and executable global navigation paths are generated through cross-modal target retrieval and graph-search-based planning. Experimental results on the Replica dataset demonstrate that HOSG-Nav achieves competitive performance in scene representation, semantic target retrieval, and global navigation, validating the effectiveness of jointly integrating multimodal 3D representation, hierarchical semantic abstraction, and language-guided planning. Full article
Show Figures

Figure 1

21 pages, 2332 KB  
Article
GCA-Trans: Global Context-Aware Transformer for Robust Transparent Object Segmentation in Robotic Environments
by Deping Li, Zujian Dong, Zilong Yang, Ka-Kui Li and Yushen Huang
J. Imaging 2026, 12(5), 212; https://doi.org/10.3390/jimaging12050212 - 16 May 2026
Viewed by 403
Abstract
Transparent object segmentation plays a critical role in indoor and outdoor scene understanding, particularly driven by the rapid advancements in autonomous driving and robotics. However, this task presents significant challenges due to the lack of distinct texture and chromatic features in transparent objects, [...] Read more.
Transparent object segmentation plays a critical role in indoor and outdoor scene understanding, particularly driven by the rapid advancements in autonomous driving and robotics. However, this task presents significant challenges due to the lack of distinct texture and chromatic features in transparent objects, causing their appearance to blend into the background. Existing methods face inherent architectural limitations: CNNs are restricted by limited receptive fields, while Transformer-based methods may inadvertently suppress the weak feature details of transparent surfaces due to the inherent low-pass filtering property of self-attention mechanisms, treating them as background noise. Consequently, these approaches struggle to consistently segment transparent objects across diverse scales, failing to preserve both fine details and large-scale structures. To address these limitations, we propose the Global Context-Aware Transformer (GCA-Trans). Specifically, we design a Multi-scale Context Mining (MCM) module that leverages parallel dilated convolutions with varying receptive fields to simultaneously extract features at multiple scales. This design allows the model to capture and fuse fine-grained local details (e.g., edges and textures) with coarse-grained global spatial context (e.g., overall object shapes), ensuring robust segmentation performance for transparent objects of varying scales. Extensive experiments on four benchmark datasets demonstrate that GCA-Trans sets a new state of the art, achieving significant improvements of 2.53% mIoU on Trans10K-v2, 2.1% IoU on RGB-D GSD, 2.2% IoU on GDD, and 1.9% IoU on GSD, validating the effectiveness and robustness of our approach. Full article
(This article belongs to the Special Issue AI-Driven Robot Vision: Progress, Challenges, and Perspectives)
Show Figures

Figure 1

27 pages, 23708 KB  
Article
SGMR-LPR: A Semantic-Guided Network Robust to Movable Objects for LiDAR-Based Place Recognition
by Weizhong Jiang, Zhipeng Xiao, Lilin Qian, Erke Shang, Dawei Zhao, Qi Zhu and Liang Xiao
Sensors 2026, 26(10), 3050; https://doi.org/10.3390/s26103050 - 12 May 2026
Viewed by 559
Abstract
Robust LiDAR point cloud processing in dynamic outdoor environments, where movable objects such as vehicles and pedestrians introduce significant structural uncertainty, remains a key challenge for remote sensing and autonomous systems. This work addresses LiDAR-based place recognition (LPR), a critical component for loop [...] Read more.
Robust LiDAR point cloud processing in dynamic outdoor environments, where movable objects such as vehicles and pedestrians introduce significant structural uncertainty, remains a key challenge for remote sensing and autonomous systems. This work addresses LiDAR-based place recognition (LPR), a critical component for loop closure and re-localization that is highly susceptible to such dynamics. While semantic information is beneficial, existing methods often require external segmentation models at inference or lack explicit mechanisms to suppress movable objects under uncertain predictions. To address these limitations, we propose SGMR-LPR, an end-to-end semantic-guided framework designed to explicitly counteract movable-object interference during feature encoding. Building on the “segmentation-while-describing” paradigm, SGMR-LPR incorporates an internal semantic segmentation branch and two novel modules: a probabilistic movable object masking (PMOM) module, which transforms semantic logits into continuous, uncertainty-aware masks of movable regions; and a movable-suppressed channel–spatial attention (MSCS) module, which uses these masks to adaptively modulate high-level BEV features—suppressing responses from movable-object regions while enhancing stable structural elements. By embedding explicit movable-awareness into feature modulation, SGMR-LPR achieves enhanced robustness without external semantic models at inference. Extensive experiments on multiple benchmarks demonstrate consistent performance gains, particularly in scenes with dense movable objects, advancing reliable point cloud-based scene understanding in dynamic environments. Full article
(This article belongs to the Section Radar Sensors)
Show Figures

Figure 1

26 pages, 4404 KB  
Article
Loop Closure with 3D Gaussian Splatting for Dynamic SLAM
by Zhanwu Ma, Wansheng Cheng and Song Fan
Sensors 2026, 26(9), 2669; https://doi.org/10.3390/s26092669 - 25 Apr 2026
Viewed by 890
Abstract
Robust pose estimation and high-fidelity scene reconstruction in dynamic environments represent core challenges in the field of Visual Simultaneous Localization and Mapping (SLAM). Although 3D Gaussian Splatting (3DGS)-based techniques have demonstrated significant potential, existing methods typically assume static scenes and struggle to address [...] Read more.
Robust pose estimation and high-fidelity scene reconstruction in dynamic environments represent core challenges in the field of Visual Simultaneous Localization and Mapping (SLAM). Although 3D Gaussian Splatting (3DGS)-based techniques have demonstrated significant potential, existing methods typically assume static scenes and struggle to address the inconsistency between photometric and geometric observations in dynamic settings, leading to a notable degradation in pose estimation and map accuracy. To address these issues, this paper presents a novel dynamic SLAM method: Loop Closure with 3D Gaussian Splatting for Dynamic SLAM (LCD-Splat). Taking RGB-D images as input, LCD-Splat integrates Mask R-CNN with an improved multi-view geometry approach to detect dynamic objects, generating static scene maps and filling in occluded backgrounds. By leveraging 3DGS submaps and a frame to model tracking strategy, LCD-Splat achieves dense map construction. The method initiates online loop closure detection and employs a novel coarse to fine 3DGS registration algorithm to compute loop closure constraints between submaps. Global consistency is ultimately ensured through robust pose graph optimization. Experimental results on real-world datasets such as TUM RGB-D and Bonn demonstrate that LCD-Splat outperforms existing state-of-the-art SLAM methods in terms of tracking, scene reconstruction, and rendering performance. This approach provides novel insights for high-precision SLAM in dynamic environments and holds significant implications for scene understanding in complex settings. Full article
Show Figures

Figure 1

23 pages, 2271 KB  
Article
Semantic Segmentation of Sparse Array-SAR 3D Point Clouds Using an Enhanced PointNet++ Framework
by Ya Shu, Lei Pang and Miao Li
Appl. Sci. 2026, 16(9), 4149; https://doi.org/10.3390/app16094149 - 23 Apr 2026
Viewed by 259
Abstract
The semantic segmentation of sparse array synthetic aperture radar (SAR) 3D point clouds remains a significant challenge. These datasets are characterized by extreme sparsity, irregular distribution, and structural discontinuity, factors that diminish the reliability of local neighborhoods and impede the performance of traditional [...] Read more.
The semantic segmentation of sparse array synthetic aperture radar (SAR) 3D point clouds remains a significant challenge. These datasets are characterized by extreme sparsity, irregular distribution, and structural discontinuity, factors that diminish the reliability of local neighborhoods and impede the performance of traditional segmentation algorithms. This study introduces an enhanced PointNet++ framework specifically tailored for the semantic segmentation of sparse array-SAR 3D point clouds. Utilizing PointNet++ as a hierarchical backbone, the proposed architecture incorporates three geometry-oriented modifications: a feature enhancement strategy integrating normalized height, surface normals, and local density; an EdgeConv module positioned at an intermediate abstraction stage to reinforce local geometric modeling; and an FP-Refine module designed to optimize cross-scale feature propagation and recovery within sparse regions. Rather than proposing a fundamentally distinct universal architecture, this research focuses on a task-oriented adaptation of PointNet++ to address the neighborhood instability and structural gaps inherent in sparse array-SAR data. Experimental evaluations using the SARMV3D-1.0 dataset indicate that the proposed method consistently outperforms the PointNet++ baseline, maintaining stable performance across various random seeds with an mIoU between 55% and 58%. Further validation through ablation studies, parameter sensitivity analyses, and perturbation-based robustness assessments confirms the utility of the integrated components. Additionally, cross-dataset experiments on S3DIS and Toronto3D suggest that the framework generalizes effectively to point clouds with varying densities and spatial configurations. The findings demonstrate that the method is particularly successful for categories defined by distinct vertical geometry and structural continuity, such as trees, roofs, and facades, though performance remains limited for weakly structured classes like roads. Full article
Show Figures

Figure 1

15 pages, 12377 KB  
Article
Gaussian Semantic Segmentation Based on Color and Shape Deformation Fields
by Yongtao Hao, Kaibin Bao and Wei Wu
Electronics 2026, 15(8), 1700; https://doi.org/10.3390/electronics15081700 - 17 Apr 2026
Viewed by 499
Abstract
Dynamic scene reconstruction has achieved significant milestones with the advent of 3D Gaussian Splatting (3DGS). However, extending this technology from geometric reconstruction to semantic understanding in dynamic environments remains a challenge. Existing methods often rely on external 2D trackers, which lead to temporal [...] Read more.
Dynamic scene reconstruction has achieved significant milestones with the advent of 3D Gaussian Splatting (3DGS). However, extending this technology from geometric reconstruction to semantic understanding in dynamic environments remains a challenge. Existing methods often rely on external 2D trackers, which lead to temporal inconsistencies and semantic drift, or suffer from the high computational costs of high-dimensional feature fields. In this paper, we propose a novel framework, Gaussian Semantic Segmentation based on Color and Shape Deformation Fields (GSSBC), to address these issues. Building upon our GBC dynamic scene representation, we bind learnable semantic features to deformable Gaussian primitives. We introduce a spatiotemporal contrastive learning strategy guided by the Segment Anything Model (SAM) to enforce semantic consistency without explicit tracking. Furthermore, we employ a density-based clustering algorithm with label propagation to extract discrete object entities efficiently. Experimental results on the HyperNeRF and Neu3D datasets demonstrate that our method achieves superior segmentation accuracy and spatiotemporal stability compared to state-of-the-art approaches, enabling effective semantic understanding in complex dynamic scenes. Full article
Show Figures

Figure 1

27 pages, 4829 KB  
Article
Dual RANSAC with Rescue Midpoint Multi-Trend Vanishing Point Detection
by Nada Said, Bilal Nakhal, Ali El-Zaart and Lama Affara
J. Imaging 2026, 12(4), 172; https://doi.org/10.3390/jimaging12040172 - 16 Apr 2026
Viewed by 526
Abstract
Vanishing point detection is a fundamental step in computer vision that allows 3D scene understanding and autonomous navigation. Classical techniques have significant challenges when trying to understand scenes that are heavily cluttered and images containing multiple perspective cues, leading to poor or unreliable [...] Read more.
Vanishing point detection is a fundamental step in computer vision that allows 3D scene understanding and autonomous navigation. Classical techniques have significant challenges when trying to understand scenes that are heavily cluttered and images containing multiple perspective cues, leading to poor or unreliable vanishing point determination. We present a Dual RANSAC with Rescue Midpoint-based Multi-Trend Vanishing Point Detection framework, which targets the simultaneous detection and fine-tuning of multiple, globally consistent vanishing points. The proposed framework introduces a novel Midpoint-based Multi-Trend Random Sample Consensus formulation that operates on line segment midpoints to infer dominant directional groups, thereby eliminating noisy or unstable midpoints and stabilizing subsequent vanishing point inference. The main novelty lies in using line segment midpoints to model the orientation variation as a linear regression in the midpoint–orientation space, which helps reduce sensitivity to endpoint instability. Candidate vanishing points are prioritized through inlier-based confidence ranking and subsequently optimized via an MSAC-based arbiter to resolve hypothesis conflicts and minimize geometric error. We evaluate our work against state-of-the-art techniques such as J-Linkage and Conditional Sample Consensus, over two of the current challenging public datasets that comprise the York Urban Dataset and the Toulouse Vanishing Point Dataset. The results show that the proposed framework achieves a recall of up to 95% and an image success rate of almost 84%, outperforming both J-Linkage and Conditional Sample Consensus, especially under tighter angular thresholds. This demonstrates the ability of the proposed framework to provide enhanced stability and localization accuracy. Full article
(This article belongs to the Section Computer Vision and Pattern Recognition)
Show Figures

Figure 1

35 pages, 3098 KB  
Article
ImmerseFM-3D: A Foundation Model Framework for Generalizable 360-Degree Video Streaming with Cross-Modal Scene Understanding
by Reka Sandaruwan Gallena Watthage and Anil Fernando
Appl. Sci. 2026, 16(7), 3424; https://doi.org/10.3390/app16073424 - 1 Apr 2026
Viewed by 336
Abstract
Current 360-degree video streaming systems consider viewport prediction, adaptive bitrate allocation, tile selection, and quality-of-experience (QoE) estimation as independent activities, yielding fragmented pipelines that do not scale well across content type and network conditions and do not scale well to individual users. We [...] Read more.
Current 360-degree video streaming systems consider viewport prediction, adaptive bitrate allocation, tile selection, and quality-of-experience (QoE) estimation as independent activities, yielding fragmented pipelines that do not scale well across content type and network conditions and do not scale well to individual users. We propose ImmerseFM-3D, a foundation model that jointly solves all four sub-tasks through a single shared representation. Seven input modalities, namely video frames, network traces, head-motion trajectories, ambisonics audio, depth maps, eye-tracking signals, and CLIP scene semantics, are fused by four-layer cross-modal attention and compressed into a 256-dimensional bottleneck latent via a variational information bottleneck. Four task-specific decoders operate on this shared latent simultaneously. A model-agnostic meta-learning adapter augmented with episodic memory and a hypernetwork personalizes the model from as little as 1 s of user interaction data. An extended branch supports six-degrees-of-freedom volumetric content through spherical harmonic viewport decoding and depth-aware tile importance weighting. Trained and evaluated on the IMMERSE-1M combined dataset (1000 h of 360° and volumetric video, 524 users, and over 50,000 mean opinion scores), ImmerseFM-3D reduces the mean angular viewport error by 34%, lowers the bandwidth violation rate from 8.3% to 3.1%, and achieves a QoE Pearson correlation of 0.891. The personalization adapter reaches 90% of peak performance in 22 s, while zero-shot cross-format transfer attains 72% of full in-domain accuracy. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

Back to TopTop