Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (168)

Search Parameters:
Keywords = multimodal semantic segmentation

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
23 pages, 5036 KB  
Article
Distilling Vision Foundation Models into LiDAR Networks via Manifold-Aware Topological Alignment
by Yuchuan Yang and Xiaosu Xu
Computers 2026, 15(4), 234; https://doi.org/10.3390/computers15040234 - 9 Apr 2026
Viewed by 264
Abstract
LiDAR point cloud semantic segmentation is essential for autonomous driving, yet LiDAR-only methods remain constrained by sparsity and limited texture cues. We propose Cross-Modal Collaborative Manifold Distillation (CMCMD), which transfers open-world semantic priors from the DINOv3 Vision Foundation Model to a LiDAR student [...] Read more.
LiDAR point cloud semantic segmentation is essential for autonomous driving, yet LiDAR-only methods remain constrained by sparsity and limited texture cues. We propose Cross-Modal Collaborative Manifold Distillation (CMCMD), which transfers open-world semantic priors from the DINOv3 Vision Foundation Model to a LiDAR student network. The framework combines an Adaptive Relation Convolution (ARConv) backbone with geometry-conditioned aggregation, a Unified Bidirectional Mapping Module (UBMM) for explicit 2D–3D interaction, and Manifold-Aware Topological Distillation (MATD), which aligns inter-sample affinity structures in a shared latent manifold rather than enforcing pointwise feature matching. By preserving relational topology instead of absolute feature coordinates, CMCMD mitigates negative transfer across heterogeneous modalities. Experiments on SemanticKITTI and nuScenes yield mIoU values of 72.9% and 81.2%, respectively, surpassing the compared distillation baselines and approaching the performance of multimodal fusion methods at lower inference cost. Additional evaluation on real-world campus scenes further supports the cross-domain robustness of the proposed framework. Full article
Show Figures

Graphical abstract

19 pages, 4608 KB  
Article
SGH-Net: An Efficient Hierarchical Fusion Network with Spectrally Guided Attention for Multi-Modal Landslide Segmentation
by Jing Wang, Haiyang Li, Shuguang Wu, Yukui Yu, Guigen Nie and Zhaoquan Fan
Remote Sens. 2026, 18(8), 1115; https://doi.org/10.3390/rs18081115 - 9 Apr 2026
Viewed by 274
Abstract
Accurate landslide segmentation from remote sensing imagery is important for geohazard assessment and emergency response, yet it remains challenging because landslide regions are often spectrally confused with bare soil, riverbeds, shadows, and disturbed surfaces while also suffering from severe foreground–background imbalance. To address [...] Read more.
Accurate landslide segmentation from remote sensing imagery is important for geohazard assessment and emergency response, yet it remains challenging because landslide regions are often spectrally confused with bare soil, riverbeds, shadows, and disturbed surfaces while also suffering from severe foreground–background imbalance. To address these issues, we propose an Efficient Spectrally Guided Hierarchical Fusion Network (SGH-Net) for multi-modal landslide segmentation. Instead of directly concatenating heterogeneous inputs at the image level, SGH-Net adopts an asymmetric encoder–decoder design in which a pretrained EfficientNet-B4 extracts RGB features, while two lightweight guidance encoders capture complementary multispectral band and DEM-derived terrain cues. These guidance features are progressively injected into the RGB backbone through multi-stage Guided Attention Blocks, enabling selective feature recalibration and reducing cross-modal interference. In addition, a hybrid Dice–Focal loss is used to alleviate class imbalance. Experiments on the Landslide4Sense dataset show that SGH-Net achieves the best overall performance among the compared methods under the adopted evaluation protocol, reaching 81.15% IoU and a 77.86% F1-score. Compared with representative multi-modal baselines, the proposed method delivers more accurate boundary delineation and fewer false alarms while maintaining favorable model complexity. These results indicate that modality-guided hierarchical fusion is an effective and efficient strategy for multi-modal landslide segmentation. Full article
Show Figures

Figure 1

32 pages, 21661 KB  
Article
Robust Human-to-Robot Handover System Under Adverse Lighting
by Yifei Wang, Baoguo Xu, Huijun Li and Aiguo Song
Biomimetics 2026, 11(4), 231; https://doi.org/10.3390/biomimetics11040231 - 1 Apr 2026
Viewed by 465
Abstract
Human-to-robot (H2R) handovers are critical in human–robot interaction but are challenged by complex environments that impact robot perception. Traditional RGB-based perception methods exhibit severe performance degradation under harsh lighting (e.g., glare and darkness). Furthermore, H2R handovers occur in unstructured environments populated with fine-grained [...] Read more.
Human-to-robot (H2R) handovers are critical in human–robot interaction but are challenged by complex environments that impact robot perception. Traditional RGB-based perception methods exhibit severe performance degradation under harsh lighting (e.g., glare and darkness). Furthermore, H2R handovers occur in unstructured environments populated with fine-grained visual details, such as multi-angle hand configurations and novel object geometries, where conventional semantic segmentation and grasp generation approaches struggle to generalize. To overcome lighting disturbances, we present an H2R handover system with a dual-path perception pipeline. The system fuses perception data from a stereo RGB-D camera (eye-in-hand) and a time-of-flight (ToF) camera (fixed scene) under normal lighting, and switches to the ToF camera for reliable perception under glare and darkness. In parallel, to address the complex spatial and geometric features, we augment the Point Transformer v3 (PTv3) architecture by integrating a T-Net module and a self-attention mechanism to fuse the relative positional angle features between human and robot, enabling efficient real-time 3D semantic segmentation of both the object and the human hand. For grasp generation, we extend GraspNet with a grasp selection module optimized for H2R scenarios. We validate our approach through extensive experiments: (1) a semantic segmentation dataset with 7500 annotated point clouds covering 15 objects and 5 relative angles and tested on 750 point clouds from 15 unseen objects, where our method achieves 84.4% mIoU, outperforming Swin3D-L by 3.26 percentage points with 3.2× faster inference; (2) 250 real-world handover trials comparing our method with the baseline across 5 objects, 5 hand postures, and 5 angles, showing an improvement of 18.4 percentage points in success rate; (3) 450 trials under controlled adverse lighting (darkness and glare), where our dual-path perception method achieves 82.7% overall success, surpassing single-camera baselines by up to 39.4 percentage points; and (4) a comparative experiment against a state-of-the-art multimodal H2R handover method under identical adverse lighting, where our system achieves 75.0% success (15/20) versus the baseline’s 15.0% (3/20), further confirming the lighting robustness of our design. These results demonstrate the system’s robustness and generalization in challenging H2R handover scenarios. Full article
(This article belongs to the Special Issue Human-Inspired Grasp Control in Robotics 2025)
Show Figures

Figure 1

18 pages, 10514 KB  
Article
Hierarchical Compositional Alignment for Zero-Shot Part-Level Segmentation
by Shan Yang, Shujie Ji, Zhendong Xiao, Xiongding Liu and Wu Wei
Sensors 2026, 26(7), 2130; https://doi.org/10.3390/s26072130 - 30 Mar 2026
Viewed by 497
Abstract
In robotic fine-grained tasks (e.g., grasping and assembly), precise interaction requires a detailed understanding of object components. While Visual Language Models (VLMs) excel at object-level recognition, they struggle with part-level segmentation (e.g., knife handles), limiting performance in complex scenarios. VLMs face three key [...] Read more.
In robotic fine-grained tasks (e.g., grasping and assembly), precise interaction requires a detailed understanding of object components. While Visual Language Models (VLMs) excel at object-level recognition, they struggle with part-level segmentation (e.g., knife handles), limiting performance in complex scenarios. VLMs face three key challenges: (1) Visual granularity mismatch—object-level features lack part-level details; (2) Semantic hierarchy gaps—parts and objects differ significantly in semantics; (3) Cross-modal bias—CLIP’s text–image alignment favors global over local features. To address these, we propose a one-stage VLM-based part segmentation method. First, the Hierarchy-Aware Feature Selection mechanism analyzes Transformer features in different hierarchies to enhance spatial and semantic precision for part segmentation. Second, the Multi-Hierarchy Feature Adapter bridges object-to-part feature granularity via the hierarchical adaptation. Finally, the Hierarchical Multimodal Alignment Module harmonizes classification accuracy and mask integrity via hierarchical alignment of vision–language, mitigating the bias of CLIP’s object-level priori knowledge. Experiments show the proposed method improves part segmentation performance for Zero-Shot, achieving 25.86% on Pascal-Part and 13.09% on ADE20K-Part (gains of +0.81% hIoU and +2.96% hIoU over baseline). This work advances robotic visual perception, with applications in intelligent manufacturing and intelligent service. Full article
(This article belongs to the Section Sensors and Robotics)
Show Figures

Figure 1

30 pages, 3658 KB  
Article
TB-DLossNet: Fine-Grained Segmentation of Tea Leaf Diseases Based on Semantic-Visual Fusion
by Shuqi Zheng, Hao Zhou, Ziyang Shi, Fulin Su, Wei Shi, Ruifeng Liu, Lin Li and Fangying Wan
Plants 2026, 15(7), 1035; https://doi.org/10.3390/plants15071035 - 27 Mar 2026
Viewed by 457
Abstract
Camellia oleifera is an economically vital woody oil crop. Its productivity and oil quality are severely compromised by various diseases. Implementing pixel-level lesion segmentation within complex field environments is crucial for advancing precision plant protection. Despite recent progress, existing segmentation methods struggle with [...] Read more.
Camellia oleifera is an economically vital woody oil crop. Its productivity and oil quality are severely compromised by various diseases. Implementing pixel-level lesion segmentation within complex field environments is crucial for advancing precision plant protection. Despite recent progress, existing segmentation methods struggle with three primary challenges: semantic ambiguity arising from evolving pathological stages, blurred boundaries due to overlapping lesions, and the high omission rate of micro-lesions. To address these issues, this paper presents TB-DLossNet (Text-Conditioned Boundary-Aware Network with Dynamic Loss Reweighting), a novel segmentation framework based on semantic-visual multi-modal fusion. Leveraging VMamba as the visual backbone, the proposed model innovatively integrates BERT-encoded structured text as an auxiliary modality to resolve visual ambiguities through cross-modal semantic guidance. Furthermore, a boundary enhancement branch is incorporated alongside a multi-scale deep supervision strategy to mitigate boundary displacement and ensure the topological continuity of lesion structures. To tackle the detection of small-scale targets, we designed a dynamic weight loss function conditioned on lesion area, significantly bolstering the model’s sensitivity to minute pathological features. Additionally, to alleviate the scarcity of high-quality data, we curated a comprehensive multi-modal dataset encompassing seven typical diseases of Camellia oleifera. Experimental results demonstrate that TB-DLossNet achieves a Mean Intersection over Union (mIoU) of 87.02%, outperforming the state-of-the-art unimodal VMamba and multimodal Lvit by 4.9% and 2.59%, respectively. Qualitative evaluations confirm that our model exhibits lower false-negative rates and superior boundary-fitting precision in heterogeneous field scenarios. Finally, generalization tests on an apple disease dataset further validate the robustness and transferability of the proposed framework. Full article
(This article belongs to the Special Issue Advances in Artificial Intelligence for Plant Research—2nd Edition)
Show Figures

Figure 1

25 pages, 13685 KB  
Article
Vision and Language Reference for a Segment Anything Model for Few-Shot Segmentation
by Kosuke Sakurai, Ryotaro Shimizu and Masayuki Goto
J. Imaging 2026, 12(4), 143; https://doi.org/10.3390/jimaging12040143 - 24 Mar 2026
Viewed by 459
Abstract
Segment Anything Model (SAM)-based few-shot segmentation models traditionally rely solely on annotated reference images as prompts, which inherently limits their accuracy due to an over-reliance on visual cues and a lack of semantic context. This reliance leads to incorrect segmentation, where visually similar [...] Read more.
Segment Anything Model (SAM)-based few-shot segmentation models traditionally rely solely on annotated reference images as prompts, which inherently limits their accuracy due to an over-reliance on visual cues and a lack of semantic context. This reliance leads to incorrect segmentation, where visually similar objects from different categories are incorrectly identified as the target object. We propose Vision and Language Reference Prompt into SAM (VLP-SAM), a novel few-shot segmentation model that integrates both visual information of reference images and semantic information of text labels into SAM. VLP-SAM introduces a vision-language model (VLM) with pixel–text matching into the prompt encoder for SAM, effectively leveraging textual semantic consistency while preserving SAM’s extensive segmentation knowledge. By incorporating task-specific structures such as an attention mask, our model achieves superior few-shot segmentation performance with only 1.4 M learnable parameters. Evaluations on PASCAL-5i and COCO-20i datasets demonstrate that VLP-SAM significantly outperforms previous methods by 6.8% and 9.3% in mIoU, respectively. Furthermore, VLP-SAM exhibits strong generalization across unseen objects and cross-domain scenarios, highlighting the robustness provided by textual semantic guidance. This study offers an effective and scalable framework for few-shot segmentation with multimodal prompts. Full article
Show Figures

Figure 1

30 pages, 2362 KB  
Article
SGCAD: A SAR-Guided Confidence-Gated Distillation Framework of Optical and SAR Images for Water-Enhanced Land-Cover Semantic Segmentation
by Junjie Ma, Zhiyi Wang, Yanyi Yuan and Fengming Hu
Remote Sens. 2026, 18(6), 962; https://doi.org/10.3390/rs18060962 - 23 Mar 2026
Viewed by 361
Abstract
Multimodal fusion of synthetic aperture radar (SAR) and optical imagery is widely used in Earth observation for applications such as land-cover mapping and surface-water mapping (including post-event flood mapping under near-synchronous acquisitions) and land-use inventory. Optical images provide rich spectral and texture cues, [...] Read more.
Multimodal fusion of synthetic aperture radar (SAR) and optical imagery is widely used in Earth observation for applications such as land-cover mapping and surface-water mapping (including post-event flood mapping under near-synchronous acquisitions) and land-use inventory. Optical images provide rich spectral and texture cues, whereas SAR offers all-weather structural information that is complementary but heterogeneous. In practice, this heterogeneity often introduces fusion conflicts in multi-class segmentation, causing critical categories such as water bodies to be under-optimized. To address this issue, this paper presents a SAR-guided class-aware knowledge distillation (SGCAD) method for multimodal semantic segmentation. First, a SAR-only HRNet is trained as a water-expert teacher to learn discriminative backscattering and boundary priors for water extraction. Second, a lightweight multimodal student model (LightMCANet) is optimized using a class-aware distillation strategy that transfers teacher knowledge only within high-confidence water regions, thereby suppressing noisy supervision and reducing interference to other classes. Third, a SAR edge guidance module (SEGM) is introduced in the decoder to enhance boundary continuity for slender structures such as water bodies and roads. Overall, SGCAD improves targeted category learning while maintaining stable performance across the remaining classes. Experiments on a self-built dataset from GF-1 optical and LuTan-1 SAR imagery demonstrate higher overall accuracy and more coherent water/road predictions than representative baselines. Future work will extend the proposed distillation scheme to additional categories and broader geographic scenes. Full article
(This article belongs to the Section Remote Sensing Image Processing)
Show Figures

Figure 1

21 pages, 1669 KB  
Article
Robust BEV Perception via Dual 4D Radar–Camera Fusion Under Adverse Conditions with Fog-Aware Enhancement
by Zhengqing Li and Baljit Singh
Electronics 2026, 15(6), 1284; https://doi.org/10.3390/electronics15061284 - 19 Mar 2026
Viewed by 439
Abstract
Bird’s-eye-view (BEV) perception has emerged as a key representation for unified scene understanding in autonomous driving. However, current BEV methods relying solely on monocular cameras suffer from severe degradation under adverse weather and dynamic scenes due to limited depth cues and illumination dependency. [...] Read more.
Bird’s-eye-view (BEV) perception has emerged as a key representation for unified scene understanding in autonomous driving. However, current BEV methods relying solely on monocular cameras suffer from severe degradation under adverse weather and dynamic scenes due to limited depth cues and illumination dependency. To address these challenges, we propose a robust multi-modal BEV perception framework that integrates dual-source 4D millimeter-wave radar and multi-view camera images. The proposed architecture systematically exploits Doppler velocity and temporal information from 4D radar to model dynamic object motion, while introducing a deformable fusion strategy in the BEV space for accurate semantic alignment across modalities. Our design includes four key modules: a Doppler-Aware Radar Encoder (DARE) that enhances motion-sensitive features via velocity-guided attention; a Fog-Aware Feature Denoising Module (FADM) that suppresses modality inconsistency in low-visibility conditions through cross-modal attention and residual enhancement; a Multi-Modal Temporal Fusion Module (TFM) that encodes radar temporal sequences using a Transformer encoder for motion continuity modeling; and a confidence-aware multi-task loss that jointly supervises semantic segmentation, motion estimation, and object detection. Extensive experiments on the DualRadar dataset and adverse-weather simulations demonstrate that our method achieves significant gains over state-of-the-art baselines in BEV segmentation accuracy, detection robustness, and motion stability. The proposed framework offers a scalable and resilient solution for real-world autonomous perception, especially under challenging environmental conditions. Full article
(This article belongs to the Special Issue Image Processing Based on Convolution Neural Network: 2nd Edition)
Show Figures

Figure 1

22 pages, 4742 KB  
Article
PromptSeg: An End-to-End Universal Medical Image Segmentation Method via Visual Prompts
by Minfan Zhao, Bingxun Wang, Jun Shi and Hong An
Entropy 2026, 28(3), 342; https://doi.org/10.3390/e28030342 - 18 Mar 2026
Viewed by 351
Abstract
Deep learning has achieved remarkable advancements in medical image segmentation, yet its generalization capability across unseen tasks remains a significant challenge. The variety of task objectives, disease-dependent labeling variations, and multi-center data contribute to the high uncertainty of task-specific models on unseen distributions. [...] Read more.
Deep learning has achieved remarkable advancements in medical image segmentation, yet its generalization capability across unseen tasks remains a significant challenge. The variety of task objectives, disease-dependent labeling variations, and multi-center data contribute to the high uncertainty of task-specific models on unseen distributions. In this study, we propose PromptSeg, an innovative Transformer-based unified framework for universal 2D medical image segmentation. From an information-theoretic perspective, PromptSeg formulates the segmentation process as a conditional entropy minimization problem, utilizing visual prompts as side information to reduce the uncertainty of the target task. Guided by the information bottleneck principle, PromptSeg aims to utilize the provided visual prompts to filter out redundant noise and learn contextual representations, thereby breaking the restrictions of the task-specific paradigm. When faced with unseen datasets or segmentation targets, our method only requires a few annotated visual prompt pairs to extract task-specific semantics and segment the query images without retraining. Extensive experiments on CT and MRI datasets demonstrate that PromptSeg not only outperforms state-of-the-art methods but also exhibits strong multi-modality generalization capabilities. Full article
Show Figures

Figure 1

26 pages, 12325 KB  
Article
Pairwise Comparison-Based Salient Object Ranking Using Multimodal Large Models
by Yifan Liu, Jia Song and Chenglizhao Chen
Sensors 2026, 26(6), 1913; https://doi.org/10.3390/s26061913 - 18 Mar 2026
Viewed by 285
Abstract
Salient object ranking aims to assign a relative importance order to multiple objects in an image, aligning with human visual attention. However, existing methods struggle with ranking ambiguity in complex scenes, particularly when objects are numerous, occluded, or semantically similar, leading to decreased [...] Read more.
Salient object ranking aims to assign a relative importance order to multiple objects in an image, aligning with human visual attention. However, existing methods struggle with ranking ambiguity in complex scenes, particularly when objects are numerous, occluded, or semantically similar, leading to decreased accuracy for low-saliency objects. To address this, we propose PairwiseSOR-MLMs, a novel framework leveraging multimodal large models and pairwise comparison to achieve salient object ranking. The approach decomposes global ranking into a series of pairwise comparison tasks. It first employs object detection and instance segmentation to identify objects, uses image inpainting to reconstruct scenes by removing occlusions, and then prompts MLMs to perform pairwise comparisons based on visual saliency cues. Finally, another MLM inference aggregates these comparisons into a consistent global ranking. Experiments on ASSR and IRSR benchmarks show our method achieves state-of-the-art or competitive performance across metrics, demonstrating robustness in handling occlusion and semantic similarity. Its pairwise comparison paradigm can extend to other relative assessment tasks. Full article
(This article belongs to the Section Sensors and Robotics)
Show Figures

Figure 1

19 pages, 9029 KB  
Article
AIM-SEEM: Adapting SEEM for Open-Vocabulary Terrain Segmentation Across Arbitrary Imaging Modalities
by Yuqian Wang, Xuefu Xiang, Yongcun Wu, Yong Zhang and Xinyue Li
Sensors 2026, 26(6), 1869; https://doi.org/10.3390/s26061869 - 16 Mar 2026
Viewed by 375
Abstract
Terrain segmentation performance directly affects the reliability of robotic environmental perception and decision making, yet most existing methods are built upon the assumptions of fixed sensing configurations and closed label sets. As a result, they struggle to meet real world outdoor requirements where [...] Read more.
Terrain segmentation performance directly affects the reliability of robotic environmental perception and decision making, yet most existing methods are built upon the assumptions of fixed sensing configurations and closed label sets. As a result, they struggle to meet real world outdoor requirements where modalities can be dynamically available and semantic classes continually expand. This paper systematically studies open-vocabulary terrain segmentation under arbitrary imaging modality combinations and proposes a unified foundation model-based framework named AIM-SEEM (SEEM for Arbitrary Imaging Modalities). Built upon Segment Everything Everywhere All at Once (SEEM), AIM-SEEM performs stable input side adaptation and controlled fusion of heterogeneous modalities, maximizing the reuse of pre-trained visual priors to accommodate different modality types and counts. Furthermore, to address the distribution shifts and the resulting vision–text alignment degradation caused by modality extension, a vision-guided text calibration mechanism is introduced to preserve open-vocabulary segmentation capability under multi-modality combination inputs. Experiments on two benchmarks under three evaluation settings, including full-modality, modality-agnostic, and open-vocabulary, show that AIM-SEEM consistently outperforms prior methods. Full article
(This article belongs to the Section Sensors and Robotics)
Show Figures

Figure 1

16 pages, 6152 KB  
Article
DisasterReliefGPT: Multimodal AI for Autonomous Disaster Impact Assessment and Crisis Communication
by Lekshmi Chandrika Reghunath, Athikkal Sudhir Abhishek, Arjun Changat, Arjun Unnikrishnan, Ayush Kumar Rai, Christian Napoli and Cristian Randieri
Technologies 2026, 14(3), 179; https://doi.org/10.3390/technologies14030179 - 16 Mar 2026
Viewed by 376
Abstract
The work presented herein proposes DisasterReliefGPT, a multimodal AI system for automation in the areas of crisis communication and post-disaster assessment. The system integrates three tightly coupled components: a vision module called DisasterOCS for structural damage detection in satellite images, a Large Vision–Language [...] Read more.
The work presented herein proposes DisasterReliefGPT, a multimodal AI system for automation in the areas of crisis communication and post-disaster assessment. The system integrates three tightly coupled components: a vision module called DisasterOCS for structural damage detection in satellite images, a Large Vision–Language Model (LVLM) for enhanced visual understanding and contextual reasoning, and a Large Language Model (LLM) to produce detailed, clear assessment reports. DisasterOCS relies on a ResNet34-based encoder with partial weight sharing and event-specific decoders, coupled with a custom MultiCrossEntropyDiceLoss function for multi-class segmentation on pre- and post-disaster image pairs. On the benchmark xBD dataset, the developed system reaches a high score of 78.8% in identifying F1-damage, making correct identifications of destroyed buildings with 81.3% precision, while undamaged structures are found with a very high value of 90.7%. From a combination of these components, emergency responders can immediately provide reliable and readable assessments of damage that can be used to directly support urgent decision-making. Full article
Show Figures

Graphical abstract

23 pages, 12466 KB  
Article
Real-Time LiDAR 3D Semantic Segmentation via Multi-View and Cross-Modal Compact Featuring Two-Branch Knowledge Distillation
by Yun Zhang, Kun Qian, Zihan Zhang, Min’ao Zhang and Hai Yu
Sensors 2026, 26(6), 1860; https://doi.org/10.3390/s26061860 - 15 Mar 2026
Viewed by 536
Abstract
Simultaneous online mapping and semantic segmentation using handheld scanners supports various environmental inspection and measurement tasks. For such scanners, combing visual and LiDAR data is beneficial for improving the segmentation performance. But the direct fusion of multi-modal and multi-view features faces challenges in [...] Read more.
Simultaneous online mapping and semantic segmentation using handheld scanners supports various environmental inspection and measurement tasks. For such scanners, combing visual and LiDAR data is beneficial for improving the segmentation performance. But the direct fusion of multi-modal and multi-view features faces challenges in terms of both real-time performance and robustness. To address these challenges, this paper proposes a multi-view and cross-modal knowledge distillation method for supporting runtime LiDAR-only semantic segmentation. The proposed method hierarchically compacts multi-view and cross-model priors and distills them into two branches to improve segmentation accuracy. In addition, we design an improved data augmentation technique based on PolarMix for rendering more realistic point cloud scenes. The experimental results on the SemanticKITTI and nuScenes datasets demonstrate that the mIoU of our approach outperforms the state-of-the-art knowledge-distillation-based methods. In addition, mapping experiments using a handheld scanner demonstrate the proposed method’s superior real-time performance and accuracy. Full article
Show Figures

Figure 1

21 pages, 4406 KB  
Article
An Abnormal File Access Detection Model for Containers Based on eBPF Listening
by Naqin Zhou, Hao Chen, Zeyu Chen, Chao Li and Fan Li
Mathematics 2026, 14(6), 991; https://doi.org/10.3390/math14060991 - 14 Mar 2026
Viewed by 503
Abstract
With the widespread adoption of container technology, its shared kernel architecture has made abnormal file access behavior a key precursor to container escape and lateral attacks, necessitating precise and efficient runtime detection mechanisms. However, existing monitoring methods typically suffer from issues such as [...] Read more.
With the widespread adoption of container technology, its shared kernel architecture has made abnormal file access behavior a key precursor to container escape and lateral attacks, necessitating precise and efficient runtime detection mechanisms. However, existing monitoring methods typically suffer from issues such as insufficient granularity in data collection, limited path semantic modeling capabilities, and low anomaly detection accuracy. To address these challenges, this paper proposes an eBPF-based method for detecting abnormal file access in containers. A lightweight kernel-level monitoring mechanism is constructed to capture access behavior in real time at the system call level, effectively enhancing both the granularity of data collection and the completeness of context. At the feature modeling layer, a multimodal path semantic representation method is designed, combining risk-layer rules and semantic vectorization strategies to enhance the hierarchical expression of path structures and improve context modeling ability. In the detection layer, an attention-enhanced autoencoder model is introduced, achieving high-precision identification of abnormal access behavior and low false-positive monitoring under unsupervised conditions through a path segment attention mechanism and weighted reconstruction loss function. Experiments in real container environments show that the proposed method achieves a recall rate of 82.0%, a false-positive rate of 0.79%, and a Matthews correlation coefficient of 0.852, significantly outperforming mainstream unsupervised detection methods such as Isolation Forest, One-Class SVM, and Local Outlier Factor. These results verify the advantages of the proposed method in terms of detection accuracy, real-time performance, and system friendliness, providing an efficient and feasible solution for enhancing the detection of unknown attacks in container runtimes. Full article
Show Figures

Figure 1

43 pages, 2166 KB  
Article
Research on Root Cause Analysis Method for Certain Civil Aircraft Based on Ensemble Learning and Large Language Model Reasoning
by Wenyou Du, Jingtao Du, Haoran Zhang and Dongsheng Yang
Machines 2026, 14(3), 322; https://doi.org/10.3390/machines14030322 - 12 Mar 2026
Viewed by 501
Abstract
To address the challenges commonly encountered in civil aircraft operating under multi-mode, strongly coupled closed-loop control—namely scarce fault samples, pronounced distribution shift, and root-cause explanations that are easily confounded by covariates—this paper proposes a root-cause analysis method that integrates ensemble learning with constraint-guided [...] Read more.
To address the challenges commonly encountered in civil aircraft operating under multi-mode, strongly coupled closed-loop control—namely scarce fault samples, pronounced distribution shift, and root-cause explanations that are easily confounded by covariates—this paper proposes a root-cause analysis method that integrates ensemble learning with constraint-guided reasoning by large language models (LLMs). First, for Full Authority Digital Engine Control (FADEC) monitoring sequences, a feature system comprising environment-normalized ratios, mechanism-informed mixing indices, and multi-scale temporal statistics is constructed, thereby improving cross-mode comparability and enhancing engineering-semantic expressiveness. Second, in the anomaly detection stage, a cost-sensitive LightGBM model is adopted and a validation-set-based adaptive thresholding strategy is introduced to achieve robust identification under highly imbalanced fault conditions. Furthermore, for Root Cause Analysis (RCA), a “computation–reasoning decoupling” framework is developed: Shapley Additive exPlanations (SHAP) are used to generate segment-level contribution evidence, while causal chains, engineering prohibitions, and structured output templates are injected into prompts to constrain the LLM, enabling it to infer root-cause candidates and produce structured explanations under mechanism-consistency constraints. Experiments on real flight data demonstrate that our method yields an anomaly detection F1-score of 0.9577 and improves overall RCA accuracy to 97.1% (versus 62.3% for a pure SHAP baseline). Practically, by translating complex high-dimensional data into actionable natural language diagnostic reports, the proposed method provides reliable and interpretable decision support for rapid RCA. Full article
(This article belongs to the Section Automation and Control Systems)
Show Figures

Figure 1

Back to TopTop