Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (659)

Search Parameters:
Keywords = multi-scale supervision

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
28 pages, 8358 KB  
Article
Deep Climate Model Distillation for Localized Flood Forecasting in Low-Resource Areas
by Julius Olaniyan, Deborah Olaniyan, Ibidun C. Obagbuwa and Madison N. Ngafeeson
Meteorology 2026, 5(2), 16; https://doi.org/10.3390/meteorology5020016 (registering DOI) - 19 Jun 2026
Abstract
Floods remain among the most devastating natural disasters globally, disproportionately impacting low-resource regions where real-time flood forecasting is constrained by limited computational infrastructure and the scarcity of fine-resolution predictive models. Although state-of-the-art global climate models achieve high predictive accuracy, their scale and computational [...] Read more.
Floods remain among the most devastating natural disasters globally, disproportionately impacting low-resource regions where real-time flood forecasting is constrained by limited computational infrastructure and the scarcity of fine-resolution predictive models. Although state-of-the-art global climate models achieve high predictive accuracy, their scale and computational complexity restrict their applicability in localized and resource-constrained settings. This study proposes a deep climate model distillation framework that transfers knowledge from a high-capacity Fourier Neural Operator (FNO)-based global climate model inspired by FourCastNet into lightweight, regionally adaptive student networks suitable for edge deployment. The framework combines climate variables, satellite observations, and hydrological measurements to improve localized flood prediction. Knowledge transfer is achieved through a multi-objective distillation strategy that combines supervised learning, soft-target alignment, and intermediate feature matching. Experimental evaluation across multiple flood-prone regions in Sub-Saharan Africa and South Asia shows that the distilled student model achieves an average classification accuracy of 0.89, an AUC of 0.91, and an F1-score of 0.88, retaining approximately 96.7% of the teacher model’s predictive performance. In continuous discharge estimation, the model attains a mean absolute error of 0.17, RMSE of 0.24, and an R2 score of 0.85. The proposed distillation approach yields an 8× reduction in inference latency and over a 20× reduction in model size, enabling real-time execution on low-power edge devices such as the Raspberry Pi 4 and NVIDIA Jetson Nano. The student model further demonstrates robust regional and temporal generalization, with limited performance degradation in unseen geographic areas and during extreme flood years. Full article
(This article belongs to the Special Issue Early Career Scientists’ (ECS) Contributions to Meteorology (2026))
Show Figures

Graphical abstract

25 pages, 1244 KB  
Article
Semi-SwinUNeTR: Towards 3D Swin Vision Transformer-Based UNet for Medical Image Segmentation with Limited Annotations
by Yinbing Tian, Ziyang Wang and Li Guo
Bioengineering 2026, 13(6), 695; https://doi.org/10.3390/bioengineering13060695 - 17 Jun 2026
Viewed by 7
Abstract
Accurate brain tumor segmentation from magnetic resonance imaging (MRI) is essential for computer-assisted diagnosis, treatment planning, and disease monitoring. However, brain tumors usually exhibit irregular, heterogeneous, and multi-scale spatial patterns with complex and ambiguous boundaries. At the same time, the performance of deep [...] Read more.
Accurate brain tumor segmentation from magnetic resonance imaging (MRI) is essential for computer-assisted diagnosis, treatment planning, and disease monitoring. However, brain tumors usually exhibit irregular, heterogeneous, and multi-scale spatial patterns with complex and ambiguous boundaries. At the same time, the performance of deep segmentation models is often constrained by the limited availability of voxel-level annotations, which are expensive and time-consuming to obtain. To address these challenges, this paper proposes Semi-SwinUNeTR, a semi-supervised framework for 3D brain tumor segmentation with limited annotated data. The proposed method adopts SwinUNeTR as the segmentation backbone, enabling hierarchical volumetric representation learning through shifted-window self-attention while preserving the encoder–decoder structure required for dense prediction. On top of this backbone, we introduce a dual-consistency semi-supervised learning strategy, consisting of mean teacher-based model consistency and interpolation consistency-based data consistency. In addition, voxel-wise consistency weights are used to redistribute semi-supervised supervision toward structurally complex and boundary-irregular tumor regions without changing the SwinUNeTR backbone. Experiments on the BraTS 2019 benchmark demonstrate that the proposed framework achieves strong performance across different annotation ratios. The original Semi-SwinUNeTR achieves Dice scores of 84.93%, 86.25%, 87.05%, and 87.83% under the 10%, 20%, 40%, and 80% labeled-data settings, respectively. With the weighted consistency extension, the Dice scores are further improved to 85.64%, 87.94%, and 88.59% under the 10%, 20%, and 80% labeled-data settings, respectively, while the corresponding HD95 values are reduced to 8.9826, 8.1854, and 7.4533. These results indicate that combining a SwinUNeTR backbone with complementary model consistency, data consistency, and voxel-wise consistency weighting is an effective strategy for semi-supervised volumetric medical image segmentation under limited annotation. Full article
(This article belongs to the Special Issue AI and Robotics for Multimodal Psychophysiological Health Monitoring)
18 pages, 1868 KB  
Article
Self-Supervised Spectral Representation Learning for LAMOST
by Wenjun Zhang, Anhua Zhou, Lei Yuan, Yuchen Liang, Yihan Song and Zhenping Yi
Universe 2026, 12(6), 181; https://doi.org/10.3390/universe12060181 - 17 Jun 2026
Viewed by 36
Abstract
The Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) has collected tens of millions of spectra, providing an unprecedented resource for large-scale spectroscopic studies. Efficient retrieval techniques are therefore essential for exploring such massive datasets. Existing approaches often rely on predefined templates or [...] Read more.
The Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) has collected tens of millions of spectra, providing an unprecedented resource for large-scale spectroscopic studies. Efficient retrieval techniques are therefore essential for exploring such massive datasets. Existing approaches often rely on predefined templates or manually labeled training samples, which can limit their applicability in large and diverse spectral archives. In this work, we present a general similarity-retrieval framework that combines self-supervised contrastive learning based on a convolutional neural network with Facebook AI Similarity Search (FAISS) for efficient large-scale spectral retrieval. The framework learns spectral representations directly from unlabeled data and enables flexible retrieval from user-defined wavelength regions based on feature similarity. We evaluate the framework on several stellar populations in LAMOST DR8. For late-type M8-star retrieval, 90.5% of the top 1000 retrieved spectra are later than M6. For M0–M5 giants, the mean retrieval accuracy across six subtypes reaches 94.8%. Using a C-H star spectrum as the query spectrum, 90.8% of the top 1000 retrieved candidates are classified as carbon stars by the LAMOST pipeline. Cross-matching with SIMBAD further confirms 255 C-H stars and 47 C-R stars among the retrieved candidates. These results demonstrate that the proposed framework can efficiently identify spectrally similar objects across large spectroscopic databases and can serve as a useful tool for searching for rare or spectrally distinctive stellar populations. Full article
(This article belongs to the Special Issue New Discoveries in Astronomical Data (II))
Show Figures

Figure 1

30 pages, 13578 KB  
Article
A Semi-Supervised Topographic Inversion Algorithm for Small-Scale Tidal Flats Based on Multi-Source Data Fusion Under Spatially Clustered ICESat-2 Label Distributions
by Hao Chen, Xiaowen Luo, Feng Gui, Jiaxin Cui, Jiayang Chen and Qi Li
Remote Sens. 2026, 18(12), 2017; https://doi.org/10.3390/rs18122017 - 17 Jun 2026
Viewed by 61
Abstract
High-precision topography of tidal flats is essential for coastal monitoring, geomorphic change analysis, and ecological assessment. Although satellite remote sensing supports repeated and large-area observation, topographic inversion over small-scale tidal flats—here defined as localized intertidal patches with limited areal extent, represented in this [...] Read more.
High-precision topography of tidal flats is essential for coastal monitoring, geomorphic change analysis, and ecological assessment. Although satellite remote sensing supports repeated and large-area observation, topographic inversion over small-scale tidal flats—here defined as localized intertidal patches with limited areal extent, represented in this study by a 1.11 km2 tidal flat near Dafeng Port—remains challenging, because ICESat-2 laser altimetry tracks across such areas are typically sparse and spatially clustered within narrow sub-regions, leaving extensive observation-blind zones without direct elevation labels. This label-clustering problem constrains the applicability of traditional empirical models and tends to cause deep learning models to generalize poorly beyond the spatial distribution of training samples. To address this issue, this study proposes a Residual Attention Physical-constraint Semi-supervised U-Net (RAPS-UNet) that fuses ICESat-2 ATL03/ATL08 elevation labels with Sentinel-1 SAR and Sentinel-2 optical features. The preprocessing pipeline comprises refined ICESat-2 photon filtering, adaptive inundation-frequency extraction, multi-source feature selection, and baseline DEM construction. RAPS-UNet integrates residual learning, attention-based multi-source fusion, physics-constrained loss, and confidence-weighted pseudo-label augmentation to improve extrapolation under clustered-label conditions. A four-level validation protocol—in-distribution validation, spatial holdout testing, and field-based assessment over both interpolation and extrapolation zones—was designed to evaluate spatial generalization. Against a field-surveyed DEM, RAPS-UNet achieved an overall RMSE of 0.20 m, an MAE of 0.16 m, and an R2 of 0.91; the field-based interpolation and extrapolation zones yielded RMSEs of 0.17 m and 0.22 m, respectively, while the spatial holdout test reached an RMSE of 0.23 m and an R2 of 0.81. Relative to the traditional inundation frequency–elevation linear model (RMSE = 0.35 m), RAPS-UNet reduced the field-validation RMSE by approximately 43%. The proposed framework therefore offers a practical approach for fine-scale coastal-zone topographic mapping under sparse and spatially clustered altimetry conditions. Full article
Show Figures

Figure 1

16 pages, 3093 KB  
Article
LapDINO: A DINOv3 and Laplacian Pyramid-Based Approach for Outdoor Terrain Segmentation
by Shiquan Ling, Xingchen Qin, Wenkang Xu, Mingmin Fu, Hao Huang, Shijie Ma and Zhenyu Liu
Sensors 2026, 26(12), 3843; https://doi.org/10.3390/s26123843 - 17 Jun 2026
Viewed by 98
Abstract
As autonomous driving technology expands from structured urban roads to unstructured outdoor environments, precise understanding of complex terrain has become a critical requirement for ensuring safe vehicle navigation. However, outdoor environments are characterized by high dynamics, drastic illumination variations, ambiguous category boundaries, and [...] Read more.
As autonomous driving technology expands from structured urban roads to unstructured outdoor environments, precise understanding of complex terrain has become a critical requirement for ensuring safe vehicle navigation. However, outdoor environments are characterized by high dynamics, drastic illumination variations, ambiguous category boundaries, and prohibitive annotation costs, making traditional supervised learning methods that rely on large amounts of pixel-level annotations difficult to generalize. In this paper, we propose a novel dual-path bidirectional interactive encoder, termed LapDINO, that effectively combines the strong semantic generalization capability of the self-supervised foundation model DINOv3 with the multi-scale frequency analysis capacity of the Laplacian pyramid. Specifically, we leverage DINOv3 to extract global semantic features as a “semantic map”, while simultaneously obtaining multi-scale high-frequency details through Laplacian pyramid decomposition as “structural contours”. Building upon this, we design a bidirectional cross-attention fusion mechanism that enables dynamic interaction and mutual refinement between semantic information and geometric details. Furthermore, we introduce a multi-branch attention enhancement module that extracts pyramid features from three complementary perspectives. To address domain shift, we design lightweight visual adapters that enable efficient fine-tuning of the frozen DINOv3 backbone. Finally, we construct two off-road terrain segmentation datasets, VOTD and VOCD, to facilitate research in this domain. Experimental results demonstrate that the proposed method achieves state-of-the-art performance, striking an optimal balance between accuracy and computational efficiency, thereby providing a robust and efficient engineering solution for terrain perception in off-road environments. Full article
(This article belongs to the Section Vehicular Sensing)
Show Figures

Figure 1

19 pages, 1102 KB  
Article
SR-VLN: Implicit Spatial Reasoning Vision-and-Language Navigation
by Ruolin Zhu, Shaobin Li and Min Yang
Sensors 2026, 26(12), 3809; https://doi.org/10.3390/s26123809 - 15 Jun 2026
Viewed by 191
Abstract
Vision-and-language navigation (VLN) traditionally relies on explicit reasoning chains, which, despite being interpretable, impose severe constraints on inference efficiency and scalability in long-range environments. Existing multimodal large language models (MLLMs) frequently encounter latency bottlenecks due to the generation of verbose textual narratives during [...] Read more.
Vision-and-language navigation (VLN) traditionally relies on explicit reasoning chains, which, despite being interpretable, impose severe constraints on inference efficiency and scalability in long-range environments. Existing multimodal large language models (MLLMs) frequently encounter latency bottlenecks due to the generation of verbose textual narratives during decision-making. To address these limitations, we propose spatial reasoning vision-and-language navigation (SR-VLN), a novel framework that shifts the paradigm from explicit chain-of-thought (CoT) to an implicit spatial representation space. SR-VLN introduces a pyramidal hierarchical history framework integrated with perceptual compression to condense historical trajectories into multi-scale representations, effectively minimizing token overhead while preserving critical spatial semantics. Rather than generating verbose textual reasoning steps, SR-VLN employs compact, learnable spatial tokens (S-Tokens) to perform agile inference directly within the latent feature space. To establish robust causal mappings between these implicit states and navigational actions, we employ a hybrid training strategy that combines sparse reward supervision with reinforcement learning via GRPO. Extensive evaluations on the R2R, REVERIE, and SOON datasets demonstrate that SR-VLN achieves state-of-the-art overall navigation performance, while maintaining a comparable balance between accuracy and efficiency. Compared to explicit reasoning baselines, our method reduces token consumption by 68% and achieves a 4.1× speedup in inference while reaching a 76.02% success rate and a 73.80% SPL on the R2R unseen split, thereby facilitating near-real-time action prediction in long-range navigation environments. Full article
(This article belongs to the Section Navigation and Positioning)
19 pages, 3589 KB  
Article
DIDW-YOLOv11: The Steel Surface Defect Detection Method Based on Improved YOLOv11 Network
by Jiajun Jiang, Yaodan Zhang, Ziyang Xue and Chuzheng Wang
Electronics 2026, 15(12), 2593; https://doi.org/10.3390/electronics15122593 - 12 Jun 2026
Viewed by 122
Abstract
The steel surface defect detection is crucial for steel quality and usage safety. The high computational cost and low detection accuracy are still the main issues in current steel detection models. To efficiently address the issues above, this paper proposes a new steel [...] Read more.
The steel surface defect detection is crucial for steel quality and usage safety. The high computational cost and low detection accuracy are still the main issues in current steel detection models. To efficiently address the issues above, this paper proposes a new steel surface defect detection model named DIDW-YOLOv11. In the proposed DIDW-YOLOv11, the YOLOv11 C3k2 module is first innovatively improved by C3K2-DIMB, which integrates C3K2 and DIMB by introducing DynamicInceptionDWConv2d (DIDW) to sufficiently strengthen the detailed feature extraction for tiny defects and weak-texture defects, improving the matching degree of multi-scale receptive fields. Then the YOLOv11 SPPF module is enhanced by integrating the IDWFSPPF module for optimizing the fusion of local and global information, which combines average pooling and max pooling to enhance the model’s multi-scale feature fusion capability. An auxiliary detection head (ADH) is finally proposed with an additional coarse loss function to process shallow feature information into the model, which uses extra supervision for shallow features to suppress background noise and reduce false detections. Experimental results on the NEU-DET and GC10-DET datasets show that DIDW-YOLOv11 achieves 4.9% and 3.8% improvements in mAP@0.5 compared to the baseline model YOLOv11s. Our research indicates that DIDW-YOLOv11 exhibits stronger recognition ability and robustness in complex and diverse defect detection, providing an effective solution for steel defect detection in industrial production. In addition, experimental results show that our model offers improved performance over the baseline methods. Full article
Show Figures

Figure 1

40 pages, 9816 KB  
Article
CORE-Net: A Collaborative Optimization Framework for Rotated Ship Detection in Complex SAR Scenes
by Yongqi Kang and Haiping Qu
Sensors 2026, 26(12), 3707; https://doi.org/10.3390/s26123707 - 10 Jun 2026
Viewed by 240
Abstract
Rotated ship detection in complex synthetic aperture radar (SAR) scenes remains a critical yet challenging task for maritime remote sensing applications. Existing methods are plagued by three core bottlenecks: inconsistent directional responses across multi-scale features, unstable rotation angle regression, and non-uniform supervision quality [...] Read more.
Rotated ship detection in complex synthetic aperture radar (SAR) scenes remains a critical yet challenging task for maritime remote sensing applications. Existing methods are plagued by three core bottlenecks: inconsistent directional responses across multi-scale features, unstable rotation angle regression, and non-uniform supervision quality of positive samples during training, which collectively lead to elevated false alarms, missed detections, and severe localization degradation, especially under high IoU thresholds in complex inshore environments. To address these challenges, we propose CORE-Net, a collaborative optimization framework integrating three dedicated modules in the forward detection stage: a Rotation-Consistent Feature Pyramid (RCFP) to alleviate cross-scale directional mismatch, a Progressive Cascade Rotation Head (PCR Head) to improve progressive angle prediction stability, and an Orientation-Aware Regression Enhancement Unit (OAREU) to strengthen directional geometric representation in regression features, alongside an Uncertainty-Aware Sample Reliability Steering (UARS) module for training-stage optimization to softly downweight the regression contribution of positive samples with high classification confidence but low geometric consistency. Extensive experiments on three public SAR ship detection datasets (RSDD-SAR, SSDD+, and RSAR) demonstrate that the proposed method consistently improves AP50:95 while maintaining high Recall and Precision, validating that joint optimization of feature representation, rotated regression, and sample reliability is an effective strategy to enhance both the robustness and fine-grained localization capability of rotated ship detection in complex SAR scenes. In addition, large-scene inference experiments on uncropped Sentinel-1 and RSDD-SAR images further demonstrate that CORE-Net can be extended from patch-based evaluation to high-resolution SAR scene interpretation using a sliding-window inference strategy. Full article
(This article belongs to the Special Issue Application of SAR and Remote Sensing Technology in Earth Observation)
Show Figures

Figure 1

29 pages, 10118 KB  
Article
A Unified Explainable Autonomous Driving Framework via Cross-Attention Scene Selection and Semantic–Object Fusion
by Habib Dhahri, Fahad Alotaibi, Awais Mahmood and Mousa Jari
Machines 2026, 14(6), 677; https://doi.org/10.3390/machines14060677 - 10 Jun 2026
Viewed by 201
Abstract
Intelligent autonomous driving systems must not only predict the appropriate driving manoeuvre but also provide human-interpretable evidence that justifies the decision. However, existing methods typically address these objectives separately, leading to three practical limitations: multi-stage perception-to-language pipelines can propagate upstream perception errors into [...] Read more.
Intelligent autonomous driving systems must not only predict the appropriate driving manoeuvre but also provide human-interpretable evidence that justifies the decision. However, existing methods typically address these objectives separately, leading to three practical limitations: multi-stage perception-to-language pipelines can propagate upstream perception errors into downstream explanations; post hoc saliency methods often produce pixel-level highlights that are difficult to interpret semantically; and decoupled decision and explanation modules cannot guarantee that the explanation reflects the same scene evidence used for behaviour prediction. In this paper, we propose a unified framework that jointly performs vehicle behaviour prediction and human-centric interpretation from a shared visual backbone. Specifically, a hierarchical Swin Transformer encodes the driving scene into a sequence of spatial tokens, which are processed by two complementary branches. The first branch, termed the Object Selection Module (OSM), learns a compact scene-level semantic representation through query-guided cross-attention, while the second branch extracts a small set of class-agnostic object-centric tokens without requiring bounding-box or segmentation supervision. These two representations are subsequently integrated by a Semantic–Object Fusion (SOF) module based on scaled dot-product attention, residual connections, and a feed-forward network. The behaviour prediction head operates on the fused representation, whereas the interpretation head leverages the semantic representation through a skip connection to preserve decision-relevant context. For surround-view perception, learnable per-camera embeddings are introduced to maintain viewpoint identity with negligible additional parameter cost. Furthermore, a compact language model fine-tuned via Low-Rank Adaptation (LoRA) generates fluent, label-conditioned natural-language justifications. Extensive experiments on two public benchmarks, BDD-OIA and nu-AD, demonstrate that the proposed framework consistently delivers superior performance and provides effective, human-readable interpretations of driving decisions. Full article
Show Figures

Figure 1

30 pages, 5690 KB  
Article
M3DANet: A Lightweight Semi-Supervised Network and Embedded System for Bee Colony Counting
by Xue Li, Mingzhen Ma, Ying Kong, Huijun Huang, Qian Li, Feng Liu, Zhenguo Liu and Guangming Wang
Agriculture 2026, 16(12), 1284; https://doi.org/10.3390/agriculture16121284 - 10 Jun 2026
Viewed by 263
Abstract
Accurate bee counting is important for colony monitoring, pollination assessment, and precision beekeeping, but manual counting and dense point annotation are labor-intensive. This study proposes M3DANet, a lightweight semi-supervised density regression network with a handheld edge deployment system for bee colony counting. A [...] Read more.
Accurate bee counting is important for colony monitoring, pollination assessment, and precision beekeeping, but manual counting and dense point annotation are labor-intensive. This study proposes M3DANet, a lightweight semi-supervised density regression network with a handheld edge deployment system for bee colony counting. A dataset containing 586 valid high-resolution images and 34,869 point annotations was constructed for training and evaluation. M3DANet uses the first seven stages of MobileNetV3-Large as the lightweight backbone and combines multi-scale context encoding, attention-guided low-level feature fusion, and teacher–student consistency learning with confidence masking and warm-up training. The 10%, 30%, and 50% labeled data settings refer to the proportions of labeled images in the training set, and the remaining training images are used as unlabeled data. Mean absolute error (MAE) and root mean square error (RMSE) are used as evaluation metrics. On the main dataset, M3DANet achieved MAE values of 9.937, 7.003, and 5.570 and RMSE values of 13.093, 9.387, and 7.620 under the 10%, 30%, and 50% settings, respectively, outperforming representative semi-supervised baselines. Under the fully supervised setting, it achieved an MAE of 5.201 and an RMSE of 6.989 with only 2.095 M parameters and 416.64 FPS, using 87.1% fewer parameters and running 17.7 times faster than CSRNet. Cross-species experiments confirmed its low-label generalization ability. Jetson Orin NX deployment achieved 65.75 ms/image inference latency and 10.44 FPS complete-pipeline throughput. These results show that M3DANet balances counting accuracy, annotation efficiency, generalization, and edge deployment practicality. Full article
Show Figures

Figure 1

22 pages, 13414 KB  
Article
Boundary-Aware Multi-Scale Feature Enhancement Based Few-Shot Hyperspectral Image Semantic Segmentation
by Xiaorong Zhang, Siyuan Li and Xi Zheng
Remote Sens. 2026, 18(12), 1911; https://doi.org/10.3390/rs18121911 - 9 Jun 2026
Viewed by 176
Abstract
To address the issues of model overfitting under scarce samples and poor segmentation performance on slender objects in the task of semantic segmentation of remote sensing hyperspectral images, this paper proposes a hyperspectral image semantic segmentation framework that integrates edge awareness and multi-scale [...] Read more.
To address the issues of model overfitting under scarce samples and poor segmentation performance on slender objects in the task of semantic segmentation of remote sensing hyperspectral images, this paper proposes a hyperspectral image semantic segmentation framework that integrates edge awareness and multi-scale feature enhancement under extremely few-shot conditions. This architecture effectively integrates orthogonal-direction convolutions, elongated feature enhancement, multi-scale feature fusion, and deep supervision mechanisms, solving challenges such as difficulty in extracting features of slender objects, model overfitting under few-sample conditions, and insufficient generalization ability. The experimental results on multiple public datasets show that the proposed algorithm achieves excellent segmentation performance with just one small-sized sample per labeled category, surpassing existing popular algorithms and thereby confirming the algorithm’s effectiveness and superiority. On the PaviaU dataset, the overall accuracy (OA) and mean intersection over union (mIoU) improved by approximately 9.7% and 15.5% compared to the second-best model; especially for the segmentation of the key elongated feature ‘road’, the intersection over union reached 94.75%, highlighting the effectiveness of the proposed mechanism. This paper provides a novel and efficient solution for fine interpretation of hyperspectral images under few-sample conditions. Full article
Show Figures

Figure 1

18 pages, 3959 KB  
Article
Blind Self-Supervised Denoising of In Situ BOTDR Strain Data Using TrendBlend-BSFormer for Underwater Flexible Mattress Monitoring
by Jing Liu, Pengfei Jin, Zhixuan Zhang and Xianglong Wei
Sensors 2026, 26(12), 3663; https://doi.org/10.3390/s26123663 - 8 Jun 2026
Viewed by 227
Abstract
The long-term stability of submerged sandbars and protected shorelines in large alluvial rivers depends on the serviceability of flexible mattresses installed on the riverbed. Distributed fiber optic sensing is one of the few practical methods for monitoring deformation along these underwater systems over [...] Read more.
The long-term stability of submerged sandbars and protected shorelines in large alluvial rivers depends on the serviceability of flexible mattresses installed on the riverbed. Distributed fiber optic sensing is one of the few practical methods for monitoring deformation along these underwater systems over engineering-scale distances. Yet BOTDR-derived strain-difference profiles are often heavily contaminated by noise and rarely have reliable clean references. To address this issue, this study develops TrendBlend-BSFormer, a blind self-supervised denoising framework for in situ BOTDR strain data from underwater flexible mattresses. The framework combines four key features: blind-spot masking, a one-dimensional encoder decoder backbone, a Transformer bottleneck for long-range spatial dependence, and a multi-scale trend-detail blending branch with dual signal-noise heads. The framework was validated using annual and daily BOTDR field data from the Yudaizhou shoreline protection project in the Yangtze River, containing 9343 and 9875 valid measurement points, respectively. TrendBlend-BSFormer achieved pseudo-SNR/RMSE/MAE values of 14.22 dB, 15.03 με and 12.05 με for the annual data set and 5.32 dB, 8.02 με and 6.45 με for the daily data set, improving the pseudo-SNR by 1.45 dB and 2.95 dB relative to the published BiLSTM-CNN benchmark. It also reduced the high-frequency energy ratio from 0.172 to 0.011 for the annual data and from 0.424 to 0.112 for the daily data. The denoised profiles suppress isolated spikes while preserving mechanically plausible peaks, valleys, and short-range fluctuations, indicating that blind self-supervised denoising can provide a more physically credible strategy for BOTDR-based monitoring in complex underwater environments. Full article
(This article belongs to the Special Issue Underwater Vision Sensing System: 2nd Edition)
Show Figures

Figure 1

20 pages, 6566 KB  
Communication
Consistency-Guided Distillation from Vision Foundation Models for Zero-Shot Airborne Point Cloud Segmentation
by Yuan Gao, Jindong Zhao, Shaobo Xia, Sheng Nie, Cheng Wang and Xiaohuan Xi
Remote Sens. 2026, 18(12), 1875; https://doi.org/10.3390/rs18121875 - 6 Jun 2026
Viewed by 193
Abstract
Semantic segmentation of large-scale airborne point clouds traditionally relies on labor-intensive 3D manual annotations. While recent zero-shot methods attempt to alleviate this burden by distilling knowledge from 2D Vision–Language Models (VLMs) via 2D-to-3D projection, they suffer from performance degradation in complex urban environments. [...] Read more.
Semantic segmentation of large-scale airborne point clouds traditionally relies on labor-intensive 3D manual annotations. While recent zero-shot methods attempt to alleviate this burden by distilling knowledge from 2D Vision–Language Models (VLMs) via 2D-to-3D projection, they suffer from performance degradation in complex urban environments. Specifically, lacking 3D geometric awareness, 2D VLMs frequently exhibit “semantic bleeding”, where large-scale background categories (e.g., ground) erroneously submerge small-scale targets (e.g., vehicles and street elements). To address this issue, we propose a geometry-constrained pseudo-label generation and purification framework. Our approach tackles the problem through a dual-branch design: extracting open-vocabulary semantics via SAM3-based multi-view projection while simultaneously deriving sharp, class-agnostic instances using SAM2 on Gamma-transformed elevation maps. By introducing a geometric–semantic consistency module, we evaluate the internal semantic purity and external spatial homogeneity of these instances, detecting and filtering out semantic misclassifications. The purified pseudo-labels are then used to supervise a 3D sparse convolutional network via a Masked Cross-Entropy Loss. Experiments on the H3D and Turin3D datasets demonstrate that our method recovers small-scale targets that are prone to being submerged, outperforming existing zero-shot baselines by improving mIoU from 52.15% to 63.45% on H3D and from 29.52% to 58.51% on Turin3D, thereby narrowing the performance gap with fully-supervised approaches. Full article
(This article belongs to the Section AI Remote Sensing)
Show Figures

Figure 1

20 pages, 2070 KB  
Article
Temporal-Enhanced and Visual-Text Adaptive Fusion for Weakly Supervised Video Anomaly Detection in Public Safety
by Jin Si, Qifen Dong and Xue Yang
J. Imaging 2026, 12(6), 249; https://doi.org/10.3390/jimaging12060249 - 6 Jun 2026
Viewed by 235
Abstract
In the realm of public safety, the automated identification of potential threats from voluminous surveillance streams is pivotal for developing intelligent security systems. Manual monitoring of such massive video feeds is highly inefficient, prone to human fatigue, and often leads to missed detections [...] Read more.
In the realm of public safety, the automated identification of potential threats from voluminous surveillance streams is pivotal for developing intelligent security systems. Manual monitoring of such massive video feeds is highly inefficient, prone to human fatigue, and often leads to missed detections or false alarms. Leveraging deep learning for automatic anomaly detection is therefore essential to improve response efficiency and mitigate security risks. Weakly supervised video anomaly detection (WS-VAD) has emerged as a critical yet challenging task in this domain. In this study, we propose the Temporal-Enhanced and Visual-Text Adaptive Fusion (TE-VTAF) model for robust WS-VAD. Specifically, a Dynamic Local–Global Temporal Adaptive Module (DLG-TAM) is designed to capture multi-scale temporal dependencies and extract high-level video semantics. Concurrently, a Visual-Text Adaptive Fusion Module (VTAFM) is introduced to aggregate complementary cross-modal features, utilizing a competitive activation mechanism to suppress redundant information and enhance the discriminative power between normal and anomalous events. To further refine the learning process within the Multiple Instance Learning (MIL) framework, we incorporate a Top-K outer bag loss and a K-maxmin inner bag loss. These constraints effectively maximize the inter-class separability while suppressing label noise from normal instances within positive bags, thereby bolstering the detector’s robustness. Extensive experiments demonstrate that the proposed TE-VTAF consistently outperforms state-of-the-art methods on two large-scale benchmarks, achieving an AUC of 88.93% on UCF-Crime and an AP of 85.62% on XD-Violence. Full article
(This article belongs to the Section Computer Vision and Pattern Recognition)
Show Figures

Figure 1

28 pages, 2738 KB  
Article
BCAR-Net: A Bidirectional Cross-Attention Network with Auxiliary Reconstruction for Tree Counting in Complex Forest Scenes Using Airborne RGB and LiDAR Data
by Xiaoyu Wu, Xijian Fan, Mengjiao Tang and Size Dai
Plants 2026, 15(12), 1762; https://doi.org/10.3390/plants15121762 - 6 Jun 2026
Viewed by 384
Abstract
Accurate tree counting from remote sensing data is essential for forest inventory, biomass estimation, carbon accounting, and ecological monitoring. However, existing approaches predominantly rely on airborne RGB imagery and often struggle in complex forest scenes where neighboring crowns exhibit highly similar textures and [...] Read more.
Accurate tree counting from remote sensing data is essential for forest inventory, biomass estimation, carbon accounting, and ecological monitoring. However, existing approaches predominantly rely on airborne RGB imagery and often struggle in complex forest scenes where neighboring crowns exhibit highly similar textures and colors and where overlapping crown boundaries become ambiguous. To address this limitation, the LiDAR-derived Canopy Height Model (CHM) is introduced as a complementary modality that provides explicit cues on canopy height variation and vertical structure to support RGB-based analysis. Building on this, we propose BCAR-Net, a broker-guided RGB and depth (RGB-D) multimodal framework that couples bidirectional cross-modal interaction, adaptive tri-branch fusion, and auxiliary reconstruction within a two-stage optimization scheme. Specifically, a bidirectional cross-attention U-Net generates an intermediate broker RGB-D representation from paired RGB images and depth maps through symmetric bidirectional cross-attention between the two modalities and direction-aware gating. The original RGB image, depth map, and broker representation are then jointly encoded by three weight-sharing branches and adaptively aggregated by a spatial fusion gate for density-map regression. To regularize the fused latent feature, a multi-scale cross-attention reconstruction decoder provides auxiliary RGB and depth reconstruction supervision by querying multi-scale BCA-UNet encoder features through 2D cross-attention, and a reconstruction-oriented first stage replaces externally generated fused-image supervision, yielding a task-consistent optimization scheme. Experiments on the NEONTreeEvaluation benchmark show that BCAR-Net consistently outperforms single-modality settings and direct RGB-D concatenation multimodal baseline. Additional experiments on a public UAV RGB–LiDAR dataset provide a small-scale supplementary evaluation under a different acquisition setting, where BCAR-Net achieves modest but consistent improvements over RGB-only and depth-only baselines. These results demonstrate that the proposed framework offers an effective but computationally cautious solution for tree counting in complex forest environments. Full article
(This article belongs to the Special Issue Computer Vision Techniques for Plant Phenomics Applications)
Show Figures

Figure 1

Back to TopTop