Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (131)

Search Parameters:
Keywords = low-texture scenes

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
28 pages, 3012 KB  
Article
Context-Aware Visual Emotion Recognition Through Hierarchical Fusion of Facial Micro-Features and Scene Semantics
by Karn Yongsiriwit, Parkpoom Chaisiriprasert, Thannob Aribarg and Sokliv Kork
Appl. Sci. 2025, 15(24), 13160; https://doi.org/10.3390/app152413160 - 15 Dec 2025
Abstract
Visual emotion recognition in unconstrained environments remains challenging, as single-stream deep learning models often fail to capture the localized facial cues and contextual information necessary for accurate classification. This study introduces a hierarchical multi-level feature fusion framework that systematically combines low-level micro-textural features [...] Read more.
Visual emotion recognition in unconstrained environments remains challenging, as single-stream deep learning models often fail to capture the localized facial cues and contextual information necessary for accurate classification. This study introduces a hierarchical multi-level feature fusion framework that systematically combines low-level micro-textural features (Local Binary Patterns), mid-level facial cues (Facial Action Units), and high-level scene semantics (Places365) with ResNet-50 global embeddings. Evaluated on the large-scale EmoSet-3.3M dataset, which contains 3.3 million images across eight emotion categories, the framework demonstrates marked performance gains with the best configuration (LBP-FAUs-Places365-ResNet). The proposed framework achieves 74% accuracy and a macro-averaged F1-score of 0.75 under its best configuration (LBP-FAUs-Places365-ResNet), representing a five-percentage-point improvement over the ResNet-50 baseline. The approach excels at distinguishing high-intensity emotions, maintaining efficient inference (2.2 ms per image, 29 M parameters), and analysis confirms that integrating facial muscle activations with scene context enables nuanced emotional differentiation. These results validate that hierarchical feature integration significantly advances robust, human-aligned visual emotion recognition, making it suitable for real-world Human–Computer Interaction (HCI) and affective computing applications. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
20 pages, 5083 KB  
Article
MDR–SLAM: Robust 3D Mapping in Low-Texture Scenes with a Decoupled Approach and Temporal Filtering
by Kailin Zhang and Letao Zhou
Electronics 2025, 14(24), 4864; https://doi.org/10.3390/electronics14244864 - 10 Dec 2025
Viewed by 159
Abstract
Realizing real-time dense 3D reconstruction on resource-limited mobile platforms remains a significant challenge, particularly in low-texture environments that demand robust multi-frame fusion to resolve matching ambiguities. However, the inherent tight coupling of pose estimation and mapping in traditional monolithic SLAM architectures imposes a [...] Read more.
Realizing real-time dense 3D reconstruction on resource-limited mobile platforms remains a significant challenge, particularly in low-texture environments that demand robust multi-frame fusion to resolve matching ambiguities. However, the inherent tight coupling of pose estimation and mapping in traditional monolithic SLAM architectures imposes a severe restriction on integrating high-complexity fusion algorithms without compromising tracking stability. To overcome these limitations, this paper proposes MDR–SLAM, a modular and fully decoupled stereo framework. The system features a novel keyframe-driven temporal filter that synergizes efficient ELAS stereo matching with Kalman filtering to effectively accumulate geometric constraints, thereby enhancing reconstruction density in textureless areas. Furthermore, a confidence-based fusion backend is employed to incrementally maintain global map consistency and filter outliers. Quantitative evaluation on the NUFR-M3F indoor dataset demonstrates the effectiveness of the proposed method: compared to the standard single-frame baseline, MDR–SLAM reduces map RMSE by 83.3% (to 0.012 m) and global trajectory drift by 55.6%, while significantly improving map completeness. The system operates entirely on CPU resources with a stable 4.7 Hz mapping frequency, verifying its suitability for embedded mobile robotics. Full article
(This article belongs to the Special Issue Recent Advance of Auto Navigation in Indoor Scenarios)
Show Figures

Figure 1

22 pages, 32335 KB  
Article
MAIENet: Multi-Modality Adaptive Interaction Enhancement Network for SAR Object Detection
by Yu Tong, Kaina Xiong, Jun Liu, Guixing Cao and Xinyue Fan
Remote Sens. 2025, 17(23), 3866; https://doi.org/10.3390/rs17233866 - 28 Nov 2025
Viewed by 195
Abstract
Syntheticaperture radar (SAR) object detection offers significant advantages in remote sensing applications, particularly under adverse weather conditions or low-light environments. However, single-modal SAR image object detection encounters numerous challenges, including speckle noise, limited texture information, and interference from complex backgrounds. To address these [...] Read more.
Syntheticaperture radar (SAR) object detection offers significant advantages in remote sensing applications, particularly under adverse weather conditions or low-light environments. However, single-modal SAR image object detection encounters numerous challenges, including speckle noise, limited texture information, and interference from complex backgrounds. To address these issues, we present Modality-Aware Adaptive Interaction Enhancement Network (MAIENet), a multimodal detection framework designed to effectively extract complementary information from both SAR and optical images, thereby enhancing object detection performance. MAIENet comprises three primary components: batch-wise splitting and channel-wise concatenation (BSCC) module, modality-aware adaptive interaction enhancement (MAIE) module, and multi-directional focus (MF) module. The BSCC module extracts and reorganizes features from each modality to preserve their distinct characteristics. The MAIE module component facilitates deeper cross-modal fusion through channel reweighting, deformable convolutions, atrous convolution, and attention mechanisms, enabling the network to emphasize critical modal information while reducing interference. By integrating features from various spatial directions, the MF module expands the receptive field, allowing the model to adapt more effectively to complex scenes. The MAIENet framework is end-to-end trainable and can be seamlessly integrated into existing detection networks with minimal modifications. Experimental results on the publicly available OGSOD-1.0 dataset demonstrate that MAIENet achieves superior performance compared with existing methods, achieving 90.8% mAP50. Full article
Show Figures

Figure 1

20 pages, 21569 KB  
Article
Single Image Haze Removal via Multiple Variational Constraints for Vision Sensor Enhancement
by Yuxue Feng, Weijia Zhao, Luyao Wang, Hongyu Liu, Yuxiao Li and Yun Liu
Sensors 2025, 25(23), 7198; https://doi.org/10.3390/s25237198 - 25 Nov 2025
Viewed by 384
Abstract
Images captured by vision sensors in outdoor environments often suffer from haze-induced degradations, including blurred details, faded colors, and reduced visibility, which severely impair the performance of sensing and perception systems. To address this issue, we propose a haze-removal algorithm for hazy images [...] Read more.
Images captured by vision sensors in outdoor environments often suffer from haze-induced degradations, including blurred details, faded colors, and reduced visibility, which severely impair the performance of sensing and perception systems. To address this issue, we propose a haze-removal algorithm for hazy images using multiple variational constraints. Based on the classic atmospheric scattering model, a mixed variational framework is presented that incorporates three regularization terms for the transmission map and scene radiance. Concretely, an p norm and an 2 norm were constructed to jointly enforce the transmissions for smoothing the details and preserving the structures, and a weighted 1 norm was devised to constrain the scene radiance for suppressing the noises. Furthermore, our devised weight function takes into account both the local variances and the gradients of the scene radiance, which adaptively perceives the textures and structures and controls the smoothness in the process of image restoration. To address the mixed variational model, a re-weighted least square strategy was employed to iteratively solve two separated subproblems. Finally, a gamma correction was applied to adjust the overall brightness, yielding the final recovered result. Extensive comparisons with state-of-the-art methods demonstrated that our proposed algorithm produces visually satisfactory results with a superior clarity and vibrant colors. In addition, our proposed algorithm demonstrated a superior generalization to diverse degradation scenarios, including low-light and remote sensing hazy images, and it effectively improved the performance of high-level vision tasks. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

21 pages, 13550 KB  
Article
A Robust and Reliable Positioning Method for Complex Environments Based on Quality-Controlled Multi-Sensor Fusion of GNSS, INS, and LiDAR
by Ziteng Zhang, Chuanzhen Sheng, Shuguo Pan, Xingxing Wang, Baoguo Yu and Jingkui Zhang
Remote Sens. 2025, 17(22), 3760; https://doi.org/10.3390/rs17223760 - 19 Nov 2025
Viewed by 471
Abstract
The multi-source fusion localization algorithm demonstrates advantages in achieving continuous localization. However, its reliability and robustness could not be guaranteed and still with some insufficiencies in complex environments, especially for severe occlusions and low-texture scenes in non-cooperative scenarios. In this paper, we propose [...] Read more.
The multi-source fusion localization algorithm demonstrates advantages in achieving continuous localization. However, its reliability and robustness could not be guaranteed and still with some insufficiencies in complex environments, especially for severe occlusions and low-texture scenes in non-cooperative scenarios. In this paper, we propose a GNSS/INS/LiDAR multi-source fusion localization framework. To enhance the algorithm’s performance, availability of different sensors is evaluated quantitatively through GNSS/INS status detection, and LiDAR-data-feature repeatability quality control is implemented at the front end. Both the variability of the standard deviation of differences of features and the standard deviation of real-time features are designed as major considerations and proposed to characterize the repeatability of 3D point clouds of LiDAR. The prior probability of the sensor covariance within the factor graph improves the algorithm’s fusion weight adjustment capability. Finally, a GNSS/INS/LiDAR multi-sensor positioning test platform is developed, and experiments are conducted in sheltered and semi-sheltered environments, such as urban, tunnel, campus, and mountainous environments. The results show that, compared with state-of-the-art methods, the proposed algorithm exhibits superior adaptability, significantly enhancing both reliability and robustness in four different typical real, complex environments, and our algorithm improves the robust running time by 44% in terms of availability in large-scale urban tests. In addition, the algorithm demonstrates superior positioning accuracy compared with those of other methods, achieving a positioning accuracy (RMSE) of 0.18 and 0.21 m in large-scale, long-duration urban and mountainous settings, respectively. Full article
Show Figures

Figure 1

21 pages, 1479 KB  
Article
Neural Radiance Fields: Driven Exploration of Visual Communication and Spatial Interaction Design for Immersive Digital Installations
by Wanshu Li and Yuanhui Hu
J. Imaging 2025, 11(11), 411; https://doi.org/10.3390/jimaging11110411 - 13 Nov 2025
Viewed by 538
Abstract
In immersive digital devices, high environmental complexity can lead to rendering delays and loss of interactive details, resulting in a fragmented experience. This paper proposes a lightweight NeRF (Neural Radiance Fields) modeling and multimodal perception fusion method. First, a sparse hash code is [...] Read more.
In immersive digital devices, high environmental complexity can lead to rendering delays and loss of interactive details, resulting in a fragmented experience. This paper proposes a lightweight NeRF (Neural Radiance Fields) modeling and multimodal perception fusion method. First, a sparse hash code is constructed based on Instant-NGP (Instant Neural Graphics Primitives) to accelerate scene radiance field generation. Second, parameter distillation and channel pruning are used to reduce the model’s size and reduce computational overheads. Next, multimodal data from a depth camera and an IMU (Inertial Measurement Unit) is fused, and Kalman filtering is used to improve pose tracking accuracy. Finally, the optimized NeRF model is integrated into the Unity engine, utilizing custom shaders and asynchronous rendering to achieve low-latency viewpoint responsiveness. Experiments show that the file size of this method in high-complexity scenes is only 79.5 MB ± 5.3 MB, and the first loading time is only 2.9 s ± 0.4 s, effectively reducing rendering latency. The SSIM is 0.951 ± 0.016 at 1.5 m/s, and the GME is 7.68 ± 0.15 at 1.5 m/s. It can stably restore texture details and edge sharpness under dynamic viewing angles. In scenarios that support 3–5 people interacting simultaneously, the average interaction response delay is only 16.3 ms, and the average jitter error is controlled at 0.12°, significantly improving spatial interaction performance. In conclusion, this study provides effective technical solutions for high-quality immersive interaction in complex public scenarios. Future work will explore the framework’s adaptability in larger-scale dynamic environments and further optimize the network synchronization mechanism for multi-user concurrency. Full article
(This article belongs to the Section Image and Video Processing)
Show Figures

Figure 1

17 pages, 2260 KB  
Article
CONTI-CrackNet: A Continuity-Aware State-Space Network for Crack Segmentation
by Wenjie Song, Min Zhao and Xunqian Xu
Sensors 2025, 25(22), 6865; https://doi.org/10.3390/s25226865 - 10 Nov 2025
Viewed by 633
Abstract
Crack segmentation in cluttered scenes with slender and irregular patterns remains difficult, and practical systems must balance accuracy and efficiency. We present CONTI-CrackNet, which is a lightweight visual state-space network that integrates a Multi-Directional Selective Scanning Strategy (MD3S). MD3S performs bidirectional scanning along [...] Read more.
Crack segmentation in cluttered scenes with slender and irregular patterns remains difficult, and practical systems must balance accuracy and efficiency. We present CONTI-CrackNet, which is a lightweight visual state-space network that integrates a Multi-Directional Selective Scanning Strategy (MD3S). MD3S performs bidirectional scanning along the horizontal, vertical, and diagonal directions, and it fuses the complementary paths with a Bidirectional Gated Fusion (BiGF) module to strengthen global continuity. To preserve fine details while completing global texture, we propose a Dual-Branch Pixel-Level Global–Local Fusion (DBPGL) module that incorporates a Pixel-Adaptive Pooling (PAP) mechanism to dynamically weight max-pooled responses and average-pooled responses. Evaluated on two public benchmarks, the proposed method achieves an F1 score (F1) of 0.8332 and a mean Intersection over Union (mIoU) of 0.8436 on the TUT dataset, and it achieves an mIoU of 0.7760 on the CRACK500 dataset, surpassing competitive Convolutional Neural Network (CNN), Transformer, and Mamba baselines. With 512 × 512 input, the model requires 24.22 G floating point operations (GFLOPs), 6.01 M parameters (Params), and operates at 42 frames per second (FPS) on an RTX 3090 GPU, delivering a favorable accuracy–efficiency balance. These results show that CONTI-CrackNet improves continuity and edge recovery for thin cracks while keeping computational cost low, and it is lightweight in terms of parameter count and computational cost. Full article
(This article belongs to the Section Sensor Networks)
Show Figures

Figure 1

19 pages, 1895 KB  
Article
Cross-Context Aggregation for Multi-View Urban Scene and Building Facade Matching
by Yaping Yan and Yuhang Zhou
ISPRS Int. J. Geo-Inf. 2025, 14(11), 425; https://doi.org/10.3390/ijgi14110425 - 31 Oct 2025
Viewed by 500
Abstract
Accurate and robust feature matching across multi-view urban imagery is fundamental for urban mapping, 3D reconstruction, and large-scale spatial alignment. Real-world urban scenes involve significant variations in viewpoint, illumination, and occlusion, as well as repetitive architectural patterns that make correspondence estimation challenging. To [...] Read more.
Accurate and robust feature matching across multi-view urban imagery is fundamental for urban mapping, 3D reconstruction, and large-scale spatial alignment. Real-world urban scenes involve significant variations in viewpoint, illumination, and occlusion, as well as repetitive architectural patterns that make correspondence estimation challenging. To address these issues, we propose the Cross-Context Aggregation Matcher (CCAM), a detector-free framework that jointly leverages multi-scale local features, long-range contextual information, and geometric priors to produce spatially consistent matches. Specifically, CCAM integrates a multi-scale local enhancement branch with a parallel self- and cross-attention Transformer, enabling the model to preserve detailed local structures while maintaining a coherent global context. In addition, an independent positional encoding scheme is introduced to strengthen geometric reasoning in repetitive or low-texture regions. Extensive experiments demonstrate that CCAM outperforms state-of-the-art methods, achieving up to +31.8%, +19.1%, and +11.5% improvements in AUC@{5°, 10°, 20°} over detector-based approaches and up to 1.72% higher precision compared with detector-free counterparts. These results confirm that CCAM delivers reliable and spatially coherent matches, thereby facilitating downstream geospatial applications. Full article
Show Figures

Figure 1

23 pages, 11949 KB  
Article
MDAS-YOLO: A Lightweight Adaptive Framework for Multi-Scale and Dense Pest Detection in Apple Orchards
by Bo Ma, Jiawei Xu, Ruofei Liu, Junlin Mu, Biye Li, Rongsen Xie, Shuangxi Liu, Xianliang Hu, Yongqiang Zheng, Hongjian Zhang and Jinxing Wang
Horticulturae 2025, 11(11), 1273; https://doi.org/10.3390/horticulturae11111273 - 22 Oct 2025
Cited by 1 | Viewed by 855
Abstract
Accurate monitoring of orchard pests is vital for green and efficient apple production. Yet images captured by intelligent pest-monitoring lamps often contain small targets, weak boundaries, and crowded scenes, which hamper detection accuracy. We present MDAS-YOLO, a lightweight detection framework tailored for smart [...] Read more.
Accurate monitoring of orchard pests is vital for green and efficient apple production. Yet images captured by intelligent pest-monitoring lamps often contain small targets, weak boundaries, and crowded scenes, which hamper detection accuracy. We present MDAS-YOLO, a lightweight detection framework tailored for smart pest monitoring in apple orchards. At the input stage, we adopt the LIME++ enhancement to mitigate low illumination and non-uniform lighting, improving image quality at the source. On the model side, we integrate three structural innovations: (1) a C3k2-MESA-DSM module in the backbone to explicitly strengthen contours and fine textures via multi-scale edge enhancement and dual-domain feature selection; (2) an AP-BiFPN in the neck to achieve adaptive cross-scale fusion through learnable weighting and differentiated pooling; and (3) a SimAM block before the detection head to perform zero-parameter, pixel-level saliency re-calibration, suppressing background redundancy without extra computation. On a self-built apple-orchard pest dataset, MDAS-YOLO attains 95.68% mAP, outperforming YOLOv11n by 6.97 percentage points while maintaining a superior trade-off among accuracy, model size, and inference speed. Overall, the proposed synergistic pipeline—input enhancement, early edge fidelity, mid-level adaptive fusion, and end-stage lightweight re-calibration—effectively addresses small-scale, weak-boundary, and densely distributed pests, providing a promising and regionally validated approach for intelligent pest monitoring and sustainable orchard management, and offering methodological insights for future multi-regional pest monitoring research. Full article
(This article belongs to the Section Insect Pest Management)
Show Figures

Figure 1

31 pages, 3160 KB  
Article
Multimodal Image Segmentation with Dynamic Adaptive Window and Cross-Scale Fusion for Heterogeneous Data Environments
by Qianping He, Meng Wu, Pengchang Zhang, Lu Wang and Quanbin Shi
Appl. Sci. 2025, 15(19), 10813; https://doi.org/10.3390/app151910813 - 8 Oct 2025
Viewed by 883
Abstract
Multi-modal image segmentation is a key task in various fields such as urban planning, infrastructure monitoring, and environmental analysis. However, it remains challenging due to complex scenes, varying object scales, and the integration of heterogeneous data sources (such as RGB, depth maps, and [...] Read more.
Multi-modal image segmentation is a key task in various fields such as urban planning, infrastructure monitoring, and environmental analysis. However, it remains challenging due to complex scenes, varying object scales, and the integration of heterogeneous data sources (such as RGB, depth maps, and infrared). To address these challenges, we proposed a novel multi-modal segmentation framework, DyFuseNet, which features dynamic adaptive windows and cross-scale feature fusion capabilities. This framework consists of three key components: (1) Dynamic Window Module (DWM), which uses dynamic partitioning and continuous position bias to adaptively adjust window sizes, thereby improving the representation of irregular and fine-grained objects; (2) Scale Context Attention (SCA), a hierarchical mechanism that associates local details with global semantics in a coarse-to-fine manner, enhancing segmentation accuracy in low-texture or occluded regions; and (3) Hierarchical Adaptive Fusion Architecture (HAFA), which aligns and fuses features from multiple modalities through shallow synchronization and deep channel attention, effectively balancing complementarity and redundancy. Evaluated on benchmark datasets (such as ISPRS Vaihingen and Potsdam), DyFuseNet achieved state-of-the-art performance, with mean Intersection over Union (mIoU) scores of 80.40% and 80.85%, surpassing MFTransNet by 1.91% and 1.77%, respectively. The model also demonstrated strong robustness in challenging scenes (such as building edges and shadowed objects), achieving an average F1 score of 85% while maintaining high efficiency (26.19 GFLOPs, 30.09 FPS), making it suitable for real-time deployment. This work presents a practical, versatile, and computationally efficient solution for multi-modal image analysis, with potential applications beyond remote sensing, including smart monitoring, industrial inspection, and multi-source data fusion tasks. Full article
(This article belongs to the Special Issue Signal and Image Processing: From Theory to Applications: 2nd Edition)
Show Figures

Figure 1

25 pages, 12740 KB  
Article
GM-DETR: Infrared Detection of Small UAV Swarm Targets Based on Detection Transformer
by Chenhao Zhu, Xueli Xie, Jianxiang Xi and Xiaogang Yang
Remote Sens. 2025, 17(19), 3379; https://doi.org/10.3390/rs17193379 - 7 Oct 2025
Viewed by 761
Abstract
Infrared object detection is an important prerequisite for small unmanned aerial vehicle (UAV) swarm countermeasures. Owing to the limited imaging area and texture features of small UAV targets, accurate infrared detection of UAV swarm targets is challenging. In this paper, the GM-DETR is [...] Read more.
Infrared object detection is an important prerequisite for small unmanned aerial vehicle (UAV) swarm countermeasures. Owing to the limited imaging area and texture features of small UAV targets, accurate infrared detection of UAV swarm targets is challenging. In this paper, the GM-DETR is proposed for the detection of densely distributed small UAV swarm targets in infrared scenarios. Specifically, high-level and low-level features are fused by the Fine-Grained Context-Aware Fusion module, which augments texture features in the fused feature map. Furthermore, a Supervised Sampling and Sparsification module is proposed as an explicit guiding mechanism, which assists the GM-DETR to focus on high-quality queries according to the confidence value. The Geometric Relation Encoder is introduced to encode geometric relation among queries, which makes up for the information loss caused by query serialization. In the second stage of the GM-DETR, a long-term memory mechanism is introduced to make UAV detection more stable and distinguishable in motion blur scenes. In the decoder, the self-attention mechanism is improved by introducing memory blocks as additional decoding information, which enhances the robustness of the GM-DETR. In addition, we constructed a small UAV swarm dataset, UAV Swarm Dataset (USD), which comprises 7000 infrared images of low-altitude UAV swarms, as another contribution. The experimental results on the USD show that the GM-DETR outperforms other state-of-the-arts detectors and obtains the best scores (90.6 on AP75 and 63.8 on APS), which demonstrates the effectiveness of the GM-DETR in detecting small UAV targets. The good performance of the GM-DETR on the Drone Vehicle dataset also demonstrates the superiority of the proposed modules in detecting small targets. Full article
Show Figures

Graphical abstract

18 pages, 6931 KB  
Article
Research on Multi-Sensor Data Fusion Based Real-Scene 3D Reconstruction and Digital Twin Visualization Methodology for Coal Mine Tunnels
by Hongda Zhu, Jingjing Jin and Sihai Zhao
Sensors 2025, 25(19), 6153; https://doi.org/10.3390/s25196153 - 4 Oct 2025
Viewed by 953
Abstract
This paper proposes a multi-sensor data-fusion-based method for real-scene 3D reconstruction and digital twin visualization of coal mine tunnels, aiming to address issues such as low accuracy in non-photorealistic modeling and difficulties in feature object recognition during traditional coal mine digitization processes. The [...] Read more.
This paper proposes a multi-sensor data-fusion-based method for real-scene 3D reconstruction and digital twin visualization of coal mine tunnels, aiming to address issues such as low accuracy in non-photorealistic modeling and difficulties in feature object recognition during traditional coal mine digitization processes. The research employs cubemap-based mapping technology to project acquired real-time tunnel images onto six faces of a cube, combined with navigation information, pose data, and synchronously acquired point cloud data to achieve spatial alignment and data fusion. On this basis, inner/outer corner detection algorithms are utilized for precise image segmentation, and a point cloud region growing algorithm integrated with information entropy optimization is proposed to realize complete recognition and segmentation of tunnel planes (e.g., roof, floor, left/right sidewalls) and high-curvature feature objects (e.g., ventilation ducts). Furthermore, geometric dimensions extracted from segmentation results are used to construct 3D models, and real-scene images are mapped onto model surfaces via UV (U and V axes of texture coordinate) texture mapping technology, generating digital twin models with authentic texture details. Experimental validation demonstrates that the method performs excellently in both simulated and real coal mine environments, with models capable of faithfully reproducing tunnel spatial layouts and detailed features while supporting multi-view visualization (e.g., bottom view, left/right rotated views, front view). This approach provides efficient and precise technical support for digital twin construction, fine-grained structural modeling, and safety monitoring of coal mine tunnels, significantly enhancing the accuracy and practicality of photorealistic 3D modeling in intelligent mining applications. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

19 pages, 4672 KB  
Article
Monocular Visual/IMU/GNSS Integration System Using Deep Learning-Based Optical Flow for Intelligent Vehicle Localization
by Jeongmin Kang
Sensors 2025, 25(19), 6050; https://doi.org/10.3390/s25196050 - 1 Oct 2025
Viewed by 1012
Abstract
Accurate and reliable vehicle localization is essential for autonomous driving in complex outdoor environments. Traditional feature-based visual–inertial odometry (VIO) suffers from sparse features and sensitivity to illumination, limiting robustness in outdoor scenes. Deep learning-based optical flow offers dense and illumination-robust motion cues. However, [...] Read more.
Accurate and reliable vehicle localization is essential for autonomous driving in complex outdoor environments. Traditional feature-based visual–inertial odometry (VIO) suffers from sparse features and sensitivity to illumination, limiting robustness in outdoor scenes. Deep learning-based optical flow offers dense and illumination-robust motion cues. However, existing methods rely on simple bidirectional consistency checks that yield unreliable flow in low-texture or ambiguous regions. Global navigation satellite system (GNSS) measurements can complement VIO, but often degrade in urban areas due to multipath interference. This paper proposes a multi-sensor fusion system that integrates monocular VIO with GNSS measurements to achieve robust and drift-free localization. The proposed approach employs a hybrid VIO framework that utilizes a deep learning-based optical flow network, with an enhanced consistency constraint that incorporates local structure and motion coherence to extract robust flow measurements. The extracted optical flow serves as visual measurements, which are then fused with inertial measurements to improve localization accuracy. GNSS updates further enhance global localization stability by mitigating long-term drift. The proposed method is evaluated on the publicly available KITTI dataset. Extensive experiments demonstrate its superior localization performance compared to previous similar methods. The results show that the filter-based multi-sensor fusion framework with optical flow refined by the enhanced consistency constraint ensures accurate and reliable localization in large-scale outdoor environments. Full article
(This article belongs to the Special Issue AI-Driving for Autonomous Vehicles)
Show Figures

Figure 1

20 pages, 18992 KB  
Article
Application of LMM-Derived Prompt-Based AIGC in Low-Altitude Drone-Based Concrete Crack Monitoring
by Shijun Pan, Zhun Fan, Keisuke Yoshida, Shujia Qin, Takashi Kojima and Satoshi Nishiyama
Drones 2025, 9(9), 660; https://doi.org/10.3390/drones9090660 - 21 Sep 2025
Viewed by 718
Abstract
In recent years, large multimodal models (LMMs), such as ChatGPT 4o and DeepSeek R1—artificial intelligence systems capable of multimodal (e.g., image and text) human–computer interaction—have gained traction in industrial and civil engineering applications. Concurrently, insufficient real-world drone-view data (specifically close-distance, high-resolution imagery) for [...] Read more.
In recent years, large multimodal models (LMMs), such as ChatGPT 4o and DeepSeek R1—artificial intelligence systems capable of multimodal (e.g., image and text) human–computer interaction—have gained traction in industrial and civil engineering applications. Concurrently, insufficient real-world drone-view data (specifically close-distance, high-resolution imagery) for civil engineering scenarios has heightened the importance of artificially generated content (AIGC) or synthetic data as supplementary inputs. AIGC is typically produced via text-to-image generative models (e.g., Stable Diffusion, DALL-E) guided by user-defined prompts. This study leverages LMMs to interpret key parameters for drone-based image generation (e.g., color, texture, scene composition, photographic style) and applies prompt engineering to systematize these parameters. The resulting LMM-generated prompts were used to synthesize training data for a You Only Look Once version 8 segmentation model (YOLOv8-seg). To address the need for detailed crack-distribution mapping in low-altitude drone-based monitoring, the trained YOLOv8-seg model was evaluated on close-distance crack benchmark datasets. The experimental results confirm that LMM-prompted AIGC is a viable supplement for low-altitude drone crack monitoring, achieving >80% classification accuracy (images with/without cracks) at a confidence threshold of 0.5. Full article
Show Figures

Figure 1

21 pages, 8671 KB  
Article
IFE-CMT: Instance-Aware Fine-Grained Feature Enhancement Cross Modal Transformer for 3D Object Detection
by Xiaona Song, Haozhe Zhang, Haichao Liu, Xinxin Wang and Lijun Wang
Sensors 2025, 25(18), 5685; https://doi.org/10.3390/s25185685 - 12 Sep 2025
Viewed by 787
Abstract
In recent years, multi-modal 3D object detection algorithms have experienced significant development. However, current algorithms primarily focus on designing overall fusion strategies for multi-modal features, neglecting finer-grained representations, which leads to a decline in the detection accuracy of small objects. To address this [...] Read more.
In recent years, multi-modal 3D object detection algorithms have experienced significant development. However, current algorithms primarily focus on designing overall fusion strategies for multi-modal features, neglecting finer-grained representations, which leads to a decline in the detection accuracy of small objects. To address this issue, this paper proposes the Instance-aware Fine-grained feature Enhancement Cross Modal Transformer (IFE-CMT) model. We designed an Instance feature Enhancement Module (IE-Module), which can accurately extract object features from multi-modal data and use them to enhance overall features while avoiding view transformations and maintaining low computational overhead. Additionally, we design a new point cloud branch network that effectively expands the network’s receptive field, enhancing the model’s semantic expression capabilities while preserving texture details of the objects. Experimental results on the nuScenes dataset demonstrate that compared to the CMT model, our proposed IFE-CMT model improves mAP and NDS by 2.1% and 0.8% on the validation set, respectively. On the test set, it improves mAP and NDS by 1.9% and a 0.7%. Notably, for small object categories such as bicycles and motorcycles, the mAP improved by 6.6% and 3.7%, respectively, significantly enhancing the detection accuracy of small objects. Full article
(This article belongs to the Section Vehicular Sensing)
Show Figures

Figure 1

Back to TopTop