MDPI - Publisher of Open Access Journals

31 pages, 3160 KB

Open AccessArticle

Multimodal Image Segmentation with Dynamic Adaptive Window and Cross-Scale Fusion for Heterogeneous Data Environments

by Qianping He, Meng Wu, Pengchang Zhang, Lu Wang and Quanbin Shi

Appl. Sci. 2025, 15(19), 10813; https://doi.org/10.3390/app151910813 - 8 Oct 2025

Viewed by 146

Multi-modal image segmentation is a key task in various fields such as urban planning, infrastructure monitoring, and environmental analysis. However, it remains challenging due to complex scenes, varying object scales, and the integration of heterogeneous data sources (such as RGB, depth maps, and [...] Read more.

Multi-modal image segmentation is a key task in various fields such as urban planning, infrastructure monitoring, and environmental analysis. However, it remains challenging due to complex scenes, varying object scales, and the integration of heterogeneous data sources (such as RGB, depth maps, and infrared). To address these challenges, we proposed a novel multi-modal segmentation framework, DyFuseNet, which features dynamic adaptive windows and cross-scale feature fusion capabilities. This framework consists of three key components: (1) Dynamic Window Module (DWM), which uses dynamic partitioning and continuous position bias to adaptively adjust window sizes, thereby improving the representation of irregular and fine-grained objects; (2) Scale Context Attention (SCA), a hierarchical mechanism that associates local details with global semantics in a coarse-to-fine manner, enhancing segmentation accuracy in low-texture or occluded regions; and (3) Hierarchical Adaptive Fusion Architecture (HAFA), which aligns and fuses features from multiple modalities through shallow synchronization and deep channel attention, effectively balancing complementarity and redundancy. Evaluated on benchmark datasets (such as ISPRS Vaihingen and Potsdam), DyFuseNet achieved state-of-the-art performance, with mean Intersection over Union (mIoU) scores of 80.40% and 80.85%, surpassing MFTransNet by 1.91% and 1.77%, respectively. The model also demonstrated strong robustness in challenging scenes (such as building edges and shadowed objects), achieving an average F1 score of 85% while maintaining high efficiency (26.19 GFLOPs, 30.09 FPS), making it suitable for real-time deployment. This work presents a practical, versatile, and computationally efficient solution for multi-modal image analysis, with potential applications beyond remote sensing, including smart monitoring, industrial inspection, and multi-source data fusion tasks. Full article

(This article belongs to the Special Issue Signal and Image Processing: From Theory to Applications: 2nd Edition)

► Show Figures

Figure 1

25 pages, 12740 KB

Open AccessArticle

GM-DETR: Infrared Detection of Small UAV Swarm Targets Based on Detection Transformer

by Chenhao Zhu, Xueli Xie, Jianxiang Xi and Xiaogang Yang

Remote Sens. 2025, 17(19), 3379; https://doi.org/10.3390/rs17193379 - 7 Oct 2025

Viewed by 167

Abstract

Infrared object detection is an important prerequisite for small unmanned aerial vehicle (UAV) swarm countermeasures. Owing to the limited imaging area and texture features of small UAV targets, accurate infrared detection of UAV swarm targets is challenging. In this paper, the GM-DETR is [...] Read more.

Infrared object detection is an important prerequisite for small unmanned aerial vehicle (UAV) swarm countermeasures. Owing to the limited imaging area and texture features of small UAV targets, accurate infrared detection of UAV swarm targets is challenging. In this paper, the GM-DETR is proposed for the detection of densely distributed small UAV swarm targets in infrared scenarios. Specifically, high-level and low-level features are fused by the Fine-Grained Context-Aware Fusion module, which augments texture features in the fused feature map. Furthermore, a Supervised Sampling and Sparsification module is proposed as an explicit guiding mechanism, which assists the GM-DETR to focus on high-quality queries according to the confidence value. The Geometric Relation Encoder is introduced to encode geometric relation among queries, which makes up for the information loss caused by query serialization. In the second stage of the GM-DETR, a long-term memory mechanism is introduced to make UAV detection more stable and distinguishable in motion blur scenes. In the decoder, the self-attention mechanism is improved by introducing memory blocks as additional decoding information, which enhances the robustness of the GM-DETR. In addition, we constructed a small UAV swarm dataset, UAV Swarm Dataset (USD), which comprises 7000 infrared images of low-altitude UAV swarms, as another contribution. The experimental results on the USD show that the GM-DETR outperforms other state-of-the-arts detectors and obtains the best scores (90.6 on

{AP}_{75}

and 63.8 on

{AP}_{S}

), which demonstrates the effectiveness of the GM-DETR in detecting small UAV targets. The good performance of the GM-DETR on the Drone Vehicle dataset also demonstrates the superiority of the proposed modules in detecting small targets. Full article

► Show Figures

Figure 1

18 pages, 6931 KB

Open AccessArticle

Research on Multi-Sensor Data Fusion Based Real-Scene 3D Reconstruction and Digital Twin Visualization Methodology for Coal Mine Tunnels

by Hongda Zhu, Jingjing Jin and Sihai Zhao

Sensors 2025, 25(19), 6153; https://doi.org/10.3390/s25196153 - 4 Oct 2025

Viewed by 324

Abstract

This paper proposes a multi-sensor data-fusion-based method for real-scene 3D reconstruction and digital twin visualization of coal mine tunnels, aiming to address issues such as low accuracy in non-photorealistic modeling and difficulties in feature object recognition during traditional coal mine digitization processes. The [...] Read more.

This paper proposes a multi-sensor data-fusion-based method for real-scene 3D reconstruction and digital twin visualization of coal mine tunnels, aiming to address issues such as low accuracy in non-photorealistic modeling and difficulties in feature object recognition during traditional coal mine digitization processes. The research employs cubemap-based mapping technology to project acquired real-time tunnel images onto six faces of a cube, combined with navigation information, pose data, and synchronously acquired point cloud data to achieve spatial alignment and data fusion. On this basis, inner/outer corner detection algorithms are utilized for precise image segmentation, and a point cloud region growing algorithm integrated with information entropy optimization is proposed to realize complete recognition and segmentation of tunnel planes (e.g., roof, floor, left/right sidewalls) and high-curvature feature objects (e.g., ventilation ducts). Furthermore, geometric dimensions extracted from segmentation results are used to construct 3D models, and real-scene images are mapped onto model surfaces via UV (U and V axes of texture coordinate) texture mapping technology, generating digital twin models with authentic texture details. Experimental validation demonstrates that the method performs excellently in both simulated and real coal mine environments, with models capable of faithfully reproducing tunnel spatial layouts and detailed features while supporting multi-view visualization (e.g., bottom view, left/right rotated views, front view). This approach provides efficient and precise technical support for digital twin construction, fine-grained structural modeling, and safety monitoring of coal mine tunnels, significantly enhancing the accuracy and practicality of photorealistic 3D modeling in intelligent mining applications. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

19 pages, 4672 KB

Open AccessArticle

Monocular Visual/IMU/GNSS Integration System Using Deep Learning-Based Optical Flow for Intelligent Vehicle Localization

by Jeongmin Kang

Sensors 2025, 25(19), 6050; https://doi.org/10.3390/s25196050 - 1 Oct 2025

Viewed by 427

Abstract

Accurate and reliable vehicle localization is essential for autonomous driving in complex outdoor environments. Traditional feature-based visual–inertial odometry (VIO) suffers from sparse features and sensitivity to illumination, limiting robustness in outdoor scenes. Deep learning-based optical flow offers dense and illumination-robust motion cues. However, [...] Read more.

Accurate and reliable vehicle localization is essential for autonomous driving in complex outdoor environments. Traditional feature-based visual–inertial odometry (VIO) suffers from sparse features and sensitivity to illumination, limiting robustness in outdoor scenes. Deep learning-based optical flow offers dense and illumination-robust motion cues. However, existing methods rely on simple bidirectional consistency checks that yield unreliable flow in low-texture or ambiguous regions. Global navigation satellite system (GNSS) measurements can complement VIO, but often degrade in urban areas due to multipath interference. This paper proposes a multi-sensor fusion system that integrates monocular VIO with GNSS measurements to achieve robust and drift-free localization. The proposed approach employs a hybrid VIO framework that utilizes a deep learning-based optical flow network, with an enhanced consistency constraint that incorporates local structure and motion coherence to extract robust flow measurements. The extracted optical flow serves as visual measurements, which are then fused with inertial measurements to improve localization accuracy. GNSS updates further enhance global localization stability by mitigating long-term drift. The proposed method is evaluated on the publicly available KITTI dataset. Extensive experiments demonstrate its superior localization performance compared to previous similar methods. The results show that the filter-based multi-sensor fusion framework with optical flow refined by the enhanced consistency constraint ensures accurate and reliable localization in large-scale outdoor environments. Full article

(This article belongs to the Special Issue AI-Driving for Autonomous Vehicles)

► Show Figures

Figure 1

20 pages, 18992 KB

Open AccessArticle

Application of LMM-Derived Prompt-Based AIGC in Low-Altitude Drone-Based Concrete Crack Monitoring

by Shijun Pan, Zhun Fan, Keisuke Yoshida, Shujia Qin, Takashi Kojima and Satoshi Nishiyama

Drones 2025, 9(9), 660; https://doi.org/10.3390/drones9090660 - 21 Sep 2025

Viewed by 408

Abstract

In recent years, large multimodal models (LMMs), such as ChatGPT 4o and DeepSeek R1—artificial intelligence systems capable of multimodal (e.g., image and text) human–computer interaction—have gained traction in industrial and civil engineering applications. Concurrently, insufficient real-world drone-view data (specifically close-distance, high-resolution imagery) for [...] Read more.

In recent years, large multimodal models (LMMs), such as ChatGPT 4o and DeepSeek R1—artificial intelligence systems capable of multimodal (e.g., image and text) human–computer interaction—have gained traction in industrial and civil engineering applications. Concurrently, insufficient real-world drone-view data (specifically close-distance, high-resolution imagery) for civil engineering scenarios has heightened the importance of artificially generated content (AIGC) or synthetic data as supplementary inputs. AIGC is typically produced via text-to-image generative models (e.g., Stable Diffusion, DALL-E) guided by user-defined prompts. This study leverages LMMs to interpret key parameters for drone-based image generation (e.g., color, texture, scene composition, photographic style) and applies prompt engineering to systematize these parameters. The resulting LMM-generated prompts were used to synthesize training data for a You Only Look Once version 8 segmentation model (YOLOv8-seg). To address the need for detailed crack-distribution mapping in low-altitude drone-based monitoring, the trained YOLOv8-seg model was evaluated on close-distance crack benchmark datasets. The experimental results confirm that LMM-prompted AIGC is a viable supplement for low-altitude drone crack monitoring, achieving >80% classification accuracy (images with/without cracks) at a confidence threshold of 0.5. Full article

(This article belongs to the Special Issue Unmanned Aerial Vehicles (UAVs) Applications in Critical Industrial Sectors)

► Show Figures

Figure 1

21 pages, 8671 KB

Open AccessArticle

IFE-CMT: Instance-Aware Fine-Grained Feature Enhancement Cross Modal Transformer for 3D Object Detection

by Xiaona Song, Haozhe Zhang, Haichao Liu, Xinxin Wang and Lijun Wang

Sensors 2025, 25(18), 5685; https://doi.org/10.3390/s25185685 - 12 Sep 2025

Viewed by 445

Abstract

In recent years, multi-modal 3D object detection algorithms have experienced significant development. However, current algorithms primarily focus on designing overall fusion strategies for multi-modal features, neglecting finer-grained representations, which leads to a decline in the detection accuracy of small objects. To address this [...] Read more.

In recent years, multi-modal 3D object detection algorithms have experienced significant development. However, current algorithms primarily focus on designing overall fusion strategies for multi-modal features, neglecting finer-grained representations, which leads to a decline in the detection accuracy of small objects. To address this issue, this paper proposes the Instance-aware Fine-grained feature Enhancement Cross Modal Transformer (IFE-CMT) model. We designed an Instance feature Enhancement Module (IE-Module), which can accurately extract object features from multi-modal data and use them to enhance overall features while avoiding view transformations and maintaining low computational overhead. Additionally, we design a new point cloud branch network that effectively expands the network’s receptive field, enhancing the model’s semantic expression capabilities while preserving texture details of the objects. Experimental results on the nuScenes dataset demonstrate that compared to the CMT model, our proposed IFE-CMT model improves mAP and NDS by 2.1% and 0.8% on the validation set, respectively. On the test set, it improves mAP and NDS by 1.9% and a 0.7%. Notably, for small object categories such as bicycles and motorcycles, the mAP improved by 6.6% and 3.7%, respectively, significantly enhancing the detection accuracy of small objects. Full article

(This article belongs to the Section Vehicular Sensing)

► Show Figures

Figure 1

25 pages, 7964 KB

Open AccessArticle

DSCSRN: Physically Guided Symmetry-Aware Spatial-Spectral Collaborative Network for Single-Image Hyperspectral Super-Resolution

by Xueli Chang, Jintong Liu, Guotao Wen, Xiaoyu Huang and Meng Yan

Symmetry 2025, 17(9), 1520; https://doi.org/10.3390/sym17091520 - 12 Sep 2025

Viewed by 388

Abstract

Hyperspectral images (HSIs), with their rich spectral information, are widely used in remote sensing; yet the inherent trade-off between spectral and spatial resolution in imaging systems often limits spatial details. Single-image hyperspectral super-resolution (HSI-SR) seeks to recover high-resolution HSIs from a single low-resolution [...] Read more.

Hyperspectral images (HSIs), with their rich spectral information, are widely used in remote sensing; yet the inherent trade-off between spectral and spatial resolution in imaging systems often limits spatial details. Single-image hyperspectral super-resolution (HSI-SR) seeks to recover high-resolution HSIs from a single low-resolution input, but the high dimensionality and spectral redundancy of HSIs make this task challenging. In HSIs, spectral signatures and spatial textures often exhibit intrinsic symmetries, and preserving these symmetries provides additional physical constraints that enhance reconstruction fidelity and robustness. To address these challenges, we propose the Dynamic Spectral Collaborative Super-Resolution Network (DSCSRN), an end-to-end framework that integrates physical modeling with deep learning and explicitly embeds spatial–spectral symmetry priors into the network architecture. DSCSRN processes low-resolution HSIs with a Cascaded Residual Spectral Decomposition Network (CRSDN) to compress redundant channels while preserving spatial structures, generating accurate abundance maps. These maps are refined by two Synergistic Progressive Feature Refinement Modules (SPFRMs), which progressively enhance spatial textures and spectral details via a multi-scale dual-domain collaborative attention mechanism. The Dynamic Endmember Adjustment Module (DEAM) then adaptively updates spectral endmembers according to scene context, overcoming the limitations of fixed-endmember assumptions. Grounded in the Linear Mixture Model (LMM), this unmixing–recovery–reconstruction pipeline restores subtle spectral variations alongside improved spatial resolution. Experiments on the Chikusei, Pavia Center, and CAVE datasets show that DSCSRN outperforms state-of-the-art methods in both perceptual quality and quantitative performance, achieving an average PSNR of 43.42 and a SAM of 1.75 (×4 scale) on Chikusei. The integration of symmetry principles offers a unifying perspective aligned with the intrinsic structure of HSIs, producing reconstructions that are both accurate and structurally consistent. Full article

(This article belongs to the Section Computer)

► Show Figures

Figure 1

27 pages, 51271 KB

Open AccessArticle

Surface Damage Detection and Analysis for Reduction-Fired Cyan Square Bricks in Jiangnan Gardens via YOLOv12

by Lina Yan, Yile Chen, Xingkang Jia and Liang Zheng

Coatings 2025, 15(9), 1066; https://doi.org/10.3390/coatings15091066 - 11 Sep 2025

Viewed by 453

Abstract

As an outstanding UNESCO World Heritage Site, the Jiangnan gardens feature both exquisite and fragile components. Reduction-fired cyan square bricks, serving as crucial paving materials, are long-term exposed to natural and anthropogenic factors, making them prone to various types of surface damage and [...] Read more.

As an outstanding UNESCO World Heritage Site, the Jiangnan gardens feature both exquisite and fragile components. Reduction-fired cyan square bricks, serving as crucial paving materials, are long-term exposed to natural and anthropogenic factors, making them prone to various types of surface damage and urgently requiring efficient, non-destructive detection methods to support scientific conservation. Traditional manual inspection methods suffer from low efficiency, strong subjectivity, and potential disturbance to the fragile heritage structures. This study focuses on developing an intelligent detection method based on advanced computer vision, employing the YOLOv12 object detection model to achieve non-contact, automated identification of typical tile surface damage types in the Jiangnan gardens (such as cracking, stains, water stains, and wear). A total of 691 images of reduction-fired cyan square bricks collected on-site were used as training samples. The main conclusions of this study are as follows: (1) By constructing a dataset containing multiple samples and multiple scenes of reduction-fired cyan square brick images in Jiangnan gardens, the YOLOv12 model was trained and optimized, enabling it to accurately identify subtle damage features under complex texture backgrounds. (2) Overall indicators: Through the comparison of the confusion matrices of the four key training nodes, model C (the 159th epoch, highest mAP50–95) has the most balanced overall performance in multiple categories, with an accuracy of 0.73 for cracking, 0.77 for wear, 0.60 for water stain, and 0.65 for stains, which can meet basic detection requirements. (3) Difficulty of discrimination: Compared with stains and water stains, cracking and wear are easier to distinguish. Experimental results indicate that the detection method is feasible and effective in identifying the surface damage types of reduction-fired cyan square bricks in Jiangnan gardens. This research provides a practical and efficient “surface technology” solution for the preventive protection of cultural heritage, contributing to the sustainable preservation and management of world heritage. Full article

(This article belongs to the Special Issue Solid Surfaces, Defects and Detection, 2nd Edition)

► Show Figures

Graphical abstract

29 pages, 3367 KB

Open AccessArticle

Small Object Detection in Synthetic Aperture Radar with Modular Feature Encoding and Vectorized Box Regression

by Xinmiao Du and Xihong Wu

Remote Sens. 2025, 17(17), 3094; https://doi.org/10.3390/rs17173094 - 5 Sep 2025

Viewed by 1056

Abstract

Object detection in synthetic aperture radar (SAR) imagery poses significant challenges due to low resolution, small objects, arbitrary orientations, and complex backgrounds. Standard object detectors often fail to capture sufficient semantic and geometric cues for such tiny targets. To address this issue, a [...] Read more.

Object detection in synthetic aperture radar (SAR) imagery poses significant challenges due to low resolution, small objects, arbitrary orientations, and complex backgrounds. Standard object detectors often fail to capture sufficient semantic and geometric cues for such tiny targets. To address this issue, a new Convolutional Neural Network (CNN) framework called Deformable Vectorized Detection Network (DVDNet) has been proposed, specifically designed for detecting small, oriented, and densely packed objects in SAR images. The DVDNet consists of Grouped-Deformable Convolution for adaptive receptive field adjustment to diverse object scales, a Local Binary Pattern (LBP) Enhancement Module that enriches texture representations and enhances the visibility of small or camouflaged objects, and a Vector Decomposition Module that enables accurate regression of oriented bounding boxes via learnable geometric vectors. The DVDNet is embedded in a two-stage detection architecture and is particularly effective in preserving fine-grained features critical for mall object localization. The performance of DVDNet is validated on two SAR small target detection datasets, HRSID and SSDD, and it is experimentally demonstrated that it achieves 90.9% mAP on HRSID and 87.2% mAP on SSDD. The generalizability of DVDNet was also verified on the self-built SAR ship dataset and the remote sensing optical dataset HRSC2016. All these experiments show that DVDNet outperforms the standard detector. Notably, our framework shows substantial gains in precision and recall for small object subsets, validating the importance of combining deformable sampling, texture enhancement, and vector-based box representation for high-fidelity small object detection in complex SAR scenes. Full article

(This article belongs to the Special Issue Deep Learning Techniques and Applications of MIMO Radar Theory)

► Show Figures

Figure 1

22 pages, 8901 KB

Open AccessArticle

D3Fusion: Decomposition–Disentanglement–Dynamic Compensation Framework for Infrared-Visible Image Fusion in Extreme Low-Light

by Wansi Yang, Yi Liu and Xiaotian Chen

Appl. Sci. 2025, 15(16), 8918; https://doi.org/10.3390/app15168918 - 13 Aug 2025

Viewed by 629

Abstract

Infrared-visible image fusion quality is critical for nighttime perception in autonomous driving and surveillance but suffers severe degradation under extreme low-light conditions, including irreversible texture loss in visible images, thermal boundary diffusion artifacts, and overexposure under dynamic non-uniform illumination. To address these challenges, [...] Read more.

Infrared-visible image fusion quality is critical for nighttime perception in autonomous driving and surveillance but suffers severe degradation under extreme low-light conditions, including irreversible texture loss in visible images, thermal boundary diffusion artifacts, and overexposure under dynamic non-uniform illumination. To address these challenges, a Decomposition–Disentanglement–Dynamic Compensation framework, D3Fusion, is proposed. Firstly, a Retinex-inspired Decomposition Illumination Net (DIN) decomposes inputs into enhanced images and degradative illumination maps for joint low-light recovery. Secondly, an illumination-guided encoder and a multi-scale differential compensation decoder dynamically balance cross-modal features. Finally, a progressive three-stage training paradigm from illumination correction through feature disentanglement to adaptive fusion resolves optimization conflicts. Compared to State-of-the-Art methods, on the LLVIP, TNO, MSRS, and RoadScene datasets, D3Fusion achieves an average improvement of 1.59% in standard deviation (SD), 6.9% in spatial frequency (SF), 2.59% in edge intensity (EI), and 1.99% in visual information fidelity (VIF), demonstrating superior performance in extreme low-light scenarios. The framework effectively suppresses thermal diffusion artifacts while mitigating exposure imbalance, adaptively brightening scenes while preserving texture details in shadowed regions. This significantly improves fusion quality for nighttime images by enhancing salient information, establishing a robust solution for multimodal perception under illumination-critical conditions. Full article

(This article belongs to the Special Issue Big Data and Artificial Intelligence Applications in Infrared Thermography)

► Show Figures

Figure 1

31 pages, 47425 KB

Open AccessArticle

T360Fusion: Temporal 360 Multimodal Fusion for 3D Object Detection via Transformers

by Khanh Bao Tran, Alexander Carballo and Kazuya Takeda

Sensors 2025, 25(16), 4902; https://doi.org/10.3390/s25164902 - 8 Aug 2025

Viewed by 788

Abstract

Object detection plays a significant role in various industrial and scientific domains, particularly in autonomous driving. It enables vehicles to detect surrounding objects, construct spatial maps, and facilitate safe navigation. To accomplish these tasks, a variety of sensors have been employed, including LiDAR, [...] Read more.

Object detection plays a significant role in various industrial and scientific domains, particularly in autonomous driving. It enables vehicles to detect surrounding objects, construct spatial maps, and facilitate safe navigation. To accomplish these tasks, a variety of sensors have been employed, including LiDAR, radar, RGB cameras, and ultrasonic sensors. Among these, LiDAR and RGB cameras are frequently utilized due to their advantages. RGB cameras offer high-resolution images with rich color and texture information but tend to underperform in low light or adverse weather conditions. In contrast, LiDAR provides precise 3D geometric data irrespective of lighting conditions, although it lacks the high spatial resolution of cameras. Recently, thermal cameras have gained significant attention in both standalone applications and in combination with RGB cameras. They offer strong perception capabilities under low-visibility conditions or adverse weather conditions. Multimodal sensor fusion effectively overcomes individual sensor limitations. In this paper, we propose a novel multimodal fusion method that integrates LiDAR, a 360 RGB camera, and a 360 thermal camera to fully leverage the strengths of each modality. Our method employs a feature-level fusion strategy that temporally accumulates and synchronizes multiple LiDAR frames. This design not only improves the detection accuracy but also enhances the spatial coverage and robustness. The use of 360 images significantly reduces blind spots and provides comprehensive environmental awareness, which is especially beneficial in complex or dynamic scenes. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

17 pages, 2072 KB

Open AccessArticle

Barefoot Footprint Detection Algorithm Based on YOLOv8-StarNet

by Yujie Shen, Xuemei Jiang, Yabin Zhao and Wenxin Xie

Sensors 2025, 25(15), 4578; https://doi.org/10.3390/s25154578 - 24 Jul 2025

Viewed by 659

Abstract

This study proposes an optimized footprint recognition model based on an enhanced StarNet architecture for biometric identification in the security, medical, and criminal investigation fields. Conventional image recognition algorithms exhibit limitations in processing barefoot footprint images characterized by concentrated feature distributions and rich [...] Read more.

This study proposes an optimized footprint recognition model based on an enhanced StarNet architecture for biometric identification in the security, medical, and criminal investigation fields. Conventional image recognition algorithms exhibit limitations in processing barefoot footprint images characterized by concentrated feature distributions and rich texture patterns. To address this, our framework integrates an improved StarNet into the backbone of YOLOv8 architecture. Leveraging the unique advantages of element-wise multiplication, the redesigned backbone efficiently maps inputs to a high-dimensional nonlinear feature space without increasing channel dimensions, achieving enhanced representational capacity with low computational latency. Subsequently, an Encoder layer facilitates feature interaction within the backbone through multi-scale feature fusion and attention mechanisms, effectively extracting rich semantic information while maintaining computational efficiency. In the feature fusion part, a feature modulation block processes multi-scale features by synergistically combining global and local information, thereby reducing redundant computations and decreasing both parameter count and computational complexity to achieve model lightweighting. Experimental evaluations on a proprietary barefoot footprint dataset demonstrate that the proposed model exhibits significant advantages in terms of parameter efficiency, recognition accuracy, and computational complexity. The number of parameters has been reduced by 0.73 million, further improving the model’s speed. Gflops has been reduced by 1.5, lowering the performance requirements for computational hardware during model deployment. Recognition accuracy has reached 99.5%, with further improvements in model precision. Future research will explore how to capture shoeprint images with complex backgrounds from shoes worn at crime scenes, aiming to further enhance the model’s recognition capabilities in more forensic scenarios. Full article

(This article belongs to the Special Issue Transformer Applications in Target Tracking)

► Show Figures

Figure 1

17 pages, 1927 KB

Open AccessArticle

ConvTransNet-S: A CNN-Transformer Hybrid Disease Recognition Model for Complex Field Environments

by Shangyun Jia, Guanping Wang, Hongling Li, Yan Liu, Linrong Shi and Sen Yang

Plants 2025, 14(15), 2252; https://doi.org/10.3390/plants14152252 - 22 Jul 2025

Viewed by 1004

Abstract

To address the challenges of low recognition accuracy and substantial model complexity in crop disease identification models operating in complex field environments, this study proposed a novel hybrid model named ConvTransNet-S, which integrates Convolutional Neural Networks (CNNs) and transformers for crop disease identification [...] Read more.

To address the challenges of low recognition accuracy and substantial model complexity in crop disease identification models operating in complex field environments, this study proposed a novel hybrid model named ConvTransNet-S, which integrates Convolutional Neural Networks (CNNs) and transformers for crop disease identification tasks. Unlike existing hybrid approaches, ConvTransNet-S uniquely introduces three key innovations: First, a Local Perception Unit (LPU) and Lightweight Multi-Head Self-Attention (LMHSA) modules were introduced to synergistically enhance the extraction of fine-grained plant disease details and model global dependency relationships, respectively. Second, an Inverted Residual Feed-Forward Network (IRFFN) was employed to optimize the feature propagation path, thereby enhancing the model’s robustness against interferences such as lighting variations and leaf occlusions. This novel combination of a LPU, LMHSA, and an IRFFN achieves a dynamic equilibrium between local texture perception and global context modeling—effectively resolving the trade-offs inherent in standalone CNNs or transformers. Finally, through a phased architecture design, efficient fusion of multi-scale disease features is achieved, which enhances feature discriminability while reducing model complexity. The experimental results indicated that ConvTransNet-S achieved a recognition accuracy of 98.85% on the PlantVillage public dataset. This model operates with only 25.14 million parameters, a computational load of 3.762 GFLOPs, and an inference time of 7.56 ms. Testing on a self-built in-field complex scene dataset comprising 10,441 images revealed that ConvTransNet-S achieved an accuracy of 88.53%, which represents improvements of 14.22%, 2.75%, and 0.34% over EfficientNetV2, Vision Transformer, and Swin Transformer, respectively. Furthermore, the ConvTransNet-S model achieved up to 14.22% higher disease recognition accuracy under complex background conditions while reducing the parameter count by 46.8%. This confirms that its unique multi-scale feature mechanism can effectively distinguish disease from background features, providing a novel technical approach for disease diagnosis in complex agricultural scenarios and demonstrating significant application value for intelligent agricultural management. Full article

(This article belongs to the Section Plant Modeling)

► Show Figures

Figure 1

27 pages, 1868 KB

Open AccessArticle

SAM2-DFBCNet: A Camouflaged Object Detection Network Based on the Heira Architecture of SAM2

by Cao Yuan, Libang Liu, Yaqin Li and Jianxiang Li

Sensors 2025, 25(14), 4509; https://doi.org/10.3390/s25144509 - 21 Jul 2025

Viewed by 981

Abstract

Camouflaged Object Detection (COD) aims to segment objects that are highly integrated with their background, presenting significant challenges such as low contrast, complex textures, and blurred boundaries. Existing deep learning methods often struggle to achieve robust segmentation under these conditions. To address these [...] Read more.

Camouflaged Object Detection (COD) aims to segment objects that are highly integrated with their background, presenting significant challenges such as low contrast, complex textures, and blurred boundaries. Existing deep learning methods often struggle to achieve robust segmentation under these conditions. To address these limitations, this paper proposes a novel COD network, SAM2-DFBCNet, built upon the SAM2 Hiera architecture. Our network incorporates three key modules: (1) the Camouflage-Aware Context Enhancement Module (CACEM), which fuses local and global features through an attention mechanism to enhance contextual awareness in low-contrast scenes; (2) the Cross-Scale Feature Interaction Bridge (CSFIB), which employs a bidirectional convolutional GRU for the dynamic fusion of multi-scale features, effectively mitigating representation inconsistencies caused by complex textures and deformations; and (3) the Dynamic Boundary Refinement Module (DBRM), which combines channel and spatial attention mechanisms to optimize boundary localization accuracy and enhance segmentation details. Extensive experiments on three public datasets—CAMO, COD10K, and NC4K—demonstrate that SAM2-DFBCNet outperforms twenty state-of-the-art methods, achieving maximum improvements of 7.4%, 5.78%, and 4.78% in key metrics such as S-measure (

S_{α}

), F-measure (

F_{β}

), and mean E-measure (

E_{ϕ}

), respectively, while reducing the Mean Absolute Error (M) by 37.8%. These results validate the superior performance and robustness of our approach in complex camouflage scenarios. Full article

(This article belongs to the Special Issue Transformer Applications in Target Tracking)

► Show Figures

Figure 1

23 pages, 10392 KB

Open AccessArticle

Dual-Branch Luminance–Chrominance Attention Network for Hydraulic Concrete Image Enhancement

by Zhangjun Peng, Li Li, Chuanhao Chang, Rong Tang, Guoqiang Zheng, Mingfei Wan, Juanping Jiang, Shuai Zhou, Zhenggang Tian and Zhigui Liu

Appl. Sci. 2025, 15(14), 7762; https://doi.org/10.3390/app15147762 - 10 Jul 2025

Cited by 1 | Viewed by 489

Abstract

Hydraulic concrete is a critical infrastructure material, with its surface condition playing a vital role in quality assessments for water conservancy and hydropower projects. However, images taken in complex hydraulic environments often suffer from degraded quality due to low lighting, shadows, and noise, [...] Read more.

Hydraulic concrete is a critical infrastructure material, with its surface condition playing a vital role in quality assessments for water conservancy and hydropower projects. However, images taken in complex hydraulic environments often suffer from degraded quality due to low lighting, shadows, and noise, making it difficult to distinguish defects from the background and thereby hindering accurate defect detection and damage evaluation. In this study, following systematic analyses of hydraulic concrete color space characteristics, we propose a Dual-Branch Luminance–Chrominance Attention Network (DBLCANet-HCIE) specifically designed for low-light hydraulic concrete image enhancement. Inspired by human visual perception, the network simultaneously improves global contrast and preserves fine-grained defect textures, which are essential for structural analysis. The proposed architecture consists of a Luminance Adjustment Branch (LAB) and a Chroma Restoration Branch (CRB). The LAB incorporates a Luminance-Aware Hybrid Attention Block (LAHAB) to capture both the global luminance distribution and local texture details, enabling adaptive illumination correction through comprehensive scene understanding. The CRB integrates a Channel Denoiser Block (CDB) for channel-specific noise suppression and a Frequency-Domain Detail Enhancement Block (FDDEB) to refine chrominance information and enhance subtle defect textures. A feature fusion block is designed to fuse and learn the features of the outputs from the two branches, resulting in images with enhanced luminance, reduced noise, and preserved surface anomalies. To validate the proposed approach, we construct a dedicated low-light hydraulic concrete image dataset (LLHCID). Extensive experiments conducted on both LOLv1 and LLHCID benchmarks demonstrate that the proposed method significantly enhances the visual interpretability of hydraulic concrete surfaces while effectively addressing low-light degradation challenges. Full article

► Show Figures

Figure 1

Search Results (122)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (122)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI