Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (15)

Search Parameters:
Keywords = visual attention mamba

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
21 pages, 5527 KiB  
Article
SGNet: A Structure-Guided Network with Dual-Domain Boundary Enhancement and Semantic Fusion for Skin Lesion Segmentation
by Haijiao Yun, Qingyu Du, Ziqing Han, Mingjing Li, Le Yang, Xinyang Liu, Chao Wang and Weitian Ma
Sensors 2025, 25(15), 4652; https://doi.org/10.3390/s25154652 - 27 Jul 2025
Viewed by 307
Abstract
Segmentation of skin lesions in dermoscopic images is critical for the accurate diagnosis of skin cancers, particularly malignant melanoma, yet it is hindered by irregular lesion shapes, blurred boundaries, low contrast, and artifacts, such as hair interference. Conventional deep learning methods, typically based [...] Read more.
Segmentation of skin lesions in dermoscopic images is critical for the accurate diagnosis of skin cancers, particularly malignant melanoma, yet it is hindered by irregular lesion shapes, blurred boundaries, low contrast, and artifacts, such as hair interference. Conventional deep learning methods, typically based on UNet or Transformer architectures, often face limitations in regard to fully exploiting lesion features and incur high computational costs, compromising precise lesion delineation. To overcome these challenges, we propose SGNet, a structure-guided network, integrating a hybrid CNN–Mamba framework for robust skin lesion segmentation. The SGNet employs the Visual Mamba (VMamba) encoder to efficiently extract multi-scale features, followed by the Dual-Domain Boundary Enhancer (DDBE), which refines boundary representations and suppresses noise through spatial and frequency-domain processing. The Semantic-Texture Fusion Unit (STFU) adaptively integrates low-level texture with high-level semantic features, while the Structure-Aware Guidance Module (SAGM) generates coarse segmentation maps to provide global structural guidance. The Guided Multi-Scale Refiner (GMSR) further optimizes boundary details through a multi-scale semantic attention mechanism. Comprehensive experiments based on the ISIC2017, ISIC2018, and PH2 datasets demonstrate SGNet’s superior performance, with average improvements of 3.30% in terms of the mean Intersection over Union (mIoU) value and 1.77% in regard to the Dice Similarity Coefficient (DSC) compared to state-of-the-art methods. Ablation studies confirm the effectiveness of each component, highlighting SGNet’s exceptional accuracy and robust generalization for computer-aided dermatological diagnosis. Full article
(This article belongs to the Section Biomedical Sensors)
Show Figures

Figure 1

24 pages, 8344 KiB  
Article
Research and Implementation of Travel Aids for Blind and Visually Impaired People
by Jun Xu, Shilong Xu, Mingyu Ma, Jing Ma and Chuanlong Li
Sensors 2025, 25(14), 4518; https://doi.org/10.3390/s25144518 - 21 Jul 2025
Viewed by 341
Abstract
Blind and visually impaired (BVI) people face significant challenges in perception, navigation, and safety during travel. Existing infrastructure (e.g., blind lanes) and traditional aids (e.g., walking sticks, basic audio feedback) provide limited flexibility and interactivity for complex environments. To solve this problem, we [...] Read more.
Blind and visually impaired (BVI) people face significant challenges in perception, navigation, and safety during travel. Existing infrastructure (e.g., blind lanes) and traditional aids (e.g., walking sticks, basic audio feedback) provide limited flexibility and interactivity for complex environments. To solve this problem, we propose a real-time travel assistance system based on deep learning. The hardware comprises an NVIDIA Jetson Nano controller, an Intel D435i depth camera for environmental sensing, and SG90 servo motors for feedback. To address embedded device computational constraints, we developed a lightweight object detection and segmentation algorithm. Key innovations include a multi-scale attention feature extraction backbone, a dual-stream fusion module incorporating the Mamba architecture, and adaptive context-aware detection/segmentation heads. This design ensures high computational efficiency and real-time performance. The system workflow is as follows: (1) the D435i captures real-time environmental data; (2) the processor analyzes this data, converting obstacle distances and path deviations into electrical signals; (3) servo motors deliver vibratory feedback for guidance and alerts. Preliminary tests confirm that the system can effectively detect obstacles and correct path deviations in real time, suggesting its potential to assist BVI users. However, as this is a work in progress, comprehensive field trials with BVI participants are required to fully validate its efficacy. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

36 pages, 25361 KiB  
Article
Remote Sensing Image Compression via Wavelet-Guided Local Structure Decoupling and Channel–Spatial State Modeling
by Jiahui Liu, Lili Zhang and Xianjun Wang
Remote Sens. 2025, 17(14), 2419; https://doi.org/10.3390/rs17142419 - 12 Jul 2025
Viewed by 467
Abstract
As the resolution and data volume of remote sensing imagery continue to grow, achieving efficient compression without sacrificing reconstruction quality remains a major challenge, given that traditional handcrafted codecs often fail to balance rate-distortion performance and computational complexity, while deep learning-based approaches offer [...] Read more.
As the resolution and data volume of remote sensing imagery continue to grow, achieving efficient compression without sacrificing reconstruction quality remains a major challenge, given that traditional handcrafted codecs often fail to balance rate-distortion performance and computational complexity, while deep learning-based approaches offer superior representational capacity. However, challenges remain in achieving a balance between fine-detail adaptation and computational efficiency. Mamba, a state–space model (SSM)-based architecture, offers linear-time complexity and excels at capturing long-range dependencies in sequences. It has been adopted in remote sensing compression tasks to model long-distance dependencies between pixels. However, despite its effectiveness in global context aggregation, Mamba’s uniform bidirectional scanning is insufficient for capturing high-frequency structures such as edges and textures. Moreover, existing visual state–space (VSS) models built upon Mamba typically treat all channels equally and lack mechanisms to dynamically focus on semantically salient spatial regions. To address these issues, we present an innovative architecture for distant sensing image compression, called the Multi-scale Channel Global Mamba Network (MGMNet). MGMNet integrates a spatial–channel dynamic weighting mechanism into the Mamba architecture, enhancing global semantic modeling while selectively emphasizing informative features. It comprises two key modules. The Wavelet Transform-guided Local Structure Decoupling (WTLS) module applies multi-scale wavelet decomposition to disentangle and separately encode low- and high-frequency components, enabling efficient parallel modeling of global contours and local textures. The Channel–Global Information Modeling (CGIM) module enhances conventional VSS by introducing a dual-path attention strategy that reweights spatial and channel information, improving the modeling of long-range dependencies and edge structures. We conducted extensive evaluations on three distinct remote sensing datasets to assess the MGMNet. The results of the investigations revealed that MGMNet outperforms the current SOTA models across various performance metrics. Full article
Show Figures

Figure 1

18 pages, 4391 KiB  
Article
UWMambaNet: Dual-Branch Underwater Image Reconstruction Based on W-Shaped Mamba
by Yuhan Zhang, Xinyang Yu and Zhanchuan Cai
Mathematics 2025, 13(13), 2153; https://doi.org/10.3390/math13132153 - 30 Jun 2025
Viewed by 276
Abstract
Underwater image enhancement is a challenging task due to the unique optical properties of water, which often lead to color distortion, low contrast, and detail loss. At the present stage, the methods based on the CNN have the problem of insufficient global attention, [...] Read more.
Underwater image enhancement is a challenging task due to the unique optical properties of water, which often lead to color distortion, low contrast, and detail loss. At the present stage, the methods based on the CNN have the problem of insufficient global attention, and the methods based on Transformer generally have the problem of quadratic complexity. To address this challenge, we propose a dual-branch network architecture based on the W-shaped Mamba: UWMambaNet. Our method integrates the color contrast enhancement branch and the detail enhancement branch, and each branch is dedicated to improving specific aspects of underwater images. The color contrast enhancement branch utilizes the RGB and Lab color spaces and uses the Mamba block for advanced feature fusion to enhance color fidelity and contrast. The detail enhancement branch adopts a multi-scale feature extraction strategy to capture fine and contextual details through parallel convolutional paths. The Mamba module is added to the dual branches, and state-space modeling is used to capture the long-range dependencies and spatial relationships in the image data. This enables effective modeling of the complex interactions and light propagation effects inherent in the underwater environment. Experimental results show that our method significantly improves the visual quality of underwater images and is superior to existing technologies in terms of quantitative indicators and visualization effects; compared to the best candidate models on the UIEB and EUVP datasets, UWMambaNet improves UCIQE by 3.7% and 2.4%, respectively. Full article
Show Figures

Figure 1

18 pages, 1471 KiB  
Article
LST-BEV: Generating a Long-Term Spatial–Temporal Bird’s-Eye-View Feature for Multi-View 3D Object Detection
by Qijun Feng, Chunyang Zhao, Pengfei Liu, Zhichao Zhang, Yue Jin and Wanglin Tian
Sensors 2025, 25(13), 4040; https://doi.org/10.3390/s25134040 - 28 Jun 2025
Viewed by 506
Abstract
This paper presents a novel multi-view 3D object detection framework, Long-Term Spatial–Temporal Bird’s-Eye View (LST-BEV), designed to improve performance in autonomous driving. Traditional 3D detection relies on sensors like LiDAR, but visual perception using multi-camera systems is emerging as a more cost-effective solution. [...] Read more.
This paper presents a novel multi-view 3D object detection framework, Long-Term Spatial–Temporal Bird’s-Eye View (LST-BEV), designed to improve performance in autonomous driving. Traditional 3D detection relies on sensors like LiDAR, but visual perception using multi-camera systems is emerging as a more cost-effective solution. Existing methods struggle with capturing long-range dependencies and cross-task information due to limitations in attention mechanisms. To address this, we propose a Long-Range Cross-Task Detection Head (LRCH) to capture these dependencies and integrate cross-task information for accurate predictions. Additionally, we introduce the Long-Term Temporal Perception Module (LTPM), which efficiently extracts temporal features by combining Mamba and linear attention, overcoming challenges in temporal frame extraction. Experimental results in the nuScenes dataset demonstrate that our proposed LST-BEV outperforms its baseline (SA-BEVPool) by 2.1% mAP and 2.7% NDS, indicating a significant performance improvement. Full article
(This article belongs to the Section Vehicular Sensing)
Show Figures

Figure 1

28 pages, 114336 KiB  
Article
Mamba-STFM: A Mamba-Based Spatiotemporal Fusion Method for Remote Sensing Images
by Qiyuan Zhang, Xiaodan Zhang, Chen Quan, Tong Zhao, Wei Huo and Yuanchen Huang
Remote Sens. 2025, 17(13), 2135; https://doi.org/10.3390/rs17132135 - 21 Jun 2025
Viewed by 603
Abstract
Spatiotemporal fusion techniques can generate remote sensing imagery with high spatial and temporal resolutions, thereby facilitating Earth observation. However, traditional methods are constrained by linear assumptions; generative adversarial networks suffer from mode collapse; convolutional neural networks struggle to capture global context; and Transformers [...] Read more.
Spatiotemporal fusion techniques can generate remote sensing imagery with high spatial and temporal resolutions, thereby facilitating Earth observation. However, traditional methods are constrained by linear assumptions; generative adversarial networks suffer from mode collapse; convolutional neural networks struggle to capture global context; and Transformers are hard to scale due to quadratic computational complexity and high memory consumption. To address these challenges, this study introduces an end-to-end remote sensing image spatiotemporal fusion approach based on the Mamba architecture (Mamba-spatiotemporal fusion model, Mamba-STFM), marking the first application of Mamba in this domain and presenting a novel paradigm for spatiotemporal fusion model design. Mamba-STFM consists of a feature extraction encoder and a feature fusion decoder. At the core of the encoder is the visual state space-FuseCore-AttNet block (VSS-FCAN block), which deeply integrates linear complexity cross-scan global perception with a channel attention mechanism, significantly reducing quadratic-level computation and memory overhead while improving inference throughput through parallel scanning and kernel fusion techniques. The decoder’s core is the spatiotemporal mixture-of-experts fusion module (STF-MoE block), composed of our novel spatial expert and temporal expert modules. The spatial expert adaptively adjusts channel weights to optimize spatial feature representation, enabling precise alignment and fusion of multi-resolution images, while the temporal expert incorporates a temporal squeeze-and-excitation mechanism and selective state space model (SSM) techniques to efficiently capture short-range temporal dependencies, maintain linear sequence modeling complexity, and further enhance overall spatiotemporal fusion throughput. Extensive experiments on public datasets demonstrate that Mamba-STFM outperforms existing methods in fusion quality; ablation studies validate the effectiveness of each core module; and efficiency analyses and application comparisons further confirm the model’s superior performance. Full article
Show Figures

Figure 1

14 pages, 3525 KiB  
Article
MRD: A Linear-Complexity Encoder for Real-Time Vehicle Detection
by Kaijie Li and Xiaoci Huang
World Electr. Veh. J. 2025, 16(6), 307; https://doi.org/10.3390/wevj16060307 - 30 May 2025
Viewed by 606
Abstract
Vehicle detection algorithms constitute a fundamental pillar in intelligent driving systems and smart transportation infrastructure. Nevertheless, the inherent complexity and dynamic variability of traffic scenarios present substantial technical barriers to robust vehicle detection. While visual transformer-based detection architectures have demonstrated performance breakthroughs through [...] Read more.
Vehicle detection algorithms constitute a fundamental pillar in intelligent driving systems and smart transportation infrastructure. Nevertheless, the inherent complexity and dynamic variability of traffic scenarios present substantial technical barriers to robust vehicle detection. While visual transformer-based detection architectures have demonstrated performance breakthroughs through enhanced perceptual capabilities, establishing themselves as the dominant paradigm in this domain, their practical implementation faces critical challenges due to the quadratic computational complexity inherent in the self-attention mechanism, which imposes prohibitive computational overhead. To address these limitations, this study introduces Mamba RT-DETR (MRD), an optimized architecture featuring three principal innovations: (1) We devise an efficient vehicle detection Mamba (EVDMamba) network that strategically integrates a linear-complexity state space model (SSM) to substantially mitigate computational overhead while preserving feature extraction efficacy. (2) To counteract the constrained receptive fields and suboptimal spatial localization associated with conventional SSM sequence modeling, we implement a multi-branch collaborative learning framework that synergistically optimizes channel dimension processing, thereby augmenting the model’s capacity to capture critical spatial dependencies. (3) Comprehensive evaluations on the BDD100K benchmark demonstrate that MRD architecture achieves a 3.1% enhancement in mean average precision (mAP) relative to state-of-the-art RT-DETR variants, while concurrently reducing parameter count by 55.7%—a dual optimization of accuracy and efficiency. Full article
(This article belongs to the Special Issue Recent Advances in Intelligent Vehicle)
Show Figures

Figure 1

27 pages, 1868 KiB  
Article
MACA-Net: Mamba-Driven Adaptive Cross-Layer Attention Network for Multi-Behavior Recognition in Group-Housed Pigs
by Zhixiong Zeng, Zaoming Wu, Runtao Xie, Kai Lin, Shenwen Tan, Xinyuan He and Yizhi Luo
Agriculture 2025, 15(9), 968; https://doi.org/10.3390/agriculture15090968 - 29 Apr 2025
Viewed by 725
Abstract
The accurate recognition of pig behaviors in intensive farming is crucial for health monitoring and growth assessment. To address multi-scale recognition challenges caused by perspective distortion (non-frontal camera angles), this study proposes MACA-Net, a YOLOv8n-based model capable of detecting four key behaviors: eating, [...] Read more.
The accurate recognition of pig behaviors in intensive farming is crucial for health monitoring and growth assessment. To address multi-scale recognition challenges caused by perspective distortion (non-frontal camera angles), this study proposes MACA-Net, a YOLOv8n-based model capable of detecting four key behaviors: eating, lying on the belly, lying on the side, and standing. The model incorporates a Mamba Global–Local Extractor (MGLE) Module, which leverages Mamba to capture global dependencies while preserving local details through convolutional operations and channel shuffle, overcoming Mamba’s limitation in retaining fine-grained visual information. Additionally, an Adaptive Multi-Path Attention (AMPA) mechanism integrates spatial-channel attention to enhance feature focus, ensuring robust performance in complex environments and low-light conditions. To further improve detection, a Cross-Layer Feature Pyramid Transformer (CFPT) neck employs non-upsampled feature fusion, mitigating semantic gap issues where small target features are overshadowed by large target features during feature transmission. Experimental results demonstrate that MACA-Net achieves a precision of 83.1% and mAP of 85.1%, surpassing YOLOv8n by 8.9% and 4.4%, respectively. Furthermore, MACA-Net significantly reduces parameters by 48.4% and FLOPs by 39.5%. When evaluated in comparison to leading detectors such as RT-DETR, Faster R-CNN, and YOLOv11n, MACA-Net demonstrates a consistent level of both computational efficiency and accuracy. These findings provide a robust validation of the efficacy of MACA-Net for intelligent livestock management and welfare-driven breeding, offering a practical and efficient solution for modern pig farming. Full article
(This article belongs to the Special Issue Modeling of Livestock Breeding Environment and Animal Behavior)
Show Figures

Figure 1

16 pages, 3599 KiB  
Article
Classification of Diabetic Retinopathy Based on Efficient Computational Modeling
by Jiao Xue, Jianyu Wu, Yingxu Bian, Shiyan Zhang and Qinsheng Du
Appl. Sci. 2024, 14(23), 11327; https://doi.org/10.3390/app142311327 - 4 Dec 2024
Cited by 1 | Viewed by 1842
Abstract
Convolutional neural networks (CNN) and Vision Transformers (ViT) have long been the main backbone networks for visual classification in the field of deep learning. Although ViT has recently received more attention than CNN due to its excellent fitting ability, their scalability is largely [...] Read more.
Convolutional neural networks (CNN) and Vision Transformers (ViT) have long been the main backbone networks for visual classification in the field of deep learning. Although ViT has recently received more attention than CNN due to its excellent fitting ability, their scalability is largely limited by the quadratic complexity of attention computation. For the determination of diabetic retinopathy, the fundus lesions as well as the width, angle, and branching pattern of retinal blood vessels are characterized, inspired by the ability of Mamba and VMamba to efficiently model long sequences, VMamba-m is proposed in this paper. This is a generalized visual skeleton model designed to reduce computational complexity to linear while retaining the advantageous features of ViTs. By modifying the cross-entropy loss function, we enhance the model’s attention to rare categories, especially in large-scale multi-category classification tasks. In order to enhance the adaptability of the VMamba-m model in processing visual data, we introduce the se channel attention mechanism, which enables the model to learn features in the channel dimension and form the importance of each channel. Finally, different weights are assigned to each channel through the incentive part. In addition to this, this paper further improves the implementation details and architectural design by introducing a novel attention mechanism implemented based on the local windowing method, which aims to optimize the model’s ability in processing long sequence data to enhance the performance of VMamba-m and improve its inference speed. Extensive experimental results show that VMamba-m performs well in the retinopathy V classification task, and it has significant advantages in terms of accuracy and computation time over existing benchmark models. Full article
Show Figures

Figure 1

22 pages, 4188 KiB  
Article
Hyperspectral Object Detection Based on Spatial–Spectral Fusion and Visual Mamba
by Wenjun Li, Fuqiang Yuan, Hongkun Zhang, Zhiwen Lv and Beiqi Wu
Remote Sens. 2024, 16(23), 4482; https://doi.org/10.3390/rs16234482 - 29 Nov 2024
Cited by 3 | Viewed by 2103
Abstract
Hyperspectral object-detection algorithms based on deep learning have been receiving increasing attention due to their ability to operate without relying on prior spectral information about the target and their strong real-time inference performance. However, current methods are unable to efficiently extract both spatial [...] Read more.
Hyperspectral object-detection algorithms based on deep learning have been receiving increasing attention due to their ability to operate without relying on prior spectral information about the target and their strong real-time inference performance. However, current methods are unable to efficiently extract both spatial and spectral information from hyperspectral image data simultaneously. In this study, an innovative hyperspectral object-detection algorithm is proposed that improves the detection accuracy compared to benchmark algorithms and state-of-the-art hyperspectral object-detection algorithms. Specifically, to achieve the integration of spectral and spatial information, we propose an innovative edge-preserving dimensionality reduction (EPDR) module. This module applies edge-preserving dimensionality reduction, based on spatial texture-weighted fusion, to the raw hyperspectral data, producing hyperspectral data that integrate both spectral and spatial information. Subsequently, to enhance the network’s perception of aggregated spatial and spectral data, we integrate a CNN with Visual Mamba to construct a spatial feature enhancement module (SFEM) with linear complexity. The experimental results demonstrate the effectiveness of our method. Full article
Show Figures

Figure 1

20 pages, 5352 KiB  
Article
Facial Expression Recognition-You Only Look Once-Neighborhood Coordinate Attention Mamba: Facial Expression Detection and Classification Based on Neighbor and Coordinates Attention Mechanism
by Cheng Peng, Mingqi Sun, Kun Zou, Bowen Zhang, Genan Dai and Ah Chung Tsoi
Sensors 2024, 24(21), 6912; https://doi.org/10.3390/s24216912 - 28 Oct 2024
Cited by 2 | Viewed by 1648
Abstract
In studying the joint object detection and classification problem for facial expression recognition (FER) deploying the YOLOX framework, we introduce a novel feature extractor, called neighborhood coordinate attention Mamba (NCAMamba) to substitute for the original feature extractor in the Feature Pyramid Network (FPN). [...] Read more.
In studying the joint object detection and classification problem for facial expression recognition (FER) deploying the YOLOX framework, we introduce a novel feature extractor, called neighborhood coordinate attention Mamba (NCAMamba) to substitute for the original feature extractor in the Feature Pyramid Network (FPN). NCAMamba combines the background information reduction capabilities of Mamba, the local neighborhood relationship understanding of neighborhood attention, and the directional relationship understanding of coordinate attention. The resulting FER-YOLO-NCAMamba model, when applied to two unaligned FER benchmark datasets, RAF-DB and SFEW, obtains significantly improved mean average precision (mAP) scores when compared with those obtained by other state-of-the-art methods. Moreover, in ablation studies, it is found that the NCA module is relatively more important than the Visual State Space (VSS), a version of using Mamba for image processing, and in visualization studies using the grad-CAM method, it reveals that regions around the nose tip are critical to recognizing the expression; if it is too large, it may lead to erroneous prediction, while a small focused region would lead to correct recognition; this may explain why FER of unaligned faces is such a challenging problem. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

19 pages, 9439 KiB  
Article
MFAD-RTDETR: A Multi-Frequency Aggregate Diffusion Feature Flow Composite Model for Printed Circuit Board Defect Detection
by Zhihua Xie and Xiaowei Zou
Electronics 2024, 13(17), 3557; https://doi.org/10.3390/electronics13173557 - 7 Sep 2024
Cited by 6 | Viewed by 2516
Abstract
To address the challenges of excessive model parameters and low detection accuracy in printed circuit board (PCB) defect detection, this paper proposes a novel PCB defect detection model based on the improved RTDETR (Real-Time Detection, Embedding and Tracking) method, named MFAD-RTDETR. Specifically, the [...] Read more.
To address the challenges of excessive model parameters and low detection accuracy in printed circuit board (PCB) defect detection, this paper proposes a novel PCB defect detection model based on the improved RTDETR (Real-Time Detection, Embedding and Tracking) method, named MFAD-RTDETR. Specifically, the proposed model introduces the designed Detail Feature Retainer (DFR) into the original RTDETR backbone to capture and retain local details. Subsequently, based on the Mamba architecture, the Visual State Space (VSS) module is integrated to enhance global attention while reducing the original quadratic complexity to a linear level. Furthermore, by exploiting the deformable attention mechanism, which dynamically adjusts reference points, the model achieves precise localization of target defects and improves the accuracy of the transformer in complex visual tasks. Meanwhile, a receptive field synthesis mechanism is incorporated to enrich multi-scale semantic information and reduce parameter complexity. In addition, the scheme proposes a novel Multi-frequency Aggregation and Diffusion feature composite paradigm (MFAD-feature composite paradigm), which consists of the Aggregation Diffusion Fusion (ADF) module and the Refiner Feature Composition (RFC) module. It aims to strengthen features with fine-grained awareness while preserving a certain level of global attention. Finally, the Wise IoU (WIoU) dynamic nonmonotonic focusing mechanism is used to reduce competition among high-quality anchor boxes and mitigate the effects of the harmful gradients from low-quality examples, thereby concentrating on anchor boxes of average quality to promote the overall performance of the detector. Extensive experiments are conducted on the PCB defect dataset released by Peking University to validate the effectiveness of the proposed model. The experimental results show that our approach achieves the 97.0% and 51.0% performance in mean Average Precision (mAP)@0.5 and mAP@0.5:0.95, respectively, which significantly outperforms the original RTDETR. Moreover, the model reduces the number of parameters by approximately 18.2% compared to the original RTDETR. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision Application)
Show Figures

Figure 1

14 pages, 7579 KiB  
Article
Optimization and Application of Improved YOLOv9s-UI for Underwater Object Detection
by Wei Pan, Jiabao Chen, Bangjun Lv and Likun Peng
Appl. Sci. 2024, 14(16), 7162; https://doi.org/10.3390/app14167162 - 15 Aug 2024
Cited by 27 | Viewed by 1961
Abstract
The You Only Look Once (YOLO) series of object detection models is widely recognized for its efficiency and real-time performance, particularly under the challenging conditions of underwater environments, characterized by insufficient lighting and visual disturbances. By modifying the YOLOv9s model, this study aims [...] Read more.
The You Only Look Once (YOLO) series of object detection models is widely recognized for its efficiency and real-time performance, particularly under the challenging conditions of underwater environments, characterized by insufficient lighting and visual disturbances. By modifying the YOLOv9s model, this study aims to improve the accuracy and real-time capabilities of underwater object detection, resulting in the introduction of the YOLOv9s-UI detection model. The proposed model incorporates the Dual Dynamic Token Mixer (D-Mixer) module from TransXNet to improve feature extraction capabilities. Additionally, it integrates a feature fusion network design from the LocalMamba network, employing channel and spatial attention mechanisms. These attention modules effectively guide the feature fusion process, significantly enhancing detection accuracy while maintaining the model’s compact size of only 9.3 M. Experimental evaluation on the UCPR2019 underwater object dataset shows that the YOLOv9s-UI model has higher accuracy and recall than the existing YOLOv9s model, as well as excellent real-time performance. This model significantly improves the ability of underwater target detection by introducing advanced feature extraction and attention mechanisms. The model meets portability requirements and provides a more efficient solution for underwater detection. Full article
(This article belongs to the Section Marine Science and Engineering)
Show Figures

Figure 1

34 pages, 4902 KiB  
Review
A Survey on Visual Mamba
by Hanwei Zhang, Ying Zhu, Dan Wang, Lijun Zhang, Tianxiang Chen, Ziyang Wang and Zi Ye
Appl. Sci. 2024, 14(13), 5683; https://doi.org/10.3390/app14135683 - 28 Jun 2024
Cited by 54 | Viewed by 13811
Abstract
State space models (SSM) with selection mechanisms and hardware-aware architectures, namely Mamba, have recently shown significant potential in long-sequence modeling. Since the complexity of transformers’ self-attention mechanism is quadratic with image size, as well as increasing computational demands, researchers are currently exploring how [...] Read more.
State space models (SSM) with selection mechanisms and hardware-aware architectures, namely Mamba, have recently shown significant potential in long-sequence modeling. Since the complexity of transformers’ self-attention mechanism is quadratic with image size, as well as increasing computational demands, researchers are currently exploring how to adapt Mamba for computer vision tasks. This paper is the first comprehensive survey that aims to provide an in-depth analysis of Mamba models within the domain of computer vision. It begins by exploring the foundational concepts contributing to Mamba’s success, including the SSM framework, selection mechanisms, and hardware-aware design. Then, we review these vision Mamba models by categorizing them into foundational models and those enhanced with techniques including convolution, recurrence, and attention to improve their sophistication. Furthermore, we investigate the widespread applications of Mamba in vision tasks, which include their use as a backbone in various levels of vision processing. This encompasses general visual tasks, medical visual tasks (e.g., 2D/3D segmentation, classification, image registration, etc.), and remote sensing visual tasks. In particular, we introduce general visual tasks from two levels: high/mid-level vision (e.g., object detection, segmentation, video classification, etc.) and low-level vision (e.g., image super-resolution, image restoration, visual generation, etc.). We hope this endeavor will spark additional interest within the community to address current challenges and further apply Mamba models in computer vision. Full article
(This article belongs to the Special Issue Application of Artificial Intelligence in Visual Processing)
Show Figures

Figure 1

16 pages, 22655 KiB  
Article
LightCF-Net: A Lightweight Long-Range Context Fusion Network for Real-Time Polyp Segmentation
by Zhanlin Ji, Xiaoyu Li, Jianuo Liu, Rui Chen, Qinping Liao, Tao Lyu and Li Zhao
Bioengineering 2024, 11(6), 545; https://doi.org/10.3390/bioengineering11060545 - 27 May 2024
Cited by 12 | Viewed by 2284
Abstract
Automatically segmenting polyps from colonoscopy videos is crucial for developing computer-assisted diagnostic systems for colorectal cancer. Existing automatic polyp segmentation methods often struggle to fulfill the real-time demands of clinical applications due to their substantial parameter count and computational load, especially those based [...] Read more.
Automatically segmenting polyps from colonoscopy videos is crucial for developing computer-assisted diagnostic systems for colorectal cancer. Existing automatic polyp segmentation methods often struggle to fulfill the real-time demands of clinical applications due to their substantial parameter count and computational load, especially those based on Transformer architectures. To tackle these challenges, a novel lightweight long-range context fusion network, named LightCF-Net, is proposed in this paper. This network attempts to model long-range spatial dependencies while maintaining real-time performance, to better distinguish polyps from background noise and thus improve segmentation accuracy. A novel Fusion Attention Encoder (FAEncoder) is designed in the proposed network, which integrates Large Kernel Attention (LKA) and channel attention mechanisms to extract deep representational features of polyps and unearth long-range dependencies. Furthermore, a newly designed Visual Attention Mamba module (VAM) is added to the skip connections, modeling long-range context dependencies in the encoder-extracted features and reducing background noise interference through the attention mechanism. Finally, a Pyramid Split Attention module (PSA) is used in the bottleneck layer to extract richer multi-scale contextual features. The proposed method was thoroughly evaluated on four renowned polyp segmentation datasets: Kvasir-SEG, CVC-ClinicDB, BKAI-IGH, and ETIS. Experimental findings demonstrate that the proposed method delivers higher segmentation accuracy in less time, consistently outperforming the most advanced lightweight polyp segmentation networks. Full article
(This article belongs to the Section Biosignal Processing)
Show Figures

Figure 1

Back to TopTop