Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (136)

Search Parameters:
Keywords = top-down visual attention

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
17 pages, 1388 KB  
Article
HISF: Hierarchical Interactive Semantic Fusion for Multimodal Prompt Learning
by Haohan Feng and Chen Li
Multimodal Technol. Interact. 2026, 10(1), 6; https://doi.org/10.3390/mti10010006 - 6 Jan 2026
Viewed by 170
Abstract
Recent vision-language pre-training models, like CLIP, have been shown to generalize well across a variety of multitask modalities. Nonetheless, their generalization for downstream tasks is limited. As a lightweight adaptation approach, prompt learning could allow task transfer by optimizing only several learnable vectors [...] Read more.
Recent vision-language pre-training models, like CLIP, have been shown to generalize well across a variety of multitask modalities. Nonetheless, their generalization for downstream tasks is limited. As a lightweight adaptation approach, prompt learning could allow task transfer by optimizing only several learnable vectors and thus is more flexible for pre-trained models. However, current methods mainly concentrate on the design of unimodal prompts and ignore effective means for multimodal semantic fusion and label alignment, which limits their representation power. To tackle these problems, this paper designs a Hierarchical Interactive Semantic Fusion (HISF) framework for multimodal prompt learning. On top of frozen CLIP backbones, HISF injects visual and textual signals simultaneously in intermediate layers of a Transformer through a cross-attention mechanism as well as fitting category embeddings. This architecture realizes the hierarchical semantic fusion at the modality level with structural consistency kept at each layer. In addition, a Label Embedding Constraint and a Semantic Alignment Loss are proposed to promote category consistency while alleviating semantic drift in training. Extensive experiments across 11 few-shot image classification benchmarks show that HISF improves the average accuracy by around 0.7% compared to state-of-the-art methods and has remarkable robustness in cross-domain transfer tasks. Ablation studies also verify the effectiveness of each proposed part and their combination: hierarchical structure, cross-modal attention, and semantic alignment collaborate to enrich representational capacity. In conclusion, the proposed HISF is a new hierarchical view for multimodal prompt learning and provides a more lightweight and generalizable paradigm for adapting vision-language pre-trained models. Full article
Show Figures

Figure 1

16 pages, 1260 KB  
Article
DAR-Swin: Dual-Attention Revamped Swin Transformer for Intelligent Vehicle Perception Under NVH Disturbances
by Xinglong Zhang, Zhiguo Zhang, Huihui Zuo, Chaotan Xue, Zhenjiang Wu, Zhiyu Cheng and Yan Wang
Machines 2026, 14(1), 51; https://doi.org/10.3390/machines14010051 - 31 Dec 2025
Viewed by 269
Abstract
In recent years, deep learning-based image classification has made significant progress, especially in safety-critical perception fields such as intelligent vehicles. Factors such as vibrations caused by NVH (noise, vibration, and harshness), sensor noise, and road surface roughness pose challenges to robustness and real-time [...] Read more.
In recent years, deep learning-based image classification has made significant progress, especially in safety-critical perception fields such as intelligent vehicles. Factors such as vibrations caused by NVH (noise, vibration, and harshness), sensor noise, and road surface roughness pose challenges to robustness and real-time deployment. The Transformer architecture has become a fundamental component of high-performance models. However, in complex visual environments, shifted window attention mechanisms exhibit inherent limitations: although computationally efficient, local window constraints impede cross-region semantic integration, while deep feature processing obstructs robust representation learning. To address these challenges, we propose DAR-Swin (Dual-Attention Revamped Swin Transformer), enhancing the framework through two complementary attention mechanisms. First, Scalable Self-Attention universally substitutes the standard Window-based Multi-head Self-Attention via sub-quadratic complexity operators. These operators decouple spatial positions from feature associations, enabling position-adaptive receptive fields for comprehensive contextual modeling. Second, Latent Proxy Attention integrated before the classification head adopts a learnable spatial proxy to integrate global semantic information into a fixed-size representation, while preserving relational semantics and achieving linear computational complexity through efficient proxy interactions. Extensive experiments demonstrate significant improvements over Swin Transformer Base, achieving 87.3% top-1 accuracy on CIFAR-100 (+1.5% absolute improvement) and 57.0% mAP on COCO2017 (+1.3% absolute improvement). These characteristics are particularly important for the active and passive safety features of intelligent vehicles. Full article
Show Figures

Figure 1

16 pages, 1906 KB  
Article
Visual Attention and User Preference Analysis of Children’s Beds in Interior Environments Using Eye-Tracking Technology
by Yunxi Nie, Jinjing Wang and Yushu Chen
Buildings 2026, 16(1), 44; https://doi.org/10.3390/buildings16010044 - 22 Dec 2025
Viewed by 355
Abstract
Visual attention plays a critical role in users’ cognitive evaluation of safety and functionality in interior furniture, particularly for children’s beds, which are inherently safety-sensitive products. This study adopts an integrated approach combining eye-tracking experiments and questionnaire surveys to examine users’ visual cognition [...] Read more.
Visual attention plays a critical role in users’ cognitive evaluation of safety and functionality in interior furniture, particularly for children’s beds, which are inherently safety-sensitive products. This study adopts an integrated approach combining eye-tracking experiments and questionnaire surveys to examine users’ visual cognition and preference patterns toward children’s solid wood beds under controlled viewing conditions, focusing on material attributes, bed typologies, and key structural components. The results indicate that natural solid wood materials with clear textures and warm tones attract higher visual attention, while storage-integrated bed designs significantly enhance exploratory gaze behavior. At the component level, safety-related elements such as guardrails and headboards consistently receive the earliest visual attention, highlighting their cognitive priority in safety assessment and spatial perception. Overall, the findings support a dual-path visual cognition mechanism in which bottom-up visual salience interacts with top-down concerns related to safety and usability. This study provides evidence-based insights for material selection and structural emphasis in children’s furniture design within interior environments. The applicability of the conclusions is primarily limited to adult observers under controlled visual conditions. Full article
(This article belongs to the Section Architectural Design, Urban Science, and Real Estate)
Show Figures

Figure 1

17 pages, 7561 KB  
Article
Fine-Grained Image Recognition with Bio-Inspired Gradient-Aware Attention
by Bing Ma, Junyi Li, Zhengbei Jin, Wei Zhang, Xiaohui Song and Beibei Jin
Biomimetics 2025, 10(12), 834; https://doi.org/10.3390/biomimetics10120834 - 12 Dec 2025
Viewed by 591
Abstract
Fine-grained image recognition is one of the key tasks in the field of computer vision. However, due to subtle inter-class differences and significant intra-class differences, it still faces severe challenges. Conventional approaches often struggle with background interference and feature degradation. To address these [...] Read more.
Fine-grained image recognition is one of the key tasks in the field of computer vision. However, due to subtle inter-class differences and significant intra-class differences, it still faces severe challenges. Conventional approaches often struggle with background interference and feature degradation. To address these issues, we draw inspiration from the human visual system, which adeptly focuses on discriminative regions, to propose a bio-inspired gradient-aware attention mechanism. Our method explicitly models gradient information to guide the attention, mimicking biological edge sensitivity, thereby enhancing the discrimination between global structures and local details. Experiments on the CUB-200-2011, iNaturalist2018, nabbirds and Stanford Cars datasets demonstrated the superiority of our method, achieving Top-1 accuracy rates of 92.9%, 90.5%, 93.1% and 95.1%, respectively. Full article
(This article belongs to the Special Issue Biologically Inspired Vision and Image Processing 2025)
Show Figures

Figure 1

25 pages, 3819 KB  
Article
Cross-Modal and Contrastive Optimization for Explainable Multimodal Recognition of Predatory and Parasitic Insects
by Mingyu Liu, Liuxin Wang, Ruihao Jia, Shiyu Ji, Yalin Wu, Yuxin Wu, Luozehan Xie and Min Dong
Insects 2025, 16(12), 1187; https://doi.org/10.3390/insects16121187 - 22 Nov 2025
Viewed by 677
Abstract
Natural enemies play a vital role in pest suppression and ecological balance within agricultural ecosystems. However, conventional vision-based recognition methods are highly susceptible to illumination variation, occlusion, and background noise in complex field environments, making it difficult to accurately distinguish morphologically similar species. [...] Read more.
Natural enemies play a vital role in pest suppression and ecological balance within agricultural ecosystems. However, conventional vision-based recognition methods are highly susceptible to illumination variation, occlusion, and background noise in complex field environments, making it difficult to accurately distinguish morphologically similar species. To address these challenges, a multimodal natural enemy recognition and ecological interpretation framework, termed MAVC-XAI, is proposed to enhance recognition accuracy and ecological interpretability in real-world agricultural scenarios. The framework employs a dual-branch spatiotemporal feature extraction network for deep modeling of both visual and acoustic signals, introduces a cross-modal sampling attention mechanism for dynamic inter-modality alignment, and incorporates cross-species contrastive learning to optimize inter-class feature boundaries. Additionally, an explainable generation module is designed to provide ecological visualizations of the model’s decision-making process in both visual and acoustic domains. Experiments conducted on multimodal datasets collected across multiple agricultural regions confirm the effectiveness of the proposed approach. The MAVC-XAI framework achieves an accuracy of 0.938, a precision of 0.932, a recall of 0.927, an F1-score of 0.929, an mAP@50 of 0.872, and a Top-5 recognition rate of 97.8%, all significantly surpassing unimodal models such as ResNet, Swin-T, and VGGish, as well as multimodal baselines including MMBT and ViLT. Ablation experiments further validate the critical contributions of the cross-modal sampling attention and contrastive learning modules to performance enhancement. The proposed framework not only enables high-precision natural enemy identification under complex ecological conditions but also provides an interpretable and intelligent foundation for AI-driven ecological pest management and food security monitoring. Full article
Show Figures

Figure 1

20 pages, 3079 KB  
Article
EABI-DETR: An Efficient Aerial Small Object Detection Network
by Fufang Li, Yuehua Zhang and Yuxuan Fan
Biomimetics 2025, 10(11), 770; https://doi.org/10.3390/biomimetics10110770 - 13 Nov 2025
Viewed by 690
Abstract
Small object detection, as an important research topic in computer vision, has been widely applied in aerial visual tasks such as remote sensing and UAV imagery. However, due to challenges such as small object size, large-scale variations, and complex backgrounds, existing detection models [...] Read more.
Small object detection, as an important research topic in computer vision, has been widely applied in aerial visual tasks such as remote sensing and UAV imagery. However, due to challenges such as small object size, large-scale variations, and complex backgrounds, existing detection models often struggle to capture fine-grained semantics and high-resolution texture information in aerial scenes, leading to limited performance. To address these issues, this paper proposes an efficient aerial small object detection model, EABI-DETR (Efficient Attention and Bi-level Integration DETR), based on the RT-DETR framework. The proposed model introduces systematic enhancements from three aspects: (1) A lightweight backbone network, C2f-EMA, is developed by integrating the C2f structure with an efficient multi-scale attention (EMA) mechanism. This design jointly models channel semantics and spatial details with minimal computational overhead, thereby strengthening the perception of small objects. (2) A P2-BiFPN bi-directional multi-scale fusion module is further designed to incorporate shallow high-resolution features. Through top-down and bottom-up feature interactions, this module enhances cross-scale information flow and effectively preserves the fine details and textures of small objects. (3) To improve localization robustness, a Focaler-MPDIoU loss function is introduced to better handle hard samples during regression optimization. Experiments conducted on the VisDrone2019 dataset demonstrate that EABI-DETR achieves 53.4% mAP@0.5 and 34.1% mAP@0.5:0.95, outperforming RT-DETR by 6.2% and 5.1%, respectively, while maintaining high inference efficiency. These results confirm the effectiveness of integrating lightweight attention mechanisms and shallow feature fusion for aerial small object detection, offering a new paradigm for efficient UAV-based visual perception. Full article
(This article belongs to the Special Issue Exploration of Bioinspired Computer Vision and Pattern Recognition)
Show Figures

Graphical abstract

13 pages, 1410 KB  
Article
The Effect and Time Course of Prediction and Perceptual Load on Category-Based Attentional Orienting Across Color and Shape Dimensions
by Yunpeng Jiang, Tianyu Chen, Fangyuan Ou, Yun Wang, Ruixi Feng, Xia Wu and Lin Lin
Brain Sci. 2025, 15(11), 1210; https://doi.org/10.3390/brainsci15111210 - 9 Nov 2025
Viewed by 561
Abstract
Objectives: This study investigated the temporal dynamics of category-based attentional orienting (CAO) under the influences of prediction (top-down) and perceptual load (bottom-up) across color and shape dimensions, combining behavioral and event-related potential (ERP) measures. Methods: Across two experiments, we manipulated predictive validity and [...] Read more.
Objectives: This study investigated the temporal dynamics of category-based attentional orienting (CAO) under the influences of prediction (top-down) and perceptual load (bottom-up) across color and shape dimensions, combining behavioral and event-related potential (ERP) measures. Methods: Across two experiments, we manipulated predictive validity and perceptual load during a visual search for category-defined targets. Results: The results revealed a critical dimension-specific effect of prediction: invalid predictions elicited a larger N2pc component (indexing attentional selection) for shape-defined targets, but not color-defined targets, indicating that shape CAO relies more heavily on predictive information during early processing. At the behavioral level, a combined analysis of the two experiments revealed an interaction between prediction and perceptual load on accuracy, suggesting their integration can occur at later stages. Conclusions: These findings demonstrate that prediction and perceptual load exhibit distinct temporal profiles, primarily independently modulating early attentional orienting, with their interactive effects on behavior being more nuanced and dimension-dependent. This study elucidates the distinct temporal and dimensional mechanisms through which top-down and bottom-up sources of uncertainty shape attentional orienting to categories. Full article
(This article belongs to the Section Neuropsychology)
Show Figures

Figure 1

23 pages, 59318 KB  
Article
BAT-Net: Bidirectional Attention Transformer Network for Joint Single-Image Desnowing and Snow Mask Prediction
by Yongheng Zhang
Information 2025, 16(11), 966; https://doi.org/10.3390/info16110966 - 7 Nov 2025
Viewed by 450
Abstract
In the wild, snow is not merely additive noise; it is a non-stationary, semi-transparent veil whose spatial statistics vary with depth, illumination, and wind. Because conventional two-stage pipelines first detect a binary mask and then inpaint the occluded regions, any early mis-classification is [...] Read more.
In the wild, snow is not merely additive noise; it is a non-stationary, semi-transparent veil whose spatial statistics vary with depth, illumination, and wind. Because conventional two-stage pipelines first detect a binary mask and then inpaint the occluded regions, any early mis-classification is irreversibly baked into the final result, leading to over-smoothed textures or ghosting artifacts. We propose BAT-Net, a Bidirectional Attention Transformer Network that frames desnowing as a coupled representation learning problem, jointly disentangling snow appearance and scene radiance in a single forward pass. Our core contributions are as follows: (1) A novel dual-decoder architecture where a background decoder and a snow decoder are coupled via a Bidirectional Attention Module (BAM). The BAM implements a continuous predict–verify–correct mechanism, allowing the background branch to dynamically accept, reject, or refine the snow branch’s occlusion hypotheses, dramatically reducing error accumulation. (2) A lightweight yet effective multi-scale feature fusion scheme comprising a Scale Conversion Module (SCM) and a Feature Aggregation Module (FAM), enabling the model to handle the large scale variance among snowflakes without a prohibitive computational cost. (3) The introduction of the FallingSnow dataset, curated to eliminate the label noise caused by irremovable ground snow in existing benchmarks, providing a cleaner benchmark for evaluating dynamic snow removal. Extensive experiments on synthetic and real-world datasets demonstrate that BAT-Net sets a new state of the art. It achieves a PSNR of 35.78 dB on the CSD dataset, outperforming the best prior model by 1.37 dB, and also achieves top results on SRRS (32.13 dB) and Snow100K (34.62 dB) datasets. The proposed method has significant practical applications in autonomous driving and surveillance systems, where accurate snow removal is crucial for maintaining visual clarity. Full article
(This article belongs to the Special Issue Intelligent Image Processing by Deep Learning, 2nd Edition)
Show Figures

Figure 1

39 pages, 4559 KB  
Article
Effects of Biases in Geometric and Physics-Based Imaging Attributes on Classification Performance
by Bahman Rouhani and John K. Tsotsos
J. Imaging 2025, 11(10), 333; https://doi.org/10.3390/jimaging11100333 - 25 Sep 2025
Viewed by 674
Abstract
Learned systems in the domain of visual recognition and cognition impress in part because even though they are trained with datasets many orders of magnitude smaller than the full population of possible images, they exhibit sufficient generalization to be applicable to new and [...] Read more.
Learned systems in the domain of visual recognition and cognition impress in part because even though they are trained with datasets many orders of magnitude smaller than the full population of possible images, they exhibit sufficient generalization to be applicable to new and previously unseen data. Since training data sets typically represent such a small sampling of any domain, the possibility of bias in their composition is very real. But what are the limits of generalization given such bias, and up to what point might it be sufficient for a real problem task? There are many types of bias as will be seen, but we focus only on one, selection bias. In vision, image contents are dependent on the physics of vision and geometry of the imaging process and not only on scene contents. How do biases in these factors—that is, non-uniform sample collection across the spectrum of imaging possibilities—affect learning? We address this in two ways. The first is theoretical in the tradition of the Thought Experiment. The point is to use a simple theoretical tool to probe into the bias of data collection to highlight deficiencies that might then deserve extra attention either in data collection or system development. Those theoretical results are then used to motivate practical tests on a new dataset using several existing top classifiers. We report that, both theoretically and empirically, there are some selection biases rooted in the physics and imaging geometry of vision that challenge current methods of classification. Full article
(This article belongs to the Section Computer Vision and Pattern Recognition)
Show Figures

Figure 1

21 pages, 25636 KB  
Article
SARFT-GAN: Semantic-Aware ARConv Fused Top-k Generative Adversarial Network for Remote Sensing Image Denoising
by Haotian Sun, Ruifeng Duan, Guodong Sun, Haiyan Zhang, Feixiang Chen, Feng Yang and Jia Cao
Remote Sens. 2025, 17(17), 3114; https://doi.org/10.3390/rs17173114 - 7 Sep 2025
Cited by 1 | Viewed by 1112
Abstract
Optical remote sensing images play a pivotal role in numerous applications, notably feature recognition and scene semantic segmentation. Nevertheless, their efficacy is frequently compromised by various noise types, which detrimentally impact practical usage. We have meticulously crafted a novel attention module amalgamating Adaptive [...] Read more.
Optical remote sensing images play a pivotal role in numerous applications, notably feature recognition and scene semantic segmentation. Nevertheless, their efficacy is frequently compromised by various noise types, which detrimentally impact practical usage. We have meticulously crafted a novel attention module amalgamating Adaptive Rectangular Convolution (ARConv) with Top-k Sparse Attention. This design dynamically modifies feature receptive fields, effectively mitigating superfluous interference and enhancing multi-scale feature extraction. Concurrently, we introduce a Semantic-Aware Discriminator, leveraging visual-language prior knowledge derived from the Contrastive Language–Image Pretraining (CLIP) model, steering the generator towards a more realistic texture reconstruction. This research introduces an innovative image denoising model termed the Semantic-Aware ARConv Fused Top-k Generative Adversarial Network (SARFT-GAN). Addressing shortcomings in traditional convolution operations, attention mechanisms, and discriminator design, our approach facilitates a synergistic optimization between noise suppression and feature preservation. Extensive experiments on RRSSRD, SECOND, a private Jilin-1 set, and real-world NWPU-RESISC45 images demonstrate consistent gains. Across three noise levels and four scenarios, SARFT-GAN attains state-of-the-art perceptual quality—achieving the best FID in all 12 settings and strong LPIPS—while remaining competitive on PSNR/SSIM. Full article
Show Figures

Figure 1

16 pages, 1077 KB  
Case Report
Investigating the Impact of Presentation Format on Reading Ability in Posterior Cortical Atrophy: A Case Study
by Jeremy J. Tree and David R. Playfoot
Reports 2025, 8(3), 160; https://doi.org/10.3390/reports8030160 - 31 Aug 2025
Cited by 2 | Viewed by 837
Abstract
Background and Clinical Significance: Patients with a neurodegenerative condition known as posterior cortical atrophy (PCA) can present with attention impairments across a variety of cognitive contexts, but the consequences of these are little explored in example of single word reading. Case Presentation: We [...] Read more.
Background and Clinical Significance: Patients with a neurodegenerative condition known as posterior cortical atrophy (PCA) can present with attention impairments across a variety of cognitive contexts, but the consequences of these are little explored in example of single word reading. Case Presentation: We present a detailed single-case study of KL, a local resident of South Wales, a patient diagnosed with posterior cortical atrophy (PCA) in 2018, whose reading and letter-naming abilities are selectively disrupted under non-canonical visual presentations. In particular, KL shows significantly impaired accuracy performance when reading words presented in tilted (rotated 90°) format. By contrast, his reading under conventional horizontal (canonical) presentation is nearly flawless. Whilst other presentation formats including, mixed-case text (e.g., TaBLe) and vertical (marquee) format led to only mild performance decrements—even though mixed-case formats are generally thought to increase attentional ‘crowding’ effects. Discussion: These findings indicate that impairments of word reading can emerge in PCA when visual-attentional demands are sufficiently high, and access to ‘top down’ orthographic information is severely attenuated. Next, we explored a cardinal feature of attentional dyslexia, namely the word–letter reading dissociation in which word reading is superior to letter-in-string naming. In KL, a similar dissociative pattern could be provoked by non-canonical formats. That is, conditions that similarly disrupted his word reading led to a pronounced disparity between word and letter-in-string naming performance. Moreover, different orientation formats revealed the availability (or otherwise) of distinct compensatory strategies. KL successfully relied on an oral (letter by letter) spelling strategy when reading vertically presented words or naming letters-in-strings, whereas he had no ability to engage compensatory mental rotation processes for tilted text. Thus, the observed impact of non-canonical presentations was moderated by the success or failure of alternative compensatory strategies. Conclusions: Importantly, our results suggest that an attentional ‘dyslexia-like’ profile can be unmasked in PCA under sufficiently taxing visual-attentional conditions. This approach may prove useful in clinical assessment, highlighting subtle reading impairments that conventional testing might overlook. Full article
Show Figures

Figure 1

19 pages, 9845 KB  
Article
TriQuery: A Query-Based Model for Surgical Triplet Recognition
by Mengrui Yao, Wenjie Zhang, Lin Wang, Zhongwei Zhao and Xiao Jia
Sensors 2025, 25(17), 5306; https://doi.org/10.3390/s25175306 - 26 Aug 2025
Cited by 1 | Viewed by 1164
Abstract
Artificial intelligence has shown great promise in advancing intelligent surgical systems. Among its applications, surgical video action recognition plays a critical role in enabling accurate intraoperative understanding and decision support. However, the task remains challenging due to the temporal continuity of surgical scenes [...] Read more.
Artificial intelligence has shown great promise in advancing intelligent surgical systems. Among its applications, surgical video action recognition plays a critical role in enabling accurate intraoperative understanding and decision support. However, the task remains challenging due to the temporal continuity of surgical scenes and the long-tailed, semantically entangled distribution of action triplets composed of instruments, verbs, and targets. To address these issues, we propose TriQuery, a query-based model for surgical triplet recognition and classification. Built on a multi-task Transformer framework, TriQuery decomposes the complex triplet task into three semantically aligned subtasks using task-specific query tokens, which are processed through specialized attention mechanisms. We introduce a Multi-Query Decoding Head (MQ-DH) to jointly model structured subtasks and a Top-K Guided Query Update (TKQ) module to incorporate inter-frame temporal cues. Experiments on the CholecT45 dataset demonstrate that TriQuery achieves improved overall performance over existing baselines across multiple classification tasks. Attention visualizations further show that task queries consistently attend to semantically relevant spatial regions, enhancing model interpretability. These results highlight the effectiveness of TriQuery for advancing surgical video understanding in clinical environments. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

28 pages, 4317 KB  
Article
Multi-Scale Attention Networks with Feature Refinement for Medical Item Classification in Intelligent Healthcare Systems
by Waqar Riaz, Asif Ullah and Jiancheng (Charles) Ji
Sensors 2025, 25(17), 5305; https://doi.org/10.3390/s25175305 - 26 Aug 2025
Cited by 4 | Viewed by 1357
Abstract
The increasing adoption of artificial intelligence (AI) in intelligent healthcare systems has elevated the demand for robust medical imaging and vision-based inventory solutions. For an intelligent healthcare inventory system, accurate recognition and classification of medical items, including medicines and emergency supplies, are crucial [...] Read more.
The increasing adoption of artificial intelligence (AI) in intelligent healthcare systems has elevated the demand for robust medical imaging and vision-based inventory solutions. For an intelligent healthcare inventory system, accurate recognition and classification of medical items, including medicines and emergency supplies, are crucial for ensuring inventory integrity and timely access to life-saving resources. This study presents a hybrid deep learning framework, EfficientDet-BiFormer-ResNet, that integrates three specialized components: EfficientDet’s Bidirectional Feature Pyramid Network (BiFPN) for scalable multi-scale object detection, BiFormer’s bi-level routing attention for context-aware spatial refinement, and ResNet-18 enhanced with triplet loss and Online Hard Negative Mining (OHNM) for fine-grained classification. The model was trained and validated on a custom healthcare inventory dataset comprising over 5000 images collected under diverse lighting, occlusion, and arrangement conditions. Quantitative evaluations demonstrated that the proposed system achieved a mean average precision (mAP@0.5:0.95) of 83.2% and a top-1 classification accuracy of 94.7%, outperforming conventional models such as YOLO, SSD, and Mask R-CNN. The framework excelled in recognizing visually similar, occluded, and small-scale medical items. This work advances real-time medical item detection in healthcare by providing an AI-enabled, clinically relevant vision system for medical inventory management. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

22 pages, 3139 KB  
Article
A Counterfactual Fine-Grained Aircraft Classification Network for Remote Sensing Images Based on Normalized Coordinate Attention
by Zeya Zhao, Wenyin Tuo, Shuai Zhang and Xinbo Zhao
Appl. Sci. 2025, 15(16), 8903; https://doi.org/10.3390/app15168903 - 12 Aug 2025
Viewed by 817
Abstract
Fine-grained aircraft classification in remote sensing is a critical task within the field of remote sensing image processing, aiming to precisely distinguish between different types of aircraft in aerial images. Due to the high visual similarity among aircraft targets in remote sensing images, [...] Read more.
Fine-grained aircraft classification in remote sensing is a critical task within the field of remote sensing image processing, aiming to precisely distinguish between different types of aircraft in aerial images. Due to the high visual similarity among aircraft targets in remote sensing images, accurately capturing subtle and discriminative features becomes a key technical challenge for fine-grained aircraft classification. In this context, we propose a Normalized Coordinate Attention-Based Counterfactual Classification Network (NCC-Net), which emphasizes the spatial positional information of aircraft targets and effectively captures long-range dependencies, thereby enabling precise localization of various aircraft components. Furthermore, we analyze the proposed network from a causal perspective, encouraging the model to focus on key discriminative features of the aircraft while minimizing distraction from the surrounding environment and background. Experimental results on three benchmark datasets demonstrate the superiority of our method. Specifically, NCC-Net achieves Top-1 classification accuracies of 97.7% on FAIR1M, 95.2% on MTARSI2, and 98.4% on ARSI120, outperforming several state-of-the-art methods. These results highlight the effectiveness and generalizability of our proposed method for fine-grained remote sensing target recognition. Full article
Show Figures

Figure 1

20 pages, 2759 KB  
Article
Visual Attention Fusion Network (VAFNet): Bridging Bottom-Up and Top-Down Features in Infrared and Visible Image Fusion
by Yaochen Liu, Yunke Wang and Zixuan Jing
Symmetry 2025, 17(7), 1104; https://doi.org/10.3390/sym17071104 - 9 Jul 2025
Viewed by 843
Abstract
Infrared and visible image fusion aims to integrate useful information from the source image to obtain a fused image that not only has excellent visual perception but also promotes the performance of the subsequent object detection task. However, due to the asymmetry between [...] Read more.
Infrared and visible image fusion aims to integrate useful information from the source image to obtain a fused image that not only has excellent visual perception but also promotes the performance of the subsequent object detection task. However, due to the asymmetry between image fusion and object detection tasks, obtaining superior visual effects while facilitating object detection tasks remains challenging in real-world applications. Addressing this issue, we propose a novel visual attention fusion network for infrared and visible image fusion (VAFNet), which can bridge bottom-up and top-down features to achieve high-quality visual perception while improving the performance of object detection tasks. The core idea is that bottom-up visual attention is utilized to extract multi-layer bottom-up features for ensuring superior visual perception, while top-down visual attention determines object attention signals related to object detection tasks. Then, a bidirectional attention integration mechanism is designed to naturally integrate two forms of attention into the fused image. Experiments on public and collection datasets demonstrate that VAFNet not only outperforms seven state-of-the-art (SOTA) fusion methods in qualitative and quantitative evaluation but also has advantages in facilitating object detection tasks. Full article
(This article belongs to the Special Issue Symmetry in Next-Generation Intelligent Information Technologies)
Show Figures

Figure 1

Back to TopTop