Saved Queries

The precise identification of minute, high-velocity entities within unconstrained visual fields represents a significant hurdle in computational perception. This difficulty primarily arises from the geometric degradation stemming from scale volatility, motion-induced asymmetry, and heterogeneous background clutter. To mitigate the critical deficit of high-fidelity benchmarks for dynamic micro-targets, we present Soccer-Wild. This comprehensive dataset is characterized by the extreme visual complexity of microscopic objects in diverse ecological settings. Built upon this empirical foundation, we introduce GOAL (Global Object Alignment for Localization). This novel computational paradigm is designed to enhance the weak features of tiny targets by integrating frequency-domain filtering, dynamic feature routing, and entropy-guided probabilistic modeling. The GOAL framework rigorously preserves spatial-structural equilibrium and information fidelity through three synergetic mechanisms: (1) Spectral Purification: We implement a Frequency-aware Spectral Gating approach that operates in the Fourier manifold, suppressing stochastic noise to accentuate the spectral signatures of the targets; (2) Geometric Adaptation: A Multi-Granularity Mixture of Experts (MG-MoE) is formulated with heterogeneous receptive fields to dynamically rectify anisotropic distortions caused by kinetic blurring. This adaptive routing ensures cross-state representation consistency; (3) Information Recovery: We propose Information-Guided Gaussian Distribution Estimation (IGDE), which utilizes information entropy to conceptualize target coordinates as radially symmetric probability densities. This facilitates the implicit recovery of latent signals typically discarded by rigid deterministic regression. Empirical validations on the Soccer-Wild and VisDrone2019 benchmarks reveal that the proposed methodology yields substantial gains in precision. Specifically, our model achieves 40.0% and 40.4% AP (Average Precision), respectively, establishing a new state-of-the-art for localizing highly dynamic, micro-scale objects. Full article

(This article belongs to the Special Issue Deep Learning and Deep Learning Synergy of Transformers and Symmetry in Small Object Detection and Tracking)

►▼ Show Figures

Figure 1

17 pages, 1639 KB

Open AccessArticle

Cascade Registration and Fusion for Unaligned Infrared and Visible Images in Autonomous Driving

by Long Xiao, Yidong Xie and Chengda Yao

Electronics 2026, 15(7), 1427; https://doi.org/10.3390/electronics15071427 - 30 Mar 2026

Abstract

Infrared and visible image fusion is a critical technology for enhancing the all-weather perception capabilities of autonomous driving systems. However, the inherent physical parallax of vehicle-mounted sensors combined with motion-induced vibrations makes it difficult to achieve strict alignment between the source images. Direct fusion of such misaligned pairs leads to ghosting artifacts, which significantly compromises driving safety. To address this challenge, this paper proposes a cascaded deep fusion framework tailored for autonomous driving scenarios. A dual-modal perception dataset is first constructed, incorporating realistic physical parallax and non-rigid deformations. Subsequently, a decoupled strategy is established, characterized by geometric correction followed by semantic fusion: the Static-Feature Recursive Registration (SFRR) network is utilized to explicitly correct the spatial misalignments caused by parallax, thereby establishing geometric consistency; then, the Hierarchical Invertible Block Fusion (HIBF) network achieves lossless integration of cross-modal features by combining spatial frequency separation with invertible interaction techniques. Experimental results demonstrate that the proposed method outperforms representative algorithms across several metrics, including Mutual Information (MI), Visual Information Fidelity (VIF), Structural Similarity (SSIM), and Correlation Coefficient (CC), producing high-quality fused images with clear structural definitions. Full article

(This article belongs to the Special Issue Development and Application of Computer Vision and Perception in Vehicles)

►▼ Show Figures

Figure 1

31 pages, 11688 KB

Open AccessArticle

RShDet: An Adaptive Spectral-Aware Network for Remote Sensing Object Detection Under Haze Corruption

by Wei Zhang, Yuantao Wang, Haowei Yang and Xuerui Mao

Remote Sens. 2026, 18(7), 1020; https://doi.org/10.3390/rs18071020 - 29 Mar 2026

Abstract

Remote sensing (RS) object detection faces intrinsic challenges arising from the overhead imaging paradigm and the diversity of climatic conditions. In particular, atmospheric phenomena such as clouds and haze cause severe visual degradation, making reliable object detection difficult. However, most existing detectors are developed under clear-weather conditions, which limits their generalization capability in realistic haze-degraded RS scenarios. To alleviate this issue, an adaptive spectral-aware network for RS object detection under haze interference is proposed, termed RShDet, which is designed to handle both high-altitude RS imagery and low-altitude Unmanned Aerial Vehicle (UAV) scenarios. Firstly, the Object-Centered Dynamic Enhancement (OCDE) module dynamically adjusts the spatial positions of key-value pairs through query-agnostic offsets, enabling the network to emphasize object-relevant regions while suppressing haze-induced background interference. Secondly, the Dynamic Multi-Spectral Perception and Filtering (DSPF) module introduces a multi-spectral attention mechanism that adaptively selects informative frequency components, thereby enhancing discriminative feature representations in hazy environments. Thirdly, the Frequency-Domain Multi-Feature Fusion (FDMF) module employs learnable weights to complementarily integrate amplitude and phase information in the frequency domain, enabling effective cross-task feature interaction between the enhancement and detection branches. Extensive experiments demonstrate that RShDet consistently achieves superior detection performance under hazy conditions across both synthetic and real-world benchmarks. Specifically, it achieves improvements of 2.4% mAP50 on Hazy-DOTA, 1.9% mAP on HazyDet, and 2.33% mAP on the real-world foggy dataset RTTS, surpassing existing state-of-the-art methods. Full article

(This article belongs to the Special Issue Advances in Remote Sensing Image Target Detection and Recognition)

27 pages, 7770 KB

Open AccessArticle

Structured Data Visualization Instruction in Graduate Education: An Empirical Study of Conceptual and Procedural Development

by Simón Gutiérrez de Ravé, Eduardo Gutiérrez de Ravé and Francisco José Jiménez-Hornero

Educ. Sci. 2026, 16(4), 533; https://doi.org/10.3390/educsci16040533 - 27 Mar 2026

Viewed by 192

Abstract

Information visualization is a crucial yet often underdeveloped research skill in graduate education. This study examined how practice-based visualization instruction enhances graduate students’ conceptual understanding and procedural competence in scientific graph construction. Forty first-year graduate students participated in a ten-week instructional program combining diagnostic assessment, guided exercises, and a complex graph replication task. Conceptual and procedural competence were evaluated using validated analytic rubrics to ensure reliability and depth of analysis. Results showed substantial improvement in students’ ability to select suitable chart types, label axes accurately, and apply coherent color schemes. Consistent with the study’s hypotheses, significant gains were observed in conceptual understanding (H1) and technical execution (H2), and a moderate positive correlation between the two domains (H3) confirmed that stronger conceptual grasp aligned with higher visualization proficiency. Iterative feedback and guided reflection supported the integration of theory and practice. However, challenges in detailed annotation and multivariable coordination persisted. Overall, structured, practice-based visualization training enhanced methodological competence and communication clarity. Embedding such experiential learning within graduate curricula can strengthen visualization literacy and support the development of research independence. Full article

(This article belongs to the Section Higher Education)

►▼ Show Figures

Figure A1

41 pages, 3554 KB

Open AccessReview

Towards Reliable Power Grid Modeling from Drawings: A Review of Intelligent Understanding, Topology Inference, and Model Generation

by Congying Wu, Haozheng Yu, Yu Liu and Chao Gong

Machines 2026, 14(4), 371; https://doi.org/10.3390/machines14040371 - 27 Mar 2026

Viewed by 243

Abstract

This paper presents a comprehensive review of the intelligent understanding of power grid drawings, with the aim of enabling reliable and executable grid modeling. First, a unified pipeline is established to describe the transformation from drawings to grid models, covering visual understanding, topology inference, and consistency validation. Second, existing methods are systematically analyzed within this framework, where visual understanding extracts components and textual information and topology inference reconstructs electrical connectivity and network structure. Third, model generation methods are investigated as a critical yet underexplored component, focusing on topology correctness and physical constraint verification. Compared with existing review studies that primarily focus on perception-level tasks such as detection and recognition, this paper explicitly emphasizes the reliability of the resulting models. It highlights that errors in connectivity inference and the lack of validation mechanisms significantly limit practical deployment. Key challenges, including connectivity ambiguity, error propagation, and the absence of standardized validation frameworks, are analyzed. Furthermore, emerging directions such as topology-aware learning and physics-constrained validation are discussed. This review provides a structured perspective on transforming power grid drawings into reliable models and offers insights for future research into power system digitalization. Full article

(This article belongs to the Special Issue Smart Electrified Energy–Motion Systems: Design, Control, and Intelligence from Machines and Vehicles to the Grid)

►▼ Show Figures

Figure 1

33 pages, 172200 KB

Open AccessArticle

HDCGAN+: A Low-Illumination UAV Remote Sensing Image Enhancement and Evaluation Method Based on WPID

by Kelly Chen Ke, Min Sun, Xinyi Wang, Dong Liu and Hanjun Yang

Remote Sens. 2026, 18(7), 999; https://doi.org/10.3390/rs18070999 - 26 Mar 2026

Viewed by 143

Abstract

Remote sensing images acquired by UAVs under nighttime or low-illumination conditions suffer from insufficient illumination, leading to degraded image quality, detail loss, and noise, which restrict their application in public security and disaster emergency scenarios. Although existing machine learning-based enhancement methods can recover part of the missing information, they often cause color distortion and texture inconsistency. This study proposes an improved low-illumination image enhancement method based on a Weakly Paired Image Dataset (WPID), combining the Hierarchical Deep Convolutional Generative Adversarial Network (HDCGAN) with a low-rank image fusion strategy to enhance the quality of low-illumination UAV remote sensing images. First, YCbCr color channel separation is applied to preserve color information from visible images. Then, a Low-Rank Representation Fusion Network (LRRNet) is employed to perform structure-aware fusion between thermal infrared (TIR) and visible images, thereby enabling effective preservation of structural details and realistic color appearance. Furthermore, a weakly paired training mechanism is incorporated into HDCGAN to enhance detail restoration and structural fidelity. To achieve objective evaluation, a structural consistency assessment framework is constructed based on semantic segmentation results from the Segment Anything Model (SAM). Experimental results demonstrate that the proposed method outperforms state-of-the-art approaches in both visual quality and application-oriented evaluation metrics. Full article

(This article belongs to the Section Remote Sensing Image Processing)

►▼ Show Figures

Figure 1

30 pages, 11967 KB

Open AccessArticle

Incorporating Occupant Age Structure into Building Energy Simulation for Envelope Retrofit Evaluation in Existing Residential Buildings

by Zexin Man, Yutong Tan, Han Lin, Zhengtao Ai and Rongpeng Zhang

Buildings 2026, 16(7), 1323; https://doi.org/10.3390/buildings16071323 - 26 Mar 2026

Viewed by 227

Abstract

The retrofit of existing residential buildings plays a critical role in reducing energy consumption and carbon emissions in the building sector. However, previous retrofit evaluations often fail to account for the age-related thermal and lighting requirements of residents in aging residential buildings, thereby overlooking the substantial behavioral heterogeneity that shapes retrofit effectiveness. This study evaluates the comprehensive performance of different building envelope retrofit strategies, considering occupants’ thermal and visual comfort, from the perspectives of energy efficiency, economic feasibility, and environmental sustainability. First, age-specific differences in occupancy patterns, thermal preferences, and lighting requirements between elderly and non-elderly comparison group occupants were systematically extracted from the literature. Then, a typical high-rise residential building was modeled in EnergyPlus to serve as the reference building, within which the differentiated occupant behavior models were implemented, and the pre-retrofit condition was defined as the baseline scenario. Next, six commonly applied exterior wall insulation materials and different glass configurations and window frames were parameterized and evaluated under varying insulation thicknesses and remaining building service life scenarios. Finally, the energy-saving performance, economic benefits, and carbon reduction potential of envelope retrofit measures were quantitatively assessed across three primary functional zones (bedroom, living room, and study), using area-normalized indicators. The results indicate that, in the retrofit of existing residential buildings, bedrooms and study rooms exhibit greater retrofit benefits than living rooms, primarily due to longer occupancy durations and higher heating demand. In terms of retrofit strategies, exterior wall insulation consistently outperforms window retrofitting in energy-saving potential, with energy-saving rates of approximately 3.2–4.3% depending on functional zone, material type, and insulation thickness. Among the evaluated materials, vitrified microbead insulation performs best overall in terms of energy, economic, and carbon benefits at 40–60 mm thickness. These findings support occupant-informed, low-carbon retrofit decision-making for existing residential buildings. Full article

(This article belongs to the Special Issue Low-Carbon Transformation of Existing Built Environments: From Retrofit to Regeneration)

►▼ Show Figures

Figure 1

14 pages, 1333 KB

Open AccessArticle

Enhancing Pilot ‘Mission’ Projection Through a Virtual Reality Flight Simulator: A Quasi-Transfer of Training Study

by Alexander Somerville, Keith Joiner and Graham Wild

Sci 2026, 8(4), 70; https://doi.org/10.3390/sci8040070 - 26 Mar 2026

Viewed by 225

Abstract

The purported benefits of Virtual Reality for pilot flight simulator training, such as increased immersion and presence, would be of great benefit in training those flight skills that rely on visuospatial awareness. The implementation of this technology for the training of pilots requires careful consideration of its ability to transfer required skills and of any comparative advantages over conventional flight simulators. In order to examine this question, a quasi-transfer-of-training study was conducted using a separate-sample pretest–posttest design. The ability of a low-cost VR simulator to transfer flying skills and mission projection skills, using internally valid measures, during a common flight manoeuvre was evaluated. Results were consistent with improved post-intervention flying performance (g = 0.875) and ‘mission projection’ performance (g = 0.661), with no statistically significant difference between the estimated effect sizes, as well as the combined measure (g = 0.768). The findings indicate that the VR simulator was associated with better performance in the quasi-transfer of basic flying skills, those skills that require understanding of spatial relationships based on visual information, and in the broader training of technique. These findings must, however, be considered in the context of the noted limitations of the technology and the research design. Full article

►▼ Show Figures

Figure 1

22 pages, 2243 KB

Open AccessArticle

Multimodal Fake News Detection via Evidence Retrieval and Visual Forensics with Large Vision-Language Models

by Liwei Dong, Yanli Chen, Wei Ke, Hanzhou Wu, Lunzhi Deng and Guixiang Liao

Information 2026, 17(4), 317; https://doi.org/10.3390/info17040317 - 25 Mar 2026

Viewed by 227

Abstract

Fake news has caused significant harm and disruption across various sectors of society. With the rapid advancement of the Internet and social media platforms, both academic and industrial communities have shown growing interest in multimodal fake news detection. In this work, we propose MERF (Multimodal Evidence Retrieval and Forensics with LVLM), a unified framework for multimodal fake news detection that leverages the reasoning capabilities of Large Vision-Language Models (LVLMs). While LVLMs outperform traditional Large Language Models (LLMs) in processing multimodal content, our study reveals that their reasoning abilities remain limited in the absence of sufficient supporting evidence. MERF addresses this challenge by integrating web-based content retrieval, reverse image search, and image manipulation detection into a coherent pipeline, enabling the model to generate informed and explainable veracity judgments. Specifically, our approach performs cross-modal consistency checking, retrieves corroborative information for both textual and visual content, and applies forensic analysis to detect potential visual forgeries. The aggregated evidence is then fed into the LVLM, facilitating comprehensive reasoning and evidence-based decision-making. Experimental results on two public benchmark datasets—Weibo and Twitter—demonstrate that MERF consistently outperforms state-of-the-art baselines across all major evaluation metrics, achieving substantial improvements in accuracy, robustness, and interpretability. Full article

(This article belongs to the Section Artificial Intelligence)

►▼ Show Figures

Figure 1

28 pages, 13123 KB

Open AccessArticle

A Generative Augmentation and Physics-Informed Network for Interpretable Prediction of Mining-Induced Deformation from InSAR Data

by Yuchen Han, Jiajia Yuan, Mingzhi Sun and Lu Liu

Remote Sens. 2026, 18(7), 987; https://doi.org/10.3390/rs18070987 - 25 Mar 2026

Viewed by 223

Abstract

Accurate forecasting of mining-induced surface deformation is critical for coal-mine safety assessment and hazard mitigation. InSAR deformation time series are often short, temporally sparse, and strongly nonlinear. These characteristics can make purely data-driven predictors unreliable in small-sample settings. To address this issue, we propose a generation–prediction–interpretation framework that combines generative augmentation with physics-informed forecasting. We first develop a TCN-TimeGAN model to synthesize high-fidelity deformation sequences and expand the training set. Recurrent modules in the generator and discriminator are replaced with causal TCN residual blocks, and a temporal self-attention layer is further stacked on top of the TCN backbone to adaptively reweight informative time steps. We then construct a physics-informed Kolmogorov–Arnold Network, termed PI-KAN. Subsidence-consistency and smoothness priors are embedded in the learning objective to promote physically plausible predictions while retaining spline-based interpretability. Experiments on SBAS-InSAR deformation series from the Guqiao coal mine show that the framework achieves an RMSE of 0.825 mm and an R² of 0.968. It outperforms TGAN-KAN, CNN-BiGRU, and BiGRU under the same evaluation protocol. Visualizations of the learned spline-based edge functions further reveal stronger nonlinear responses for lagged inputs closer to the forecast horizon, providing interpretable evidence of short-term temporal sensitivity under sparse observations. Full article

►▼ Show Figures

Figure 1

25 pages, 3612 KB

Open AccessArticle

Learning Modality Complementarity for RGB-D Salient Object Detection via Dynamic Neural Network

by Yuanhao Li, Jia Song, Chenglizhao Chen and Xinyu Liu

Electronics 2026, 15(7), 1361; https://doi.org/10.3390/electronics15071361 - 25 Mar 2026

Viewed by 186

Abstract

RGB-D salient object detection (RGB-D SOD) aims to accurately localize and segment visually salient objects by jointly leveraging RGB images and depth maps. Some existing methods rely on static fusion strategies with fixed paths and weights, which treat all regions equally and fail to capture the varying importance of different regions and modalities. Although some attention-based methods alleviate the limitations of static fusion by assigning adaptive weights to different regions and modalities, the quality of RGB and depth data may degrade in real-world scenarios due to sensor noise, illumination changes, or environmental interference. These attention-based methods often overlook inter-modality quality differences and complementarity, making them prone to over-relying on a certain modality, which can lead to noise introduction, feature conflicts, and performance degradation. To address these limitations, this paper proposes a novel dynamic feature routing and fusion framework for RGB-D SOD, which adaptively adjusts the fusion strategy according to the quality of input modalities. To enable modality quality awareness, the proposed method characterizes the modality complementarity between RGB and depth features in a task-driven manner inspired by information-theoretic principles. We introduce a task-relevance scoring function which is integrated with a mutual information estimator to quantify such complementarity, and emphasizes task-relevant features while suppressing redundancy. A dynamic routing module is then designed to perform feature selection guided by the captured complementarity. In addition, we propose a novel cross-modal fusion module to adaptively fuse the features selected by the dynamic routing module, which effectively enhances complementary representations while suppressing redundant features and noise interference. Extensive experiments conducted on seven public RGB-D SOD benchmark datasets demonstrate that the proposed method consistently achieves competitive performance, outperforming existing methods by an average of approximately 1% across multiple evaluation metrics. Notably, in challenging scenarios with severe modality quality degradation, the proposed method outperforms existing best-performing methods by up to 1.8%, demonstrating strong robustness against cluttered backgrounds, complex object structures, and diverse object scales. Overall, the proposed dynamic fusion framework provides a novel solution to modality quality imbalance in RGB-D salient object detection. Full article

(This article belongs to the Section Artificial Intelligence)

►▼ Show Figures

Figure 1

25 pages, 13685 KB

Open AccessArticle

Vision and Language Reference for a Segment Anything Model for Few-Shot Segmentation

by Kosuke Sakurai, Ryotaro Shimizu and Masayuki Goto

J. Imaging 2026, 12(4), 143; https://doi.org/10.3390/jimaging12040143 - 24 Mar 2026

Viewed by 215

Abstract

Segment Anything Model (SAM)-based few-shot segmentation models traditionally rely solely on annotated reference images as prompts, which inherently limits their accuracy due to an over-reliance on visual cues and a lack of semantic context. This reliance leads to incorrect segmentation, where visually similar objects from different categories are incorrectly identified as the target object. We propose Vision and Language Reference Prompt into SAM (VLP-SAM), a novel few-shot segmentation model that integrates both visual information of reference images and semantic information of text labels into SAM. VLP-SAM introduces a vision-language model (VLM) with pixel–text matching into the prompt encoder for SAM, effectively leveraging textual semantic consistency while preserving SAM’s extensive segmentation knowledge. By incorporating task-specific structures such as an attention mask, our model achieves superior few-shot segmentation performance with only 1.4 M learnable parameters. Evaluations on PASCAL-5ⁱ and COCO-20ⁱ datasets demonstrate that VLP-SAM significantly outperforms previous methods by 6.8% and 9.3% in mIoU, respectively. Furthermore, VLP-SAM exhibits strong generalization across unseen objects and cross-domain scenarios, highlighting the robustness provided by textual semantic guidance. This study offers an effective and scalable framework for few-shot segmentation with multimodal prompts. Full article

(This article belongs to the Special Issue Trustworthy Multimodal Vision Models: Generalization, Robustness, and Explainability)

►▼ Show Figures

Figure 1

11 pages, 602 KB

Open AccessReview

A Pharmacovigilance Analysis of Ocular Adverse Events Associated with GLP-1 Receptor Agonists

by Abdullah Virk and Karen Allison

J. Clin. Med. 2026, 15(6), 2464; https://doi.org/10.3390/jcm15062464 - 23 Mar 2026

Viewed by 294

Abstract

Background/Objectives: Glucagon-like peptide-1 receptor agonists (GLP-1 RAs) are increasingly prescribed for type 2 diabetes in addition to other conditions such as obesity. As their use expands, understanding potential ocular safety signals is important, particularly in populations already at risk for diabetic eye disease. The aim of this study is to identify potential pharmacovigilance safety signals for ocular adverse events (AEs) related to GLP-1 RA medications to better inform future clinical practice. Methods: This study utilized the publicly available FDA Adverse Event Reporting System (FAERS) to obtain AE reports related to exenatide, tirzepatide, dulaglutide, liraglutide, and semaglutide from 2005 to 2024. Reports were categorized by demographic and geographic variables. Disproportionality analysis using reporting odds ratios (RORs) was performed to detect potential safety signals. Year-over-year trends in the proportional representation of each drug were also assessed through linear regression and time series plots. Results: Ocular AEs represented 3.61% of all GLP-1 RA related reports. Median age was 63 years, and 62.6% of reports involved female patients. Exenatide accounted for 33.61% of ocular AEs but showed a significant annual decline in reporting (–5.15% per year, p < 0.001). Semaglutide (31.37%) and tirzepatide (12.19%) demonstrated significant year-over-year increases in proportional reporting (2.23% and 0.79% per year, respectively; both p < 0.05), consistent with rapid uptake in clinical practice. Semaglutide demonstrated a modestly elevated ROR (1.46), while tirzepatide showed a low ROR (0.42), though this likely reflects shorter post-marketing exposure rather than lower clinical risk. The most frequently reported events were visual impairment, followed by vision blurred, cataract, and blindness. Conclusions: This pharmacovigilance analysis identifies potential ocular AE signals associated with GLP-1 RAs, particularly semaglutide. While semaglutide showed a statistically significant disproportional reporting signal for ocular AEs, the absence of exposure denominators, comparator groups, and the susceptibility of FAERS to reporting bias means these findings are hypothesis-generating rather than causal. Clinicians should remain vigilant and consider eye care referrals when indicated. Further research is needed to validate these associations and clarify underlying mechanisms. Full article

(This article belongs to the Section Ophthalmology)

►▼ Show Figures

Figure 1

37 pages, 5953 KB

Open AccessArticle

Fire Detection Using Sound Analysis Based on a Hybrid Artificial Intelligence Algorithm

by Robert-Nicolae Boştinaru, Sebastian-Alexandru Drǎguşin, Nicu Bizon, Dumitru Cazacu and Gabriel-Vasile Iana

Algorithms 2026, 19(3), 240; https://doi.org/10.3390/a19030240 - 23 Mar 2026

Viewed by 221

Abstract

Fire detection is a critical task for early warning systems, particularly in environments where visual sensing is unreliable. While most existing approaches rely on image-based or smoke-based detection, acoustic signals provide complementary information capable of capturing early combustion-related events. This study investigates deep learning models for sound-based fire detection, focusing on convolutional and Transformer-based architectures. VGG16 and VGG19 convolutional neural networks are adapted to process time-frequency audio representations for binary classification into Fire and No-Fire classes. An Audio Spectrogram Transformer (AST) is further employed to model long-range temporal dependencies in acoustic data. Finally, a hybrid VGG19-AST architecture is proposed, in which convolutional layers extract local spectral–temporal features, and Transformer-based self-attention performs global sequence modeling. The models are evaluated on a curated dataset containing fire sounds and diverse environmental background noises under multiple noise conditions. Experimental results demonstrate competitive performance across convolutional and Transformer-based models, while the proposed hybrid VGG19-AST architecture achieves the most consistent overall results. The findings suggest that integrating convolutional feature extraction with self-attention-based global modeling enhances robustness under complex acoustic variability. The proposed hybrid framework provides a scalable and cost-effective solution for sound-based fire detection, particularly in scenarios where visual monitoring may be obstructed or ineffective. Full article

►▼ Show Figures

Figure 1

24 pages, 39455 KB

Open AccessArticle

Information Bottleneck Scores for Identifying Causally Informative Attention Heads in Vision–Language Models

by Yiyou Zhang and Liyan Ma

Algorithms 2026, 19(3), 238; https://doi.org/10.3390/a19030238 - 23 Mar 2026

Viewed by 177

Abstract

Vision–language models (VLMs) have demonstrated remarkable performance on a wide range of multimodal reasoning tasks, yet their visual grounding mechanisms remain poorly understood and are often unreliable for fine-grained visual concepts. Existing approaches typically rely on raw attention maps or gradient-based saliency, which provide heuristic explanations but lack a causal interpretation of how visual evidence contributes to model predictions. In this paper, we propose an Information Bottleneck Score (IBS) framework that explicitly quantifies the causal importance of visual patches through interventional analysis. By masking candidate image patches and measuring the induced change in the model prediction, the IBS captures patch-level causal contributions rather than correlation-based signals. We further lift patch-level importance to the attention-head level by aggregating the IBS with text-to-image attention, enabling the identification of a small subset of information-transmitting attention heads responsible for visual grounding. Building on the selected heads, we construct refined importance maps that guide visual cropping in a fully training-free manner. Extensive experiments on multiple detail-sensitive benchmarks, including TextVQA, V*, POPE, and DocVQA, demonstrate consistent improvements in fine-grained visual understanding, while evaluations on general-purpose datasets such as GQA, AOKVQA, and VQAv2 confirm that overall reasoning performance is preserved. Additional ablation studies further validate the effectiveness of each component in the proposed framework. Overall, our work provides a causal perspective on visual grounding in VLMs and offers a model-agnostic, training-free approach for both interpreting and enhancing multimodal reasoning. Full article

(This article belongs to the Section Evolutionary Algorithms and Machine Learning)

►▼ Show Figures

Figure 1

Show export options Show export options

Select all

Export citation of selected articles as:

Error

Oops... you haven't selected anything for export.

Displaying article 1-50 on page 1 of 25.

Go to page 1 2 3 4 5

Search Results (1,239)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI