Saved Queries

Images captured in low-light conditions often have poor visibility, low contrast, and color distortion due to uneven lighting. Most existing enhancement methods often suffer from unstable brightness recovery and color cast, which affect both visual quality and performance of advanced vision tasks. To address those issues, we propose DADNet, a dual-branch network with an attention mechanism and dark channel prior containing an Illumination Enhancement Module (IEM) and Color Transformation Module (CTM). The IEM extracts multi-scale features and improves lighting based on the dark channel prior, while the CTM employs the attention mechanism to handle color features and adjust saturation adaptively. Experimental results on three datasets show that DADNet performs well in both qualitative and quantitative evaluations. It effectively preserves image structure and texture details while achieving a good balance between overall brightness and color quality. Full article

(This article belongs to the Section Computer)

►▼ Show Figures

Figure 1

16 pages, 1730 KB

Open AccessCase Report

Neurorehabilitation and Functional Improvement in Joubert Syndrome: A 12-Month Case Report

by Łukasz Mański, Aleksandra Moluszys, Eliza Wasilewska, Agnieszka Rosa, Krzysztof Szczałuba, Jan Szumlicki, Krystyna Szymańska and Jolanta Wierzba

Children 2026, 13(4), 452; https://doi.org/10.3390/children13040452 (registering DOI) - 26 Mar 2026

Abstract

Background: Joubert syndrome (JS) is a rare ciliopathy characterized by cerebellar and brainstem malformations and the molar tooth sign on magnetic resonance imaging. Motor impairment is primarily driven by axial hypotonia, impaired postural control, and disrupted respiratory-postural integration. Longitudinal reports describing structured neurorehabilitation with standardized functional outcomes remain limited. Case presentation: We report a female child with prenatally suspected vermian hypoplasia and postnatally MRI-confirmed Joubert syndrome. Subsequent molecular testing performed at the age of 3 years and 11 months identified heterozygous variants in the B9D2 gene associated with Joubert syndrome. Early development was marked by axial hypotonia, global motor delay, impaired trunk stabilization, sleep-disordered breathing, and early hip migration. At 2.5 years of age, following motor plateau under conventional therapy, a structured 12-month rehabilitation programme was introduced, combining Vojta-based reflex locomotion, respiratory therapy targeting thoraco-diaphragmatic synchronization, daily home-based practice, and supported standing. Results: After 12 months, gross motor function improved substantially, with GMFM-88 increasing from 12% to 52% (+40 percentage points). PEDI scaled scores improved across all domains, with mobility increasing from 8 to 40, self-care from 15 to 45, and social function from 25 to 50. Ataxia severity decreased from 22 to 15 on the modified Brief Ataxia Rating Scale, consistent with improved trunk stability and coordination. Postural and respiratory organization improved, reflected by a reduction in the subcostal angle from 137° to 90°, an increase in sacral slope from 5° to 10°, and increased expiratory pressure from 10 to 25 mmHg. Caregiver-reported assessment combined with structured clinical observation indicated improved functional visual performance, including enhanced visual attention, visuomotor coordination, and environmental visual interaction. Conclusions: Structured neurorehabilitation was associated with substantial functional improvement across motor, postural, and respiratory domains. These findings support the clinical relevance of mechanism-oriented neurorehabilitation and standardized longitudinal outcome assessment in Joubert syndrome. Full article

(This article belongs to the Special Issue Physical Therapy in Pediatric Developmental Disorders)

►▼ Show Figures

Figure 1

16 pages, 1547 KB

Open AccessArticle

Prospect and Refuge in the Workplace: An Exploratory Pilot EEG Investigation of Desk Orientation and Hypervigilance Among Adults with ADHD

by Jinoh Park, Michelle Boyoung Huh, Marjan Miri, Melissa Hoelting, Samantha Flores, Yashaswini Karagaiah and Mahdi Afkhami

Architecture 2026, 6(2), 51; https://doi.org/10.3390/architecture6020051 (registering DOI) - 25 Mar 2026

Abstract

Open-plan workplaces are often associated with increased sensory exposure, which may present challenges for adults with Attention-Deficit/Hyperactivity Disorder (ADHD), a condition characterized by atypical arousal regulation and sensory sensitivity. Although the Prospect–Refuge Theory suggests that spatial configuration may influence perceived security and attentional states, objective neurophysiological evidence in workplace contexts remains limited. This exploratory pilot study employed a mixed design to examine whether desk orientation and office enclosure were associated with differences in neural activity among adults with ADHD (n = 6). Four desk configurations were tested within each office setting, while two office types (Open Office and Enclosed Private Office) were examined between participants. Neurophysiological data were collected using portable electroencephalography (EEG), and power spectral density (PSD) across canonical frequency bands was analyzed during standardized cognitive tasks. Results indicated context-dependent spatial effects. In the Open Office setting, configurations providing both outward visibility and visual backing were associated with lower beta and gamma power relative to orientations lacking these features. In the Enclosed Private Office, orientation-related differences were not statistically significant. These preliminary findings suggest that desk orientation may influence neural indicators of cognitive demand in open-plan environments. Given the small sample size, results should be interpreted cautiously but contribute initial physiological evidence to neurodiversity-informed workplace research. Full article

(This article belongs to the Special Issue Responsive Interiors: Human-Centered Approaches to Interior Architecture and Design)

18 pages, 21058 KB

Open AccessArticle

MSSA-Net: Multi-Modal Structural and Semantic-Adaptive Network for Low-Light Image Enhancement

by Tianxiang Chen, Xiaoyi Wang, Tongshun Zhang and Qiuzhan Zhou

Sensors 2026, 26(7), 2059; https://doi.org/10.3390/s26072059 - 25 Mar 2026

Abstract

Low-light image enhancement (LLIE) remains challenging due to severe degradation of high-frequency structures and semantic ambiguity under extreme darkness. Although existing methods achieve satisfactory brightness recovery, they often suffer from structural inconsistency and semantic drift, as diverse scenes are typically processed with uniform enhancement strategies or static text prompts. To address these issues, we propose a Multi-Modal Structural and Semantic-Adaptive Network (MSSA-Net) under a structure-anchored paradigm. First, we design a Multi-Scale Self-Refinement Block (MSRB) to enhance degraded visible representations through multi-scale feature extraction and progressive refinement. Meanwhile, a pseudo-infrared structural prior derived from the input image is introduced to provide noise-insensitive geometric cues. These cues are extracted via a Structure-Guided Cross-Attention (SGCA) module to produce structure-dominant features. The refined visible features and structural features are then adaptively integrated through an adaptive residual fusion (ARF) module to achieve balanced restoration. Furthermore, we develop a Large Multi-modal Model (LMM)-Driven Scene-Adaptive Attention mechanism that generates instance-aware scene tags from a coarse preview and injects semantic embeddings into visual features. Extensive experiments demonstrate that MSSA-Net improves structural fidelity, brightness recovery, and semantic naturalness across multiple benchmarks. Full article

(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition)

►▼ Show Figures

Figure 1

23 pages, 131728 KB

Open AccessArticle

Hyperspectral Image Reconstruction Based on State Space Models

by Xuguang Wang, Haozhe Zhou, Tongxin Wei and Yanchao Zhang

Remote Sens. 2026, 18(7), 990; https://doi.org/10.3390/rs18070990 - 25 Mar 2026

Abstract

To address the high hardware costs associated with hyperspectral imaging in precision agriculture, spectral reconstruction (SR) is emerging as a feasible solution for obtaining hyperspectral images. However, existing methods, mainly including CNN and Transformer, face a notable dilemma: convolutional neural networks (CNNs) are limited by their local receptive fields, while Transformers encounter the problem of quadratic computational complexity. Effectively balancing computational efficiency with the capture of long-range spatial dependencies remains a significant challenge. To this end, this study proposes FGA-Mamba (Frequency-Gradient Attention Mamba), a novel reconstruction network based on the Mamba architecture. This network introduces a Frequency-Visual State Space (F-VSS) module, which combines the linear long-range modeling capability of state space models (SSMs) with a frequency-domain self-calibration mechanism to enhance global structural consistency by explicitly modulating frequency features. In addition, we designed an Enhanced Gradient Attention Module (EGAM). This module optimizes local feature representation through a gradient-aware mechanism, effectively compensating for the loss of spatial details. Experimental results on 3 datasets shows that FGA-Mamba have significant improvement in both quantitative and qualitative metrics. Moreover, the high consistency observed in vegetation index (VI) calculations confirms its potential for practical agricultural application. Full article

(This article belongs to the Special Issue AI-Driven Remote Sensing Image Restoration and Generation)

►▼ Show Figures

Figure 1

25 pages, 3612 KB

Open AccessArticle

CrtNet: A Cross-Model Residual Transformer Network for Structure-Guided Remote Sensing Scene Classification

by Chaoran Chen, Tianyuan Zhu, Tao Cui, Dalin Li, Adriano Tavares, Yanchun Liang and Yanheng Liu

Electronics 2026, 15(7), 1366; https://doi.org/10.3390/electronics15071366 - 25 Mar 2026

Abstract

Accurate remote sensing scene classification is essential for large-scale Earth observation but remains challenging due to significant inter-class similarity and complex spatial layouts in medium- and low-resolution imagery. Conventional convolutional neural networks (CNNs) effectively capture local structural patterns but struggle to model long-range semantic dependencies, whereas Vision Transformers excel at global context modeling yet often show reduced sensitivity to fine-grained spatial structures. To address these limitations, we propose CrtNet, a structure-aware Cross-Model Residual Transformer Network that establishes a dual-stream collaborative architecture integrating convolutional structural representations with Transformer-based semantic modeling through gated residual cross-model interactions. In this framework, a convolutional branch first extracts stable local structural features with strong spatial inductive biases. These features are continuously injected into the Transformer encoding process via residual cross-model connections, enabling persistent structural guidance during global attention modeling. In addition, a sample-adaptive dynamic gating mechanism is introduced to flexibly balance structural and semantic features during prediction. Extensive experiments conducted on two public remote sensing benchmarks, EuroSAT and UCM, demonstrate that CrtNet consistently outperforms representative CNN-based, Transformer-based, and hybrid state-of-the-art models, particularly in visually ambiguous scene categories. Full article

(This article belongs to the Special Issue Computer Vision and Machine Learning: Real-World Applications)

►▼ Show Figures

Figure 1

26 pages, 16104 KB

Open AccessArticle

Multi-Slot Attention with State Guidance for Egocentric Robotic Manipulation

by Sofanit Wubeshet Beyene and Ji-Hyeong Han

Electronics 2026, 15(7), 1365; https://doi.org/10.3390/electronics15071365 - 25 Mar 2026

Abstract

Visual perception is fundamental to robotic manipulation for recognizing objects, goals, and contextual details. Third-person cameras provide global views but can miss contact-rich interactions and require calibration. Wrist-mounted egocentric cameras reduce these limitations but introduce occlusion, motion blur, and partial observability, which complicate visuomotor learning. Furthermore, existing perception modules that rely solely on pixels or fuse imagery with proprioception as flat vectors do not explicitly model structured scene representations in dynamic egocentric views. To address these challenges, a multi-slot attention fusion encoder for egocentric manipulation is introduced. Learnable slot queries extract localized visual features from image tokens, and Feature-wise Linear Modulation (FiLM) conditions each slot on the robot’s joint states, producing a structured slot-based latent representation that adapts to viewpoint and configuration changes without requiring object labels or external camera priors. The resulting structured slot-based latent representation is used as input to a Soft Actor–Critic (SAC) agent, which achieves a higher mean cumulative return than pixel-only CNN/DrQ and state-only baselines on a ManiSkill3 egocentric manipulation task. Probing experiments and real-camera evaluation further show that the learned representation remains stable under egocentric viewpoint shifts and partial occlusions, indicating robustness in practical manipulation settings. Full article

(This article belongs to the Section Artificial Intelligence)

►▼ Show Figures

Figure 1

28 pages, 13123 KB

Open AccessArticle

A Generative Augmentation and Physics-Informed Network for Interpretable Prediction of Mining-Induced Deformation from InSAR Data

by Yuchen Han, Jiajia Yuan, Mingzhi Sun and Lu Liu

Remote Sens. 2026, 18(7), 987; https://doi.org/10.3390/rs18070987 - 25 Mar 2026

Abstract

Accurate forecasting of mining-induced surface deformation is critical for coal-mine safety assessment and hazard mitigation. InSAR deformation time series are often short, temporally sparse, and strongly nonlinear. These characteristics can make purely data-driven predictors unreliable in small-sample settings. To address this issue, we propose a generation–prediction–interpretation framework that combines generative augmentation with physics-informed forecasting. We first develop a TCN-TimeGAN model to synthesize high-fidelity deformation sequences and expand the training set. Recurrent modules in the generator and discriminator are replaced with causal TCN residual blocks, and a temporal self-attention layer is further stacked on top of the TCN backbone to adaptively reweight informative time steps. We then construct a physics-informed Kolmogorov–Arnold Network, termed PI-KAN. Subsidence-consistency and smoothness priors are embedded in the learning objective to promote physically plausible predictions while retaining spline-based interpretability. Experiments on SBAS-InSAR deformation series from the Guqiao coal mine show that the framework achieves an RMSE of 0.825 mm and an R² of 0.968. It outperforms TGAN-KAN, CNN-BiGRU, and BiGRU under the same evaluation protocol. Visualizations of the learned spline-based edge functions further reveal stronger nonlinear responses for lagged inputs closer to the forecast horizon, providing interpretable evidence of short-term temporal sensitivity under sparse observations. Full article

►▼ Show Figures

Figure 1

28 pages, 7008 KB

Open AccessArticle

Multimodal Deep Learning Framework for Profiling Socio-Economic Indicators and Public Health Determinants in Urban Environments

by Esaie Dufitimana, Jean Pierre Bizimana, Ernest Uwayezu, Paterne Gahungu and Emmy Mugisha

Urban Sci. 2026, 10(4), 177; https://doi.org/10.3390/urbansci10040177 (registering DOI) - 25 Mar 2026

Abstract

Urbanization significantly enhances socio-economic conditions, health, and well-being for many by improving access to services, education, and economic opportunities. However, socio-economic and public health disparities are also being exacerbated by urbanization. The reliable data required to monitor these conditions are often unavailable, outdated, or inconsistent. This study introduces a multimodal deep learning framework that integrates satellite imagery with street network datasets to predict urban socio-economic indicators and public health determinants at the sector level as a political administrative unit of public health planning in Rwanda. We extracted latent visual and topological embeddings of the urban built environment, using a Convolutional Neural Network (CNN) and Graph Neural Network (GNN). These embeddings were fused through an attentional mechanism to train a multi-task regression model that simultaneously predicts multiple socio-economic indicators and public health determinants. This framework was applied to the City of Kigali in Rwanda. Overall, the multimodal fusion model achieved the best average performance across targets, with an average correlation of 0.68 and MAE of 1.26 for socio-economic indicators, and 0.68 and 1.46 for public health determinants, demonstrating the benefit of integrating visual and topological information. The learned fused embedding space arranges socio-economic indicators and public health determinant deciles along a continuous morphological gradient from sparsely built rural settings to dense urban settings, demonstrating that the urban form encodes latent signals that capture socio-economic indicators and health determinants. Moreover, the study reveals a strong relationship between socio-economic indicators and the public health index, with education, cooking materials, and floor materials exhibiting a correlation above 0.96. This work demonstrates the utility of an integrated framework for socio-economic indicator profiling and public health planning in data-scarce urban contexts, offering a scalable approach for monitoring the indicators of Sustainable Development Goals in rapidly changing urban environments. Full article

(This article belongs to the Topic Geospatial AI: Systems, Model, Methods, and Applications)

►▼ Show Figures

Figure 1

25 pages, 3612 KB

Open AccessArticle

Learning Modality Complementarity for RGB-D Salient Object Detection via Dynamic Neural Network

by Yuanhao Li, Jia Song, Chenglizhao Chen and Xinyu Liu

Electronics 2026, 15(7), 1361; https://doi.org/10.3390/electronics15071361 - 25 Mar 2026

Abstract

RGB-D salient object detection (RGB-D SOD) aims to accurately localize and segment visually salient objects by jointly leveraging RGB images and depth maps. Some existing methods rely on static fusion strategies with fixed paths and weights, which treat all regions equally and fail to capture the varying importance of different regions and modalities. Although some attention-based methods alleviate the limitations of static fusion by assigning adaptive weights to different regions and modalities, the quality of RGB and depth data may degrade in real-world scenarios due to sensor noise, illumination changes, or environmental interference. These attention-based methods often overlook inter-modality quality differences and complementarity, making them prone to over-relying on a certain modality, which can lead to noise introduction, feature conflicts, and performance degradation. To address these limitations, this paper proposes a novel dynamic feature routing and fusion framework for RGB-D SOD, which adaptively adjusts the fusion strategy according to the quality of input modalities. To enable modality quality awareness, the proposed method characterizes the modality complementarity between RGB and depth features in a task-driven manner inspired by information-theoretic principles. We introduce a task-relevance scoring function which is integrated with a mutual information estimator to quantify such complementarity, and emphasizes task-relevant features while suppressing redundancy. A dynamic routing module is then designed to perform feature selection guided by the captured complementarity. In addition, we propose a novel cross-modal fusion module to adaptively fuse the features selected by the dynamic routing module, which effectively enhances complementary representations while suppressing redundant features and noise interference. Extensive experiments conducted on seven public RGB-D SOD benchmark datasets demonstrate that the proposed method consistently achieves competitive performance, outperforming existing methods by an average of approximately 1% across multiple evaluation metrics. Notably, in challenging scenarios with severe modality quality degradation, the proposed method outperforms existing best-performing methods by up to 1.8%, demonstrating strong robustness against cluttered backgrounds, complex object structures, and diverse object scales. Overall, the proposed dynamic fusion framework provides a novel solution to modality quality imbalance in RGB-D salient object detection. Full article

(This article belongs to the Section Artificial Intelligence)

►▼ Show Figures

Figure 1

21 pages, 38078 KB

Open AccessArticle

Development and Evaluation of a Deep Learning Model for Ovarian Cancer Histotype Classification Using Whole-Slide Imaging

by Dagoberto Pulido and Nathalia Arias-Mendoza

J. Imaging 2026, 12(4), 144; https://doi.org/10.3390/jimaging12040144 - 25 Mar 2026

Abstract

The histopathological classification of ovarian carcinoma is fundamental for patient management. While microscopic evaluation by pathologists is the current diagnostic standard, it is known to be subject to interobserver variability, which can affect consistency in treatment decisions. This study addresses this clinical need by developing and validating a deep learning-based diagnostic support tool designed to enhance the objectivity and reproducibility of this classification. In this work, we address a key challenge in computational pathology—the tendency of attention mechanisms to overfit by concentrating on limited features—by systematically evaluating a direct regularization method within multiple instance learning (MIL) models. The models were trained and validated using 10-fold cross-validation on a public training set of 538 whole-slide images and further tested on an independent public dataset for the more challenging task of molecular subtype classification. We utilized features from a foundational model pre-trained on histopathology data to represent tissue morphology. Our findings demonstrate that directly regularizing the attention mechanism with a stochastic approach provides a statistically significant improvement in accuracy and generalization, highlighting its power as a robust technique to mitigate overfitting for this clinical task. In direct contrast to the reported variability in manual assessment, our final model achieved high consistency and accuracy, with a balanced accuracy of 0.854 and a Cohen’s Kappa of 0.791. The model also demonstrated strong generalization on the molecular classification task. Its attention mechanism provides visual heatmaps for pathologist review, fostering interpretability and trust. We have developed a highly accurate and generalizable artificial intelligence tool that directly addresses the challenge of interobserver variability in ovarian cancer classification. Its performance highlights the potential for artificial intelligence to serve as a decision support system, standardizing histopathological assessment. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

►▼ Show Figures

Figure 1

19 pages, 2375 KB

Open AccessArticle

Beyond the Black Box: An Interpretable Saliency Framework for Abstract Art via Theory-Driven Heuristics

by Evaldas Vaičekauskas and Vytautas Abromavičius

Appl. Sci. 2026, 16(7), 3145; https://doi.org/10.3390/app16073145 - 24 Mar 2026

Abstract

Visual saliency modeling has achieved high predictive performance in natural image domains, yet its generalization to abstract art remains limited by the lack of explicit semantic structure and the scarcity of eye-tracking data. In such semantically ambiguous contexts, understanding the underlying drivers of attention is as critical as predictive accuracy. This paper presents an interpretable, ’white-box’ saliency framework tailored to abstract art, which constructs predictions through a weighted combination of 35 modular heuristics grounded in perceptual psychology and art theory, including contrast, grouping, isolation and symmetry. Heuristic weights are optimized via a genetic algorithm and refined by a context-aware modulation mechanism that adapts to image-level visual features. Evaluation against eye-tracking data from 40 abstract paintings demonstrates that the model with the expanded activation variant produces stable, meaningful predictions while achieving a competitive KL-divergence score (1.11 ± 0.55), which is comparable to the SalGAN baseline (1.11 ± 0.53). Analysis of the optimized weights reveals strong contributions from contrast, texture, and grouping mechanisms, while nearly half of the heuristics, including most horizontal symmetry heuristics are systematically pruned by the model. Moreover, context-aware modulation reveals that these weights are not static but shift dynamically based on image-level features such as edge density and intensity variation. By prioritizing transparency over raw predictive performance, this study demonstrates that explainable saliency models can function as robust investigative tools for decoding the principles of human visual perception in data-scarce domains. Full article

(This article belongs to the Special Issue Explainable Machine Learning and Computer Vision)

►▼ Show Figures

Figure 1

21 pages, 9626 KB

Open AccessArticle

An Improved AlexNet-Based Image Recognition Method for Transmission Line Wildfires

by Zilin Zhao and Guoyong Duan

Algorithms 2026, 19(4), 245; https://doi.org/10.3390/a19040245 - 24 Mar 2026

Abstract

The wildfires in the vicinity of the power transmission corridors are famous for their sudden occurrence, rapid growth, and susceptibility to interference from fire-like interferences at night, which can easily lead to line discharge and trip accidents, thus affecting the safe operation of the power system. In order to address the issue of the high false alarm rate and poor generalization performance of wildfire image recognition in complex power transmission corridor environments, a wildfire image recognition method based on an improved AlexNet is proposed in this paper. The proposed method improves the description of flame and smoke properties at different scales by designing a reparameterized multi-scale feature extraction structure, and effectively alleviates the influence of strong light reflection and fire-like interference at night by using lightweight multi-scale attention and hybrid pooling attention mechanisms. A wildfire image dataset is constructed based on 1246 on-site images of the power transmission corridor captured by a visual monitoring device and 600 wildfire images downloaded from the internet, and tested in real-world imbalanced distribution scenarios. The experimental results show that the proposed method can recognize wildfire images with an accuracy of 96.9% and an F1 value of 94.9% on the test dataset, which is much higher than that of the original AlexNet, and has a strong ability to adapt to cross-dataset tests. The research work can provide technical support for online monitoring and operation and maintenance of wildfires in power transmission corridors. Full article

(This article belongs to the Special Issue AI-Based Techniques in Smart Grid Operations)

►▼ Show Figures

Figure 1

25 pages, 13685 KB

Open AccessArticle

Vision and Language Reference for a Segment Anything Model for Few-Shot Segmentation

by Kosuke Sakurai, Ryotaro Shimizu and Masayuki Goto

J. Imaging 2026, 12(4), 143; https://doi.org/10.3390/jimaging12040143 - 24 Mar 2026

Abstract

Segment Anything Model (SAM)-based few-shot segmentation models traditionally rely solely on annotated reference images as prompts, which inherently limits their accuracy due to an over-reliance on visual cues and a lack of semantic context. This reliance leads to incorrect segmentation, where visually similar objects from different categories are incorrectly identified as the target object. We propose Vision and Language Reference Prompt into SAM (VLP-SAM), a novel few-shot segmentation model that integrates both visual information of reference images and semantic information of text labels into SAM. VLP-SAM introduces a vision-language model (VLM) with pixel–text matching into the prompt encoder for SAM, effectively leveraging textual semantic consistency while preserving SAM’s extensive segmentation knowledge. By incorporating task-specific structures such as an attention mask, our model achieves superior few-shot segmentation performance with only 1.4 M learnable parameters. Evaluations on PASCAL-5ⁱ and COCO-20ⁱ datasets demonstrate that VLP-SAM significantly outperforms previous methods by 6.8% and 9.3% in mIoU, respectively. Furthermore, VLP-SAM exhibits strong generalization across unseen objects and cross-domain scenarios, highlighting the robustness provided by textual semantic guidance. This study offers an effective and scalable framework for few-shot segmentation with multimodal prompts. Full article

(This article belongs to the Special Issue Trustworthy Multimodal Vision Models: Generalization, Robustness, and Explainability)

►▼ Show Figures

Figure 1

20 pages, 6234 KB

Open AccessArticle

Wafer Defect Recognition for Industrial Inspection: FCS-VMamba Model and Experimental Validation

by Yijia Zhang, Ziyi Ma, Tongji Cui, Tiejun Zhao, Qi Wang and Jianhua Wang

J. Imaging 2026, 12(4), 142; https://doi.org/10.3390/jimaging12040142 - 24 Mar 2026

Abstract

In industrial imaging scenarios, semiconductor wafer defect classification is crucial for chip manufacturing yield and reliability. However, numerous challenges persist, including weak imaging responses and detail loss during downsampling, complex backgrounds that interfere with feature extraction, and the trade-off between performance and efficiency on edge devices. Traditional CNNs and ViTs exhibit limitations in modeling long-range dependencies and managing edge deployment costs. To address these issues, we leverage the VMamba architecture, a Visual State Space Model (SSM) that achieves global contextual modeling with linear computational complexity. Based on the VMamba architecture, we propose FCS-VMamba, a domain-adapted model that integrates three core modules, namely Frequency Attention (FA), Cross-Layer Cross-Attention (CLCA), and Saliency Feature Suppression (SFS). The experimental results show that FCS-VMamba achieved

86.06 %

macro-precision and

87.91 %

Top-1 accuracy with only

1.2

M parameters. These results demonstrate that FCS-VMamba provides a practical and parameter-efficient baseline for industrial wafer defect recognition. Full article

(This article belongs to the Section AI in Imaging)

►▼ Show Figures

Figure 1

Show export options Show export options

Select all

Export citation of selected articles as:

Error

Oops... you haven't selected anything for export.

Displaying article 1-50 on page 1 of 99.

Go to page 1 2 3 4 5

Search Results (4,901)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI