Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (3,479)

Search Parameters:
Keywords = vision transformer

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
20 pages, 1844 KB  
Article
AI-Enhanced Prognostic Model for Predicting Polyp Recurrence and Guiding Post-Polypectomy Surveillance Intervals Using the ERCPMP-V5 Dataset
by Sri Harsha Boppana, Sachin Sravan Kumar Komati, Ritwik Raj, Gautam Maddineni, Raja Chandra Chakinala, Pradeep Yarra, Venkata C. K. Sunkesula and Cyrus David Mintz
J. Clin. Med. 2026, 15(9), 3303; https://doi.org/10.3390/jcm15093303 (registering DOI) - 26 Apr 2026
Abstract
Introduction: Colorectal cancer remains a leading cause of cancer-related morbidity and mortality, with adenomatous polyps representing a common precursor. Post-polypectomy polyp recurrence represents a significant risk of colorectal cancer, driving periodic colonoscopy surveillance and polypectomy as needed. In this study, we explore a [...] Read more.
Introduction: Colorectal cancer remains a leading cause of cancer-related morbidity and mortality, with adenomatous polyps representing a common precursor. Post-polypectomy polyp recurrence represents a significant risk of colorectal cancer, driving periodic colonoscopy surveillance and polypectomy as needed. In this study, we explore a multimodal machine learning approach that integrates endoscopic imaging with clinical and pathology data to improve recurrence risk prediction and support individualized surveillance planning. Methods: We developed and evaluated a multimodal artificial intelligence (AI) model to predict post-polypectomy colorectal polyp recurrence using the ERCPMP-v5 dataset. The cohort included 217 patients with 796 high-resolution endoscopic RGB images and 21 endoscopic videos; video data were converted to still frames at 2 frames per second. Images and frames were resized to 224 × 224 pixels and normalized. Patient-level demographic, morphological (Paris, Kudo Pit, JNET), anatomical, and pathological variables were encoded using standard scaling for continuous features and one-hot encoding for categorical features. Visual representations were extracted using a pretrained Vision Transformer backbone (ViT-Base-Patch16-224) with frozen weights. Structured metadata (79 variables) was encoded using a multilayer perceptron. A late fusion framework used image and metadata representations to generate a recurrence probability via a sigmoid classifier; probabilities were thresholded at 0.5 for binary prediction. Model performance was evaluated on a held-out test set using accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). We additionally compared fusion performance with image-only and metadata-only baselines. Predicted probabilities were translated to surveillance recommendations using risk tiers: low risk (0.00 ≤ p < 0.20), moderate risk (0.20 ≤ p < 0.50), and high risk (p ≥ 0.50). Results: On the test set, the multimodal fusion model achieved 90.4% accuracy, 86.7% precision, 83.1% recall, 84.9% F1-score, and an AUC of 0.920. The image-only model achieved 84.6% accuracy (AUC 0.880), and the metadata-only model achieved 81.9% accuracy (AUC 0.850), indicating improved performance with multimodal fusion. Risk stratification enabled surveillance recommendations of 1–3 years for low risk, 6–12 months for moderate risk, and 3–6 months for high risk. Conclusions: A late-fusion multimodal model integrating endoscopic imaging with structured clinical and pathology variables demonstrated excellent performance for predicting post-polypectomy recurrence and generated actionable risk-based surveillance intervals. This approach may support individualized follow-up planning and more efficient allocation of surveillance resources, while prioritizing timely evaluation for patients at higher predicted risk. Full article
Show Figures

Figure 1

24 pages, 1435 KB  
Article
Physically Guided Attention Mechanism for Underwater Motion Deblurring via Cep9613strum-Based Blur Estimation
by Ning Hu, Shuai Li and Jindong Tan
J. Imaging 2026, 12(5), 186; https://doi.org/10.3390/jimaging12050186 (registering DOI) - 26 Apr 2026
Abstract
Underwater images often suffer from mixed degradations, including motion blur, which reduce structural clarity and adversely affect downstream vision tasks. To address this problem, we propose a physically guided Transformer framework for underwater motion deblurring. The proposed method combines two-stage cepstrum-based blur estimation [...] Read more.
Underwater images often suffer from mixed degradations, including motion blur, which reduce structural clarity and adversely affect downstream vision tasks. To address this problem, we propose a physically guided Transformer framework for underwater motion deblurring. The proposed method combines two-stage cepstrum-based blur estimation with a point spread function (PSF)-guided self-attention mechanism. Specifically, blur parameters are first robustly estimated through cepstrum analysis, ellipse fitting, and negative-peak refinement, and the resulting PSF is then embedded into the Transformer attention module to guide feature aggregation. On the real underwater benchmark datasets UIEB Challenge-60 and EUVP330, the proposed method achieves UIQM/UCIQE scores of 4.09/0.56 and 3.40/0.58, respectively, significantly outperforming UFPNet and Phaseformer, thereby demonstrating superior perceptual restoration in terms of sharpness, contrast, and color consistency. On the synthetic test set, the proposed method attains 24.23 dB PSNR and 0.918 SSIM, outperforming both recent deep models and classical non-blind deconvolution methods, which confirms its strong restoration fidelity and structural consistency. In the controlled water-tank experiments, the proposed method consistently achieves the best performance under different camera motion speeds, demonstrating excellent robustness and practical applicability. Overall, the proposed framework provides an effective and physically interpretable solution for underwater motion deblurring. Full article
(This article belongs to the Section Image and Video Processing)
50 pages, 17736 KB  
Article
Swin–YOLOv12: A Hybrid Transformer-Based Deep Learning Approach for Enhanced Real-Time Brain Tumor Detection in MRI Images
by Mubashar Tariq and Kiho Choi
Mathematics 2026, 14(9), 1447; https://doi.org/10.3390/math14091447 (registering DOI) - 25 Apr 2026
Abstract
Brain tumors (BTs) arise from the abnormal growth of cells within brain tissue and may spread rapidly, making them a major cause of mortality worldwide. Early detection of BTs remains highly challenging due to the brain’s complex structure and the heterogeneous nature of [...] Read more.
Brain tumors (BTs) arise from the abnormal growth of cells within brain tissue and may spread rapidly, making them a major cause of mortality worldwide. Early detection of BTs remains highly challenging due to the brain’s complex structure and the heterogeneous nature of tumors. Magnetic Resonance Imaging (MRI) provides detailed information about tumor size, location, and shape, thereby supporting clinical decision-making for treatments such as chemotherapy, radiation therapy, and surgery. Traditional machine learning (ML) approaches mainly rely on manual feature extraction, whereas recent advances in Computer-Aided Diagnosis (CAD) and deep learning (DL) have enabled more accurate detection of small and complex tumor regions. To improve automated tumor detection, we propose a hybrid Swin–YOLO framework that combines the Swin Transformer (ST) with the latest CNN-based YOLOv12 model. In this framework, the Swin Transformer serves as the main backbone for feature extraction, while the Feature Pyramid Network (FPN) and Path Aggregation Network (PANet) are employed in the neck to better capture multi-scale features. For training, we used the publicly available Br35H dataset and applied data augmentation to enhance the model’s robustness and generalization capability. The experimental results show that the proposed framework achieved 99.7% accuracy, 99.4% mAP@50, and 87.2% mAP@50:95. Furthermore, we incorporated Explainable Artificial Intelligence (XAI) techniques, including Grad-CAM and SHAP, to improve the interpretability of the model by visually highlighting the tumor regions that contributed most to the prediction. In addition, we developed NeuroVision AI, a web-based application designed to support faster and more accurate clinical decision-making. Although the proposed model demonstrated strong performance on the dataset, these results should be interpreted within the context of the current experimental setting. Full article
22 pages, 3386 KB  
Article
UAV Visual Localization via Multimodal Fusion and Multi-Scale Attention Enhancement
by Yiheng Wang, Yushuai Zhang, Zhenyu Wang, Jianxin Guo, Feng Wang, Rui Zhu and Dejing Lin
Sustainability 2026, 18(9), 4277; https://doi.org/10.3390/su18094277 (registering DOI) - 25 Apr 2026
Abstract
For power-grid applications such as transmission corridor inspection, substation asset inspection, and post-disaster emergency repair, reliable UAV self-localization under GNSS-degraded or GNSS-denied conditions is critical to ensuring operational safety and accurate defect geotagging. Due to substantial discrepancies in viewpoint, scale, and geometric structure [...] Read more.
For power-grid applications such as transmission corridor inspection, substation asset inspection, and post-disaster emergency repair, reliable UAV self-localization under GNSS-degraded or GNSS-denied conditions is critical to ensuring operational safety and accurate defect geotagging. Due to substantial discrepancies in viewpoint, scale, and geometric structure between oblique UAV images and nadir satellite images, conventional RGB-based cross-view retrieval methods often suffer from unstable alignment and insufficient geometric modeling, particularly in scenarios with repetitive textures and partial overlap. To address these challenges, we propose a cross-view visual geo-localization model that integrates RGBD multimodal inputs with multi-scale attention enhancement. Specifically, MiDaS is used to estimate relative depth from UAV imagery, which is concatenated with RGB to form a four-channel input, while satellite images are padded with an additional zero channel to maintain dimensional consistency. A shared-weight ViTAdapter is adopted to learn joint semantic–geometric representations, and a lightweight Efficient Multi-scale Attention (EMA) module is adopted on spatial feature maps to strengthen multi-scale spatial consistency. In addition, an IoU-weighted InfoNCE loss is employed to accommodate partial matching during training, thereby improving the robustness of feature alignment. Experiments on the GTA-UAV dataset under the cross-area protocol show stable performance across both retrieval and localization metrics. Specifically, Recall@1, Recall@5, and Recall@10 reach 18.12%, 38.83%, and 49.47%, respectively; AP is 28.01 and SDM@3 is 0.53; meanwhile, the top-1 geodesic distance error Dis@1 is 1052.73 m. These results indicate that explicit geometric priors combined with multi-scale spatial enhancement can effectively improve cross-view feature alignment, leading to enhanced robustness and accuracy for localization in challenging power inspection scenarios. Full article
Show Figures

Figure 1

21 pages, 20196 KB  
Article
VMMedSAM-X: A State-Enhanced Dual-Branch Encoder for Efficient Promptable Medical Image Segmentation
by Hengwei Zhang, Wei Li and Yazhi Liu
Appl. Sci. 2026, 16(9), 4199; https://doi.org/10.3390/app16094199 (registering DOI) - 24 Apr 2026
Abstract
Medical image segmentation plays a crucial role in clinical diagnosis and treatment planning. However, existing segmentation frameworks frequently exhibit high computational complexity and often fail to retain fine-grained structural details—especially along intricate anatomical boundaries such as blood vessels and tumor margins. To overcome [...] Read more.
Medical image segmentation plays a crucial role in clinical diagnosis and treatment planning. However, existing segmentation frameworks frequently exhibit high computational complexity and often fail to retain fine-grained structural details—especially along intricate anatomical boundaries such as blood vessels and tumor margins. To overcome these limitations, we propose VMMedSAM-X, an efficient and computationally economical medical image segmentation framework that incorporates structured state space modeling into the Medical Segment Anything Model (MedSAM) architecture. The proposed method adopts a state-enhanced encoder that combines extended long short-term memory (xLSTM) with two-dimensional selective scanning (SS2D) and a dual-path cross-attention mechanism to enhance long-range dependency modeling while maintaining linear computational complexity. Experiments conducted on the 1024×1024 ACDC cardiac MRI dataset show that the proposed encoder reduces floating-point operations from 369.44 G to 17.36 G and achieves a 2.4× improvement in inference speed compared with the Vision Transformer (ViT)-based encoder. Additional evaluations on the SegTHOR and MSD-Lung datasets demonstrate consistent improvements in Dice Similarity Coefficient (DSC) and Intersection over Union (IoU) metrics over MedSAM and Vision Mamba U-Net (VM-UNet) baselines. These results indicate that the proposed framework provides an effective and computationally efficient solution for high-resolution medical image segmentation. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
17 pages, 1226 KB  
Article
Design and Laboratory Validation of a Low-Cost Vision-Based Strain Monitoring System Using ESP32-CAM with Centralized Processing
by Asare Kwaku Anim, Weijie Li, Xuefeng Zhao, Jun Ma, Ronghuan Liu and Dong Sun
Buildings 2026, 16(9), 1681; https://doi.org/10.3390/buildings16091681 - 24 Apr 2026
Abstract
Vision-based structural health monitoring offers a promising alternative to conventional wired sensing systems; however, its adoption is often limited by high hardware costs and computational constraints at sensing nodes. This study presents the design and laboratory validation of a low-cost vision-based system for [...] Read more.
Vision-based structural health monitoring offers a promising alternative to conventional wired sensing systems; however, its adoption is often limited by high hardware costs and computational constraints at sensing nodes. This study presents the design and laboratory validation of a low-cost vision-based system for displacement and strain monitoring using a centralized processing architecture. The proposed system separates image acquisition from computation, where an ESP32-CAM module serves as a lightweight edge node for grayscale image capture and wireless transmission, while computational tasks including displacement tracking, subpixel localization, scale calibration, and strain estimation are performed on a centralized unit. This enables low-cost deployment at USD 60 per node with low power consumption at 1 W. System performance was evaluated through controlled experiments, including a 24 h zero-drift test and quasi-static displacement tests up to 15 μm. Validation against a Linear Variable Differential Transformer (LVDT) shows close agreement, with an absolute error of 2.63 µε and drift within ±2 μm. The system achieves an effective strain range of ±35,000 με. These results demonstrate the potential of low-cost centralized vision-based systems, demonstrating strong potential for practical deployment in structural health monitoring applications. Full article
(This article belongs to the Section Building Structures)
20 pages, 5677 KB  
Article
Robust Image Watermarking via Clustered Visual State-Space Modeling
by Bo Liu and Jianhua Ren
Appl. Sci. 2026, 16(9), 4166; https://doi.org/10.3390/app16094166 - 24 Apr 2026
Viewed by 68
Abstract
Most existing DNN-based image watermarking methods adopt an “encoder–noise–decoder” paradigm, where the watermark is typically replicated and expanded in a straightforward manner and then directly fused with image features, which limits robustness under complex distortions. Although Transformers improve fusion via attention mechanisms, their [...] Read more.
Most existing DNN-based image watermarking methods adopt an “encoder–noise–decoder” paradigm, where the watermark is typically replicated and expanded in a straightforward manner and then directly fused with image features, which limits robustness under complex distortions. Although Transformers improve fusion via attention mechanisms, their quadratic computational complexity makes high-resolution processing prohibitively expensive. To address these issues, we propose CCViM, a robust watermarking framework built on Vision Mamba, which leverages the linear-complexity property of state-space models (SSMs) to enable efficient global interactions. We design a Watermark Representation Learning Module (WRLM) that performs hierarchical feature extraction and structured expansion of the watermark through cascaded VSS blocks, yielding semantically rich and perturbation-resistant watermark representations. In addition, we introduce an Interwoven Fusion Enhancement Module (IFEM), which employs a CCS6 structure to treat the watermark as a dynamic guidance signal. By combining contextual clustering with the Mamba mechanism, IFEM deeply interweaves the watermark into host features at both local and global levels. Experiments on COCO, DIV2K, and ImageNet demonstrate that CCViM consistently improves imperceptibility, robustness, and efficiency to varying degrees, and remains stable and high quality under attacks such as JPEG compression, cropping, and Gaussian blur. Full article
(This article belongs to the Special Issue Advanced Pattern Recognition & Computer Vision, 2nd Edition)
Show Figures

Figure 1

30 pages, 1401 KB  
Article
Feasibility Analysis of Static-Image-Based Traffic Accident Detection Under Domain Shift for Edge-AI Surveillance Systems
by Chien-Chung Wu and Wei-Cheng Chen
Electronics 2026, 15(9), 1803; https://doi.org/10.3390/electronics15091803 - 23 Apr 2026
Viewed by 97
Abstract
Traffic accident detection is a critical component of intelligent transportation systems (ITS), enabling timely incident response and traffic management. While most existing approaches rely on temporal information from video sequences, such methods are not always applicable in resource-constrained surveillance environments. This study investigates [...] Read more.
Traffic accident detection is a critical component of intelligent transportation systems (ITS), enabling timely incident response and traffic management. While most existing approaches rely on temporal information from video sequences, such methods are not always applicable in resource-constrained surveillance environments. This study investigates the feasibility of detecting traffic accidents from single static images by formulating the task as a binary classification problem. Representative architectures, including Vision Transformer (ViT), Swin Transformer, and ResNet-50, are systematically evaluated on the Car Crash Dataset (CCD) under multiple training configurations. To assess generalization capability, cross-domain evaluation is conducted using an external crash video dataset (ECVD) constructed to approximate real-world deployment conditions. Experimental results show that all models achieve strong performance under in-domain evaluation. However, cross-domain testing reveals substantial performance degradation, particularly in recall, indicating limited generalization capability under domain shift. Qualitative analysis further shows that missed detections are associated with weak visual cues, occlusion, and complex traffic environments, while false positives are caused by visually ambiguous patterns resembling accident scenarios. Unlike prior studies that primarily report performance improvements, this work provides empirical evidence that model behavior in static-image-based accident detection is governed by dataset composition rather than architectural design. Therefore, static-image-based accident detection should be interpreted as a coarse-level screening tool rather than a fully reliable decision-making system. This study highlights the importance of data-centric design and cross-domain evaluation for improving real-world applicability. Full article
(This article belongs to the Section Computer Science & Engineering)
32 pages, 2211 KB  
Article
An Automated Vision-Based Inspection System for Metallic Lock Surface Defects Using a Transformer-Enhanced U-Net
by Hong-Dar Lin, Shun-Yan Li and Chou-Hsien Lin
Sensors 2026, 26(9), 2608; https://doi.org/10.3390/s26092608 - 23 Apr 2026
Viewed by 109
Abstract
Surface defect inspection of metallic lock components remains challenging due to strong specular reflections, low-contrast defect patterns, and geometric variability, which limit the consistency of manual inspection and conventional automated optical inspection (AOI) systems. This study presents an integrated visual inspection framework that [...] Read more.
Surface defect inspection of metallic lock components remains challenging due to strong specular reflections, low-contrast defect patterns, and geometric variability, which limit the consistency of manual inspection and conventional automated optical inspection (AOI) systems. This study presents an integrated visual inspection framework that combines controlled image acquisition with deep learning-based semantic segmentation to enable reliable and repeatable defect detection. A standardized rotational fixture with ring illumination was developed to stabilize imaging geometry, reduce reflection variability, and support consistent multi-view acquisition. A region-of-interest (ROI) masking strategy was further applied to suppress background interference and isolate the effective inspection region. At the algorithmic level, a Transformer-enhanced U-Net (TransU-Net) architecture was employed to jointly model local spatial features and global contextual dependencies, thereby improving boundary delineation and the detection of irregular surface anomalies. In addition, a boundary-aware weighted evaluation scheme was introduced to provide a more robust and application-relevant assessment by accounting for annotation uncertainty near defect edges. Experimental results demonstrate that the proposed method achieved an F1-score of 85.15%, with an average inference time of 0.3357 s per image for model prediction. Considering additional processes such as multi-view image acquisition, mechanical rotation, and preprocessing, the overall system-level inspection time is expected to be on the order of seconds per component in practical deployment. Full article
47 pages, 9682 KB  
Article
Unsupervised Hierarchical Visual Taxonomy of Marble Natural Stone Using Cluster-Aware Self-Supervised Vision Transformers
by Margarida Figueiredo, Carlos M. A. Diogo, Gustavo Paneiro, Pedro Amaral and António Alves de Campos
Appl. Sci. 2026, 16(9), 4137; https://doi.org/10.3390/app16094137 - 23 Apr 2026
Viewed by 66
Abstract
The marble industry relies on proprietary commercial names rather than objective visual categories, creating market inefficiencies for stakeholders who select stones based on appearance. Supervised classification perpetuates this problem by replicating inconsistent commercial labels instead of discovering intrinsic visual structure. We propose an [...] Read more.
The marble industry relies on proprietary commercial names rather than objective visual categories, creating market inefficiencies for stakeholders who select stones based on appearance. Supervised classification perpetuates this problem by replicating inconsistent commercial labels instead of discovering intrinsic visual structure. We propose an unsupervised pipeline combining a two-stage training strategy: A pure self-supervised pretraining followed by cluster-aware fine-tuning of a DINO Vision Transformer, with empirically selected dimensionality reduction and agglomerative hierarchical clustering. Systematic ablation studies on 1480 marble images spanning 10 commercial varieties validate each design choice: cluster-aware training at k = 10 yields geometrically improved embeddings over the self-supervised baseline (mean Silhouette Score 0.693 ± 0.053 vs. 0.660 ± 0.030; mean Davies–Bouldin Index 0.386 ± 0.075 vs. 0.569 ± 0.012; N = 9 independent evaluations across 3 data partitions × 3 training initializations). The resulting taxonomy reveals three phenomena invisible to commercial classification: cross-category merging of visually indistinguishable stones carrying different market names, intra-category splitting of heterogeneous sub-populations within single varieties, and coherent grouping where commercial and visual boundaries coincide, with all three confirmed in every independent run. We further demonstrate that standard extrinsic metrics are misaligned with unsupervised taxonomy objectives when reference labels encode the inconsistencies the method aims to resolve. Validating this methodology across diverse stone types, larger datasets, and varied acquisition conditions represents a natural and necessary next step toward establishing its cross-domain generalizability. Full article
18 pages, 4961 KB  
Article
A Generalizable Low-Precision Softmax Approximation for Small-FPGA Deployment of Vision Transformers
by Samuel Aboagye, Lujun Zhai and Suxia Cui
Electronics 2026, 15(9), 1774; https://doi.org/10.3390/electronics15091774 - 22 Apr 2026
Viewed by 193
Abstract
Softmax is a step in transformer computation during which the internal buffer size grows rapidly because of the use of the exponential function. Softmax is a fundamental yet computationally expensive operation in vision transformer attention, posing significant challenges for deployment on resource-constrained FPGAs [...] Read more.
Softmax is a step in transformer computation during which the internal buffer size grows rapidly because of the use of the exponential function. Softmax is a fundamental yet computationally expensive operation in vision transformer attention, posing significant challenges for deployment on resource-constrained FPGAs (Field Programmable Gate Arrays). Computational precision demands grow at the softmax stage in the attention pipeline mainly because of the use of the exponential function in the softmax computation. This paper proposes a low-precision softmax approximation that combines a truncated Maclaurin-series exponential with input-range clamping to enable efficient hardware realization without sacrificing reconstruction quality. By bounding extreme attention scores that contribute negligibly to final outputs, the proposed method mitigates the instability of low-order polynomial approximations while preserving their hardware efficiency. The approach is first validated in software using SwinIR (Image restoration using the SWIN Transformer) super resolution to ensure reconstruction fidelity and is then analyzed for FPGA deployment. SWINIR is a multi-stage version of other transformers like Deit and Vit, making it a preferred option for testing the reconstruction fidelity of the change for transformers. Experimental results demonstrate that the proposed fourth-order clamped approximation achieves near-reference performance, incurring only 0.15 dB PSNR and 0.0059 SSIM degradation on SwinIR-M, while significantly reducing precision and memory requirements. For the large-sized SWINIR model (SWINIR-L), a PSNR increase with a less than 0.01 SSIM loss is observed, further highlighting the insignificance of extreme values as model size gets bigger. A Horner-form reformulation further improves hardware efficiency by limiting intermediate precision growth. Overall, this work presents a reconstruction-aware and hardware-friendly softmax reformulation that enables practical deployment of vision transformers on small FPGA platforms. This work also uses this contribution to improve the performance of the ViTA accelerator design. We also add bias initialization and a PE loop bound runtime variable to the existing ViTA accelerator design. Full article
Show Figures

Figure 1

8 pages, 1161 KB  
Proceeding Paper
Human Event and Action Analysis Using Transformer-Based Multimodal AI
by Ralph Edcel R. Fabian, Peter Miles Anthony L. Laporre, Louis Raphael Q. Lagare, Paul Emmanuel G. Empas and John Paul T. Cruz
Eng. Proc. 2026, 134(1), 72; https://doi.org/10.3390/engproc2026134072 - 22 Apr 2026
Viewed by 120
Abstract
With the increasing demand for enhanced security and surveillance, the integration of multimodal AI has shown significant promise. We developed and fine-tuned a transformer-based model, the Large Language and Vision Assistant–OneVision, tailored for human event and action recognition. By utilizing a multimodal approach, [...] Read more.
With the increasing demand for enhanced security and surveillance, the integration of multimodal AI has shown significant promise. We developed and fine-tuned a transformer-based model, the Large Language and Vision Assistant–OneVision, tailored for human event and action recognition. By utilizing a multimodal approach, we identified specific human actions, including eating, running, fighting, sitting, and sleeping, within diverse real-world settings. Through knowledge distillation and Low-Rank Adaptation, the model’s performance was optimized in demonstrating substantial improvements in context-aware recognition and response generation. Evaluation results showed recall-oriented understudy for obtaining evaluation (ROUGE)-1 score of 0.6844, ROUGE-2 score of 0.5751, ROUGE-L score of 0.6520, and the bilingual evaluation understudy score of 68.20, demonstrating significant gains in accuracy and interpretability. The model’s success highlights its potential for real-time applications in surveillance, healthcare, and interactive AI systems, providing reliable, efficient, and context-sensitive human action detection. Full article
Show Figures

Figure 1

16 pages, 1285 KB  
Article
A SMOTE–ViT Framework for Advanced Soil Classification on a Self-Generated Geotechnical Image Database
by Atousa Zohouri Rad, Ahmet Topal, Burcu Tunga and Müge Balkaya
Appl. Sci. 2026, 16(9), 4063; https://doi.org/10.3390/app16094063 - 22 Apr 2026
Viewed by 132
Abstract
Accurate soil type classification is fundamental to geotechnical engineering, yet traditional laboratory methods are often time consuming and labor intensive. This study investigates the potential of a Transformer-based deep learning framework for the automated classification of complex soil compositions. An image database for [...] Read more.
Accurate soil type classification is fundamental to geotechnical engineering, yet traditional laboratory methods are often time consuming and labor intensive. This study investigates the potential of a Transformer-based deep learning framework for the automated classification of complex soil compositions. An image database for geotechnical analysis is constructed using six distinct geotechnical samples comprising gravel, sand, silt, and clay systematically blended into 80 ternary mixtures. To address the inherent class imbalances in the multi-component dataset, the Synthetic Minority Oversampling Technique (SMOTE) is employed, ensuring robust representation across all categories. The proposed framework utilizes a Vision Transformer (ViT) architecture, leveraging its self-attention mechanism to capture both intricate textural patterns and long-range structural dependencies within the soil matrices. Experimental results demonstrate that the SMOTE–ViT pipeline achieved an overall accuracy of 95.83%, with high precision and recall across diverse ternary compositions. This interdisciplinary approach provides a scalable and high-precision alternative for soil characterization, offering significant potential for real-time decision-making in geotechnical investigation workflows. Full article
Show Figures

Figure 1

3 pages, 133 KB  
Editorial
Artificial Intelligence for the Food Industry
by Malik A. Hussain and Azharul Karim
Foods 2026, 15(9), 1456; https://doi.org/10.3390/foods15091456 - 22 Apr 2026
Viewed by 191
Abstract
Artificial Intelligence (AI) is transforming the food industry by enhancing food safety (contamination detection, traceability), optimizing supply chains (demand forecasting, waste reduction, logistics), personalizing nutrition (customized recommendations), and driving product innovation (new flavor creation, formulation) through data analysis, machine vision, and predictive analytics, [...] Read more.
Artificial Intelligence (AI) is transforming the food industry by enhancing food safety (contamination detection, traceability), optimizing supply chains (demand forecasting, waste reduction, logistics), personalizing nutrition (customized recommendations), and driving product innovation (new flavor creation, formulation) through data analysis, machine vision, and predictive analytics, leading to greater efficiency, sustainability, and consumer satisfaction from farm to fork [...] Full article
(This article belongs to the Special Issue Artificial Intelligence for the Food Industry)
24 pages, 506 KB  
Review
Processing of Amplitude-Temporal Acoustic Parameters in the Auditory System During Signal Coding for Image Recognition: Analytical Review
by Sergey Lytaev
Appl. Sci. 2026, 16(8), 4047; https://doi.org/10.3390/app16084047 - 21 Apr 2026
Viewed by 143
Abstract
In the study of sensory processes, the visual system has received the most research compared to other sensory systems. The primary difference between visual and auditory perception lies in the nature of the stimuli and the reception processes: vision perceives electromagnetic radiation, while [...] Read more.
In the study of sensory processes, the visual system has received the most research compared to other sensory systems. The primary difference between visual and auditory perception lies in the nature of the stimuli and the reception processes: vision perceives electromagnetic radiation, while auditory perception perceives acoustic signals of mechanical origin. This review aims to analyze modern approaches and controversies to the mechanisms of auditory perception related to psychophysics, psychophysiology, psychopathology, modern research on hearing in human–computer interaction (HCI) systems, and machine learning methods. Modern studies of acoustic patterns include a comprehensive assessment of the physical characteristics of perception, complex nonverbal auditory cues, verbalization, perception and memory, as well as individual differences in auditory perception. An analysis of the scientific literature allowed us to conclude that acoustic signals transformed in the brain into auditory images retain (encode) a number of amplitude-temporal parameters of acoustic signals that facilitate auditory discrimination (filtering), but interfere with auditory detection (recognition). Signal processing often, but not necessarily, involves brain regions involved in other forms of perception. It depends on subvocalization, includes semantically interpreted information and expectations, pictorial (visual) and descriptive components, functions as a mnemonic, and is linked to individual musical ability and experience (although the mechanisms of this connection are unclear). Full article
(This article belongs to the Special Issue Cognitive, Affective and Behavior Neuroscience)
Show Figures

Figure 1

Back to TopTop