Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (394)

Search Parameters:
Keywords = ViT-CNN

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
25 pages, 12002 KB  
Article
Evaluating Convolutional and Transformer Architectures for Photovoltaic Defect Classification via Electroluminescence Imagery
by Seda Bayat Toksöz, Gültekin Işık, Gökhan Şahin and Erdal Akin
Sensors 2026, 26(12), 3775; https://doi.org/10.3390/s26123775 (registering DOI) - 13 Jun 2026
Abstract
Electroluminescence (EL) imaging is widely used for photovoltaic (PV) defect inspection, yet fair comparison of deep learning backbones remains difficult because datasets, labels, and protocols vary across studies. This work presents a controlled image-level benchmark of six architectures (ConvNeXt-T, ViT-B/16, DeiT-B/16, Swin-T, DenseNet121, [...] Read more.
Electroluminescence (EL) imaging is widely used for photovoltaic (PV) defect inspection, yet fair comparison of deep learning backbones remains difficult because datasets, labels, and protocols vary across studies. This work presents a controlled image-level benchmark of six architectures (ConvNeXt-T, ViT-B/16, DeiT-B/16, Swin-T, DenseNet121, and MobileNetV3-Large) across five hierarchical tasks for monocrystalline and polycrystalline cells with binary and multi-class labels. A balanced proprietary dataset of 20,000 single-cell EL images was evaluated with identical preprocessing, augmentation, training, and stratified five-fold cross-validation, yielding 150 runs. ConvNeXt-T achieved the highest mean macro-F1 (93.12%) while using about one-third of the parameters of base ViT/DeiT models. On the four-class polycrystalline task, it reached 84.94 ± 0.45% macro-F1, compared with 70.08 ± 1.19% for DenseNet121 and 59.43 ± 1.71% for MobileNetV3-Large. Error analysis revealed conservative missed-defect behavior in lightweight CNNs, especially for surface-level degradation and crack categories. The results provide image-level cross-validation evidence for controlled benchmarking and motivate future module-level grouped validation. Full article
(This article belongs to the Special Issue Sensing and Imaging for Defect Detection: 2nd Edition)
Show Figures

Figure 1

44 pages, 3129 KB  
Article
Early Sepsis Detection Using Heterogeneous Structured ICU Data with Explainable Deep Learning
by Attaphongse Taparugssanagorn, Mariella Särestöniemi, Matti Hämäläinen and Jari Iinatti
Sensors 2026, 26(12), 3648; https://doi.org/10.3390/s26123648 - 8 Jun 2026
Viewed by 224
Abstract
Sepsis is life-threatening organ dysfunction caused by a dysregulated host response to infection, making early detection critical for improving outcomes in intensive care units (ICUs). This study presents a retrospective comparative evaluation of deep learning architectures for predicting sepsis up to 6 h [...] Read more.
Sepsis is life-threatening organ dysfunction caused by a dysregulated host response to infection, making early detection critical for improving outcomes in intensive care units (ICUs). This study presents a retrospective comparative evaluation of deep learning architectures for predicting sepsis up to 6 h before the PhysioNet/Computing in Cardiology 2019 Challenge onset label using hourly structured electronic health record (EHR) variables, including vital signs, laboratory measurements, and demographics. Evaluated architectures include Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Bidirectional Long Short-Term Memory (Bi-LSTM), Temporal Convolutional Network (TCN), Transformer, and hybrid Convolutional Neural Network–Vision Transformer (CNN-ViT) models. Median imputation and class-weighted loss were applied to address missing values and severe class imbalance, while Shapley Additive Explanations (SHAP) and attention analyses were used as complementary interpretability approaches. Among the evaluated models, CNN-ViT achieved the strongest overall minority-class performance, with 88.25% accuracy, 0.7480 recall, a 0.454 F1-score, and a 0.48 area under the precision–recall curve (AUPRC), although the numerical gains over other advanced temporal and hybrid architectures were modest. Leave-one-unit-out evaluation further demonstrated relatively stable performance under internal distribution shifts. The results suggest that combining local feature extraction with temporal and attention-based modeling can improve early sepsis prediction from structured ICU data. However, the study represents a retrospective computational benchmark using a public dataset and does not constitute prospective clinical validation or real-world deployment assessment. Full article
(This article belongs to the Section Communications)
Show Figures

Figure 1

35 pages, 1263 KB  
Systematic Review
Advances in Artificial Intelligence-Enabled Crop Pest and Disease Detection: A Systematic Review
by Zhen Ma, Cundeng Wang, Xinzhong Wang and Xuegeng Chen
Agriculture 2026, 16(12), 1262; https://doi.org/10.3390/agriculture16121262 - 7 Jun 2026
Viewed by 420
Abstract
The detection technology of crop diseases and pests is transitioning from single sensor monitoring to intelligent perception and multimodal fusion. This paper follows the PRISMA 2020 standard and systematically reviews the relevant core literature. This paper systematically summarizes the development history of spectral [...] Read more.
The detection technology of crop diseases and pests is transitioning from single sensor monitoring to intelligent perception and multimodal fusion. This paper follows the PRISMA 2020 standard and systematically reviews the relevant core literature. This paper systematically summarizes the development history of spectral sensing technology and analyzes the physical mechanisms of hyperspectral and multispectral imaging in early identification of crop diseases. The focus is on the architectural evolution of deep learning models, including lightweight convolutional neural networks (CNNs), vision transformers (ViTs) with long-range dependency modeling capabilities, and the efficient computing state space model Mamba. In addition, the research progress of spatial spectral joint learning, heterogeneous data fusion, and vision-language models (VLMs) in improving system robustness and interpretability are introduced. By synthesizing the integrated applications of UAV remote sensing, Internet of Things (IoT) edge computing and intelligent robots in staple and cash crops, this paper summarizes the implementation of the integrated system of perception, decision-making and execution. To address the issues of insufficient cross-domain generalization ability and uneven allocation of computing resources in existing models, this paper provides perspectives on the future development of agricultural artificial intelligence (AI) towards foundation model-driven, edge-intelligent collaboration, and green sustainable direction, which can provide theoretical reference for engineering applications in the field of intelligent plant protection. Full article
(This article belongs to the Section Crop Protection, Diseases, Pests and Weeds)
Show Figures

Figure 1

25 pages, 2614 KB  
Article
Ensemble Artificial Intelligence for Dermoscopic Decision Support: Development and External Validation of the HAM20000 Dataset
by Ming-Hseng Tseng, Wen-Da Liu and Yu-Hsien Chen
Electronics 2026, 15(11), 2367; https://doi.org/10.3390/electronics15112367 - 31 May 2026
Viewed by 397
Abstract
Artificial intelligence (AI) has shown strong potential in dermoscopic image classification; however, reliable clinical deployment remains challenging due to limited generalizability across populations and institutions. This challenge is further compounded by the performance saturation of single models: continued architectural scaling often results in [...] Read more.
Artificial intelligence (AI) has shown strong potential in dermoscopic image classification; however, reliable clinical deployment remains challenging due to limited generalizability across populations and institutions. This challenge is further compounded by the performance saturation of single models: continued architectural scaling often results in diminishing returns within the training domain and fails to yield corresponding gains in cross-domain robustness. Previous work often relies on a single benchmark dataset and evaluate models under matched data distributions, which can overestimate real-world performance. In this study, we examine the robustness and generalizability of dermoscopic AI systems under realistic deployment conditions. We curate HAM20000, a controlled expansion of the HAM10000 dataset integrating multiple public ISIC releases from 2017 to 2024, designed as a stress-test dataset to analyze performance saturation and architectural sensitivity. A total of 27 pretrained deep learning models, including convolutional neural networks, Vision Transformers, and hybrid CNN–Transformer architectures, were systematically evaluated. To mitigate the limitations of single-model optimization, we applied greedy ensemble selection to construct a compact heterogeneous ensemble using soft voting. All models were trained and selected exclusively on HAM20000 and evaluated under strict zero-shot external validation on two independent datasets: CSMUH (East Asian population) and BCN20000 (European cohort). Although the proposed ensemble achieved over 93% accuracy on the HAM20000 dataset, its performance declined substantially under external validation, highlighting a pronounced cross-population domain gap. On the BCN20000 external validation set, the ensemble reached 62% accuracy, outperforming the strongest individual model (59%) by 3%. On the CSMUH external validation set, the proposed GES achieved an accuracy of approximately 57%, demonstrating improved performance over representative single-model baselines, although it did not exceed the best-performing individual architecture in this dataset. These results indicate that, despite a marked absolute performance drop under distribution shift, the ensemble provides a modest yet consistent improvement over single-model baselines rather than a substantial performance leap. This finding underscores the importance of rigorous external validation and suggests that heterogeneous ensemble learning represents a practical and promising strategy for achieving incremental robustness gains in clinically applicable dermoscopic decision support systems. Full article
(This article belongs to the Special Issue Feature Papers in Bioelectronics: 2025–2026 Edition)
Show Figures

Figure 1

24 pages, 44455 KB  
Article
VISR-CNN: A Dual-Stream Framework for Meteorological Visibility Estimation via Multi-Scale Transmission Attention and Spectral Gating
by Wai Lun Lo, Kwok Wai Wong, Richard Tai Chiu Hsung, Henry Shu Hung Chung, Hong Fu, Harris Sik Ho Tsang and Tony Yulin Zhu
Algorithms 2026, 19(6), 434; https://doi.org/10.3390/a19060434 - 28 May 2026
Viewed by 452
Abstract
Accurate meteorological visibility estimation is vital for transportation safety and environmental monitoring. However, modeling the inherent nonlinear spatial and spectral degradations in hazy environments remains challenging. While recent Large Vision-Language Models (LVLMs) offer strong scene understanding, they lack the regression precision required for [...] Read more.
Accurate meteorological visibility estimation is vital for transportation safety and environmental monitoring. However, modeling the inherent nonlinear spatial and spectral degradations in hazy environments remains challenging. While recent Large Vision-Language Models (LVLMs) offer strong scene understanding, they lack the regression precision required for visibility estimation. In this paper, we propose the Visibility-Aware Refined CNN (VISR-CNN), a dual-stream architecture that synthesizes local spatial cues with global frequency-domain signatures. The model integrates a Multi-Scale Transmission Attention (MSTA) module, which uses parallel dilated convolutions to estimate atmospheric transmission, and a Global Frequency Branch that utilizes 2D Real Fast Fourier Transforms (RFFT) with Spectral Gating to quantify visibility-dependent blurring. A progressive training strategy is introduced to decouple spectral and spatial optimization, and a physics-informed loss function is designed to supervise numerical regression while enforcing a monotonic ranking constraint consistent with physical light-attenuation laws. Results on the HKCHC-VD dataset show that VISR-CNN achieves state-of-the-art performance (MAE: 1.54 km; RMSE: 2.31 km), representing a 13.0% improvement over VisNet. Further evaluations on the CP1 and SWH datasets confirm robust generalization, reducing overall MAE by 21% and 20%, respectively, compared with the hybrid ResNeXt-50 + ViT model. Notably, in safety-critical range (0–10 km), VISR-CNN reduces RMSE for the HKCHC-VD, CP1, and SWH datasets by approximately 55%, 64%, and 71%, respectively, when compared with VisNet. These findings demonstrate the superiority of specialized, physics-grounded architectures over general-purpose LVLMs for high-precision meteorological regression. Full article
Show Figures

Figure 1

27 pages, 4390 KB  
Article
Underwater Image Feature Extraction and Classification Using a Multi-Scale Vision Transformer with Cross-Scale Biased Attention Fusion
by Abdullah Faiz, Kun Li and Ping Chen
Appl. Sci. 2026, 16(11), 5358; https://doi.org/10.3390/app16115358 - 27 May 2026
Viewed by 137
Abstract
Underwater image analysis is affected by light scattering, wavelength-dependent attenuation, low contrast, and suspended particles, which reduce the discriminative visual features. Current multi-scale Vision Transformers are not well-suited to these degradations because they cannot effectively fuse features across scales to achieve accurate classification. [...] Read more.
Underwater image analysis is affected by light scattering, wavelength-dependent attenuation, low contrast, and suspended particles, which reduce the discriminative visual features. Current multi-scale Vision Transformers are not well-suited to these degradations because they cannot effectively fuse features across scales to achieve accurate classification. Although Vision Transformers (ViTs) can model long-range interactions, single-scale patch tokenization remains suboptimal for underwater images, where both fine-grained textures and global structures are important. This study proposes a Multi-Scale Vision Transformer (MS-ViT) with Cross-Scale Biased Attention Fusion (CSBAF) for underwater image classification. Before transformer encoding, the CSBAF introduces a learnable source–target scale-pair bias and an input-dependent scale-reliability gate. This differs from standard multi-scale fusion and cross-attention methods, which mainly concatenate features or exchange information between scale branches. The proposed design enables the model to emphasize reliable scales while suppressing degraded-scale responses. A hybrid dataset containing 14,000 images from the Roboflow Aquarium and RUIE datasets across five classes was used for evaluation. MS-ViT with CSBAF achieved 88.9% accuracy and an 88.8% F1-score, outperforming the CNN baseline by 7.6% and state-of-the-art transformer models, including UWFormer, DP-ViT, and CvT, by 2.3–4.2%. Ablation studies showed a 1.7% accuracy improvement over simple multi-scale concatenation, whereas cross-dataset testing achieved 84.4% accuracy, indicating reasonable cross-dataset robustness. These results demonstrate that explicit scale–aware fusion can improve transformer-based underwater visual understanding. Full article
Show Figures

Figure 1

31 pages, 17767 KB  
Article
Integration of Superpixel Segmentation, Convolutional Neural Networks and Vision Transformers for Automatic Benthic Habitats Classification
by Hassan Mohamed and Kazuo Nadaoka
Remote Sens. 2026, 18(11), 1711; https://doi.org/10.3390/rs18111711 - 26 May 2026
Viewed by 177
Abstract
Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have achieved significant success in various computer vision applications, including the classification of high-resolution imagery. However, a notable limitation of these deep learning approaches is their tendency to inadequately preserve the precise edges and shapes [...] Read more.
Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have achieved significant success in various computer vision applications, including the classification of high-resolution imagery. However, a notable limitation of these deep learning approaches is their tendency to inadequately preserve the precise edges and shapes of target objects. In contrast, Object-Based Image Analysis (OBIA) offers a methodology that emphasizes the preservation of object boundaries by segmenting images into meaningful objects. Combining CNNs and ViTs with OBIA leverages the feature extraction capabilities of these deep learning algorithms and the boundary-preserving advantages of OBIA, leading to enhanced classification accuracy and improved delineation of object boundaries in high-resolution images. Still, the main challenge for combining these methods lies in effectively aligning the irregularly shaped image objects produced by OBIA with the regular image patches required by CNNs and ViT architectures. In this study, we propose a novel approach that integrates superpixel segmentation with CNNs and ViTs for the automatic classification of benthic habitats using high-resolution orthomosaic images. Initially, the Simple Linear Iterative Clustering (SLIC) algorithm was applied to segment the high-resolution orthomosaic images into superpixels. Subsequently, the central points of the resulting superpixels were utilized to generate square image patches. These patches performed as inputs for ConvNeXt-Base and EfficientNet-B0 pre-trained CNNs to extract fine-grained features and Dinov2 ViTs to extract high-level features. Then, a Support Vector Machine (SVM) classifier was trained using these attributes to classify benthic habitats. Eventually, the classification label derived from the SVM defined the class of each superpixel segment. This method achieved an average overall accuracy of 0.96 in classifying benthic habitats. Overall, we demonstrate that combining CNNs, ViTs, and superpixel segmentation is an effective approach to benthic habitats classification, providing accurate high-resolution maps of heterogeneous reef environments. Full article
(This article belongs to the Section Ocean Remote Sensing)
Show Figures

Figure 1

15 pages, 2296 KB  
Article
Implementation of Vision Transformer Model for Robust Tool Wear Monitoring in Milling of Inconel 718
by Garvit Singh, Ankit Agarwal, Kaushal A. Desai and Laine Mears
Machines 2026, 14(6), 589; https://doi.org/10.3390/machines14060589 - 25 May 2026
Viewed by 312
Abstract
Tool wear monitoring is essential for ensuring machining efficiency and product quality, particularly for difficult-to-machine materials such as Inconel 718 (IN718). Traditional deep learning models, such as Conventional Convolutional Neural Networks (CNNs), often struggle to capture complex wear patterns and lack accuracy across [...] Read more.
Tool wear monitoring is essential for ensuring machining efficiency and product quality, particularly for difficult-to-machine materials such as Inconel 718 (IN718). Traditional deep learning models, such as Conventional Convolutional Neural Networks (CNNs), often struggle to capture complex wear patterns and lack accuracy across varying machining conditions while developing image-based tool wear identification systems. To address these limitations, this paper presents a Vision Transformer (ViT) model for identifying tool-wear categories during end-milling of IN718. The performance of the ViT-based model is systematically compared with a CNN-based EfficientNet-b0 model. The robustness and generalization of the ViT-based model are validated on two previously unseen image datasets: one with conditions similar to those of the training data and another acquired under varying lighting conditions. The results indicate that the ViT model outperforms the EfficientNet-b0 model in terms of classification accuracy and computational efficiency. The ViT model achieves higher accuracy with fewer training epochs and faster convergence. Furthermore, it exhibits strong generalization across different lighting conditions, demonstrating robustness to variations in the machining environment. The findings presented in this work clearly demonstrate ViT’s effectiveness in tool wear classification and its potential as a reliable, efficient algorithm for developing tool wear monitoring systems for practical machining applications. Full article
(This article belongs to the Special Issue Intelligent Tool Wear Monitoring)
Show Figures

Figure 1

16 pages, 2294 KB  
Article
A Quantitative Evaluation of Gradient-Based Visual Explainability Methods Across Convolutional and Transformer-Based Vision Models
by Angelos Tzirtis, Christos Troussas, Akrivi Krouska, Phivos Mylonas and Cleo Sgouropoulou
Electronics 2026, 15(11), 2241; https://doi.org/10.3390/electronics15112241 - 22 May 2026
Viewed by 226
Abstract
Explainable Artificial Intelligence (XAI) has become a critical requirement for the responsible deployment of deep learning systems in safety-critical and regulated domains, particularly in medical imaging. In computer vision, gradient-based explanation methods such as Saliency Maps and Gradient-weighted Class Activation Mapping (Grad-CAM) are [...] Read more.
Explainable Artificial Intelligence (XAI) has become a critical requirement for the responsible deployment of deep learning systems in safety-critical and regulated domains, particularly in medical imaging. In computer vision, gradient-based explanation methods such as Saliency Maps and Gradient-weighted Class Activation Mapping (Grad-CAM) are widely used for interpreting convolutional neural networks (CNNs). However, the increasing adoption of Vision Transformers (ViTs) introduces structural differences in internal representations that challenge the direct transfer of convolutional explainability mechanisms. This study presents a systematic, quantitative, and statistically validated evaluation of gradient-based visual explainability across CNN architectures (VGG16 and ResNet50) and a Vision Transformer (ViT-B/16), using both a domain-specific medical imaging dataset (brain MRI, tumor vs. non-tumor classification). Beyond qualitative heatmap inspection, we conduct deletion-based faithfulness analysis, sensitivity-to-noise evaluation, feature masking validation, and statistical hypothesis testing over 30 independent runs. All models achieve strong predictive performance on the domain dataset (mean accuracy ≈ 0.99), enabling a fair and meaningful comparison of explanation methods across architectures. Results demonstrate that explanation reliability is highly method- and architecture-dependent. Sensitivity differences are consistently statistically significant, whereas deletion-based faithfulness does not always yield equally strong separation under the adopted masking protocol. Masking-based analysis reveals substantial false-positive rates in certain configurations, indicating that visually plausible heatmaps do not necessarily isolate decision-necessary evidence. These findings underscore the importance of coupling visual explanations with behavioral validation metrics, particularly in high-risk domains governed by emerging regulatory frameworks such as the EU AI Act. Overall, the study advocates for empirically validated, architecture-aware, and statistically grounded approaches to medical XAI. Full article
Show Figures

Figure 1

29 pages, 4755 KB  
Article
DenseViT-OCT: A Hybrid CNN-Transformer Architecture with Multi-Scale Dense Feature Aggregation for Automated Epiretinal Membrane Severity Classification
by Elif Yusufoğlu, Salih Taha Alperen Özçelik, Orhan Atila, Numan Halit Guldemir and Abdulkadir Sengur
Tomography 2026, 12(6), 76; https://doi.org/10.3390/tomography12060076 - 22 May 2026
Viewed by 228
Abstract
Background/Objectives: Epiretinal membrane (ERM) is a common vitreoretinal disorder characterized by fibrocellular proliferation on the inner retinal surface, often leading to progressive visual impairment. Accurate grading of ERM severity using optical coherence tomography (OCT) is critical for treatment planning and surgical decision-making; however, [...] Read more.
Background/Objectives: Epiretinal membrane (ERM) is a common vitreoretinal disorder characterized by fibrocellular proliferation on the inner retinal surface, often leading to progressive visual impairment. Accurate grading of ERM severity using optical coherence tomography (OCT) is critical for treatment planning and surgical decision-making; however, manual grading is labor-intensive and subjective. This study aims to develop an automated and reliable deep learning-based method for ERM severity classification. Methods: We propose DenseViT-OCT, a hybrid deep learning model that integrates dense convolutional neural networks (CNN) and vision transformers (ViT). The model introduces three key modules: Multi-Scale Dense Feature Aggregation (MDFA) for capturing hierarchical features across multiple spatial scales, Adaptive Feature Calibration (AFC) for enhancing feature discrimination through channel and spatial attention, and Cross-Attention Feature Fusion (CAFF) for enabling bidirectional interaction between convolutional and transformer representations. The model was trained and evaluated on 2195 OCT B-scan images obtained from 397 patients. Results: DenseViT-OCT achieved an overall accuracy of 94.76% on the internal four-class test set, outperforming 19 benchmark models, including ConvNeXt, EfficientNet, ViT, and Swin Transformers. The model demonstrated balanced performance with a macro-averaged precision of 93.76%, recall of 93.22%, F1-score of 93.47%, Cohen’s kappa of 92.62%, and macro-Area Under the Curve (AUC) of 98.95%. Ablation experiments confirmed the contribution of the proposed MDFA, AFC, CAFF, and deep supervision components, with the full model consistently outperforming reduced variants and standalone DenseNet121 and ViT-B/16 backbones. In repeated experiments across five random seeds, DenseViT-OCT also achieved the best mean accuracy (0.9399 ± 0.0052). External validation on the public multicenter OCTDL dataset, performed as binary ERM-versus-normal classification because of label availability, yielded 90.76% accuracy and 97.61% AUC, indicating promising generalization beyond the development cohort. Conclusions: DenseViT-OCT provides a robust framework for automated ERM severity classification from OCT B-scans. The combination of local CNN features, global transformer context, and dedicated fusion modules improves classification performance and yields clinically meaningful error patterns. Although further stage-wise multicenter validation, volumetric OCT analysis, and prospective clinical assessment are required, the proposed method shows promise as a research-oriented decision-support framework for B-scan-level ERM assessment. Full article
(This article belongs to the Special Issue Medical Image Analysis in CT Imaging)
Show Figures

Figure 1

26 pages, 3005 KB  
Article
EcoTomHybridNet: Policy-Guided Adaptive CNN–Transformer Inference for Resource-Aware Edge-Based Tomato Leaf Disease Classification
by Oussama Nabil and Cherkaoui Leghris
Future Internet 2026, 18(5), 271; https://doi.org/10.3390/fi18050271 - 21 May 2026
Viewed by 308
Abstract
Tomato (Solanum lycopersicum) cultivation is highly vulnerable to fungal, bacterial, and viral leaf diseases that can significantly reduce crop yield and fruit quality when not detected at early stages. Although recent deep learning approaches have achieved remarkable performance in plant disease [...] Read more.
Tomato (Solanum lycopersicum) cultivation is highly vulnerable to fungal, bacterial, and viral leaf diseases that can significantly reduce crop yield and fruit quality when not detected at early stages. Although recent deep learning approaches have achieved remarkable performance in plant disease classification, many state-of-the-art architectures remain computationally expensive and therefore difficult to deploy on resource-constrained edge devices commonly used in smart agriculture environments. To address this challenge, this paper introduces EcoTomHybridNet, an adaptive resource-aware CNN–Transformer framework designed for efficient tomato leaf disease classification under edge-computing constraints. The proposed architecture combines a lightweight convolutional backbone with a dual-branch inference mechanism composed of a fast convolutional branch for computationally efficient prediction and a Transformer-enhanced branch with local self-attention for richer contextual feature extraction. Unlike conventional lightweight hybrid models relying on static inference pipelines, EcoTomHybridNet integrates a lightweight policy-guided routing mechanism that dynamically allocates inputs between the fast convolutional branch and the Transformer-enhanced branch according to input complexity. This adaptive inference strategy dynamically reduces unnecessary Transformer computations for simpler samples while preserving strong predictive performance on more challenging inputs through policy-guided branch allocation. To further improve representation capability without significantly increasing computational complexity, the proposed student network is trained using knowledge distillation from a ViT-Tiny teacher model. Experimental results on the PlantVillage tomato dataset demonstrate that EcoTomHybridNet achieves 99.42% test accuracy and 99.0% validation accuracy under the full hybrid inference configuration. Additional validation strategies, including 5-fold cross-validation and robustness evaluation under Gaussian noise and motion blur perturbations, indicate stable performance across different data splits and moderate image degradations, suggesting improved generalization capability beyond simple dataset memorization. Furthermore, adaptive routing experiments using a lightweight threshold-based policy mechanism achieved 99.20% test accuracy while reducing computational complexity from 0.36 GFLOPs to 0.25 GFLOPs per image, corresponding to approximately 30% computational savings. These results demonstrate the effectiveness of policy-guided adaptive inference for balancing predictive performance and computational efficiency in edge-oriented plant disease classification. Overall, EcoTomHybridNet provides an efficient and adaptive framework for intelligent plant disease monitoring in IoT-enabled smart agriculture systems. Full article
Show Figures

Graphical abstract

23 pages, 2142 KB  
Article
ECAViT-Net: A Lightweight Hybrid CNN-Transformer Architecture for Efficient Cervical Cytological Cell Classification
by Mamadou Eric Sangare, Boujemaa Nassiri, Youssef El Habouz, Yousef El Mourabit, Hamidou Tembine and Bsiss Mohammed Aziz
Appl. Sci. 2026, 16(10), 4995; https://doi.org/10.3390/app16104995 - 17 May 2026
Viewed by 335
Abstract
Cervical cancer, primarily caused by human papillomavirus (HPV) infection, remains a major cause of cancer-related mortality among women worldwide, making early detection through cytological screening essential. However, manual analysis of cytology images is time-consuming and subject to variability, while recent deep learning approaches, [...] Read more.
Cervical cancer, primarily caused by human papillomavirus (HPV) infection, remains a major cause of cancer-related mortality among women worldwide, making early detection through cytological screening essential. However, manual analysis of cytology images is time-consuming and subject to variability, while recent deep learning approaches, particularly transformer-based architectures, often require high computational resources, limiting their use in resource-constrained settings. In this study, we propose ECAViTNet, a lightweight hybrid CNN–Transformer architecture for cervical cytology image classification that balances accuracy and efficiency. The model integrates Efficient Channel Attention modules for adaptive feature recalibration, residual connections for stable optimization, MobileViT blocks to capture local and global dependencies, and gated multi-scale fusion mechanisms to enhance feature representation, along with progressive downsampling and skip connections to preserve fine-grained details. The proposed approach was evaluated on the SIPaKMeD dataset, achieving a test accuracy of 96.42% with only 982,491 parameters and a macro-average F1-score of 0.96 and a weighted-average F1-score of 0.96, while maintaining balanced class-wise performance and reduced computational cost compared to recent methods. These results demonstrate that ECAViTNet is an effective and efficient solution for automated cervical cytology classification, with strong potential for deployment in mobile health systems and low-resource clinical environments. Full article
(This article belongs to the Special Issue AI for Medical Systems: Algorithms, Applications, and Challenges)
Show Figures

Figure 1

12 pages, 2066 KB  
Article
Automated Classification of Maxillary Sinus Ostium Patency Using a ConvNeXt-Tiny + DeiT Gated MLP-Based Hybrid Deep Learning Model: A Retrospective CBCT Study
by Furkan Talo, Nurullah Duger, Emre Aslan, Muhammed Yildirim, Mahmut Kaya, Ahmet Bedri Ozer and Tuba Talo Yildirim
Diagnostics 2026, 16(10), 1512; https://doi.org/10.3390/diagnostics16101512 - 16 May 2026
Viewed by 311
Abstract
Background/Objectives: The patency and anatomical location of the maxillary sinus ostium are critical for preventing postoperative complications in dental implant planning and sinus lift surgeries in the posterior maxilla. Narrowing or obstruction of the ostium carries risks, including the development of acute/chronic [...] Read more.
Background/Objectives: The patency and anatomical location of the maxillary sinus ostium are critical for preventing postoperative complications in dental implant planning and sinus lift surgeries in the posterior maxilla. Narrowing or obstruction of the ostium carries risks, including the development of acute/chronic sinusitis and bone graft failure after surgery. These risks must be carefully evaluated using preoperative radiographic images. It is time-consuming for physicians to manually perform this process, and details are overlooked due to a lack of clinical experience, which can increase surgical risks. Methods: This study aims to overcome these clinical challenges and improve the reliability of radiographic evaluation. In this study, a hybrid deep learning model is proposed for the automatic detection of the maxillary sinus ostium. The proposed model combines the local feature extraction power of CNN-based models with the global context modeling capabilities of transformer-based models, creating an effective model. Additionally, the gated fusion technique efficiently combines features from various designs, significantly enhancing classification performance. Results: The proposed model was compared with six different ViT and CNN architectures established in the literature. While the highest test accuracy among pre-trained models was 89.36%, the proposed hybrid model achieved 95.03%, demonstrating strong clinical diagnostic performance. Conclusions: Based on the performance metrics obtained, we believe the proposed model can be used to determine the patency of the maxillary sinus ostium. This will lighten the workload for specialists and minimize traditional errors. Full article
Show Figures

Figure 1

30 pages, 7003 KB  
Article
Facial Expression Recognition in Anime and Manga Characters: A Comparative Study of Vision Transformers and Convolutional Neural Networks
by Marco Parrillo, Elia Santoro, Luigi Laura and Valerio Rughetti
Information 2026, 17(5), 484; https://doi.org/10.3390/info17050484 - 15 May 2026
Viewed by 461
Abstract
Facial expression recognition (FER) is a well-established task in computer vision, yet its application to non-photorealistic domains, such as anime and manga, remains largely underexplored. The stylized, exaggerated, and often non-proportional facial features of illustrated characters present unique challenges for deep learning models [...] Read more.
Facial expression recognition (FER) is a well-established task in computer vision, yet its application to non-photorealistic domains, such as anime and manga, remains largely underexplored. The stylized, exaggerated, and often non-proportional facial features of illustrated characters present unique challenges for deep learning models trained predominantly on realistic imagery. In this work, we construct a balanced dataset of 3000 manga and anime face images spanning six emotion categories (Angry, Embarrassed, Happy, Manic–Euphoric, Sad, Scared) and conduct a systematic comparison of two major deep learning paradigms: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Specifically, we evaluate ResNet-18, ResNet-50, ViT-B/16, and ViT-S/16 under four fine-tuning strategies: linear probing, partial fine-tuning, full fine-tuning, and progressive unfreezing, enabling a controlled comparison of both architectural families and transfer learning depth. Our results show that fine-tuning strategy significantly impacts performance: the best configuration (ViT-B/16 with progressive unfreezing) achieves 81.33% test accuracy (single run, seed 42), compared to 61.33% for the weakest linear probe baseline (ViT-S/16), a gap of 20.00 percentage points. To isolate architectural differences from strategy effects, we note that under full fine-tuning, the only strategy applied identically to all four models, ViT-S/16 (76.00%) outperforms ResNet-18 (74.44%) by 1.56 percentage points and ViT-B/16 (74.22%) by 1.78 percentage points, confirming a modest but consistent architectural advantage for Transformers once backbone adaptation is permitted. Vision Transformers benefit disproportionately from fine-tuning, and the relative ranking of architectures changes across fine-tuning regimes. Confusion matrix analysis reveals persistent cross-class confusion between visually similar emotions (e.g., Happy vs. Embarrassed), while the highly distinctive Manic–Euphoric category is consistently well recognized across all architectures. To the best of our knowledge, this is the first work to conduct a controlled multi-architecture, multi-strategy transfer learning benchmark specifically for FER in anime and manga, revealing findings that are not predictable from photographic FER literature and that carry direct practical implications for model selection in non-photorealistic visual recognition tasks. The anime and manga domain provides a uniquely controlled testbed for studying transfer learning under deliberate stylization, where the domain gap from realistic imagery is not an artifact of image degradation or environmental noise but a principled artistic choice with codified visual conventions; observing that fine-tuning depth dominates architectural choice in this domain suggests the same conclusion likely holds in other non-photorealistic transfer scenarios such as medical illustrations, architectural drawings, and synthetic training data. Full article
Show Figures

Figure 1

27 pages, 2306 KB  
Article
Multimodal Emotion Detection in Low-Resource Languages Using Lightweight Transformer Architectures: A Dual-Level Fusion Framework Integrating DistilBERT, CNN-BiGRU, and MobileViT for Efficient Real-Time Urdu Affective Computing
by Muhammad Azhar, Adeen Amjad, Muhammad Arman and Deshinta Arrova Dewi
Information 2026, 17(5), 458; https://doi.org/10.3390/info17050458 - 8 May 2026
Viewed by 294
Abstract
This paper addresses emotion recognition in low-resource language settings for healthcare and human-computer interaction (HCI). Most existing multimodal systems rely on resource-intensive transformers or high-resource languages, limiting their applicability to low-resource languages like Urdu. We propose an efficiency-driven, lightweight multimodal framework for Urdu [...] Read more.
This paper addresses emotion recognition in low-resource language settings for healthcare and human-computer interaction (HCI). Most existing multimodal systems rely on resource-intensive transformers or high-resource languages, limiting their applicability to low-resource languages like Urdu. We propose an efficiency-driven, lightweight multimodal framework for Urdu emotion detection integrating facial expressions, speech, and text. We utilize DistilBERT for text, CNN-BiGRU for audio, and MobileViT-XXS for visual processing with a dual-level fusion strategy. We evaluate on the publicly available UMED corpus, the only multimodal Urdu emotion dataset. Our system recognizes expressed emotional signals rather than internal affective states. Experimental results demonstrate competitive performance (83.72% accuracy) while requiring 76.5% fewer parameters and 4.4× faster inference than heavyweight baselines, enabling accessible, real-time emotion recognition in low-resource contexts. Full article
Show Figures

Figure 1

Back to TopTop