Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (205)

Search Parameters:
Keywords = Swin Transformer neural network

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
15 pages, 2358 KB  
Article
Optimized Lung Nodule Classification Using CLAHE-Enhanced CT Imaging and Swin Transformer-Based Deep Feature Extraction
by Dorsaf Hrizi, Khaoula Tbarki and Sadok Elasmi
J. Imaging 2025, 11(10), 346; https://doi.org/10.3390/jimaging11100346 (registering DOI) - 4 Oct 2025
Abstract
Lung cancer remains one of the most lethal cancers globally. Its early detection is vital to improving survival rates. In this work, we propose a hybrid computer-aided diagnosis (CAD) pipeline for lung cancer classification using Computed Tomography (CT) scan images. The proposed CAD [...] Read more.
Lung cancer remains one of the most lethal cancers globally. Its early detection is vital to improving survival rates. In this work, we propose a hybrid computer-aided diagnosis (CAD) pipeline for lung cancer classification using Computed Tomography (CT) scan images. The proposed CAD pipeline integrates ten image preprocessing techniques and ten pretrained deep learning models for feature extraction including convolutional neural networks and transformer-based architectures, and four classical machine learning classifiers. Unlike traditional end-to-end deep learning systems, our approach decouples feature extraction from classification, enhancing interpretability and reducing the risk of overfitting. A total of 400 model configurations were evaluated to identify the optimal combination. The proposed approach was evaluated on the publicly available Lung Image Database Consortium and Image Database Resource Initiative dataset, which comprises 1018 thoracic CT scans annotated by four thoracic radiologists. For the classification task, the dataset included a total of 6568 images labeled as malignant and 4849 images labeled as benign. Experimental results show that the best performing pipeline, combining Contrast Limited Adaptive Histogram Equalization, Swin Transformer feature extraction, and eXtreme Gradient Boosting, achieved an accuracy of 95.8%. Full article
(This article belongs to the Special Issue Advancements in Imaging Techniques for Detection of Cancer)
22 pages, 782 KB  
Article
Hybrid CNN-Swin Transformer Model to Advance the Diagnosis of Maxillary Sinus Abnormalities on CT Images Using Explainable AI
by Mohammad Alhumaid and Ayman G. Fayoumi
Computers 2025, 14(10), 419; https://doi.org/10.3390/computers14100419 - 2 Oct 2025
Abstract
Accurate diagnosis of sinusitis is essential due to its widespread prevalence and its considerable impact on patient quality of life. While multiple imaging techniques are available for detecting maxillary sinus, computed tomography (CT) remains the preferred modality because of its high sensitivity and [...] Read more.
Accurate diagnosis of sinusitis is essential due to its widespread prevalence and its considerable impact on patient quality of life. While multiple imaging techniques are available for detecting maxillary sinus, computed tomography (CT) remains the preferred modality because of its high sensitivity and spatial resolution. Although recent advances in deep learning have led to the development of automated methods for sinusitis classification, many existing models perform poorly in the presence of complex pathological features and offer limited interpretability, which hinders their integration into clinical workflows. In this study, we propose a hybrid deep learning framework that combines EfficientNetB0, a convolutional neural network, with the Swin Transformer, a vision transformer, to improve feature representation. An attention-based fusion module is used to integrate both local and global information, thereby enhancing diagnostic accuracy. To improve transparency and support clinical adoption, the model incorporates explainable artificial intelligence (XAI) techniques using Gradient-weighted Class Activation Mapping (Grad-CAM). This allows for visualization of the regions influencing the model’s predictions, helping radiologists assess the clinical relevance of the results. We evaluate the proposed method on a curated maxillary sinus CT dataset covering four diagnostic categories: Normal, Opacified, Polyposis, and Retention Cysts. The model achieves a classification accuracy of 95.83%, with precision, recall, and F1 score all at 95%. Grad-CAM visualizations indicate that the model consistently focuses on clinically significant regions of the sinus anatomy, supporting its potential utility as a reliable diagnostic aid in medical practice. Full article
23 pages, 18084 KB  
Article
WetSegNet: An Edge-Guided Multi-Scale Feature Interaction Network for Wetland Classification
by Li Chen, Shaogang Xia, Xun Liu, Zhan Xie, Haohong Chen, Feiyu Long, Yehong Wu and Meng Zhang
Remote Sens. 2025, 17(19), 3330; https://doi.org/10.3390/rs17193330 - 29 Sep 2025
Abstract
Wetlands play a crucial role in climate regulation, pollutant filtration, and biodiversity conservation. Accurate wetland classification through high-resolution remote sensing imagery is pivotal for the scientific management, ecological monitoring, and sustainable development of these ecosystems. However, the intricate spatial details in such imagery [...] Read more.
Wetlands play a crucial role in climate regulation, pollutant filtration, and biodiversity conservation. Accurate wetland classification through high-resolution remote sensing imagery is pivotal for the scientific management, ecological monitoring, and sustainable development of these ecosystems. However, the intricate spatial details in such imagery pose significant challenges to conventional interpretation techniques, necessitating precise boundary extraction and multi-scale contextual modeling. In this study, we propose WetSegNet, an edge-guided Multi-Scale Feature Interaction network for wetland classification, which integrates a convolutional neural network (CNN) and Swin Transformer within a U-Net architecture to synergize local texture perception and global semantic comprehension. Specifically, the framework incorporates two novel components: (1) a Multi-Scale Feature Interaction (MFI) module employing cross-attention mechanisms to mitigate semantic discrepancies between encoder–decoder features, and (2) a Multi-Feature Fusion (MFF) module that hierarchically enhances boundary delineation through edge-guided spatial attention (EGA). Experimental validation on GF-2 satellite imagery of Dongting Lake wetlands demonstrates that WetSegNet achieves state-of-the-art performance, with an overall accuracy (OA) of 90.81% and a Kappa coefficient of 0.88. Notably, it achieves classification accuracies exceeding 90% for water, sedge, and reed habitats, surpassing the baseline U-Net by 3.3% in overall accuracy and 0.05 in Kappa. The proposed model effectively addresses heterogeneous wetland classification challenges, validating its capability to reconcile local–global feature representation. Full article
Show Figures

Figure 1

18 pages, 1694 KB  
Article
FAIR-Net: A Fuzzy Autoencoder and Interpretable Rule-Based Network for Ancient Chinese Character Recognition
by Yanling Ge, Yunmeng Zhang and Seok-Beom Roh
Sensors 2025, 25(18), 5928; https://doi.org/10.3390/s25185928 - 22 Sep 2025
Viewed by 150
Abstract
Ancient Chinese scripts—including oracle bone carvings, bronze inscriptions, stone steles, Dunhuang scrolls, and bamboo slips—are rich in historical value but often degraded due to centuries of erosion, damage, and stylistic variability. These issues severely hinder manual transcription and render conventional OCR techniques inadequate, [...] Read more.
Ancient Chinese scripts—including oracle bone carvings, bronze inscriptions, stone steles, Dunhuang scrolls, and bamboo slips—are rich in historical value but often degraded due to centuries of erosion, damage, and stylistic variability. These issues severely hinder manual transcription and render conventional OCR techniques inadequate, as they are typically trained on modern printed or handwritten text and lack interpretability. To tackle these challenges, we propose FAIR-Net, a hybrid architecture that combines the unsupervised feature learning capacity of a deep autoencoder with the semantic transparency of a fuzzy rule-based classifier. In FAIR-Net, the deep autoencoder first compresses high-resolution character images into low-dimensional, noise-robust embeddings. These embeddings are then passed into a Fuzzy Neural Network (FNN), whose hidden layer leverages Fuzzy C-Means (FCM) clustering to model soft membership degrees and generate human-readable fuzzy rules. The output layer uses Iteratively Reweighted Least Squares Estimation (IRLSE) combined with a Softmax function to produce probabilistic predictions, with all weights constrained as linear mappings to maintain model transparency. We evaluate FAIR-Net on CASIA-HWDB1.0, HWDB1.1, and ICDAR 2013 CompetitionDB, where it achieves a recognition accuracy of 97.91%, significantly outperforming baseline CNNs (p < 0.01, Cohen’s d > 0.8) while maintaining the tightest confidence interval (96.88–98.94%) and lowest standard deviation (±1.03%). Additionally, FAIR-Net reduces inference time to 25 s, improving processing efficiency by 41.9% over AlexNet and up to 98.9% over CNN-Fujitsu, while preserving >97.5% accuracy across evaluations. To further assess generalization to historical scripts, FAIR-Net was tested on the Ancient Chinese Character Dataset (9233 classes; 979,907 images), achieving 83.25% accuracy—slightly higher than ResNet101 but 2.49% lower than SwinT-v2-small—while reducing training time by over 5.5× compared to transformer-based baselines. Fuzzy rule visualization confirms enhanced robustness to glyph ambiguities and erosion. Overall, FAIR-Net provides a practical, interpretable, and highly efficient solution for the digitization and preservation of ancient Chinese character corpora. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

26 pages, 1061 KB  
Article
EEViT: Efficient Enhanced Vision Transformer Architectures with Information Propagation and Improved Inductive Bias
by Rigel Mahmood, Sarosh Patel and Khaled Elleithy
AI 2025, 6(9), 233; https://doi.org/10.3390/ai6090233 - 17 Sep 2025
Viewed by 523
Abstract
The Transformer architecture has been the foundational cornerstone of the recent AI revolution, serving as the backbone of Large Language Models, which have demonstrated impressive language understanding and reasoning capabilities. When pretrained on large amounts of data, Transformers have also shown to be [...] Read more.
The Transformer architecture has been the foundational cornerstone of the recent AI revolution, serving as the backbone of Large Language Models, which have demonstrated impressive language understanding and reasoning capabilities. When pretrained on large amounts of data, Transformers have also shown to be highly effective in image classification via the advent of the Vision Transformer. However, they still lag in vision application performance compared to Convolutional Neural Networks (CNNs), which offer translational invariance, whereas Transformers lack inductive bias. Further, the Transformer relies on the attention mechanism, which despite increasing the receptive field, makes it computationally inefficient due to its quadratic time complexity. In this paper, we enhance the Transformer architecture, focusing on its above two shortcomings. We propose two efficient Vision Transformer architectures that significantly reduce the computational complexity without sacrificing classification performance. Our first enhanced architecture is the EEViT-PAR, which combines features from two recently proposed designs of PerceiverAR and CaiT. This enhancement leads to our second architecture, EEViT-IP, which provides implicit windowing capabilities akin to the SWIN Transformer and implicitly improves the inductive bias, while being extremely memory and computationally efficient. We perform detailed experiments on multiple image datasets to show the effectiveness of our architectures. Our best performing EEViT outperforms existing SOTA ViT models in terms of execution efficiency and surpasses or provides competitive classification accuracy on different benchmarks. Full article
Show Figures

Figure 1

22 pages, 3585 KB  
Article
A Novel 3D U-Net–Vision Transformer Hybrid with Multi-Scale Fusion for Precision Multimodal Brain Tumor Segmentation in 3D MRI
by Fathia Ghribi and Fayçal Hamdaoui
Electronics 2025, 14(18), 3604; https://doi.org/10.3390/electronics14183604 - 11 Sep 2025
Viewed by 419
Abstract
In recent years, segmentation for medical applications using Magnetic Resonance Imaging (MRI) has received increasing attention. Working in this field has emerged as an ambitious task and a major challenge for researchers; particularly, brain tumor segmentation from MRI is a crucial task for [...] Read more.
In recent years, segmentation for medical applications using Magnetic Resonance Imaging (MRI) has received increasing attention. Working in this field has emerged as an ambitious task and a major challenge for researchers; particularly, brain tumor segmentation from MRI is a crucial task for accurate diagnosis, treatment planning, and patient monitoring. With the rapid development of deep learning methods, significant improvements have been made in medical image segmentation. Convolutional Neural Networks (CNNs), such as U-Net, have shown excellent performance in capturing local spatial features. However, these models cannot explicitly capture long-range dependencies. Therefore, Vision Transformers have emerged as an alternative segmentation method recently, as they can exploit long-range correlations through the self-attention mechanism (MSA). Despite their effectiveness, ViTs require large annotated datasets and may compromise fine-grained spatial details. To address these problems, we propose a novel hybrid approach for brain tumor segmentation that combines a 3D U-Net with a 3D Vision Transformer (ViT3D), aiming to jointly exploit local feature extraction and global context modeling. Additionally, we developed an effective fusion method that uses upsampling and convolutional refinement to improve multi-scale feature integration. Unlike traditional fusion approaches, our method explicitly refines spatial details while maintaining global dependencies, improving the quality of tumor border delineation. We evaluated our approach on the BraTS 2020 dataset, achieving a global accuracy score of 99.56%, an average Dice similarity coefficient (DSC) of 77.43% (corresponding to the mean across the three tumor subregions), with individual Dice scores of 84.35% for WT, 80.97% for TC, and 66.97% for ET, and an average Intersection over Union (IoU) of 71.69%. These extensive experimental results demonstrate that our model not only localizes tumors with high accuracy and robustness but also outperforms a selection of current state-of-the-art methods, including U-Net, SwinUnet, M-Unet, and others. Full article
Show Figures

Figure 1

25 pages, 4660 KB  
Article
Dual-Stream Former: A Dual-Branch Transformer Architecture for Visual Speech Recognition
by Sanghun Jeon, Jieun Lee and Yong-Ju Lee
AI 2025, 6(9), 222; https://doi.org/10.3390/ai6090222 - 9 Sep 2025
Viewed by 877
Abstract
This study proposes Dual-Stream Former, a novel architecture that integrates a Video Swin Transformer and Conformer designed to address the challenges of visual speech recognition (VSR). The model captures spatiotemporal dependencies, achieving a state-of-the-art character error rate (CER) of 3.46%, surpassing traditional convolutional [...] Read more.
This study proposes Dual-Stream Former, a novel architecture that integrates a Video Swin Transformer and Conformer designed to address the challenges of visual speech recognition (VSR). The model captures spatiotemporal dependencies, achieving a state-of-the-art character error rate (CER) of 3.46%, surpassing traditional convolutional neural network (CNN)-based models, such as 3D-CNN + DenseNet-121 (CER: 5.31%), and transformer-based alternatives, such as vision transformers (CER: 4.05%). The Video Swin Transformer captures multiscale spatial representations with high computational efficiency, whereas the Conformer back-end enhances temporal modeling across diverse phoneme categories. Evaluation of a high-resolution dataset comprising 740,000 utterances across 185 classes highlighted the effectiveness of the model in addressing visually confusing phonemes, such as diphthongs (/ai/, /au/) and labio-dental sounds (/f/, /v/). Dual-Stream Former achieved phoneme recognition error rates of 10.39% for diphthongs and 9.25% for labiodental sounds, surpassing those of CNN-based architectures by more than 6%. Although the model’s large parameter count (168.6 M) poses resource challenges, its hierarchical design ensures scalability. Future work will explore lightweight adaptations and multimodal extensions to increase deployment feasibility. These findings underscore the transformative potential of Dual-Stream Former for advancing VSR applications such as silent communication and assistive technologies by achieving unparalleled precision and robustness in diverse settings. Full article
Show Figures

Figure 1

26 pages, 6612 KB  
Article
A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis
by Leonardo Scabini, Andre Sacilotti, Kallil M. Zielinski, Lucas C. Ribas, Bernard De Baets and Odemir M. Bruno
J. Imaging 2025, 11(9), 304; https://doi.org/10.3390/jimaging11090304 - 5 Sep 2025
Cited by 1 | Viewed by 693
Abstract
Texture, a significant visual attribute in images, plays an important role in many pattern recognition tasks. While Convolutional Neural Networks (CNNs) have been among the most effective methods for texture analysis, alternative architectures such as Vision Transformers (ViTs) have recently demonstrated superior performance [...] Read more.
Texture, a significant visual attribute in images, plays an important role in many pattern recognition tasks. While Convolutional Neural Networks (CNNs) have been among the most effective methods for texture analysis, alternative architectures such as Vision Transformers (ViTs) have recently demonstrated superior performance on a range of visual recognition problems. However, the suitability of ViTs for texture recognition remains underexplored. In this work, we investigate the capabilities and limitations of ViTs for texture recognition by analyzing 25 different ViT variants as feature extractors and comparing them to CNN-based and hand-engineered approaches. Our evaluation encompasses both accuracy and efficiency, aiming to assess the trade-offs involved in applying ViTs to texture analysis. Our results indicate that ViTs generally outperform CNN-based and hand-engineered models, particularly when using strong pre-training and in-the-wild texture datasets. Notably, BeiTv2-B/16 achieves the highest average accuracy (85.7%), followed by ViT-B/16-DINO (84.1%) and Swin-B (80.8%), outperforming the ResNet50 baseline (75.5%) and the hand-engineered baseline (73.4%). As a lightweight alternative, EfficientFormer-L3 attains a competitive average accuracy of 78.9%. In terms of efficiency, although ViT-B and BeiT(v2) have a higher number of GFLOPs and parameters, they achieve significantly faster feature extraction on GPUs compared to ResNet50. These findings highlight the potential of ViTs as a powerful tool for texture analysis while also pointing to areas for future exploration, such as efficiency improvements and domain-specific adaptations. Full article
(This article belongs to the Special Issue Celebrating the 10th Anniversary of the Journal of Imaging)
Show Figures

Figure 1

19 pages, 3770 KB  
Article
Segmentation of 220 kV Cable Insulation Layers Using WGAN-GP-Based Data Augmentation and the TransUNet Model
by Liang Luo, Song Qing, Yingjie Liu, Guoyuan Lu, Ziying Zhang, Yuhang Xia, Yi Ao, Fanbo Wei and Xingang Chen
Energies 2025, 18(17), 4667; https://doi.org/10.3390/en18174667 - 2 Sep 2025
Viewed by 606
Abstract
This study presents a segmentation framework for images of 220 kV cable insulation that addresses sample scarcity and blurred boundaries. The framework integrates data augmentation using the Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) and the TransUNet architecture. Considering the difficulty and [...] Read more.
This study presents a segmentation framework for images of 220 kV cable insulation that addresses sample scarcity and blurred boundaries. The framework integrates data augmentation using the Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) and the TransUNet architecture. Considering the difficulty and high cost of obtaining real cable images, WGAN-GP generates high-quality synthetic data to expand the dataset and improve the model’s generalization. The TransUNet network, designed to handle the structural complexity and indistinct edge features of insulation layers, combines the local feature extraction capability of convolutional neural networks (CNNs) with the global context modeling strength of Transformers. This combination enables accurate delineation of the insulation regions. The experimental results show that the proposed method achieves mDice, mIoU, MP, and mRecall scores of 0.9835, 0.9677, 0.9840, and 0.9831, respectively, with improvements of approximately 2.03%, 3.05%, 2.08%, and 1.98% over a UNet baseline. Overall, the proposed approach outperforms UNet, Swin-UNet, and Attention-UNet, confirming its effectiveness in delineating 220 kV cable insulation layers under complex structural and data-limited conditions. Full article
(This article belongs to the Special Issue Fault Detection and Diagnosis of Power Distribution System)
Show Figures

Figure 1

16 pages, 1500 KB  
Article
Emotion Recognition in Autistic Children Through Facial Expressions Using Advanced Deep Learning Architectures
by Petra Radočaj and Goran Martinović
Appl. Sci. 2025, 15(17), 9555; https://doi.org/10.3390/app15179555 - 30 Aug 2025
Viewed by 807
Abstract
Atypical and subtle facial expression patterns in individuals with autism spectrum disorder (ASD) pose a significant challenge for automated emotion recognition. This study evaluates and compares the performance of convolutional neural networks (CNNs) and transformer-based deep learning models for facial emotion recognition in [...] Read more.
Atypical and subtle facial expression patterns in individuals with autism spectrum disorder (ASD) pose a significant challenge for automated emotion recognition. This study evaluates and compares the performance of convolutional neural networks (CNNs) and transformer-based deep learning models for facial emotion recognition in this population. Using a labeled dataset of emotional facial images, we assessed eight models across four emotion categories: natural, anger, fear, and joy. Our results demonstrate that transformer models consistently outperformed CNNs in both overall and emotion-specific metrics. Notably, the Swin Transformer achieved the highest performance, with an accuracy of 0.8000 and an F1-score of 0.7889, significantly surpassing all CNN counterparts. While CNNs failed to detect the fear class, transformer models showed a measurable capability in identifying complex emotions such as anger and fear, suggesting an enhanced ability to capture subtle facial cues. Analysis of the confusion matrix further confirmed the transformers’ superior classification balance and generalization. Despite these promising results, the study has limitations, including class imbalance and its reliance solely on facial imagery. Future work should explore multimodal emotion recognition, model interpretability, and personalization for real-world applications. Research also demonstrates the potential of transformer architectures in advancing inclusive, emotion-aware AI systems tailored for autistic individuals. Full article
Show Figures

Figure 1

16 pages, 1458 KB  
Article
Deep Ensemble Learning for Multiclass Skin Lesion Classification
by Tsu-Man Chiu, I-Chun Chi, Yun-Chang Li and Ming-Hseng Tseng
Bioengineering 2025, 12(9), 934; https://doi.org/10.3390/bioengineering12090934 - 29 Aug 2025
Viewed by 601
Abstract
The skin, the largest organ of the body, acts as a protective shield against external stimuli. Skin lesions, which can be the result of inflammation, infection, tumors, or autoimmune conditions, can appear as rashes, spots, lumps, or scales, or remain asymptomatic until they [...] Read more.
The skin, the largest organ of the body, acts as a protective shield against external stimuli. Skin lesions, which can be the result of inflammation, infection, tumors, or autoimmune conditions, can appear as rashes, spots, lumps, or scales, or remain asymptomatic until they become severe. Conventional diagnostic approaches such as visual inspection and palpation often lack accuracy. Artificial intelligence (AI) improves diagnostic precision by analyzing large volumes of skin images to detect subtle patterns that clinicians may not recognize. This study presents a multiclass skin lesion diagnostic model developed using the CSMUH dataset, which focuses on the Eastern population. The dataset was categorized into seven disease classes for model training. A total of 25 pre-trained models, including convolutional neural networks (CNNs) and vision transformers (ViTs), were fine-tuned. The top three models were combined into an ensemble using the hard and soft voting methods. To ensure reliability, the model was tested through five randomized experiments and validated using the holdout technique. The proposed ensemble model, Swin-ViT-EfficientNetB4, achieved the highest test accuracy of 98.5%, demonstrating strong potential for accurate and early skin lesion diagnosis. Full article
(This article belongs to the Special Issue Mathematical Models for Medical Diagnosis and Testing)
Show Figures

Figure 1

18 pages, 2884 KB  
Article
Research on Multi-Path Feature Fusion Manchu Recognition Based on Swin Transformer
by Yu Zhou, Mingyan Li, Hang Yu, Jinchi Yu, Mingchen Sun and Dadong Wang
Symmetry 2025, 17(9), 1408; https://doi.org/10.3390/sym17091408 - 29 Aug 2025
Viewed by 438
Abstract
Recognizing Manchu words can be challenging due to their complex character variations, subtle differences between similar characters, and homographic polysemy. Most studies rely on character segmentation techniques for character recognition or use convolutional neural networks (CNNs) to encode word images for word recognition. [...] Read more.
Recognizing Manchu words can be challenging due to their complex character variations, subtle differences between similar characters, and homographic polysemy. Most studies rely on character segmentation techniques for character recognition or use convolutional neural networks (CNNs) to encode word images for word recognition. However, these methods can lead to segmentation errors or a loss of semantic information, which reduces the accuracy of word recognition. To address the limitations in the long-range dependency modeling of CNNs and enhance semantic coherence, we propose a hybrid architecture to fuse the spatial features of original images and spectral features. Specifically, we first leverage the Short-Time Fourier Transform (STFT) to preprocess the raw input images and thereby obtain their multi-view spectral features. Then, we leverage a primary CNN block and a pair of symmetric CNN blocks to construct a symmetric spectral enhancement module, which is used to encode the raw input features and the multi-view spectral features. Subsequently, we design a feature fusion module via Swin Transformer to fuse multi-view spectral embedding and thereby concat it with the raw input embedding. Finally, we leverage a Transformer decoder to obtain the target output. We conducted extensive experiments on Manchu words benchmark datasets to evaluate the effectiveness of our proposed framework. The experimental results demonstrated that our framework performs robustly in word recognition tasks and exhibits excellent generalization capabilities. Additionally, our model outperformed other baseline methods in multiple writing-style font-recognition tasks. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

37 pages, 3806 KB  
Article
Comparative Evaluation of CNN and Transformer Architectures for Flowering Phase Classification of Tilia cordata Mill. with Automated Image Quality Filtering
by Bogdan Arct, Bartosz Świderski, Monika A. Różańska, Bogdan H. Chojnicki, Tomasz Wojciechowski, Gniewko Niedbała, Michał Kruk, Krzysztof Bobran and Jarosław Kurek
Sensors 2025, 25(17), 5326; https://doi.org/10.3390/s25175326 - 27 Aug 2025
Viewed by 784
Abstract
Understanding and monitoring the phenological phases of trees is essential for ecological research and climate change studies. In this work, we present a comprehensive evaluation of state-of-the-art convolutional neural networks (CNNs) and transformer architectures for the automated classification of the flowering phase of [...] Read more.
Understanding and monitoring the phenological phases of trees is essential for ecological research and climate change studies. In this work, we present a comprehensive evaluation of state-of-the-art convolutional neural networks (CNNs) and transformer architectures for the automated classification of the flowering phase of Tilia cordata Mill. (small-leaved lime) based on a large set of real-world images acquired under natural field conditions. The study introduces a novel, automated image quality filtering approach using an XGBoost classifier trained on diverse exposure and sharpness features to ensure robust input data for subsequent deep learning models. Seven modern neural network architectures, including VGG16, ResNet50, EfficientNetB3, MobileNetV3 Large, ConvNeXt Tiny, Vision Transformer (ViT-B/16), and Swin Transformer Tiny, were fine-tuned and evaluated under a rigorous cross-validation protocol. All models achieved excellent performance, with cross-validated F1-scores exceeding 0.97 and balanced accuracy up to 0.993. The best results were obtained for ResNet50 and ConvNeXt Tiny (F1-score: 0.9879 ± 0.0077 and 0.9860 ± 0.0073, balanced accuracy: 0.9922 ± 0.0054 and 0.9927 ± 0.0042, respectively), indicating outstanding sensitivity and specificity for both flowering and non-flowering classes. Classical CNNs (VGG16, ResNet50, and ConvNeXt Tiny) demonstrated slightly superior robustness compared to transformer-based models, though all architectures maintained high generalization and minimal variance across folds. The integrated quality assessment and classification pipeline enables scalable, high-throughput monitoring of flowering phases in natural environments. The proposed methodology is adaptable to other plant species and locations, supporting future ecological monitoring and climate studies. Our key contributions are as follows: (i) introducing an automated exposure-quality filtering stage for field imagery; (ii) publishing a curated, season-long dataset of Tilia cordata images; and (iii) providing the first systematic cross-validated benchmark that contrasts classical CNNs with transformer architectures for phenological phase recognition. Full article
(This article belongs to the Special Issue Application of UAV and Sensing in Precision Agriculture)
Show Figures

Figure 1

33 pages, 8494 KB  
Article
Enhanced Multi-Class Brain Tumor Classification in MRI Using Pre-Trained CNNs and Transformer Architectures
by Marco Antonio Gómez-Guzmán, Laura Jiménez-Beristain, Enrique Efren García-Guerrero, Oscar Adrian Aguirre-Castro, José Jaime Esqueda-Elizondo, Edgar Rene Ramos-Acosta, Gilberto Manuel Galindo-Aldana, Cynthia Torres-Gonzalez and Everardo Inzunza-Gonzalez
Technologies 2025, 13(9), 379; https://doi.org/10.3390/technologies13090379 - 22 Aug 2025
Viewed by 1102
Abstract
Early and accurate identification of brain tumors is essential for determining effective treatment strategies and improving patient outcomes. Artificial intelligence (AI) and deep learning (DL) techniques have shown promise in automating diagnostic tasks based on magnetic resonance imaging (MRI). This study evaluates the [...] Read more.
Early and accurate identification of brain tumors is essential for determining effective treatment strategies and improving patient outcomes. Artificial intelligence (AI) and deep learning (DL) techniques have shown promise in automating diagnostic tasks based on magnetic resonance imaging (MRI). This study evaluates the performance of four pre-trained deep convolutional neural network (CNN) architectures for the automatic multi-class classification of brain tumors into four categories: Glioma, Meningioma, Pituitary, and No Tumor. The proposed approach utilizes the publicly accessible Brain Tumor MRI Msoud dataset, consisting of 7023 images, with 5712 provided for training and 1311 for testing. To assess the impact of data availability, subsets containing 25%, 50%, 75%, and 100% of the training data were used. A stratified five-fold cross-validation technique was applied. The CNN architectures evaluated include DeiT3_base_patch16_224, Xception41, Inception_v4, and Swin_Tiny_Patch4_Window7_224, all fine-tuned using transfer learning. The training pipeline incorporated advanced preprocessing and image data augmentation techniques to enhance robustness and mitigate overfitting. Among the models tested, Swin_Tiny_Patch4_Window7_224 achieved the highest classification Accuracy of 99.24% on the test set using 75% of the training data. This model demonstrated superior generalization across all tumor classes and effectively addressed class imbalance issues. Furthermore, we deployed and benchmarked the best-performing DL model on embedded AI platforms (Jetson AGX Xavier and Orin Nano), demonstrating their capability for real-time inference and highlighting their feasibility for edge-based clinical deployment. The results highlight the strong potential of pre-trained deep CNN and transformer-based architectures in medical image analysis. The proposed approach provides a scalable and energy-efficient solution for automated brain tumor diagnosis, facilitating the integration of AI into clinical workflows. Full article
Show Figures

Figure 1

25 pages, 4241 KB  
Article
Deep Learning for Comprehensive Analysis of Retinal Fundus Images: Detection of Systemic and Ocular Conditions
by Mohammad Mahdi Aghabeigi Alooghareh, Mohammad Mohsen Sheikhey, Ali Sahafi, Habibollah Pirnejad and Amin Naemi
Bioengineering 2025, 12(8), 840; https://doi.org/10.3390/bioengineering12080840 - 3 Aug 2025
Viewed by 1966
Abstract
The retina offers a unique window into both ocular and systemic health, motivating the development of AI-based tools for disease screening and risk assessment. In this study, we present a comprehensive evaluation of six state-of-the-art deep neural networks, including convolutional neural networks and [...] Read more.
The retina offers a unique window into both ocular and systemic health, motivating the development of AI-based tools for disease screening and risk assessment. In this study, we present a comprehensive evaluation of six state-of-the-art deep neural networks, including convolutional neural networks and vision transformer architectures, on the Brazilian Multilabel Ophthalmological Dataset (BRSET), comprising 16,266 fundus images annotated for multiple clinical and demographic labels. We explored seven classification tasks: Diabetes, Diabetic Retinopathy (2-class), Diabetic Retinopathy (3-class), Hypertension, Hypertensive Retinopathy, Drusen, and Sex classification. Models were evaluated using precision, recall, F1-score, accuracy, and AUC. Among all models, the Swin-L generally delivered the best performance across scenarios for Diabetes (AUC = 0.88, weighted F1-score = 0.86), Diabetic Retinopathy (2-class) (AUC = 0.98, weighted F1-score = 0.95), Diabetic Retinopathy (3-class) (macro AUC = 0.98, weighted F1-score = 0.95), Hypertension (AUC = 0.85, weighted F1-score = 0.79), Hypertensive Retinopathy (AUC = 0.81, weighted F1-score = 0.97), Drusen detection (AUC = 0.93, weighted F1-score = 0.90), and Sex classification (AUC = 0.87, weighted F1-score = 0.80). These results reflect excellent to outstanding diagnostic performance. We also employed gradient-based saliency maps to enhance explainability and visualize decision-relevant retinal features. Our findings underscore the potential of deep learning, particularly vision transformer models, to deliver accurate, interpretable, and clinically meaningful screening tools for retinal and systemic disease detection. Full article
(This article belongs to the Special Issue Machine Learning in Chronic Diseases)
Show Figures

Figure 1

Back to TopTop