Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (638)

Search Parameters:
Keywords = SwinTransformer

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
27 pages, 3948 KB  
Article
Fully Automated Segmentation of Cervical Spinal Cord in Sagittal MR Images Using Swin-Unet Architectures
by Rukiye Polattimur, Emre Dandıl, Mehmet Süleyman Yıldırım and Utku Şenol
J. Clin. Med. 2025, 14(19), 6994; https://doi.org/10.3390/jcm14196994 - 2 Oct 2025
Abstract
Background/Objectives: The spinal cord is a critical component of the central nervous system that transmits neural signals between the brain and the body’s peripheral regions through its nerve roots. Despite being partially protected by the vertebral column, the spinal cord remains highly [...] Read more.
Background/Objectives: The spinal cord is a critical component of the central nervous system that transmits neural signals between the brain and the body’s peripheral regions through its nerve roots. Despite being partially protected by the vertebral column, the spinal cord remains highly vulnerable to trauma, tumors, infections, and degenerative or inflammatory disorders. These conditions can disrupt neural conduction, resulting in severe functional impairments, such as paralysis, motor deficits, and sensory loss. Therefore, accurate and comprehensive spinal cord segmentation is essential for characterizing its structural features and evaluating neural integrity. Methods: In this study, we propose a fully automated method for segmentation of the cervical spinal cord in sagittal magnetic resonance (MR) images. This method facilitates rapid clinical evaluation and supports early diagnosis. Our approach uses a Swin-Unet architecture, which integrates vision transformer blocks into the U-Net framework. This enables the model to capture both local anatomical details and global contextual information. This design improves the delineation of the thin, curved, low-contrast cervical cord, resulting in more precise and robust segmentation. Results: In experimental studies, the proposed Swin-Unet model (SWU1), which uses transformer blocks in the encoder layer, achieved Dice Similarity Coefficient (DSC) and Hausdorff Distance 95 (HD95) scores of 0.9526 and 1.0707 mm, respectively, for cervical spinal cord segmentation. These results confirm that the model can consistently deliver precise, pixel-level delineations that are structurally accurate, which supports its reliability for clinical assessment. Conclusions: The attention-enhanced Swin-Unet architecture demonstrated high accuracy in segmenting thin and complex anatomical structures, such as the cervical spinal cord. Its ability to generalize with limited data highlights its potential for integration into clinical workflows to support diagnosis, monitoring, and treatment planning. Full article
(This article belongs to the Special Issue Artificial Intelligence and Deep Learning in Medical Imaging)
Show Figures

Figure 1

22 pages, 782 KB  
Article
Hybrid CNN-Swin Transformer Model to Advance the Diagnosis of Maxillary Sinus Abnormalities on CT Images Using Explainable AI
by Mohammad Alhumaid and Ayman G. Fayoumi
Computers 2025, 14(10), 419; https://doi.org/10.3390/computers14100419 - 2 Oct 2025
Abstract
Accurate diagnosis of sinusitis is essential due to its widespread prevalence and its considerable impact on patient quality of life. While multiple imaging techniques are available for detecting maxillary sinus, computed tomography (CT) remains the preferred modality because of its high sensitivity and [...] Read more.
Accurate diagnosis of sinusitis is essential due to its widespread prevalence and its considerable impact on patient quality of life. While multiple imaging techniques are available for detecting maxillary sinus, computed tomography (CT) remains the preferred modality because of its high sensitivity and spatial resolution. Although recent advances in deep learning have led to the development of automated methods for sinusitis classification, many existing models perform poorly in the presence of complex pathological features and offer limited interpretability, which hinders their integration into clinical workflows. In this study, we propose a hybrid deep learning framework that combines EfficientNetB0, a convolutional neural network, with the Swin Transformer, a vision transformer, to improve feature representation. An attention-based fusion module is used to integrate both local and global information, thereby enhancing diagnostic accuracy. To improve transparency and support clinical adoption, the model incorporates explainable artificial intelligence (XAI) techniques using Gradient-weighted Class Activation Mapping (Grad-CAM). This allows for visualization of the regions influencing the model’s predictions, helping radiologists assess the clinical relevance of the results. We evaluate the proposed method on a curated maxillary sinus CT dataset covering four diagnostic categories: Normal, Opacified, Polyposis, and Retention Cysts. The model achieves a classification accuracy of 95.83%, with precision, recall, and F1 score all at 95%. Grad-CAM visualizations indicate that the model consistently focuses on clinically significant regions of the sinus anatomy, supporting its potential utility as a reliable diagnostic aid in medical practice. Full article
23 pages, 18084 KB  
Article
WetSegNet: An Edge-Guided Multi-Scale Feature Interaction Network for Wetland Classification
by Li Chen, Shaogang Xia, Xun Liu, Zhan Xie, Haohong Chen, Feiyu Long, Yehong Wu and Meng Zhang
Remote Sens. 2025, 17(19), 3330; https://doi.org/10.3390/rs17193330 - 29 Sep 2025
Abstract
Wetlands play a crucial role in climate regulation, pollutant filtration, and biodiversity conservation. Accurate wetland classification through high-resolution remote sensing imagery is pivotal for the scientific management, ecological monitoring, and sustainable development of these ecosystems. However, the intricate spatial details in such imagery [...] Read more.
Wetlands play a crucial role in climate regulation, pollutant filtration, and biodiversity conservation. Accurate wetland classification through high-resolution remote sensing imagery is pivotal for the scientific management, ecological monitoring, and sustainable development of these ecosystems. However, the intricate spatial details in such imagery pose significant challenges to conventional interpretation techniques, necessitating precise boundary extraction and multi-scale contextual modeling. In this study, we propose WetSegNet, an edge-guided Multi-Scale Feature Interaction network for wetland classification, which integrates a convolutional neural network (CNN) and Swin Transformer within a U-Net architecture to synergize local texture perception and global semantic comprehension. Specifically, the framework incorporates two novel components: (1) a Multi-Scale Feature Interaction (MFI) module employing cross-attention mechanisms to mitigate semantic discrepancies between encoder–decoder features, and (2) a Multi-Feature Fusion (MFF) module that hierarchically enhances boundary delineation through edge-guided spatial attention (EGA). Experimental validation on GF-2 satellite imagery of Dongting Lake wetlands demonstrates that WetSegNet achieves state-of-the-art performance, with an overall accuracy (OA) of 90.81% and a Kappa coefficient of 0.88. Notably, it achieves classification accuracies exceeding 90% for water, sedge, and reed habitats, surpassing the baseline U-Net by 3.3% in overall accuracy and 0.05 in Kappa. The proposed model effectively addresses heterogeneous wetland classification challenges, validating its capability to reconcile local–global feature representation. Full article
Show Figures

Figure 1

26 pages, 7399 KB  
Article
ECL-ConvNeXt: An Ensemble Strategy Combining ConvNeXt and Contrastive Learning for Facial Beauty Prediction
by Junying Gan, Wenchao Xu, Hantian Chen, Zhen Chen, Zhenxin Zhuang and Huicong Li
Electronics 2025, 14(19), 3777; https://doi.org/10.3390/electronics14193777 - 24 Sep 2025
Viewed by 121
Abstract
Facial beauty prediction (FBP) is a cutting-edge topic in deep learning, aiming to endow computers with human-like esthetic judgment capabilities. Current facial beauty datasets are characterized by multi-class classification and imbalanced sample distributions. Most FBP methods focus on improving accuracy (ACC) as their [...] Read more.
Facial beauty prediction (FBP) is a cutting-edge topic in deep learning, aiming to endow computers with human-like esthetic judgment capabilities. Current facial beauty datasets are characterized by multi-class classification and imbalanced sample distributions. Most FBP methods focus on improving accuracy (ACC) as their primary goal, aiming to indirectly optimize other metrics. In contrast to ACC, which is well known to be a poor metric in cases of highly imbalanced datasets, the recall measures the proportion of correctly identified samples for each class, effectively evaluating classification performance across all classes without being affected by sample imbalances, thereby providing a fairer assessment of minority class performance. Therefore, targeting recall improvement facilitates balanced classification across all classes. The Macro Recall (MR), which averages the recall of all the classes, serves as a comprehensive metric for evaluating a model’s performance. Among numerous classic models, ConvNeXt, which integrates the designs of the Swin Transformer and ResNet, performs exceptionally well regarding its MR but still suffers from inter-class confusion in certain categories. To address this issue, this paper introduces contrastive learning (CL) to enhance the class separability by optimizing feature representations and reducing confusion. However, directly applying CL to all the classes may degrade the performance for high-recall categories. To this end, we propose using an ensemble strategy, ECL-ConvNeXt: First, ConvNeXt is used for multi-class prediction on the whole of dataset A to identify the most confused class pairs. Second, samples predicted to belong to these class pairs are extracted from the multi-class results to form dataset B. Third, true samples of these class pairs are extracted from dataset A to form dataset C, and CL is applied to improve their separability, training a dedicated auxiliary binary classifier (ConvNeXtCL-ABC) based on ConvNeXt. Subsequently, ConvNeXtCL-ABC is used to reclassify dataset B. Finally, the predictions of ConvNeXtCL-ABC replace the corresponding class predictions of ConvNeXt, while preserving the high recall performance for the other classes. The experimental results demonstrate that ECL-ConvNeXt significantly improves the classification performance for confused class pairs while maintaining strong performance for high-recall classes. On the LSAFBD dataset, it achieves 72.09% ACC and 75.43% MR; on the MEBeauty dataset, 73.23% ACC and 67.50% MR; on the HotOrNot dataset, 62.62% ACC and 49.29% MR. The approach is also generalizable to other multi-class imbalanced data scenarios. Full article
(This article belongs to the Special Issue Applications of Computer Vision, 3rd Edition)
Show Figures

Figure 1

27 pages, 5776 KB  
Article
R-SWTNet: A Context-Aware U-Net-Based Framework for Segmenting Rural Roads and Alleys in China with the SQVillages Dataset
by Jianing Wu, Junqi Yang, Xiaoyu Xu, Ying Zeng, Yan Cheng, Xiaodong Liu and Hong Zhang
Land 2025, 14(10), 1930; https://doi.org/10.3390/land14101930 - 23 Sep 2025
Viewed by 101
Abstract
Rural road networks are vital for rural development, yet narrow alleys and occluded segments remain underrepresented in digital maps due to irregular morphology, spectral ambiguity, and limited model generalization. Traditional segmentation models struggle to balance local detail preservation and long-range dependency modeling, prioritizing [...] Read more.
Rural road networks are vital for rural development, yet narrow alleys and occluded segments remain underrepresented in digital maps due to irregular morphology, spectral ambiguity, and limited model generalization. Traditional segmentation models struggle to balance local detail preservation and long-range dependency modeling, prioritizing either local features or global context alone. Hypothesizing that integrating hierarchical local features and global context will mitigate these limitations, this study aims to accurately segment such rural roads by proposing R-SWTNet, a context-aware U-Net-based framework, and constructing the SQVillages dataset. R-SWTNet integrates ResNet34 for hierarchical feature extraction, Swin Transformer for long-range dependency modeling, ASPP for multi-scale context fusion, and CAM-Residual blocks for channel-wise attention. The SQVillages dataset, built from multi-source remote sensing imagery, includes 18 diverse villages with adaptive augmentation to mitigate class imbalance. Experimental results show R-SWTNet achieves a validation IoU of 54.88% and F1-score of 70.87%, outperforming U-Net and Swin-UNet, and with less overfitting than R-Net and D-LinkNet. Its lightweight variant supports edge deployment, enabling on-site road management. This work provides a data-driven tool for infrastructure planning under China’s Rural Revitalization Strategy, with potential scalability to global unstructured rural road scenes. Full article
(This article belongs to the Section Land Innovations – Data and Machine Learning)
Show Figures

Figure 1

18 pages, 1694 KB  
Article
FAIR-Net: A Fuzzy Autoencoder and Interpretable Rule-Based Network for Ancient Chinese Character Recognition
by Yanling Ge, Yunmeng Zhang and Seok-Beom Roh
Sensors 2025, 25(18), 5928; https://doi.org/10.3390/s25185928 - 22 Sep 2025
Viewed by 150
Abstract
Ancient Chinese scripts—including oracle bone carvings, bronze inscriptions, stone steles, Dunhuang scrolls, and bamboo slips—are rich in historical value but often degraded due to centuries of erosion, damage, and stylistic variability. These issues severely hinder manual transcription and render conventional OCR techniques inadequate, [...] Read more.
Ancient Chinese scripts—including oracle bone carvings, bronze inscriptions, stone steles, Dunhuang scrolls, and bamboo slips—are rich in historical value but often degraded due to centuries of erosion, damage, and stylistic variability. These issues severely hinder manual transcription and render conventional OCR techniques inadequate, as they are typically trained on modern printed or handwritten text and lack interpretability. To tackle these challenges, we propose FAIR-Net, a hybrid architecture that combines the unsupervised feature learning capacity of a deep autoencoder with the semantic transparency of a fuzzy rule-based classifier. In FAIR-Net, the deep autoencoder first compresses high-resolution character images into low-dimensional, noise-robust embeddings. These embeddings are then passed into a Fuzzy Neural Network (FNN), whose hidden layer leverages Fuzzy C-Means (FCM) clustering to model soft membership degrees and generate human-readable fuzzy rules. The output layer uses Iteratively Reweighted Least Squares Estimation (IRLSE) combined with a Softmax function to produce probabilistic predictions, with all weights constrained as linear mappings to maintain model transparency. We evaluate FAIR-Net on CASIA-HWDB1.0, HWDB1.1, and ICDAR 2013 CompetitionDB, where it achieves a recognition accuracy of 97.91%, significantly outperforming baseline CNNs (p < 0.01, Cohen’s d > 0.8) while maintaining the tightest confidence interval (96.88–98.94%) and lowest standard deviation (±1.03%). Additionally, FAIR-Net reduces inference time to 25 s, improving processing efficiency by 41.9% over AlexNet and up to 98.9% over CNN-Fujitsu, while preserving >97.5% accuracy across evaluations. To further assess generalization to historical scripts, FAIR-Net was tested on the Ancient Chinese Character Dataset (9233 classes; 979,907 images), achieving 83.25% accuracy—slightly higher than ResNet101 but 2.49% lower than SwinT-v2-small—while reducing training time by over 5.5× compared to transformer-based baselines. Fuzzy rule visualization confirms enhanced robustness to glyph ambiguities and erosion. Overall, FAIR-Net provides a practical, interpretable, and highly efficient solution for the digitization and preservation of ancient Chinese character corpora. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

22 pages, 5746 KB  
Article
AGSK-Net: Adaptive Geometry-Aware Stereo-KANformer Network for Global and Local Unsupervised Stereo Matching
by Qianglong Feng, Xiaofeng Wang, Zhenglin Lu, Haiyu Wang, Tingfeng Qi and Tianyi Zhang
Sensors 2025, 25(18), 5905; https://doi.org/10.3390/s25185905 - 21 Sep 2025
Viewed by 269
Abstract
The performance of unsupervised stereo matching in complex regions such as weak textures and occlusions is constrained by the inherently local receptive fields of convolutional neural networks (CNNs), the absence of geometric priors, and the limited expressiveness of MLP in conventional ViTs. To [...] Read more.
The performance of unsupervised stereo matching in complex regions such as weak textures and occlusions is constrained by the inherently local receptive fields of convolutional neural networks (CNNs), the absence of geometric priors, and the limited expressiveness of MLP in conventional ViTs. To address these problems, we propose an Adaptive Geometry-aware Stereo-KANformer Network (AGSK-Net) for unsupervised stereo matching. Firstly, to resolve the conflict between the isotropic nature of traditional ViT and the epipolar geometry priors in stereo matching, we propose Adaptive Geometry-aware Multi-head Self-Attention (AG-MSA), which embeds epipolar priors via an adaptive hybrid structure of geometric modulation and penalty, enabling geometry-aware global context modeling. Secondly, we design Spatial Group-Rational KAN (SGR-KAN), which integrates the nonlinear capability of rational functions with the spatial awareness of deep convolutions, replacing the MLP with flexible, learnable rational functions to enhance the nonlinear expression ability of complex regions. Finally, we propose a Dynamic Candidate Gated Fusion (DCGF) module that employs dynamic dual-candidate states and spatially aware pre-enhancement to adaptively fuse global and local features across scales. Experiments demonstrate that AGSK-Net achieves state-of-the-art accuracy and generalizability on Scene Flow, KITTI 2012/2015, and Middlebury 2021. Full article
(This article belongs to the Special Issue Deep Learning Technology and Image Sensing: 2nd Edition)
Show Figures

Figure 1

23 pages, 3623 KB  
Article
WSC-Net: A Wavelet-Enhanced Swin Transformer with Cross-Domain Attention for Hyperspectral Image Classification
by Zhen Yang, Huihui Li, Feiming Wei, Jin Ma and Tao Zhang
Remote Sens. 2025, 17(18), 3216; https://doi.org/10.3390/rs17183216 - 17 Sep 2025
Viewed by 286
Abstract
This paper introduces the Wavelet-Enhanced Swin Transformer Network (WSC-Net), a novel dual-branch architecture that resolves the inherent tradeoff between global spatial contextual and fine-grained spectral details in hyperspectral image (HSI) classification. While transformer-based models excel at capturing long-range dependencies, their patch-based nature often [...] Read more.
This paper introduces the Wavelet-Enhanced Swin Transformer Network (WSC-Net), a novel dual-branch architecture that resolves the inherent tradeoff between global spatial contextual and fine-grained spectral details in hyperspectral image (HSI) classification. While transformer-based models excel at capturing long-range dependencies, their patch-based nature often overlooks intra-patch high-frequency details, hindering the discrimination of spectrally similar classes. Our framework synergistically couples a two-stage Swin transformer with a parallel Wavelet Transform Module (WTM) for local frequency information capture. To address the semantic gap between spatial and frequency domains, we propose the Cross-Domain Attention Fusion (CDAF) module—a bi-directional attention mechanism that facilitates intelligent feature exchange between the two streams. CDAF explicitly models cross-domain dependencies, amplifies complementary features, and suppresses noise through attention-guided integration. Extensive experiments on four benchmark datasets demonstrate that WSC-Net consistently outperforms state-of-the-art methods, confirming its effectiveness in balancing global contextual modeling with local detail preservation. Full article
Show Figures

Figure 1

26 pages, 1061 KB  
Article
EEViT: Efficient Enhanced Vision Transformer Architectures with Information Propagation and Improved Inductive Bias
by Rigel Mahmood, Sarosh Patel and Khaled Elleithy
AI 2025, 6(9), 233; https://doi.org/10.3390/ai6090233 - 17 Sep 2025
Viewed by 523
Abstract
The Transformer architecture has been the foundational cornerstone of the recent AI revolution, serving as the backbone of Large Language Models, which have demonstrated impressive language understanding and reasoning capabilities. When pretrained on large amounts of data, Transformers have also shown to be [...] Read more.
The Transformer architecture has been the foundational cornerstone of the recent AI revolution, serving as the backbone of Large Language Models, which have demonstrated impressive language understanding and reasoning capabilities. When pretrained on large amounts of data, Transformers have also shown to be highly effective in image classification via the advent of the Vision Transformer. However, they still lag in vision application performance compared to Convolutional Neural Networks (CNNs), which offer translational invariance, whereas Transformers lack inductive bias. Further, the Transformer relies on the attention mechanism, which despite increasing the receptive field, makes it computationally inefficient due to its quadratic time complexity. In this paper, we enhance the Transformer architecture, focusing on its above two shortcomings. We propose two efficient Vision Transformer architectures that significantly reduce the computational complexity without sacrificing classification performance. Our first enhanced architecture is the EEViT-PAR, which combines features from two recently proposed designs of PerceiverAR and CaiT. This enhancement leads to our second architecture, EEViT-IP, which provides implicit windowing capabilities akin to the SWIN Transformer and implicitly improves the inductive bias, while being extremely memory and computationally efficient. We perform detailed experiments on multiple image datasets to show the effectiveness of our architectures. Our best performing EEViT outperforms existing SOTA ViT models in terms of execution efficiency and surpasses or provides competitive classification accuracy on different benchmarks. Full article
Show Figures

Figure 1

25 pages, 4520 KB  
Article
A Multimodal Fake News Detection Model Based on Bidirectional Semantic Enhancement and Adversarial Network Under Web3.0
by Ying Xing, Changhe Zhai, Zhanbin Che, Heng Pan, Kunyang Li, Bowei Zhang, Zhongyuan Yao and Xueming Si
Electronics 2025, 14(18), 3652; https://doi.org/10.3390/electronics14183652 - 15 Sep 2025
Viewed by 458
Abstract
Web3.0 aims to foster a trustworthy environment enabling user trust and content verifiability. However, the proliferation of fake news undermines this trust and disrupts social ecosystems, making the effective alignment of visual-textual semantics and accurate content verification a pivotal challenge. Existing methods still [...] Read more.
Web3.0 aims to foster a trustworthy environment enabling user trust and content verifiability. However, the proliferation of fake news undermines this trust and disrupts social ecosystems, making the effective alignment of visual-textual semantics and accurate content verification a pivotal challenge. Existing methods still struggle with deep cross-modal interaction and the adaptive calibration of discrepancies. To address this, we introduce the Bidirectional Semantic Enhancement and Adversarial Network (BSEAN). BSEAN first extracts features using large pre-trained models: a hybrid encoder for text and the Swin Transformer for images. It then employs a Bidirectional Modality Mapping Network, governed by cycle consistency, to achieve preliminary semantic alignment. Building on this, a Semantic Enhancement and Calibration Network explores inter-modal dependencies and quantifies semantic deviations to enhance discriminative capability. Finally, a Dual Adversarial Learning framework bolsters event generalization and representation consistency through adversarial training with event and modality discriminators. Experiments on public Weibo and Twitter datasets validate BSEAN’s superior performance across all metrics, demonstrating its efficacy in tackling the complex challenges of deep cross-modal interaction and dynamic modality calibration within Web3.0 social networks. Full article
Show Figures

Figure 1

26 pages, 3973 KB  
Article
ViT-DCNN: Vision Transformer with Deformable CNN Model for Lung and Colon Cancer Detection
by Aditya Pal, Hari Mohan Rai, Joon Yoo, Sang-Ryong Lee and Yooheon Park
Cancers 2025, 17(18), 3005; https://doi.org/10.3390/cancers17183005 - 15 Sep 2025
Viewed by 374
Abstract
Background/Objectives: Lung and colon cancers remain among the most prevalent and fatal diseases worldwide, and their early detection is a serious challenge. The data used in this study was obtained from the Lung and Colon Cancer Histopathological Images Dataset, which comprises five different [...] Read more.
Background/Objectives: Lung and colon cancers remain among the most prevalent and fatal diseases worldwide, and their early detection is a serious challenge. The data used in this study was obtained from the Lung and Colon Cancer Histopathological Images Dataset, which comprises five different classes of image data, namely colon adenocarcinoma, colon normal, lung adenocarcinoma, lung normal, and lung squamous cell carcinoma, split into training (80%), validation (10%), and test (10%) subsets. In this study, we propose the ViT-DCNN (Vision Transformer with Deformable CNN) model, with the aim of improving cancer detection and classification using medical images. Methods: The combination of the ViT’s self-attention capabilities with deformable convolutions allows for improved feature extraction, while also enabling the model to learn both holistic contextual information as well as fine-grained localized spatial details. Results: On the test set, the model performed remarkably well, with an accuracy of 94.24%, an F1 score of 94.23%, recall of 94.24%, and precision of 94.37%, confirming its robustness in detecting cancerous tissues. Furthermore, our proposed ViT-DCNN model outperforms several state-of-the-art models, including ResNet-152, EfficientNet-B7, SwinTransformer, DenseNet-201, ConvNext, TransUNet, CNN-LSTM, MobileNetV3, and NASNet-A, across all major performance metrics. Conclusions: By using deep learning and advanced image analysis, this model enhances the efficiency of cancer detection, thus representing a valuable tool for radiologists and clinicians. This study demonstrates that the proposed ViT-DCNN model can reduce diagnostic inaccuracies and improve detection efficiency. Future work will focus on dataset enrichment and enhancing the model’s interpretability to evaluate its clinical applicability. This paper demonstrates the promise of artificial-intelligence-driven diagnostic models in transforming lung and colon cancer detection and improving patient diagnosis. Full article
(This article belongs to the Special Issue Image Analysis and Machine Learning in Cancers: 2nd Edition)
Show Figures

Figure 1

32 pages, 6397 KB  
Article
Enhancing YOLO-Based SAR Ship Detection with Attention Mechanisms
by Ranyeri do Lago Rocha and Felipe A. P. de Figueiredo
Remote Sens. 2025, 17(18), 3170; https://doi.org/10.3390/rs17183170 - 12 Sep 2025
Viewed by 607
Abstract
This study enhances Synthetic Aperture Radar (SAR) ship detection by integrating attention mechanisms, Bi-Level Routing Attention (BRA), Swin Transformer, and a Convolutional Block Attention Module (CBAM) into state-of-the-art YOLO architectures (YOLOv11 and v12). Addressing challenges like small ship sizes and complex maritime backgrounds [...] Read more.
This study enhances Synthetic Aperture Radar (SAR) ship detection by integrating attention mechanisms, Bi-Level Routing Attention (BRA), Swin Transformer, and a Convolutional Block Attention Module (CBAM) into state-of-the-art YOLO architectures (YOLOv11 and v12). Addressing challenges like small ship sizes and complex maritime backgrounds in SAR imagery, we systematically evaluate the impact of adding and replacing attention layers at strategic positions within the models. Experiments reveal that replacing the original attention layer at position 4 (C3k2 module) with the CBAM in YOLOv12 achieves optimal performance, attaining an mAP@0.5 of 98.0% on the SAR Ship Dataset (SSD), surpassing baseline YOLOv12 (97.8%) and prior works. The optimized CBAM-enhanced YOLOv12 also reduces computational costs (5.9 GFLOPS vs. 6.5 GFLOPS in the baseline). Cross-dataset validation on the SAR Ship Detection Dataset (SSDD) confirms consistent improvements, underscoring the efficacy of targeted attention-layer replacement for SAR-specific challenges. Additionally, tests on the SADD and MSAR datasets demonstrate that this optimization generalizes beyond ship detection, yielding gains in aircraft detection and multi-class SAR object recognition. This work establishes a robust framework for efficient, high-precision maritime surveillance using deep learning. Full article
Show Figures

Figure 1

22 pages, 3585 KB  
Article
A Novel 3D U-Net–Vision Transformer Hybrid with Multi-Scale Fusion for Precision Multimodal Brain Tumor Segmentation in 3D MRI
by Fathia Ghribi and Fayçal Hamdaoui
Electronics 2025, 14(18), 3604; https://doi.org/10.3390/electronics14183604 - 11 Sep 2025
Viewed by 419
Abstract
In recent years, segmentation for medical applications using Magnetic Resonance Imaging (MRI) has received increasing attention. Working in this field has emerged as an ambitious task and a major challenge for researchers; particularly, brain tumor segmentation from MRI is a crucial task for [...] Read more.
In recent years, segmentation for medical applications using Magnetic Resonance Imaging (MRI) has received increasing attention. Working in this field has emerged as an ambitious task and a major challenge for researchers; particularly, brain tumor segmentation from MRI is a crucial task for accurate diagnosis, treatment planning, and patient monitoring. With the rapid development of deep learning methods, significant improvements have been made in medical image segmentation. Convolutional Neural Networks (CNNs), such as U-Net, have shown excellent performance in capturing local spatial features. However, these models cannot explicitly capture long-range dependencies. Therefore, Vision Transformers have emerged as an alternative segmentation method recently, as they can exploit long-range correlations through the self-attention mechanism (MSA). Despite their effectiveness, ViTs require large annotated datasets and may compromise fine-grained spatial details. To address these problems, we propose a novel hybrid approach for brain tumor segmentation that combines a 3D U-Net with a 3D Vision Transformer (ViT3D), aiming to jointly exploit local feature extraction and global context modeling. Additionally, we developed an effective fusion method that uses upsampling and convolutional refinement to improve multi-scale feature integration. Unlike traditional fusion approaches, our method explicitly refines spatial details while maintaining global dependencies, improving the quality of tumor border delineation. We evaluated our approach on the BraTS 2020 dataset, achieving a global accuracy score of 99.56%, an average Dice similarity coefficient (DSC) of 77.43% (corresponding to the mean across the three tumor subregions), with individual Dice scores of 84.35% for WT, 80.97% for TC, and 66.97% for ET, and an average Intersection over Union (IoU) of 71.69%. These extensive experimental results demonstrate that our model not only localizes tumors with high accuracy and robustness but also outperforms a selection of current state-of-the-art methods, including U-Net, SwinUnet, M-Unet, and others. Full article
Show Figures

Figure 1

24 pages, 6369 KB  
Article
DeepSwinLite: A Swin Transformer-Based Light Deep Learning Model for Building Extraction Using VHR Aerial Imagery
by Elif Ozlem Yilmaz and Taskin Kavzoglu
Remote Sens. 2025, 17(18), 3146; https://doi.org/10.3390/rs17183146 - 10 Sep 2025
Viewed by 431
Abstract
Accurate extraction of building features from remotely sensed data is essential for supporting research and applications in urban planning, land management, transportation infrastructure development, and disaster monitoring. Despite the prominence of deep learning as the state-of-the-art (SOTA) methodology for building extraction, substantial challenges [...] Read more.
Accurate extraction of building features from remotely sensed data is essential for supporting research and applications in urban planning, land management, transportation infrastructure development, and disaster monitoring. Despite the prominence of deep learning as the state-of-the-art (SOTA) methodology for building extraction, substantial challenges remain, largely stemming from the diversity of building structures and the complexity of background features. To mitigate these issues, this study introduces DeepSwinLite, a lightweight architecture based on the Swin Transformer, designed to extract building footprints from very high-resolution (VHR) imagery. The model integrates a novel local-global attention module to enhance the interpretation of objects across varying spatial resolutions and facilitate effective information exchange between different feature abstraction levels. It comprises three modules: multi-scale feature aggregation (MSFA), improving recognition across varying object sizes; multi-level feature pyramid (MLFP), fusing detailed and semantic features; and AuxHead, providing auxiliary supervision to stabilize and enhance learning. Experimental evaluations on the Massachusetts and WHU Building Datasets reveal the superior performance of DeepSwinLite architecture when compared to existing SOTA models. On the Massachusetts dataset, the model attained an OA of 92.54% and an IoU of 77.94%, while on the WHU dataset, it achieved an OA of 98.32% and an IoU of 92.02%. Following the correction of errors identified in the Massachusetts ground truth and iterative enhancement, the model’s performance further improved, reaching 94.63% OA and 79.86% IoU. A key advantage of the DeepSwinLite model is its computational efficiency, requiring fewer floating-point operations (FLOPs) and parameters compared to other SOTA models. This efficiency makes the model particularly suitable for deployment in mobile and resource-constrained systems. Full article
(This article belongs to the Special Issue Advances in Deep Learning Approaches: UAV Data Analysis)
Show Figures

Figure 1

22 pages, 5732 KB  
Article
Explainable Transformer-Based Framework for Glaucoma Detection from Fundus Images Using Multi-Backbone Segmentation and vCDR-Based Classification
by Hind Alasmari, Ghada Amoudi and Hanan Alghamdi
Diagnostics 2025, 15(18), 2301; https://doi.org/10.3390/diagnostics15182301 - 10 Sep 2025
Viewed by 428
Abstract
Glaucoma is an eye disease caused by increased intraocular pressure (IOP) that affects the optic nerve head (ONH), leading to vision problems and irreversible blindness. Background/Objectives: Glaucoma is the second leading cause of blindness worldwide, and the number of people affected is [...] Read more.
Glaucoma is an eye disease caused by increased intraocular pressure (IOP) that affects the optic nerve head (ONH), leading to vision problems and irreversible blindness. Background/Objectives: Glaucoma is the second leading cause of blindness worldwide, and the number of people affected is increasing each year, with the number expected to reach 111.8 million by 2040. This escalating trend is alarming due to the lack of ophthalmology specialists relative to the population. This study proposes an explainable end-to-end pipeline for automated glaucoma diagnosis from fundus images. It also evaluates the performance of Vision Transformers (ViTs) relative to traditional CNN-based models. Methods: The proposed system uses three datasets: REFUGE, ORIGA, and G1020. It begins with YOLOv11 for object detection of the optic disc. Then, the optic disc (OD) and optic cup (OC) are segmented using U-Net with ResNet50, VGG16, and MobileNetV2 backbones, as well as MaskFormer with a Swin-Base backbone. Glaucoma is classified based on the vertical cup-to-disc ratio (vCDR). Results: MaskFormer outperforms all models in segmentation in all aspects, including IoU OD, IoU OC, DSC OD, and DSC OC, with scores of 88.29%, 91.09%, 93.83%, and 93.71%. For classification, it achieved accuracy and F1-scores of 84.03% and 84.56%. Conclusions: By relying on the interpretable features of the vCDR, the proposed framework enhances transparency and aligns well with the principles of explainable AI, thus offering a trustworthy solution for glaucoma screening. Our findings show that Vision Transformers offer a promising approach for achieving high segmentation performance with explainable, biomarker-driven diagnosis. Full article
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)
Show Figures

Figure 1

Back to TopTop