Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (99)

Search Parameters:
Keywords = Swin Transformer V2

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
19 pages, 2136 KB  
Article
Transformer-Based Multi-Class Classification of Bangladeshi Rice Varieties Using Image Data
by Israt Tabassum and Vimala Nunavath
Appl. Sci. 2026, 16(3), 1279; https://doi.org/10.3390/app16031279 - 27 Jan 2026
Abstract
Rice (Oryza sativa L.) is a staple food for over half of the global population, with significant economic, agricultural, and cultural importance, particularly in Asia. Thousands of rice varieties exist worldwide, differing in size, shape, color, and texture, making accurate classification essential [...] Read more.
Rice (Oryza sativa L.) is a staple food for over half of the global population, with significant economic, agricultural, and cultural importance, particularly in Asia. Thousands of rice varieties exist worldwide, differing in size, shape, color, and texture, making accurate classification essential for quality control, breeding programs, and authenticity verification in trade and research. Traditional manual identification of rice varieties is time-consuming, error-prone, and heavily reliant on expert knowledge. Deep learning provides an efficient alternative by automatically extracting discriminative features from rice grain images for precise classification. While prior studies have primarily employed deep learning models such as CNN, VGG, InceptionV3, MobileNet, and DenseNet201, transformer-based models remain underexplored for rice variety classification. This study addresses this gap by applying two deep learning models such as Swin Transformer and Vision Transformer for multi-class classification of rice varieties using the publicly available PRBD dataset from Bangladesh. Experimental results demonstrate that the ViT model achieved an accuracy of 99.86% with precision, recall, and F1-score all at 0.9986, while the Swin Transformer model obtained an accuracy of 99.44% with a precision of 0.9944, recall of 0.9944, and F1-score of 0.9943. These results highlight the effectiveness of transformer-based models for high-accuracy rice variety classification. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

16 pages, 1206 KB  
Article
HASwinNet: A Swin Transformer-Based Denoising Framework with Hybrid Attention for mmWave MIMO Systems
by Xi Han, Houya Tu, Jiaxi Ying, Junqiao Chen and Zhiqiang Xing
Entropy 2026, 28(1), 124; https://doi.org/10.3390/e28010124 - 20 Jan 2026
Viewed by 161
Abstract
Millimeter-wave (mmWave) massive multiple-input, multiple-output (MIMO) systems are a cornerstone technology for integrated sensing and communication (ISAC) in sixth-generation (6G) mobile networks. These systems provide high-capacity backhaul while simultaneously enabling high-resolution environmental sensing. However, accurate channel estimation remains highly challenging due to intrinsic [...] Read more.
Millimeter-wave (mmWave) massive multiple-input, multiple-output (MIMO) systems are a cornerstone technology for integrated sensing and communication (ISAC) in sixth-generation (6G) mobile networks. These systems provide high-capacity backhaul while simultaneously enabling high-resolution environmental sensing. However, accurate channel estimation remains highly challenging due to intrinsic noise sensitivity and clustered sparse multipath structures. These challenges are particularly severe under limited pilot resources and low signal-to-noise ratio (SNR) conditions. To address these difficulties, this paper proposes HASwinNet, a deep learning (DL) framework designed for mmWave channel denoising. The framework integrates a hierarchical Swin Transformer encoder for structured representation learning. It further incorporates two complementary branches. The first branch performs sparse token extraction guided by angular-domain significance. The second branch focuses on angular-domain refinement by applying discrete Fourier transform (DFT), squeeze-and-excitation (SE), and inverse DFT (IDFT) operations. This generates a mask that highlights angularly coherent features. A decoder combines the outputs of both branches with a residual projection from the input to yield refined channel estimates. Additionally, we introduce an angular-domain perceptual loss during training. This enforces spectral consistency and preserves clustered multipath structures. Simulation results based on the Saleh–Valenzuela (S–V) channel model demonstrate that HASwinNet achieves significant improvements in normalized mean squared error (NMSE) and bit error rate (BER). It consistently outperforms convolutional neural network (CNN), long short-term memory (LSTM), and U-Net baselines. Furthermore, experiments with reduced pilot symbols confirm that HASwinNet effectively exploits angular sparsity. The model retains a consistent advantage over baselines even under pilot-limited conditions. These findings validate the scalability of HASwinNet for practical 6G mmWave backhaul applications. They also highlight its potential in ISAC scenarios where accurate channel recovery supports both communication and sensing. Full article
Show Figures

Figure 1

35 pages, 5337 KB  
Article
Enhancing Glioma Classification in Magnetic Resonance Imaging Using Vision Transformers and Convolutional Neural Networks
by Marco Antonio Gómez-Guzmán, José Jaime Esqueda-Elizondo, Laura Jiménez-Beristain, Gilberto Manuel Galindo-Aldana, Oscar Adrian Aguirre-Castro, Edgar Rene Ramos-Acosta, Cynthia Torres-Gonzalez, Enrique Efren García-Guerrero and Everardo Inzunza-Gonzalez
Electronics 2026, 15(2), 434; https://doi.org/10.3390/electronics15020434 - 19 Jan 2026
Viewed by 109
Abstract
Brain tumors, encompassing subtypes with distinct progression and risk profiles, are a serious public health concern. Magnetic resonance imaging (MRI) is the primary imaging modality for non-invasive assessment, providing the contrast and detail necessary for diagnosis, subtype classification, and individualized care planning. In [...] Read more.
Brain tumors, encompassing subtypes with distinct progression and risk profiles, are a serious public health concern. Magnetic resonance imaging (MRI) is the primary imaging modality for non-invasive assessment, providing the contrast and detail necessary for diagnosis, subtype classification, and individualized care planning. In this paper, we evaluate the capability of modern deep learning models to classify gliomas as high-grade (HGG) or low-grade (LGG) using reduced training data from MRI scans. Utilizing the BraTS 2019 best-slice dataset (2185 images in two classes, HGG and LGG) divided in two folders, training and testing, with different images obtained from different patients, we created subsets including 10%, 25%, 50%, 75%, and 100% of the dataset. Six deep learning architectures, DeiT3_base_patch16_224, Inception_v4, Xception41, ConvNextV2_tiny, swin_tiny_patch4_window7_224, and EfficientNet_B0, were evaluated utilizing three-fold cross-validation (k = 3) and increasingly large training datasets. Explainability was assessed using Grad-CAM. With 25% of the training data, DeiT3_base_patch16_224 achieved an accuracy of 99.401% and an F1-Score of 99.403%. Under the same conditions, Inception_v4 achieved an accuracy of 99.212% and a F1-Score of 99.222%. Considering how the models performed across both data subsets and their compute demands, Inception_v4 struck the best balance for MRI-based glioma classification. Both convolutional networks and vision transformers achieved superior discrimination between HGGs and LGGs, even under data-limited conditions. Architectural disparities became increasingly apparent as training data diminished, highlighting unique inductive biases and efficiency characteristics. Even with a relatively limited amount of training data, current deep learning (DL) methods can achieve reliable performance in classifying gliomas from MRI scans. Among the architectures evaluated, Inception_v4 offered the most consistent balance between accuracy, F1-Score, and computational cost, making it a strong candidate for integration into MRI-based clinical workflows. Full article
Show Figures

Figure 1

25 pages, 19621 KB  
Article
Scrap-SAM-CLIP: Assembling Foundation Models for Typical Shape Recognition in Scrap Classification and Rating
by Guangda Bao, Wenzhi Xia, Haichuan Wang, Zhiyou Liao, Ting Wu and Yun Zhou
Sensors 2026, 26(2), 656; https://doi.org/10.3390/s26020656 - 18 Jan 2026
Viewed by 293
Abstract
To address the limitation of 2D methods in inferring absolute scrap dimensions from images, we propose Scrap-SAM-CLIP (SSC), a vision-language model integrating the segment anything model (SAM) and contrastive language-image pre-training in Chinese (CN-CLIP). The model enables identification of canonical scrap shapes, establishing [...] Read more.
To address the limitation of 2D methods in inferring absolute scrap dimensions from images, we propose Scrap-SAM-CLIP (SSC), a vision-language model integrating the segment anything model (SAM) and contrastive language-image pre-training in Chinese (CN-CLIP). The model enables identification of canonical scrap shapes, establishing a foundational framework for subsequent 3D reconstruction and dimensional extraction within the 3D recognition pipeline. Individual modules of SSC are fine-tuned on the self-constructed scrap dataset. For segmentation, the combined box-and-point prompt yields optimal performance among various prompting strategies. MobileSAM and SAM-HQ-Tiny serve as effective lightweight alternatives for edge deployment. Fine-tuning the SAM decoder significantly enhances robustness under noisy prompts, improving accuracy by at least 5.55% with a five-positive-points prompt and up to 15.00% with a five-positive-points-and-five-negative-points prompt. In classification, SSC achieves 95.3% accuracy, outperforming Swin Transformer V2_base by 2.9%, with t-SNE visualizations confirming superior feature learning capability. The performance advantages of SSC stem from its modular assembly strategy, enabling component-specific optimization through subtask decoupling and enhancing system interpretability. This work refines the scrap 3D identification pipeline and demonstrates the efficacy of adapted foundation models in industrial vision systems. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

28 pages, 3553 KB  
Article
GCN-Embedding Swin–Unet for Forest Remote Sensing Image Semantic Segmentation
by Pingbo Liu, Gui Zhang and Jianzhong Li
Remote Sens. 2026, 18(2), 242; https://doi.org/10.3390/rs18020242 - 12 Jan 2026
Viewed by 230
Abstract
Forest resources are among the most important ecosystems on the earth. The semantic segmentation and accurate positioning of ground objects in forest remote sensing (RS) imagery are crucial to the emergency treatment of forest natural disasters, especially forest fires. Currently, most existing methods [...] Read more.
Forest resources are among the most important ecosystems on the earth. The semantic segmentation and accurate positioning of ground objects in forest remote sensing (RS) imagery are crucial to the emergency treatment of forest natural disasters, especially forest fires. Currently, most existing methods for image semantic segmentation are built upon convolutional neural networks (CNNs). Nevertheless, these techniques face difficulties in directly accessing global contextual information and accurately detecting geometric transformations within the image’s target regions. This limitation stems from the inherent locality of convolution operations, which are restricted to processing data structured in Euclidean space and confined to square-shaped regions. Inspired by the graph convolution network (GCN) with robust capabilities in processing irregular and complex targets, as well as Swin Transformers renowned for exceptional global context modeling, we present a hybrid semantic segmentation framework for forest RS imagery termed GSwin–Unet. This framework embeds the GCN model into Swin–Unet architecture to address the issue of low semantic segmentation accuracy of RS imagery in forest scenarios, which is caused by the complex texture features, diverse shapes, and unclear boundaries of land objects. GSwin–Unet features a parallel dual-encoder architecture of GCN and Swin Transformer. First, we integrate the Zero-DCE (Zero-Reference Deep Curve Estimation) algorithm into GSwin–Unet to enhance forest RS image feature representation. Second, a feature aggregation module (FAM) is proposed to bridge the dual encoders by fusing GCN-derived local aggregated features with Swin Transformer-extracted features. Our study demonstrates that, compared with the baseline models TransUnet, Swin–Unet, Unet, and DeepLab V3+, the GSwin–Unet achieves improvements of 7.07%, 5.12%, 8.94%, and 2.69% in the mean Intersection over Union (MIoU) and 3.19%, 1.72%, 4.3%, and 3.69% in the average F1 score (Ave.F1), respectively, on the RGB forest RS dataset. On the NIRGB forest RS dataset, the improvements in MIoU are 5.75%, 3.38%, 6.79%, and 2.44%, and the improvements in Ave.F1 are 4.02%, 2.38%, 4.72%, and 1.67%, respectively. Meanwhile, GSwin–Unet shows excellent adaptability on the selected GID dataset with high forest coverage, where the MIoU and Ave.F1 reach 72.92% and 84.3%, respectively. Full article
Show Figures

Figure 1

28 pages, 25509 KB  
Article
Deep Learning for Semantic Segmentation in Crops: Generalization from Opuntia spp.
by Arturo Duarte-Rangel, César Camacho-Bello, Eduardo Cornejo-Velazquez and Mireya Clavel-Maqueda
AgriEngineering 2026, 8(1), 18; https://doi.org/10.3390/agriengineering8010018 - 5 Jan 2026
Viewed by 473
Abstract
Semantic segmentation of UAV–acquired RGB orthomosaics is a key component for quantifying vegetation cover and monitoring phenology in precision agriculture. This study evaluates a representative set of CNN–based architectures (U–Net, U–Net Xception–Style, SegNet, DeepLabV3+) and Transformer–based models (Swin–UNet/Swin–Transformer, SegFormer, and Mask2Former) under a [...] Read more.
Semantic segmentation of UAV–acquired RGB orthomosaics is a key component for quantifying vegetation cover and monitoring phenology in precision agriculture. This study evaluates a representative set of CNN–based architectures (U–Net, U–Net Xception–Style, SegNet, DeepLabV3+) and Transformer–based models (Swin–UNet/Swin–Transformer, SegFormer, and Mask2Former) under a unified and reproducible protocol. We propose a transfer–and–consolidation workflow whose performance is assessed not only through region–overlap and pixel–wise discrepancy metrics, but also via boundary–sensitive criteria that are explicitly linked to orthomosaic–scale vegetation–cover estimation by pixel counting under GSD (Ground Sample Distance) control. The experimental design considers a transfer scenario between morphologically related crops: initial training on Opuntia spp. (prickly pear), direct (“zero–shot”) inference on Agave salmiana, fine–tuning using only 6.84% of the agave tessellated set as limited target–domain supervision, and a subsequent consolidation stage to obtain a multi–species model. The evaluation integrates IoU, Dice, RMSE, pixel accuracy, and computational cost (time per image), and additionally reports the BF score and HD95 to characterize contour fidelity, which is critical when area is derived from orthomosaic–scale masks. Results show that Transformer-based approaches tend to provide higher stability and improved boundary delineation on Opuntia spp., whereas transfer to Agave salmiana exhibits selective degradation that is mitigated through low–annotation–cost fine-tuning. On Opuntia spp., Mask2Former achieves the best test performance (IoU 0.897 +/− 0.094; RMSE 0.146 +/− 0.002) and, after consolidation, sustains the highest overlap on both crops (IoU 0.894 +/− 0.004 on Opuntia and IoU 0.760 +/− 0.046 on Agave), while preserving high contour fidelity (BF score 0.962 +/− 0.102/0.877 +/− 0.153; HD95 2.189 +/− 3.447 px/8.458 +/− 16.667 px for Opuntia/Agave), supporting its use for final vegetation–cover quantification. Overall, the study provides practical guidelines for architecture selection under hardware constraints, a reproducible transfer protocol, and an orthomosaic–oriented implementation that facilitates integration into agronomic and remote–sensing workflows. Full article
Show Figures

Figure 1

20 pages, 7656 KB  
Article
Remote Sensing Extraction and Spatiotemporal Change Analysis of Time-Series Terraces in Complex Terrain on the Loess Plateau Based on a New Swin Transformer Dual-Branch Deformable Boundary Network (STDBNet)
by Guobin Kan, Jianhua Xiao, Benli Liu, Bao Wang, Chenchen He and Hong Yang
Remote Sens. 2026, 18(1), 85; https://doi.org/10.3390/rs18010085 - 26 Dec 2025
Viewed by 401
Abstract
Terrace construction is a critical engineering practice for soil and water conservation as well as sustainable agricultural development on the Loess Plateau (LP), China, where high-precision dynamic monitoring is essential for informed regional ecological governance. To address the challenges of inadequate extraction accuracy [...] Read more.
Terrace construction is a critical engineering practice for soil and water conservation as well as sustainable agricultural development on the Loess Plateau (LP), China, where high-precision dynamic monitoring is essential for informed regional ecological governance. To address the challenges of inadequate extraction accuracy and poor model generalization in time-series terrace mapping amid complex terrain and spectral confounding, this study proposes a novel Swin Transformer-based Terrace Dual-Branch Deformable Boundary Network (STDBNet) that seamlessly integrates multi-source remote sensing (RS) data with deep learning (DL). The STDBNet model integrates the Swin Transformer architecture with a dual-branch attention mechanism and introduces a boundary-assisted supervision strategy, thereby significantly enhancing terrace boundary recognition, multi-source feature fusion, and model generalization capability. Leveraging Sentinel-2 multi-temporal optical imagery and terrain-derived features, we constructed the first 10-m-resolution spatiotemporal dataset of terrace distribution across the LP, encompassing nine annual periods from 2017 to 2025. Performance evaluations demonstrate that STDBNet achieved an overall accuracy (OA) of 95.26% and a mean intersection over union (MIoU) of 86.84%, outperforming mainstream semantic segmentation models including U-Net and DeepLabV3+ by a significant margin. Further analysis reveals the spatiotemporal evolution dynamics of terraces over the nine-year period and their distribution patterns across gradients of key terrain factors. This study not only provides robust data support for research on terraced ecosystem processes and assessments of soil and water conservation efficacy on the LP but also lays a scientific foundation for informing the formulation of regional ecological restoration and land management policies. Full article
(This article belongs to the Special Issue Temporal and Spatial Analysis of Multi-Source Remote Sensing Images)
Show Figures

Figure 1

23 pages, 6281 KB  
Article
Empirical Mode Decomposition-Based Deep Learning Model Development for Medical Imaging: Feasibility Study for Gastrointestinal Endoscopic Image Classification
by Mou Deb, Mrinal Kanti Dhar, Poonguzhali Elangovan, Keerthy Gopalakrishnan, Divyanshi Sood, Aaftab Sethi, Sabah Afroze, Sourav Bansal, Aastha Goudel, Charmy Parikh, Avneet Kaur, Swetha Rapolu, Gianeshwaree Alias Rachna Panjwani, Rabiah Aslam Ansari, Naghmeh Asadimanesh, Shiva Sankari Karuppiah, Scott A. Helgeson, Venkata S. Akshintala and Shivaram P. Arunachalam
J. Imaging 2026, 12(1), 4; https://doi.org/10.3390/jimaging12010004 - 22 Dec 2025
Viewed by 379
Abstract
This study proposes a novel two-dimensional Empirical Mode Decomposition (2D EMD)-based deep learning framework to enhance model performance in multi-class image classification tasks and potential early detection of diseases in healthcare using medical imaging. To validate this approach, we apply it to gastrointestinal [...] Read more.
This study proposes a novel two-dimensional Empirical Mode Decomposition (2D EMD)-based deep learning framework to enhance model performance in multi-class image classification tasks and potential early detection of diseases in healthcare using medical imaging. To validate this approach, we apply it to gastrointestinal (GI) endoscopic image classification using the publicly available Kvasir dataset, which contains eight GI image classes with 1000 images each. The proposed 2D EMD-based design procedure decomposes images into a full set of intrinsic mode functions (IMFs) to enhance image features beneficial for AI model development. Integrating 2D EMD into a deep learning pipeline, we evaluate its impact on four popular models (ResNet152, VGG19bn, MobileNetV3L, and SwinTransformerV2S). The results demonstrate that subtracting IMFs from the original image consistently improves accuracy, F1-score, and AUC for all models. The study reveals a notable enhancement in model performance, with an approximately 9% increase in accuracy compared to counterparts without EMD integration for ResNet152. Similarly, there is an increase of around 18% for VGG19L, 3% for MobileNetV3L, and 8% for SwinTransformerV2. Additionally, explainable AI (XAI) techniques, such as Grad-CAM, illustrate that the model focuses on GI regions for predictions. This study highlights the efficacy of 2D EMD in enhancing deep learning model performance for GI image classification, with potential applications in other domains. Full article
Show Figures

Figure 1

16 pages, 4888 KB  
Article
PGSUNet: A Phenology-Guided Deep Network for Tea Plantation Extraction from High-Resolution Remote Sensing Imagery
by Xiaoyong Zhang, Bochen Jiang and Hongrui Sun
Appl. Sci. 2025, 15(24), 13062; https://doi.org/10.3390/app152413062 - 11 Dec 2025
Viewed by 370
Abstract
Tea, recognized as one of the world’s three principal beverages, plays a significant role both economically and culturally. The accurate, large-scale mapping of tea plantations is crucial for quality control, industry regulation, and ecological assessments. Challenges arise in high-resolution imagery due to the [...] Read more.
Tea, recognized as one of the world’s three principal beverages, plays a significant role both economically and culturally. The accurate, large-scale mapping of tea plantations is crucial for quality control, industry regulation, and ecological assessments. Challenges arise in high-resolution imagery due to the spectral similarities with other land covers and the intricate nature of their boundaries. We introduce a Phenology-Guided SwinUnet (PGSUNet), a semantic segmentation network that amalgamates Swin Transformer encoding with a parallel phenology context branch. An intelligent fusion module within this network generates spatial attention informed by phenological priors, while a dual-head decoder enhances the precision through explicit edge supervision. Using Hangzhou City as the case study, PGSUNet was compared with seven mainstream models, including DeepLabV3+ and SegFormer. It achieved an F1-score of 0.84, outperforming the second-best model by 0.03, and obtained an mIoU of 84.53%, about 2% higher than the next-best result. This study demonstrates that the integration of phenological priors with edge supervision significantly improves the fine-scale extraction of agricultural land covers from complex remote sensing imagery. Full article
(This article belongs to the Section Agricultural Science and Technology)
Show Figures

Figure 1

23 pages, 11094 KB  
Article
RSDB-Net: A Novel Rotation-Sensitive Dual-Branch Network with Enhanced Local Features for Remote Sensing Ship Detection
by Danshu Zhou, Yushan Xiong, Shuangming Yu, Peng Feng, Jian Liu, Nanjian Wu, Runjiang Dou and Liyuan Liu
Remote Sens. 2025, 17(23), 3925; https://doi.org/10.3390/rs17233925 - 4 Dec 2025
Viewed by 389
Abstract
Ship detection in remote sensing imagery is hindered by cluttered backgrounds, large variations in scale, and random orientations, limiting the performance of detectors designed for natural images. We propose RSDB-Net, a Rotation-Sensitive Dual-Branch Detection Network that introduces innovations in feature extraction, fusion, and [...] Read more.
Ship detection in remote sensing imagery is hindered by cluttered backgrounds, large variations in scale, and random orientations, limiting the performance of detectors designed for natural images. We propose RSDB-Net, a Rotation-Sensitive Dual-Branch Detection Network that introduces innovations in feature extraction, fusion, and detection. The Swin Transformer–CNN Backbone (STCBackbone) combines a Swin Transformer for global semantics with a CNN branch for local spatial detail, while the Feature Conversion and Coupling Module (FCCM) aligns and fuses heterogeneous features to handle multi-scale objects, and a Rotation-sensitive Cross-branch Fusion Head (RCFHead) enables bidirectional interaction between classification and localization, improving detection of randomly oriented targets. Additionally, an enhanced Feature Pyramid Network (eFPN) with learnable transposed convolutions restores semantic information while maintaining spatial alignment. Experiments on DOTA-v1.0 and HRSC2016 show that RSDB-Net performs better than the state of the art (SOTA), with mAP-ship values of 89.13% and 90.10% (+5.54% and +44.40% over the baseline, respectively), and reaches 72 FPS on an RTX 3090. RSDB-Net also demonstrates strong generalization and scalability, providing an effective solution for rotation-aware ship detection. Full article
Show Figures

Figure 1

20 pages, 594 KB  
Article
CFNet: Achieving Practical Speedup in Lightweight CNNs via Channel-Focused Design and Cross-Channel Mixing
by Xin Lv, Jing Liang, Qi Wang, Haipeng Du, Caixia Yan and Jiageng Zhang
Appl. Sci. 2025, 15(23), 12620; https://doi.org/10.3390/app152312620 - 28 Nov 2025
Viewed by 359
Abstract
Convolutional Neural Networks (CNNs) have achieved remarkable performance in computer vision tasks, but their deployment on resource-constrained devices remains challenging. While existing lightweight CNNs reduce FLOPs significantly, their practical inference speed is limited by memory access bottlenecks. We hence propose CFNet, an efficient [...] Read more.
Convolutional Neural Networks (CNNs) have achieved remarkable performance in computer vision tasks, but their deployment on resource-constrained devices remains challenging. While existing lightweight CNNs reduce FLOPs significantly, their practical inference speed is limited by memory access bottlenecks. We hence propose CFNet, an efficient architecture that bridges the gap between theoretical efficiency and practical speed through synergistic design of channel-focused convolution (CFConv) and channel mixed unit (CMU). CFConv dynamically selects informative channels via learnable GroupNorm scaling factors and reparameterization, reducing both FLOPs and memory access, while CMU enables cross-channel communication through a split-transform-and-mix strategy to mitigate information loss. Experiments on CIFAR/ImageNet classification and MS COCO object detection demonstrate CFNet’s superior performance. On ImageNet-1K, CFNet-A achieves 35.5% and 189.4% GPU throughput improvements over MobileNetV2 and MobileViTv1-XXS respectively, while delivering 1.76% and 4.09% accuracy gains. CFNet-E attains 83.5% top-1 accuracy, outperforming Swin-S by 0.47% with 44.6% higher GPU throughput and 43.6% lower CPU inference latency. Full article
Show Figures

Figure 1

22 pages, 10489 KB  
Article
From Contemporary Datasets to Cultural Heritage Performance: Explainability and Energy Profiling of Visual Models Towards Textile Identification
by Evangelos Nerantzis, Lamprini Malletzidou, Eleni Kyratzopoulou, Nestor C. Tsirliganis and Nikolaos A. Kazakis
Heritage 2025, 8(11), 447; https://doi.org/10.3390/heritage8110447 - 24 Oct 2025
Viewed by 836
Abstract
The identification and classification of textiles play a crucial role in archaeometric studies, in the vicinity of their technological, economic, and cultural significance. Traditional textile analysis is closely related to optical microscopy and observation, while other microscopic, analytical, and spectroscopic techniques prevail over [...] Read more.
The identification and classification of textiles play a crucial role in archaeometric studies, in the vicinity of their technological, economic, and cultural significance. Traditional textile analysis is closely related to optical microscopy and observation, while other microscopic, analytical, and spectroscopic techniques prevail over fiber identification for composition purposes. This protocol can be invasive and destructive for the artifacts under study, time-consuming, and it often relies on personal expertise. In this preliminary study, an alternative, macroscopic approach is proposed, based on texture and surface textile characteristics, using low-magnification images and deep learning models. Under this scope, a publicly available, imbalanced textile image dataset was used to pretrain and evaluate six computer vision architectures (ResNet50, EfficientNetV2, ViT, ConvNeXt, Swin Transformer, and MaxViT). In addition to accuracy, energy efficiency and ecological footprint of the process were assessed using the CodeCarbon tool. The results indicate that two of the convolutional neural network models, Swin and EfficientNetV2, both deliver competitive accuracies together with low carbon emissions, in comparison to the transformer and hybrid models. This alternative, promising, sustainable, and non-invasive approach for textile classification demonstrates the feasibility of developing a custom, heritage-based image dataset. Full article
Show Figures

Figure 1

18 pages, 1694 KB  
Article
FAIR-Net: A Fuzzy Autoencoder and Interpretable Rule-Based Network for Ancient Chinese Character Recognition
by Yanling Ge, Yunmeng Zhang and Seok-Beom Roh
Sensors 2025, 25(18), 5928; https://doi.org/10.3390/s25185928 - 22 Sep 2025
Viewed by 719
Abstract
Ancient Chinese scripts—including oracle bone carvings, bronze inscriptions, stone steles, Dunhuang scrolls, and bamboo slips—are rich in historical value but often degraded due to centuries of erosion, damage, and stylistic variability. These issues severely hinder manual transcription and render conventional OCR techniques inadequate, [...] Read more.
Ancient Chinese scripts—including oracle bone carvings, bronze inscriptions, stone steles, Dunhuang scrolls, and bamboo slips—are rich in historical value but often degraded due to centuries of erosion, damage, and stylistic variability. These issues severely hinder manual transcription and render conventional OCR techniques inadequate, as they are typically trained on modern printed or handwritten text and lack interpretability. To tackle these challenges, we propose FAIR-Net, a hybrid architecture that combines the unsupervised feature learning capacity of a deep autoencoder with the semantic transparency of a fuzzy rule-based classifier. In FAIR-Net, the deep autoencoder first compresses high-resolution character images into low-dimensional, noise-robust embeddings. These embeddings are then passed into a Fuzzy Neural Network (FNN), whose hidden layer leverages Fuzzy C-Means (FCM) clustering to model soft membership degrees and generate human-readable fuzzy rules. The output layer uses Iteratively Reweighted Least Squares Estimation (IRLSE) combined with a Softmax function to produce probabilistic predictions, with all weights constrained as linear mappings to maintain model transparency. We evaluate FAIR-Net on CASIA-HWDB1.0, HWDB1.1, and ICDAR 2013 CompetitionDB, where it achieves a recognition accuracy of 97.91%, significantly outperforming baseline CNNs (p < 0.01, Cohen’s d > 0.8) while maintaining the tightest confidence interval (96.88–98.94%) and lowest standard deviation (±1.03%). Additionally, FAIR-Net reduces inference time to 25 s, improving processing efficiency by 41.9% over AlexNet and up to 98.9% over CNN-Fujitsu, while preserving >97.5% accuracy across evaluations. To further assess generalization to historical scripts, FAIR-Net was tested on the Ancient Chinese Character Dataset (9233 classes; 979,907 images), achieving 83.25% accuracy—slightly higher than ResNet101 but 2.49% lower than SwinT-v2-small—while reducing training time by over 5.5× compared to transformer-based baselines. Fuzzy rule visualization confirms enhanced robustness to glyph ambiguities and erosion. Overall, FAIR-Net provides a practical, interpretable, and highly efficient solution for the digitization and preservation of ancient Chinese character corpora. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

26 pages, 3973 KB  
Article
ViT-DCNN: Vision Transformer with Deformable CNN Model for Lung and Colon Cancer Detection
by Aditya Pal, Hari Mohan Rai, Joon Yoo, Sang-Ryong Lee and Yooheon Park
Cancers 2025, 17(18), 3005; https://doi.org/10.3390/cancers17183005 - 15 Sep 2025
Cited by 3 | Viewed by 1237
Abstract
Background/Objectives: Lung and colon cancers remain among the most prevalent and fatal diseases worldwide, and their early detection is a serious challenge. The data used in this study was obtained from the Lung and Colon Cancer Histopathological Images Dataset, which comprises five different [...] Read more.
Background/Objectives: Lung and colon cancers remain among the most prevalent and fatal diseases worldwide, and their early detection is a serious challenge. The data used in this study was obtained from the Lung and Colon Cancer Histopathological Images Dataset, which comprises five different classes of image data, namely colon adenocarcinoma, colon normal, lung adenocarcinoma, lung normal, and lung squamous cell carcinoma, split into training (80%), validation (10%), and test (10%) subsets. In this study, we propose the ViT-DCNN (Vision Transformer with Deformable CNN) model, with the aim of improving cancer detection and classification using medical images. Methods: The combination of the ViT’s self-attention capabilities with deformable convolutions allows for improved feature extraction, while also enabling the model to learn both holistic contextual information as well as fine-grained localized spatial details. Results: On the test set, the model performed remarkably well, with an accuracy of 94.24%, an F1 score of 94.23%, recall of 94.24%, and precision of 94.37%, confirming its robustness in detecting cancerous tissues. Furthermore, our proposed ViT-DCNN model outperforms several state-of-the-art models, including ResNet-152, EfficientNet-B7, SwinTransformer, DenseNet-201, ConvNext, TransUNet, CNN-LSTM, MobileNetV3, and NASNet-A, across all major performance metrics. Conclusions: By using deep learning and advanced image analysis, this model enhances the efficiency of cancer detection, thus representing a valuable tool for radiologists and clinicians. This study demonstrates that the proposed ViT-DCNN model can reduce diagnostic inaccuracies and improve detection efficiency. Future work will focus on dataset enrichment and enhancing the model’s interpretability to evaluate its clinical applicability. This paper demonstrates the promise of artificial-intelligence-driven diagnostic models in transforming lung and colon cancer detection and improving patient diagnosis. Full article
(This article belongs to the Special Issue Image Analysis and Machine Learning in Cancers: 2nd Edition)
Show Figures

Figure 1

32 pages, 6397 KB  
Article
Enhancing YOLO-Based SAR Ship Detection with Attention Mechanisms
by Ranyeri do Lago Rocha and Felipe A. P. de Figueiredo
Remote Sens. 2025, 17(18), 3170; https://doi.org/10.3390/rs17183170 - 12 Sep 2025
Cited by 1 | Viewed by 2350
Abstract
This study enhances Synthetic Aperture Radar (SAR) ship detection by integrating attention mechanisms, Bi-Level Routing Attention (BRA), Swin Transformer, and a Convolutional Block Attention Module (CBAM) into state-of-the-art YOLO architectures (YOLOv11 and v12). Addressing challenges like small ship sizes and complex maritime backgrounds [...] Read more.
This study enhances Synthetic Aperture Radar (SAR) ship detection by integrating attention mechanisms, Bi-Level Routing Attention (BRA), Swin Transformer, and a Convolutional Block Attention Module (CBAM) into state-of-the-art YOLO architectures (YOLOv11 and v12). Addressing challenges like small ship sizes and complex maritime backgrounds in SAR imagery, we systematically evaluate the impact of adding and replacing attention layers at strategic positions within the models. Experiments reveal that replacing the original attention layer at position 4 (C3k2 module) with the CBAM in YOLOv12 achieves optimal performance, attaining an mAP@0.5 of 98.0% on the SAR Ship Dataset (SSD), surpassing baseline YOLOv12 (97.8%) and prior works. The optimized CBAM-enhanced YOLOv12 also reduces computational costs (5.9 GFLOPS vs. 6.5 GFLOPS in the baseline). Cross-dataset validation on the SAR Ship Detection Dataset (SSDD) confirms consistent improvements, underscoring the efficacy of targeted attention-layer replacement for SAR-specific challenges. Additionally, tests on the SADD and MSAR datasets demonstrate that this optimization generalizes beyond ship detection, yielding gains in aircraft detection and multi-class SAR object recognition. This work establishes a robust framework for efficient, high-precision maritime surveillance using deep learning. Full article
Show Figures

Figure 1

Back to TopTop