MDPI - Publisher of Open Access Journals

30 pages, 3147 KB

Open AccessArticle

AsTexNet: A Discriminative Scale-Aware Hybrid Texture Representation with Compact Embedding

by Vandana Gupta, Ashish Mishra and Nishant Shrivastava

Appl. Sci. 2026, 16(12), 5743; https://doi.org/10.3390/app16125743 - 7 Jun 2026

Viewed by 125

Texture classification remains challenging due to high inter-class similarity, scale variability, and limited labeled data. While handcrafted descriptors capture fine-grained microstructures and CNNs encode global semantics, existing hybrid approaches often rely on direct feature concatenation or heuristic fusion strategies that ignore the varying [...] Read more.

Texture classification remains challenging due to high inter-class similarity, scale variability, and limited labeled data. While handcrafted descriptors capture fine-grained microstructures and CNNs encode global semantics, existing hybrid approaches often rely on direct feature concatenation or heuristic fusion strategies that ignore the varying discriminative importance of texture scales. Moreover, many deep texture representations introduce high-dimensional redundant embeddings and require computationally expensive backbone adaptation, limiting their effectiveness in small-data scenarios. This paper introduces AsTexNet, a discriminative scale-aware hybrid representation framework that reformulates multi-scale feature integration as a structured weighting problem. The proposed method quantifies the class-separability contribution of each handcrafted scale using a data-driven weighting mechanism, enabling adaptive emphasis on informative texture patterns. In parallel, pretrained CNN features are enhanced using Generalized Mean (GeM) pooling and compacted via cross-validated PCA to improve separability while controlling redundancy. A normalized late-fusion formulation integrates local and global cues into a compact 512-dimensional embedding without requiring backbone fine-tuning. Extensive experiments on six benchmark datasets demonstrate consistent performance gains, achieving 100% accuracy on Kylberg and Brodatz, 99.4% on UIUC, 99.0% on UMD, 90.6% on KTH-TIPS2b, and 78.71% on FMD. These results demonstrate that explicitly modeling scale-wise discriminative contributions leads to more effective and compact texture representations, particularly in limited-data scenarios. Full article

(This article belongs to the Special Issue Machine Learning in Computer Vision and Image Processing)

► Show Figures

Figure 1

26 pages, 16906 KB

Open AccessArticle

Linear-Aware Attention: Enhancing Art Style Classification with Structural Edge Priors

by Wanglong Yu and Xuefeng Liu

Electronics 2026, 15(11), 2314; https://doi.org/10.3390/electronics15112314 - 27 May 2026

Viewed by 164

Abstract

While deep learning has achieved impressive success in art style classification, standard convolutional neural networks (CNNs) often exhibit a “texture bias”, prioritizing local brushstrokes and color patterns over the global structural logic essential for stylistic identification. Drawing inspiration from Heinrich Wölfflin’s “Linear and [...] Read more.

While deep learning has achieved impressive success in art style classification, standard convolutional neural networks (CNNs) often exhibit a “texture bias”, prioritizing local brushstrokes and color patterns over the global structural logic essential for stylistic identification. Drawing inspiration from Heinrich Wölfflin’s “Linear and Painterly” theory, we propose the Edge-Guided Spatial Attention Network (ESA-Net) to bridge the gap between feature extraction and aesthetic structure. ESA-Net utilizes a dual-stream architecture that decouples artistic representation into semantic textures and structural contours. As its core, the proposed Edge-Guided Convolutional Block Attention Module (EG-CBAM) treats exogenous edge maps as spatial gates, recalibrating the model’s focus toward salient outlines while suppressing textural noise. The experimental results on the WikiArt dataset demonstrate that ESA-Net achieves a state-of-the-art top 1 accuracy of 69.40%. Qualitative visualizations via Grad-CAM further confirm that our model effectively aligns its decision-making process with the structural layouts which are favored by human experts, providing a theoretically grounded approach to computational connoisseurship. Full article

(This article belongs to the Section Electronic Multimedia)

► Show Figures

Figure 1

25 pages, 14321 KB

Open AccessArticle

A Woodblock New Year Painting Style Classification Method Based on Structural-Aware Attention and Frequency-Domain Style Statistics

by Hua Wei, Zhihua Diao, Junxiang Diao, Liqin Wen, Binbin Sun, Xiaoxuan Chen and Luping Yin

Electronics 2026, 15(10), 2158; https://doi.org/10.3390/electronics15102158 - 18 May 2026

Viewed by 185

Abstract

To address the problems of subtle style differences, high inter-class similarity, and complex structural and texture features in woodblock New Year paintings, this paper proposes a style classification method for woodblock New Year paintings based on an improved ResNeXt-50. The method introduces SA-CBAM [...] Read more.

To address the problems of subtle style differences, high inter-class similarity, and complex structural and texture features in woodblock New Year paintings, this paper proposes a style classification method for woodblock New Year paintings based on an improved ResNeXt-50. The method introduces SA-CBAM at the middle- and high-level feature stages. Through the synergistic effect of channel attention and edge-enhanced spatial attention, the model is guided to focus on key structural regions such as human contours. Furthermore, single-stage 2D-DWT is introduced to separate deep features into low-frequency global structural components and high-frequency local detail components, thereby enabling effective representation of overall composition information and fine-grained carving textures. The Gram matrix is introduced to conduct statistical modeling of the fusion features, so as to characterize the overall style distribution from the perspective of channel correlation. The model is trained and tested on a dataset of 4043 independent images across six categories, achieving an overall classification accuracy of 97.68%, which is significantly superior to mainstream models such as Vision Transformer. Ablation experiments further verify the complementary effects of each module in structural perception, frequency-domain feature representation, and style statistical modeling, demonstrating the effectiveness and application potential of the proposed method for digital preservation and fine-grained style recognition of woodblock New Year paintings. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

24 pages, 6147 KB

Open AccessArticle

Multi-Scale Transformer-Based Neural Architecture Search for Hyperspectral Image Classification

by Aili Wang, Xinyu Liu and Haisong Chen

Remote Sens. 2026, 18(10), 1586; https://doi.org/10.3390/rs18101586 - 15 May 2026

Viewed by 244

Abstract

Hyperspectral image classification (HSIC) is a crucial task for remote sensing applications, requiring accurate pixel-level labeling while effectively capturing both spectral and spatial information. Traditional convolutional neural network architectures often struggle to balance local texture detail and global contextual consistency, and existing neural [...] Read more.

Hyperspectral image classification (HSIC) is a crucial task for remote sensing applications, requiring accurate pixel-level labeling while effectively capturing both spectral and spatial information. Traditional convolutional neural network architectures often struggle to balance local texture detail and global contextual consistency, and existing neural architecture search (NAS) methods rarely incorporate attention mechanisms, limiting their performance. To address these challenges, this study proposes a multi-scale Transformer-based NAS framework (TR-NAS) for fine-grained hyperspectral image classification. The framework combines local cube sampling, shallow and deep multi-scale convolutions, and a searchable Transformer module that adaptively selects global, local window, and multi-scale attention operators. Lightweight enhanced convolution operators, including dual-gated (DG-Conv) and mixed depthwise (MixConv) convolutions, are incorporated to improve spectral discrimination and scale robustness. Extensive experiments on the PU and Hanchuan datasets demonstrate that TR-NAS achieves superior classification accuracy, stability, and boundary consistency compared to traditional methods and existing NAS architectures, showing improved robustness to spectral similarity and spatial heterogeneity in complex remote sensing scenes. Full article

(This article belongs to the Special Issue Deep Learning for Multi-Sensor Remote Sensing: Advancements in Image Classification and Semantic Segmentation)

► Show Figures

Figure 1

24 pages, 17355 KB

Open AccessArticle

A Deep Feature Approach to Visual Similarity Analysis of Ethnic Brocades in Southwest China

by Quan Li, Huaxing Lu, Shichen Liu, Dengwei Sun and Biao Zhang

Appl. Sci. 2026, 16(10), 4928; https://doi.org/10.3390/app16104928 - 15 May 2026

Viewed by 225

Abstract

Visual similarity analysis of ethnic brocades is valuable for image retrieval, style comparison, and digital archiving in cultural heritage informatics. However, although deep neural networks provide powerful visual representations, their encoded similarity structures are often difficult to interpret. This study presents an interpretable [...] Read more.

Visual similarity analysis of ethnic brocades is valuable for image retrieval, style comparison, and digital archiving in cultural heritage informatics. However, although deep neural networks provide powerful visual representations, their encoded similarity structures are often difficult to interpret. This study presents an interpretable deep feature framework for analyzing inter-ethnic visual similarity in brocade images from ten minority groups in Southwest China. Four convolutional neural network backbones, including AlexNet, VGG-16, ResNet-18, and an SE-enhanced ResNet-18 (SResNet-18), were first evaluated to identify a reliable feature extractor. The best-performing model was then used to construct deep feature-based similarity and distance relationships among ethnic categories. To interpret this structure, five handcrafted descriptor types, namely color, texture, geometric, local-structure, and frequency-domain features, were compared with the deep feature similarity matrix using Spearman correlation analysis and weighted descriptor fusion. Experimental results showed that SResNet-18 achieved the best classification performance, with an accuracy of 95.15% and an F1-score of 95.14%. Among the handcrafted descriptors, color showed the strongest correspondence with the RGB-based deep similarity structure (

r = 0.643

), followed by local-structure descriptors (

r = 0.416

), whereas classical texture descriptors showed near-zero correspondence (

r = - 0.063

). The optimal weighted fusion further improved the correlation to

r = 0.731

. These findings suggest that the SResNet-18 feature space is more strongly associated with color composition and local motif organization than with the specific grayscale texture, global geometric, or frequency-domain descriptors used in this study. The proposed framework provides an interpretable approach for understanding deep visual similarity in cultural heritage images and offers methodological support for pattern-based retrieval, comparative style analysis, and digital documentation. Full article

► Show Figures

Figure 1

27 pages, 3832 KB

Open AccessArticle

DualMambaFormer: A Parallel Hybrid Transformer–Mamba Network for Hyperspectral Image Classification

by Jiang Yu, Jingwei Li, Gan Sun, Jingying Lu, Xuejun Cheng, Ruimeng Zhou, Wei Sun and Xianjun Gao

Remote Sens. 2026, 18(10), 1516; https://doi.org/10.3390/rs18101516 - 11 May 2026

Viewed by 375

Abstract

Hyperspectral image classification (HSIC) plays a crucial role in fine-grained Earth observation tasks. However, balancing efficient long-range dependency modeling with the extraction of fine-grained local features remains a significant challenge, primarily due to the inherent high-dimensional spectral redundancy and complex spatial variability of [...] Read more.

Hyperspectral image classification (HSIC) plays a crucial role in fine-grained Earth observation tasks. However, balancing efficient long-range dependency modeling with the extraction of fine-grained local features remains a significant challenge, primarily due to the inherent high-dimensional spectral redundancy and complex spatial variability of hyperspectral data. Existing modeling paradigms exhibit distinct limitations: Convolutional Neural Networks (CNNs) are constrained by localized receptive fields, while Vision Transformers (ViTs), despite their global receptive capabilities, incur prohibitive quadratic computational complexity. Meanwhile, the emerging Mamba architecture has demonstrated remarkable effectiveness in sequence modeling with linear complexity, but it often lacks sufficient sensitivity to local textures when directly applied to non-causal 2D images. To address these limitations, this paper proposes a novel parallel hybrid architecture termed DualMambaFormer. Deviating from the traditional serial stacking paradigm, the proposed network utilizes a dual-stream design to achieve the complementary fusion of global static attention and dynamic sequence reasoning. Specifically, the model first employs an SS-ResNet for spectral dimensionality reduction and local feature embedding. Subsequently, the architecture bifurcates into a parallel encoding stage: one branch leverages Multi-Head Self-Attention (MHSA) to capture global spatial correlations, while the other introduces a Local Enhanced Mamba (LEM) branch. By integrating State Space Models (SSM) with depthwise separable convolutions, the LEM branch simultaneously captures long-range causal dependencies and local spatial context. Finally, a dual class token fusion strategy is designed to integrate heterogeneous representations at the decision level. Extensive experiments on four benchmark datasets—Indian Pines, Pavia University, Salinas, and WHU-HongHu—show that DualMambaFormer achieves OA values of 96.56%, 98.95%, 97.60%, and 96.09%, respectively, with consistently high AA and Kappa coefficients. These results demonstrate the effectiveness, robustness, and generalization capability of the proposed method for hyperspectral image classification. Compared with the second-best competing methods, DualMambaFormer improves OA by 5.55, 2.30, 1.68, and 4.30 percentage points on the Pavia University, Indian Pines, Salinas, and WHU-HongHu datasets, respectively. Full article

(This article belongs to the Special Issue Advances in Hyperspectral Remote Sensing Image Processing: 2nd Edition)

► Show Figures

Figure 1

19 pages, 2758 KB

Open AccessArticle

Protecting Digital Identities: Deepfake Face Detection Using Dual-Decoder U-Net Semantic Segmentation

by Rodrigo Eduardo Arevalo-Ancona, Manuel Cedillo-Hernandez, Antonio Cedillo-Hernandez and Francisco Javier Garcia-Ugalde

Future Internet 2026, 18(5), 233; https://doi.org/10.3390/fi18050233 - 25 Apr 2026

Viewed by 627

Abstract

Deepfake content forgery compromises the integrity of digital media and the protection of personal identity, making its detection essential for preserving trust and enabling effective forensic analysis. Most deepfake detection approaches focus on global classification with a binary decision, which is inadequate for [...] Read more.

Deepfake content forgery compromises the integrity of digital media and the protection of personal identity, making its detection essential for preserving trust and enabling effective forensic analysis. Most deepfake detection approaches focus on global classification with a binary decision, which is inadequate for precise localization of manipulated regions. This limitation becomes particularly evident under image processing distortions. This paper proposes a dual-decoder architecture for the detection and segmentation of original and deepfake facial manipulations. Unlike conventional single-decoder segmentation models, the proposed approach introduces two decoding branches that learn complementary feature representations of authentic and forgery facial textures. In addition, attention mechanism modules are incorporated to refine encoder features based on decoder context, introducing adaptive feature selection during reconstruction. This architectural design reduces feature interference during reconstruction and enhances the localization of subtle inconsistencies introduced by deepfake manipulations. This approach generates complementary masks for real and forged regions, providing more precise boundary delineation. Experimental results highlight the robustness of the proposed method under image processing distortions, achieving intersection over union (IoU) scores of 0.9387 for real faces and 0.9254 for deepfake segmentation. These results underscore the effectiveness of the dual-decoder architecture in accurately detecting and localizing deepfake facial manipulations. Full article

(This article belongs to the Collection Information Systems Security)

► Show Figures

Graphical abstract

23 pages, 9832 KB

Open AccessArticle

A Fine-Scale Urban Impervious Surface Extraction Method Based on UAV LiDAR and Visible Imagery

by Yanni Bao, Yu Zhao, Shirong Hu, Zhanwei Wang and Hui Deng

Remote Sens. 2026, 18(9), 1275; https://doi.org/10.3390/rs18091275 - 23 Apr 2026

Viewed by 402

Abstract

Accurate extraction of impervious surface areas (ISA) is essential for urban environmental monitoring, yet severe spectral confusion among complex urban land-cover types limits the performance of classifications based solely on optical imagery. To address this issue within a localized context, this study proposes [...] Read more.

Accurate extraction of impervious surface areas (ISA) is essential for urban environmental monitoring, yet severe spectral confusion among complex urban land-cover types limits the performance of classifications based solely on optical imagery. To address this issue within a localized context, this study proposes a multi-source framework integrating UAV-based LiDAR (UAV-LiDAR) and high-resolution visible imagery for fine-scale ISA extraction. An improved segmentation optimization strategy, termed EGS-Optimizer, is developed to enhance boundary delineation within the object-based image analysis (OBIA) framework by coupling edge detection with global segmentation quality evaluation. A comprehensive feature set including spectral, index, texture, geometric, and terrain features is constructed, and Shapley Additive Explanations (SHAP) is applied to select the most informative variables while reducing dimensionality. The proposed framework is validated in a typical 1.45 km² built-up area in Deyang City, Sichuan Province. Experimental results demonstrate that, within this specific study area, multi-source data fusion improves classification accuracy by 3.59–5.79% compared with single-source data, while feature selection reduces the feature dimension from 45 to 21. Among the evaluated classifiers, the random forest (RF) model achieves the highest performance, with an overall accuracy of 97.24% (Kappa = 0.96). While the high accuracy highlights the efficacy of synergizing spectral and structural information for micro-landscape mapping, these findings are constrained to the demonstrated fine-scale local environment. The results provide an effective, interpretable solution for detailed neighborhood-level ISA mapping, though further validation is required before the framework can be generalized to larger or more heterogeneous urban scenarios. Full article

(This article belongs to the Special Issue Machine Learning for Feature Extraction and Classification in Remote Sensing Images)

► Show Figures

Figure 1

25 pages, 6534 KB

Open AccessArticle

Spectral–Spatial State Space Model with Hybrid Attention for Hyperspectral Image Classification

by Mengdi Cheng, Haixin Sun, Fanlei Meng, Qiuguang Cao and Jingwen Xu

Algorithms 2026, 19(4), 300; https://doi.org/10.3390/a19040300 - 11 Apr 2026

Viewed by 625

Abstract

Hyperspectral image (HSI) classification requires the extraction of discriminative features from high-dimensional spatial–spectral data. While the Mamba architecture has shown promise in long-sequence modeling with linear complexity, its application to HSI remains constrained by two major hurdles: the unidirectional causal scanning which fails [...] Read more.

Hyperspectral image (HSI) classification requires the extraction of discriminative features from high-dimensional spatial–spectral data. While the Mamba architecture has shown promise in long-sequence modeling with linear complexity, its application to HSI remains constrained by two major hurdles: the unidirectional causal scanning which fails to capture non-causal global dependencies, and the serialization-induced loss of two-dimensional spatial topology and local textures. To overcome these limitations, we propose HAMamba, a novel Hybrid Attention State Space Model. HAMamba facilitates deep representation learning through two core components: a Multi-Scale Dynamic Fusion (MSDF) module and a Hybrid Attention Mamba Encoder (HAME). Specifically, the MSDF module augments spatial perception through parallelized feature extraction and dynamically weighted integration. The HAME synergizes a Bidirectional Sequence Scan Mamba (BSSM) to establish global semantic context and a Spatial–Spectral Gated Attention (SSGA) module to refine local structural details. Comprehensive experiments on four public benchmark datasets demonstrate that the proposed HAMamba significantly outperforms state-of-the-art approaches, achieving a superior balance between classification accuracy and computational efficiency. Full article

(This article belongs to the Section Evolutionary Algorithms and Machine Learning)

► Show Figures

Figure 1

34 pages, 10089 KB

Open AccessArticle

GateProtoNet: A Compute-Aware Two-Stage Hybrid Framework with Prototype Evidence and Faithfulness-Verified Explainability for Wheat and Cotton Leaf Disease Classification

by Muhammad Irfan Sharif, Yong Zhong, Muhammad Zaheer Sajid and Francesco Marinello

AgriEngineering 2026, 8(4), 152; https://doi.org/10.3390/agriengineering8040152 - 10 Apr 2026

Viewed by 654

Abstract

Accurate diagnosis of wheat leaf diseases in real farming conditions requires models that are not only highly accurate but also computationally efficient and interpretable for practical deployment on edge devices. We propose GateProtoNet (GPN), a two-stage, compute-aware, and explainable framework for multi-class leaf [...] Read more.

Accurate diagnosis of wheat leaf diseases in real farming conditions requires models that are not only highly accurate but also computationally efficient and interpretable for practical deployment on edge devices. We propose GateProtoNet (GPN), a two-stage, compute-aware, and explainable framework for multi-class leaf disease recognition. Stage-1 performs ultra-light healthy-versus-diseased screening, enabling early exit for healthy samples and substantially reducing average expected inference cost. For diseased samples, Stage-2 applies a novel hybrid backbone featuring a frequency-factorized Discrete Wavelet Transform (DWT) stem, parallel micro-lesion convolutional encoding for fine texture patterns, and a linear token mixer for global context modeling. A cross-gated fusion module adaptively integrates local and global evidence with minimal computational overhead. To ensure trustworthy predictions, GPN introduces a prototype evidence head that performs classification via similarity to learned class prototypes, providing human-interpretable explanations, along with a faithfulness constraint that enforces explanation reliability by measuring confidence degradation under salient region removal. Rigorous evaluation on four publicly available wheat and cotton leaf disease datasets demonstrate that GateProtoNet achieves 99.2% classification accuracy, 99.1% macro-F1 score, and 99.3% AUC, significantly outperforming existing CNN, transformer, and hybrid baselines while requiring substantially fewer parameters and FLOPs. The two-stage inference strategy reduces average computational cost by avoiding full model execution on healthy leaves, enabling real-time, on-device diagnosis for resource-constrained agricultural environments. Full article

► Show Figures

Figure 1

19 pages, 1666 KB

Open AccessArticle

MTLL: A Novel Multi-Task Learning Approach for Lymphocytic Leukemia Classification and Nucleus Segmentation

by Cuisi Ou, Zhigang Hu, Xinzheng Wang, Kaiwen Cao and Yipei Wang

Electronics 2026, 15(7), 1419; https://doi.org/10.3390/electronics15071419 - 28 Mar 2026

Viewed by 397

Abstract

Bone marrow cell classification and nucleus segmentation in microscopic images are fundamental tasks for computer-aided diagnosis of lymphocytic leukemia. However, bone marrow cells from different subtypes exhibit high morphological similarity, and structural information is often constrained under optical microscopic imaging, posing challenges for [...] Read more.

Bone marrow cell classification and nucleus segmentation in microscopic images are fundamental tasks for computer-aided diagnosis of lymphocytic leukemia. However, bone marrow cells from different subtypes exhibit high morphological similarity, and structural information is often constrained under optical microscopic imaging, posing challenges for stable and effective feature representation. To address this issue, we propose MTLL (Multitask Model on Lymphocytic Leukemia), a novel multitask approach that performs cell classification and nucleus segmentation within a unified network to exploit their complementary information. The model constructs a hybrid backbone for shared feature representation based on a CNN-Transformer architecture, in which Fuse-MBConv modules are tightly integrated with multilayer multi-scale transformers to enable deep fusion of local texture and global semantic information. For the segmentation branch, we design an AM (Atrous Multilayer Perceptron) decoder that combines atrous spatial pyramid pooling with multilayer perceptrons to fuse multi-scale information and accurately delineate nucleus boundaries. The classification branch incorporates prior knowledge of cell nuclei structures to capture subtle variations in cellular morphology and texture, thereby enhancing the model’s ability to distinguish between leukemia subtypes. Experimental results demonstrate that the MTLL model significantly outperforms existing advanced single-task and multi-task models in both lymphocytic leukemia classification and cell nucleus segmentation. These results validate the effectiveness of the multi-task feature-sharing strategy for lymphocytic leukemia diagnosis using bone marrow microscopic images. Full article

► Show Figures

Figure 1

27 pages, 9112 KB

Open AccessArticle

MSWKN: Multi-Scale Wavelet Kolmogorov–Arnold Network with Spectral–Spatial and Frequency Domain Optimization for Hyperspectral Crop Classification

by Ziwei Li, Bingjie Liang, Weizhen Zhang, Zhenqiang Xu, Baowei Zhang, Ning Li, Weiran Luo and Jianzhong Guo

Agriculture 2026, 16(7), 740; https://doi.org/10.3390/agriculture16070740 - 27 Mar 2026

Viewed by 545

Abstract

Accurate crop classification provides fundamental data for agricultural resource management and ecological research. Hyperspectral image (HSI) classification is the core technique for achieving precise crop mapping. However, existing models often suffer from excessive parameters, limited robustness under few-shot conditions, and a trade-off between [...] Read more.

Accurate crop classification provides fundamental data for agricultural resource management and ecological research. Hyperspectral image (HSI) classification is the core technique for achieving precise crop mapping. However, existing models often suffer from excessive parameters, limited robustness under few-shot conditions, and a trade-off between efficiency and robustness. To address these issues, this paper proposes a Multi-Scale Wavelet Kolmogorov–Arnold Network (MSWKN). The model employs a Two-Branch Feature Extractor (TBFE) to capture both spectral correlations and spatial textures. a Channel Cross-Spatial (CCS) module to suppress background clutter and highlight discriminative regions. A group convolution-based Fixed Wavelet Multi-Scale Convolutional Layer (FW-MSCL) that leverages the time–frequency localization of wavelets and learnable linear combinations to enhance robustness against spectral distortion while reducing parameters. And a Fourier-based Transformer encoder to enable global frequency–space modeling. Experiments on the WHU-Hi-HanChuan and WHU-Hi-HongHu hyperspectral crop datasets show that MSWKN achieves high overall accuracy and performs favorably on few-shot categories. Under lower parameter counts and fast inference conditions, the model demonstrates a reasonable trade-off between accuracy and computational efficiency. Ablation studies and wavelet kernel comparisons further confirm the contribution of each module and the advantage of the wavelet. The proposed framework provides an efficient and robust solution for fine-grained hyperspectral crop classification. Full article

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

► Show Figures

Figure 1

22 pages, 2186 KB

Open AccessArticle

ConvDeiT-Tiny: Adding Local Inductive Bias to DeiT-Ti for Enhanced Maize Leaf Disease Classification

by Damaris Waema, Waweru Mwangi and Petronilla Muriithi

Plants 2026, 15(6), 982; https://doi.org/10.3390/plants15060982 - 23 Mar 2026

Viewed by 605

Abstract

Reliable identification of maize leaf diseases is critical for mitigating crop losses, particularly in regions where farmers have limited access to experts. Although vision transformers (ViTs) have recently demonstrated strong performance in image recognition, their weak inductive bias and limited modeling of local [...] Read more.

Reliable identification of maize leaf diseases is critical for mitigating crop losses, particularly in regions where farmers have limited access to experts. Although vision transformers (ViTs) have recently demonstrated strong performance in image recognition, their weak inductive bias and limited modeling of local texture patterns make them non-ideal for fine-grained maize leaf disease classification. To address these limitations, we propose ConvDeiT-Tiny, a lightweight hybrid ViT that improves DeiT-Ti by placing depthwise convolutions in parallel with multi-head self-attention modules in the first three transformer blocks. The local and global features captured by the convolution and attention modules are concatenated along the embedding dimension and fused using a multilayer perceptron. This results in richer token representations without significantly increasing model size. Across three datasets, ConvDeiT-Tiny (6.9 M parameters) consistently outperformed DeiT-Ti, DeiT-Ti-Distilled, and DeiT-S (21.7 M parameters) when trained from scratch. With transfer learning, ConvDeiT-Tiny achieved an accuracy of 99.15%, 99.35%, and 98.60% on the CD&S, primary, and Kaggle datasets, respectively, surpassing many previous studies with far fewer parameters. For explainability, we present gradient-weighted transformer attribution visualizations showing the disease lesions driving model predictions. These results indicate that injecting local inductive bias in early transformer blocks is beneficial for accurate maize leaf disease classification. Full article

(This article belongs to the Special Issue AI-Driven Machine Vision Technologies in Plant Science)

► Show Figures

Figure 1

28 pages, 43592 KB

Open AccessArticle

TreeSpecViT: Fine-Grained Tree Species Classification from UAV RGB Imagery for Campus-Scale Human–Vegetation Coupling Analysis

by Yinghui Yuan, Yunfeng Yang, Zhulin Chen and Sheng Xu

Remote Sens. 2026, 18(6), 928; https://doi.org/10.3390/rs18060928 - 18 Mar 2026

Cited by 1 | Viewed by 521

Abstract

On university campuses, trees and green spaces shape how students and staff move and use outdoor spaces. To support planning, tree species information is needed at the level of individual trees. Tree species classification from UAV RGB imagery remains difficult in complex campus [...] Read more.

On university campuses, trees and green spaces shape how students and staff move and use outdoor spaces. To support planning, tree species information is needed at the level of individual trees. Tree species classification from UAV RGB imagery remains difficult in complex campus scenes because roads, buildings, shadows and subtle inter species differences degrade recognition. To address background interference, the loss of subtle fine-grained cues before tokenization, and insufficient local structure modeling in lightweight transformer-based classification, we propose TreeSpecViT for tree species classification. It uses a MobileViT backbone and a Background Suppression Module (BSM) to reduce clutter from non-canopy regions. A Fine-Grained Feature Guidance (FGF) module is inserted before the unfold operation to enhance canopy details and guide tokenization toward key regions.

1 \times 1

convolutional neck layers align channels, and a Global and Local Fusion (GLF) module jointly models overall crown semantics and local textures for species recognition. From the predicted masks and species labels, we build an individual tree digital archive. The archive stores per tree geometric attributes and can be linked with grids of campus activity intensity to analyze how activity patterns relate to vegetation structure. TreeSpecViT achieves an Accuracy of 87.88% (+6.06%) and an F1 score of 76.48% (+5.08%) on the SZUTreeDataset. On our self constructed NJFUDataset, it reaches 76.30% (+5.10%) in Accuracy and 70.10% (+7.20%) in F1. These results surpass mainstream models. Ablation experiments show that the modules jointly reduce background clutter and enhance canopy features. Overall, TreeSpecViT supports campus scale analyses that link human activity intensity to vegetation patterns and provides a practical basis for planning and adjusting campus green spaces. Full article

► Show Figures

Figure 1

28 pages, 12746 KB

Open AccessArticle

PSTNet: A Hyperspectral Image Classification Method Based on Adaptive Spectral–Spatial Tokens and Parallel Attention

by Shaokang Yu, Yong Mei, Xiangsuo Fan, Song Guo, Wujun Xu and Jinlong Fan

Remote Sens. 2026, 18(6), 901; https://doi.org/10.3390/rs18060901 - 15 Mar 2026

Viewed by 604

Abstract

Hyperspectral image classification holds significant applications across multiple domains due to its rich spectral and spatial information. However, it faces challenges such as spectral variation within the same object, spectral variation across different objects, and noise interference. Existing methods like convolutional neural networks [...] Read more.

Hyperspectral image classification holds significant applications across multiple domains due to its rich spectral and spatial information. However, it faces challenges such as spectral variation within the same object, spectral variation across different objects, and noise interference. Existing methods like convolutional neural networks perform well in local feature extraction but inadequately model long-range dependencies. While Transformers can capture global relationships, they struggle to effectively coordinate spectral and spatial information modeling. To address these limitations, this paper proposes a dual-branch collaborative Transformer network (PST-Net). This architecture integrates an adaptive spectral–spatial token (ASST) module, a Parallel Attention-Augmented lightweight CNN branch (PA-SSCNN), and a collaborative fusion layer. The ASST constructs joint representation tokens through local spectral smoothing and learnable spatial embedding. PA-SSCNN employs 3D-2D cascaded convolutions and channel–spatial attention mechanisms to enhance local texture and spatial feature extraction; CHIB enables deep interaction and synergistic fusion of dual-branch features across different levels and scales. Experimental results demonstrate that with only 2% labeled samples, PST-Net achieves overall classification accuracies of 96.31%, 96.59%, 95.27%, and 89.06% on the Salinas and Whuhh, and the two complex urban scene datasets Qingyun and Houston. Especially in fine-grained categories and complex scenes, it exhibits strong robustness. The ablation experiment further validated the effectiveness and complementarity of each module. This study provides an efficient collaborative modeling framework for hyperspectral image classification that balances global dependencies and local details. Full article

(This article belongs to the Special Issue Multi-Task Remote Sensing Image Analysis: Classification, Segmentation, and Change Detection)

► Show Figures

Figure 1

Search Results (89)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (89)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI