Transformer-Based Multi-Class Classification of Bangladeshi Rice Varieties Using Image Data

Tabassum, Israt; Nunavath, Vimala

doi:10.3390/app16031279

Open AccessArticle

Transformer-Based Multi-Class Classification of Bangladeshi Rice Varieties Using Image Data

by

Israt Tabassum

^*

and

Vimala Nunavath

Department of Science and Industry Systems, University of South-Eastern Norway, 3616 Kongsberg, Norway

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(3), 1279; https://doi.org/10.3390/app16031279

Submission received: 10 December 2025 / Revised: 19 January 2026 / Accepted: 23 January 2026 / Published: 27 January 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Rice (Oryza sativa L.) is a staple food for over half of the global population, with significant economic, agricultural, and cultural importance, particularly in Asia. Thousands of rice varieties exist worldwide, differing in size, shape, color, and texture, making accurate classification essential for quality control, breeding programs, and authenticity verification in trade and research. Traditional manual identification of rice varieties is time-consuming, error-prone, and heavily reliant on expert knowledge. Deep learning provides an efficient alternative by automatically extracting discriminative features from rice grain images for precise classification. While prior studies have primarily employed deep learning models such as CNN, VGG, InceptionV3, MobileNet, and DenseNet201, transformer-based models remain underexplored for rice variety classification. This study addresses this gap by applying two deep learning models such as Swin Transformer and Vision Transformer for multi-class classification of rice varieties using the publicly available PRBD dataset from Bangladesh. Experimental results demonstrate that the ViT model achieved an accuracy of 99.86% with precision, recall, and F1-score all at 0.9986, while the Swin Transformer model obtained an accuracy of 99.44% with a precision of 0.9944, recall of 0.9944, and F1-score of 0.9943. These results highlight the effectiveness of transformer-based models for high-accuracy rice variety classification.

Keywords:

Bangladeshi rice; PRBD; deep learning; Swin Transformer; Vision Transformer (ViT); multi-class classification

1. Introduction

Rice is the primary dietary staple for more than half of the world’s population and remains a cornerstone of global food security, particularly in Asia [1]. As a major source of carbohydrates and energy, rice underpins not only human nutrition but also the livelihoods of millions of farmers across developing and developed economies. Beyond consumption, rice quality and varietal identity are critical for seed certification, breeding programs, yield optimization, market pricing, and international trade. However, rice grains exhibit substantial intra- and inter-varietal variability due to genetic diversity, environmental conditions, and post-harvest processing, making accurate variety identification a persistent challenge [2].

Bangladesh is one of the world’s leading rice-producing countries, where rice holds exceptional economic, agricultural, and cultural significance. The country cultivates a wide range of indigenous and high-yielding rice varieties adapted to diverse agroecological conditions across the Aman, Aus, and Boro seasons [3]. Accurate identification of Bangladeshi rice varieties is essential for ensuring varietal purity, maintaining quality standards, supporting breeding initiatives, and preventing economic losses caused by mislabeling or adulteration. Nevertheless, many local varieties share subtle visual similarities in grain shape, size, color, and texture, making reliable multi-class classification particularly difficult.

Traditional approaches to rice variety identification, such as manual visual inspection, morphometric analysis, and biochemical or genetic testing are often labor-intensive, time-consuming, costly, and susceptible to subjective bias [4]. These limitations restrict their scalability and practical applicability, especially in resource-constrained agricultural settings. As a result, there has been growing interest in automated, image-based classification techniques that can provide fast, objective, and reproducible results.

Recent advances in deep learning have significantly improved image-based agricultural analysis, including grain classification, seed identification, and crop disease detection [5]. Convolutional Neural Networks (CNNs) have been widely adopted for such tasks and have demonstrated promising performance [6]. However, CNN-based models often struggle to capture long-range spatial dependencies and fine-grained visual distinctions among highly similar classes, an issue that is particularly pronounced in multi-class rice variety classification. Transformer-based architectures, such as the Vision Transformer (ViT) and Swin Transformer, address these limitations by leveraging self-attention mechanisms that model global contextual relationships within images, leading to superior performance in complex visual recognition tasks [7,8].

Despite the rapid adoption of transformer models in general computer vision and agricultural applications, their use in rice variety classification particularly for Bangladeshi rice remains largely unexplored. Existing studies on the PRBD dataset [9] have primarily focused on conventional machine learning or CNN-based approaches, leaving a clear research gap in evaluating the effectiveness of modern transformer-based architectures on this dataset. To the best of our knowledge, this study represents the first systematic application of Vision Transformer and Swin Transformer models for multi-class classification of Bangladeshi rice varieties using the PRBD dataset.

The objective of this research is to develop an automated and robust rice variety classification system tailored to Bangladeshi rice grains. This study represents the first application of transformer-based deep learning models, specifically ViT and Swin Transformer, to the PRBD dataset, enabling accurate discrimination of rice varieties based on subtle visual and structural characteristics. In doing so, the work seeks to advance the existing literature by demonstrating the advantages of transformer architectures over traditional CNN-based approaches for fine-grained agricultural image classification. Ultimately, this research contributes to the digital transformation of agricultural quality assessment in Bangladesh by providing an efficient, scalable, and reliable solution for rice variety identification.

Accordingly, this study addresses the following research questions:

How effectively can transformer-based deep learning models classify multiple Bangladeshi rice varieties using image data?
Which transformer architecture provides the highest accuracy and robustness for rice variety classification on the PRBD dataset?

Overall, this study presents a comprehensive approach to classify rice variety by applying transformer-based models. The proposed method is rigorously evaluated against state-of-the-art models to demonstrate its efficacy. The rest of this paper is organized as follows: Section 2 reviews the related works on rice classification and image-based grain analysis; Section 3 describes the dataset and methodology used in this research; Section 4 presents the experimental results with performance evaluation, and Section 5 provides discussion; and Section 6 concludes the study.

2. Literature Review

Significant research has explored the application of deep learning (DL) algorithms for rice type classification during the years 2020 to 2025. This section presents existing studies that focus on multi-class rice classification, the DL models employed, the datasets used, and their obtained results. According to the analysis of previous works in Table 1, recent studies have made significant progress in rice variety classification using both traditional machine learning (ML) and modern DL techniques. These studies utilized diverse datasets, including publicly available Kaggle datasets, private collections, and custom datasets from different regions of Bangladesh. The main goal has been to enhance multi-class rice classification accuracy.

Early ML-based approaches primarily used morphological features of rice grains. For example, Ref. [11] applied Support Vector Machines (SVMs), Random Forests (RFs), Logistic Regression (LR), Decision Trees (DTs), Gaussian Naive Bayes (GNB), and K-Nearest Neighbors (K-NN) on a Kaggle (https://www.kaggle.com/datasets/muratkokludataset/rice-image-dataset, accessed on 22 January 2026) rice image dataset. Among these, K-NN achieved the highest accuracy of 97.8%, followed closely by RF and DT. Similarly, Ref. [13] classified three red rice varieties using size, shape, and texture features, where K-NN reached 98.67% and SVM reached 97.34%.

Deep learning methods have demonstrated superior performance due to their automatic feature extraction capabilities. Ref. [10] proposed a hybrid CNN–Transformer model to classify 20 and 5 rice varieties across two datasets, achieving accuracies of 99.6% and 100%, respectively. Other CNN-based studies include [12], which classified four similar-looking Sona-Masuri varieties with 87.5% accuracy, and [15], combining CNN with Random Forest to classify five rice varieties, achieving 94.87%.

Transfer learning has been extensively applied to boost performance. Ref. [14] evaluated several pre-trained models, including MobileNet, ResNet50, VGG16, and their proposed Deep Rice Transfer (DRT) model, for classifying 10 rice varieties. VGG16 achieved the highest accuracy of 99.47%, followed by MobileNet (98.94%) and DRT (98.44%). Similarly, Ref. [17] combined deep transfer learning features with ML classifiers, achieving over 99% accuracy using VGG19 and InceptionV3 features.

Lightweight and custom-designed architectures have also been explored. Ref. [16] proposed OpLW-CNN for five-class rice recognition, attaining 98.14% accuracy. RiceNet, introduced by [18], achieved 94% accuracy on a custom dataset collected from SKUAST. Moreover, Ref. [5] applied DenseNet201 on 10 PRBD rice varieties, achieving 93% on the original dataset and 94% on the augmented dataset, demonstrating the benefits of data augmentation.

Comparative studies such as [19,20] evaluated multiple DL architectures on local Bangladeshi datasets. Ref. [19] reported a hybrid CNN + SVM model reaching 99% accuracy for five Southern Bangladesh rice varieties, outperforming VGG16 (95%) and MobileNetV2 (93%). Ref. [20] tested VGG16, MobileNet, InceptionV3, InceptionResNet, and Xception, achieving up to 96% individually and 98% using an ensemble.

Recent advancements in rice classification have shifted toward more optimized and hybrid deep learning architectures. Enhanced residual networks, such as deeper or fine-tuned ResNet variants, improve feature reuse and gradient flow, enabling more robust learning compared to conventional CNNs. However, their increased depth often results in higher computational cost. EfficientNet models, which employ compound scaling to balance network depth, width, and input resolution, have recently gained attention in agricultural image classification due to their superior accuracy–efficiency trade-off, particularly for resource-constrained applications.

Moreover, hybrid CNN–Transformer architectures represent a notable advancement by integrating CNN-based local feature extraction with Transformer-based global attention mechanisms. These models have demonstrated superior performance in distinguishing visually similar rice varieties by capturing long-range dependencies and contextual relationships that traditional CNNs may overlook. Despite their high accuracy, CNN–Transformer models often require greater computational resources, highlighting an ongoing trade-off between performance and efficiency in modern rice classification systems.

Overall, the literature indicates that deep learning, transfer learning, and ensemble methods significantly improve rice classification performance. While traditional ML remains effective for smaller and less complex datasets, modern CNN-based, EfficientNet, and CNN–Transformer architectures consistently outperform earlier approaches in complex multi-class scenarios.

Research Gap and Motivation

After reviewing the related literature, it is evident that only one study has so far utilized the PRBD dataset [5], where DenseNet201 was employed for rice variety classification. To the best of our knowledge, no existing work has explored Transformer-based architectures such as Vision Transformer (ViT) or Swin Transformer on this dataset. Therefore, this study aims to investigate the performance of multiple state-of-the-art deep learning models such as Swin Transformer and ViT on the PRBD dataset. The proposed work focuses on a comprehensive comparative analysis using evaluation metrics such as accuracy, precision, recall, and F1-score.

3. Methodology

This section outlines the detailed steps followed to obtain the results. First, the data collection and description are presented, followed by image preprocessing. In the third step, the selection and implementation of deep learning models, along with hyperparameter optimization and performance evaluation, are described.

Figure 1 illustrates the overall experimental procedure of this research. After collecting the data, the next step involves data processing, followed by splitting the dataset into training, testing, and validation sets. Subsequently, we apply deep learning models such as Swin Transformer and ViT to the training data, and finally, we evaluate the outcomes using performance metrics.

3.1. Data Collection and Description

In this study, a publicly available dataset called PRBD was used, which was downloaded from Mendeley Data [9]. The dataset consists of 2000 carefully selected rice kernel images spanning ten distinct Bangladeshi rice varieties: Aush, Beroi, BR-28, BR-29, Ghee Bhog, Katari Nazir, Katari Siddho, Swarna, Miniket, and Chinigura.

As reported by the authors of the PRBD dataset [5,9], the rice kernels were collected from local markets in Karwan Bazar, Dhaka, and imaged using an HP Wide Vision HD digital camera (HP Inc., Palo Alto, CA, USA) configured as a microscope camera under controlled conditions. The imaging setup included 1000× magnification, eight white LED lights for consistent illumination, and a fixed imaging distance to ensure reproducibility. Image acquisition was performed using Pluggable Digital Viewer software (Digital Viewer 3.1.07, Pluggable Technologies, Redmond, WA, USA) at a resolution of 640 × 480 pixels, 120 × 120 dpi, and 30 frames per second. The images were stored in JPG format, resulting in a total of 2000 original images.

The dataset was organized into two main directories, Original_images and Augmented_images, each containing ten sub-folders corresponding to the rice categories. The distribution of images per variety is presented in Table 2, which are equally distributed. More details about the data collection and augmentation process can be found in [5].

Ethical Approval for the Dataset

The PRBD dataset complies with ethical standards as outlined in its original publication [5]. The study does not involve animal experimentation, human subjects, or data obtained from social media platforms [9].

3.2. Data Processing

The dataset package included both the original images and an additional set of 8000 augmented samples by author Tahsin et al. [5]. After downloading the original and augmented dataset, we split the original data into training, validation, and test sets in a 60–20–20 ratio.

Data augmentation was applied to the training set to increase diversity and improve model generalization. Specifically, the augmented samples were generated using 30° rotations, approximately 20% shear transformations, and horizontal and vertical flips. By introducing these variations, the model is exposed to different perspectives of the same images, reducing the likelihood of memorizing specific patterns and overfitting to the training data. This is particularly important given the near-perfect accuracy observed in early experiments, which can indicate overfitting. Augmentation thus helps the model learn more generalizable features, improving performance on unseen validation and test data. This resulted in 8000 images in total.

Data Splitting

The preprocessed PRBD dataset was split per class into training, validation, and test sets using a 60–20–20 ratio. Data augmentation was applied only to the training set, combining the original training images with 8000 augmented samples to increase diversity. The resulting numbers per class are shown in Table 3. Minor variations in the number of images across classes arise from rounding during the split to maintain integer counts and class balance. These small differences do not significantly affect model training or evaluation. All columns indicate the number of images per subset.

3.3. Model Architecture

In this section, we present the architectures of the Swin Transformer and Vision Transformer (ViT) models and explain how each component is utilized for classifying rice varieties in the PRBD dataset.

3.4. Swin Transformer Architecture

In this study, the first deep learning model employed was the Swin Transformer (microsoft/swin-tiny-patch4-window7-224) [21], implemented using the Hugging Face Transformers library (v4.38.0). Figure 2 illustrates the employed Swin Transformer architecture used in this research.

The input layer received images of size 224 × 224 pixels, which were divided into non-overlapping 4 × 4 patches. Each patch was projected into a feature embedding vector, forming the initial input sequence of tokens. The embeddings passed through four hierarchical stages, each consisting of multiple Swin Transformer blocks. Within each block, shifted window-based multi-head self-attention was applied, allowing local attention computation while enabling cross-window interaction through alternating window shifts. Patch merging layers between stages progressively reduced the spatial resolution while increasing the channel dimensionality, capturing both local details and broader contextual information.

The model utilized pretrained ImageNet-1k weights provided by the Hugging Face library for all layers except the classification head, which was replaced to match the number of classes in the PRBD dataset. This transfer learning setup allows efficient feature extraction while adapting to the specific grain classification task.

3.5. Vision Transformer (ViT) Architecture

The second deep learning model employed was Vision Transformer (ViT) (google/vit-base-patch16-224) [22], implemented using the Hugging Face Transformers library (v4.38.0). Figure 3 provides a schematic overview of the ViT architecture used in this study.

The input images had dimensions 224 × 224 pixels and were divided into 16 × 16 non-overlapping patches. Each patch was flattened and linearly projected into embedding vectors of dimension 768. A class token was prepended to the sequence of patch embeddings, and positional embeddings were added to retain spatial information.

The sequence was processed through 12 Transformer encoder layers, each containing multi-head self-attention and feed-forward sub-layers, followed by layer normalization and residual connections. The final output corresponding to the class token was passed through a linear classification head, which was replaced to match the number of grain categories in the dataset. The model leveraged pretrained ImageNet-21k weights provided by the Hugging Face library, enabling transfer learning and efficient adaptation to the PRBD dataset while maintaining high classification performance.

3.6. Experimental Setup and Hyperparameter Tuning

To enhance the performance of employed models for the used PRBD public dataset, a common hyperparameter tuning strategy was considered. Both the models were trained for a maximum of 50 epochs, with early stopping applied if the validation loss did not improve over three consecutive epochs, and also for preventing overfitting. A learning rate of

5 \times 10^{- 5}

was used to ensure stable adaptation of pretrained weights while maintaining convergence speed. Batch size was set to 20 to balance GPU memory usage and gradient estimation quality. The AdamW optimizer handled weight decay regularization, and a StepLR scheduler reduced the learning rate by a factor of 0.9 after each epoch. For multiclass classification, sparse categorical cross-entropy loss function: nn.CrossEntropyLoss was employed (implemented via PyTorch’s version: 2.10), which allowed the models to compute loss directly from integer class labels without requiring one-hot encoding. The classification heads were adapted to the dataset’s class count, and evaluation included both overall and per-class accuracy metrics to comprehensively assess performance across all rice grain types.

3.7. Model Performance Evaluation

To assess the performance of the proposed models, 20% of the total dataset was held out for testing purposes. During training, early stopping with a patience value of three epochs was employed to minimize overfitting. This technique ensures that the training process terminates automatically if the validation performance does not improve for three consecutive epochs, thus promoting better model generalization. After training completion, the reserved test set was utilized to evaluate the final model performance.

A set of comprehensive evaluation metrics, namely the confusion matrix, accuracy, precision, recall, and F1-score [7], were computed to measure the predictive effectiveness of the models. We also employed the saliency maps technique [23] to provide visual explanations for the test images.

3.7.1. Confusion Matrix

The confusion matrix summarizes the classification performance of the model across all ten classes (0–9). Each row represents the actual class, while each column corresponds to the predicted class. The diagonal elements indicate the number of correctly classified instances for each class (true positives), and the off-diagonal elements represent misclassifications, showing how often samples from one class were incorrectly predicted as another. This matrix provides a detailed overview of the model’s strengths and weaknesses in distinguishing between all classes [7].

3.7.2. Accuracy

For multi-class classification, accuracy represents the ratio of correctly predicted instances to the total number of instances in the dataset [24]. It can be calculated using Equation (1), where i denotes the class index and N is the total number of classes (

N = 10

):

Accuracy = \frac{\sum_{i = 0}^{9} T P (i)}{\sum_{i = 0}^{9} (T P (i) + F P (i) + F N (i) + T N (i))}

(1)

3.7.3. Precision

Precision measures how effectively the model avoids false-positive predictions for each class [24]. For a given class i, it is defined as the proportion of true-positive predictions out of all instances predicted as that class:

Precision (i) = \frac{T P (i)}{T P (i) + F P (i)}

(2)

The macro-averaged precision, which provides an overall precision score across all classes, is expressed as [7]:

MacroPrecision = \frac{1}{10} \sum_{i = 0}^{9} \frac{T P (i)}{T P (i) + F P (i)}

(3)

3.7.4. Recall

Recall quantifies the model’s ability to correctly identify all relevant instances for each class [24]. It is defined as the proportion of true-positive predictions relative to the total number of actual positives for that class:

Recall (i) = \frac{T P (i)}{T P (i) + F N (i)}

(4)

The macro-averaged recall, obtained by averaging recall values across all classes, is given by [7]:

MacroRecall = \frac{1}{10} \sum_{i = 0}^{9} \frac{T P (i)}{T P (i) + F N (i)}

(5)

3.7.5. F1-Score

The F1-score provides a balanced evaluation by combining precision and recall into a single metric [24]. For class i, it is defined as the harmonic mean of precision and recall:

F 1 - Score (i) = \frac{2 \times Precision (i) \times Recall (i)}{Precision (i) + Recall (i)}

(6)

The macro-averaged F1-score, representing the overall balance between precision and recall across all ten classes, is computed as [7]:

MacroF 1 - Score = \frac{1}{10} \sum_{i = 0}^{9} \frac{2 \times Precision (i) \times Recall (i)}{Precision (i) + Recall (i)}

(7)

3.8. Visual Explainability

In this section, we describe the methods used to visually interpret and explain the predictions of our model, focusing on how the model attends to different regions of the input images. We present two complementary techniques, saliency maps (Section 3.8.1) and attention maps (Section 3.8.2), which highlight the most informative areas that influence the model’s classification decisions.

3.8.1. Visual Explainability Using Saliency Maps

To present the visual explainability of our model, we apply saliency maps [23] on the best-performing model among the Swin and ViT architectures. For each class, we select a representative image and compute the gradient of the predicted class score with respect to the input pixels. The absolute values of these gradients form the saliency map, highlighting the regions that most strongly influence the model’s prediction. We normalize both the original image and the saliency map, then overlay the saliency map on the image using a semi-transparent heatmap. Finally, we arrange all classes in a grid so that the saliency patterns across different classes can be compared easily.

This approach allows us to visually inspect the fine-grained patterns, such as textures, edges, or distinguishing details that the model relies on, providing concrete evidence of its discriminative capabilities.

3.8.2. Visual Explainability Using Attention Maps

To present the visual explainability on the best-performing model among the Swin and ViT architectures using attention maps [25], we generate attention maps from the last transformer layer, capturing how the model attends to different image patches for each class. For each class, we select a representative image and extract the attention weights from the [CLS] token to all image patches, averaging across attention heads.

These attention weights form the attention map, highlighting the regions that most strongly influence the model’s prediction. We normalize both the original image and the attention map and remove low-attention areas to focus on the most relevant regions. The attention map is then overlaid on the original image using a semi-transparent heatmap. Finally, we arrange all classes in a grid to allow direct comparison of attention patterns across classes.

This approach provides a clear visual explanation of the fine-grained spatial regions, such as textures, edges, or discriminative features, that the best performing model relies on for classification.

4. Result

This section presents and analyzes the results obtained for the PRBD dataset using confusion matrices.

4.1. Experimental Results

Table 4 presents the quantitative performance comparison of the Swin Transformer and Vision Transformer (ViT) models on the PRBD image dataset using four standard evaluation metrics: accuracy, precision, recall, and F1-score.

The Swin Transformer achieved a test accuracy of 0.9944, indicating that it correctly classified the vast majority of test samples. Its precision (0.9944) and recall (0.9944) values further confirm the model’s strong capability to correctly identify relevant classes while maintaining a very low false-positive and false-negative rate. The resulting F1-score of 0.9943 demonstrates a well-balanced performance between precision and recall.

The ViT model exhibited slightly superior performance across all evaluation metrics. It attained a test accuracy of 0.9986, reflecting near-perfect classification performance on the PRBD dataset. The precision, recall, and F1-score are all 0.9986, highlighting the model’s exceptional ability to consistently detect true positives while minimizing misclassifications.

Overall, these results indicate that although both models generalize effectively to unseen data, the ViT model marginally outperforms the Swin Transformer in terms of overall classification accuracy and robustness.

4.1.1. Performance Evaluation Using Confusion Matrix

To further analyze the classification behavior of both models, confusion matrices are presented in Figure 4 and Figure 5. These matrices provide a class-wise breakdown of correct and incorrect predictions, offering deeper insight beyond aggregate performance metrics.

Figure 4 illustrates the confusion matrix of the Swin Transformer model, revealing a strong diagonal dominance that indicates high classification accuracy across all classes. For Class 0, all 70 samples are correctly classified, demonstrating perfect class accuracy. Class 1 shows 71 correctly classified samples, with only two misclassifications, where one sample is incorrectly predicted as Class 7 and another as Class 9. Classes 2, 3, and 4 exhibit perfect classification performance, with 74, 72, and 73 samples correctly identified, respectively. In Class 5, 68 samples are correctly classified, while two samples are misclassified as Class 6, representing the only noticeable confusion between neighboring classes. Classes 6, 7, 8, and 9 all achieve complete correct classification, with 69, 68, 70, and 75 samples correctly predicted, respectively. Overall, the Swin Transformer demonstrates stable class-level performance, with misclassifications being rare, limited in number, and restricted to a small subset of classes, indicating strong discriminative capability across the dataset.

Figure 5 presents the confusion matrix for the Vision Transformer (ViT) model and shows an even stronger diagonal structure compared to the Swin Transformer. Class 0 achieves perfect classification with all 70 samples correctly identified. Class 1 records 72 correctly classified samples, with only one sample misclassified as Class 7. All remaining classes, Classes 2 through 9, exhibit perfect classification performance, with no observed misclassifications. Specifically, Classes 2, 3, 4, 5, 6, 7, 8, and 9 record 74, 72, 73, 70, 69, 68, 70, and 75 correctly classified samples, respectively. The near absence of off-diagonal entries confirms the ViT model’s exceptional ability to distinguish between class-specific patterns and maintain consistent predictions across all categories.

In summary, both models demonstrate excellent class-wise accuracy and minimal misclassification across the PRBD dataset. However, the ViT model exhibits fewer classification errors and more consistent per-class performance than the Swin Transformer. The reduced confusion among classes in the ViT confusion matrix highlights its superior feature representation and stronger class separability, reinforcing its effectiveness for high-precision image classification tasks.

4.1.2. Performance Evaluation for Visual Explainability

Since the ViT model performs better than the Swin model, we applied both saliency map and attention map techniques to visualize the ViT model’s predictions on representative test images from all ten classes. As shown in Figure 6 and Figure 7, the model correctly classified each class, and the visualizations highlight the regions that most strongly contributed to these predictions. In the figures, T refers to the true class and P refers to the predicted class. The saliency maps (Figure 6) overlay heatmaps on the original images to reveal important features such as textures, edges, and distinguishing details, while keeping the original context intact. Similarly, the attention maps (Figure 7) emphasize the regions most attended by the model, highlighting discriminative features such as shape, texture, and internal patterns. Together, these visualizations demonstrate that the model not only achieves accurate predictions but also focuses on relevant, fine-grained spatial patterns, providing interpretability and insight into its decision-making process.

In summary, the ViT model achieves highly accurate predictions across nearly all ten classes, and both saliency and attention maps demonstrate that it attends to meaningful, fine-grained features, thereby offering strong performance and interpretability.

5. Discussion

This study examined the effectiveness of transformer-based vision models for fine-grained classification of Bangladeshi rice varieties using the PRBD dataset. The findings indicate that ViT and Swin Transformer models provide a clear performance advantage over previously reported approaches on this dataset. Rather than reiterating numerical results, this discussion interprets these outcomes in relation to existing research and examines the methodological factors underlying the observed differences.

Table 5 summarizes prior studies and illustrates the evolution of rice classification methods, from traditional machine learning approaches to Convolutional Neural Networks (CNNs), and more recently, transformer-based architectures.

5.1. Comparison with Traditional Machine Learning Approaches

Earlier rice classification studies predominantly employed handcrafted morphological or texture features combined with classifiers such as SVM, Random Forest, Decision Trees, and KNN [11,13]. Although these methods achieved relatively high accuracy on small or controlled datasets, their reliance on manually engineered features limits their ability to generalize to datasets with higher intra-class similarity and large class counts.

The performance gap observed between these approaches and the transformer-based models in this study suggests that handcrafted features may not adequately capture the subtle visual differences present in fine-grained rice varieties. In contrast, transformer architectures learn discriminative representations directly from data, enabling more robust modeling of complex visual patterns without the need for dataset-specific feature design.

5.2. Comparison with CNN-Based Deep Learning Models

CNN-based models such as VGG, Inception, DenseNet, and custom architectures have been widely adopted in rice classification tasks [14,17,19,20]. While several studies reported strong performance, many relied on private datasets or datasets with fewer classes and limited variability, which restricts direct comparability.

The most relevant benchmark is the work by [5], which evaluated DenseNet201 on the PRBD dataset. The markedly improved performance observed with transformer-based models in this study indicates that CNNs may be less effective at capturing the global structural relationships required for discriminating visually similar rice varieties. This limitation arises from the local receptive fields inherent to convolutional operations, which prioritize local texture patterns over long-range spatial dependencies.

5.3. Comparison with Hybrid and Transformer-Based Approaches

Hybrid approaches that combine CNN feature extractors with external classifiers, such as SVMs, have demonstrated competitive performance on smaller regional datasets [19]. However, these models retain the locality bias of CNN features and introduce additional architectural complexity.

Transformer-based approaches remain relatively underexplored in rice variety classification. Although CNN-transformer hybrids have been proposed [10], differences in dataset composition and experimental settings limit direct comparison. In this context, the present study provides evidence that pure transformer architectures can outperform both CNN and hybrid models on a challenging public dataset.

The improved performance of ViT and Swin Transformer can be attributed to their self-attention mechanisms, which enable the integration of long-range contextual information with fine-grained local features. This capability is particularly important for rice classification, where class-level distinctions are often defined by subtle variations in grain shape, boundary structure, and surface texture.

Consistent performance across multiple evaluation metrics and low confusion among visually similar classes further suggest that the observed improvements reflect robust class discrimination rather than isolated gains. Attention and saliency map analyses support this interpretation by indicating that the models focus on semantically meaningful grain regions instead of background artifacts.

From a methodological perspective, these findings contribute to agricultural computer vision research by demonstrating that transformer-based models are well suited for fine-grained classification tasks characterized by limited inter-class separability. This supports emerging evidence that attention-based architectures can outperform CNNs when global contextual information is critical.

From a practical standpoint, improved rice variety classification has direct implications for automated quality assessment, seed certification, and large-scale monitoring in Bangladesh. Accurate and reliable automated systems can reduce human error and facilitate scalable deployment in real-world agricultural workflows.

5.4. Limitations of This Research

Despite the promising results, this study is constrained by the scope of the PRBD dataset, which represents region-specific rice varieties and may not capture broader variability in cultivation conditions, imaging environments, or labeling quality. Additionally, evaluation focused primarily on quantitative performance metrics and visual interpretability analyses.

6. Conclusions

This study demonstrates the effectiveness of transformer-based deep learning architectures for fine-grained classification of rice varieties using the PRBD dataset. Both the Swin Transformer and Vision Transformer (ViT) models showed strong discriminative capability across ten Bangladeshi rice varieties, indicating their ability to capture subtle visual differences and complex patterns in grain images.

Among the evaluated models, ViT consistently exhibited superior performance, reflected in higher classification accuracy and more reliable class-level predictions compared with the Swin Transformer. Analysis of the confusion matrices suggests that ViT provides improved class separability, reducing misclassifications among visually similar rice varieties. In addition to quantitative evaluation, saliency and attention map visualizations offered complementary insights into the image regions and fine-grained features that contributed to the model’s predictions.

Overall, the findings indicate that transformer-based models, particularly ViT, represent a robust and effective solution for automated rice variety classification. Beyond this specific application, the study highlights the broader potential of attention-based architectures for agricultural image analysis and fine-grained visual recognition tasks, providing a solid foundation for future research in precision agriculture and intelligent crop monitoring.

Future Work

Future research may focus on developing lightweight transformer architectures optimized for deployment in resource-constrained environments, such as local markets or with limited computational capacity. Hybrid frameworks that combine CNNs for initial feature extraction with transformers for fine-grained classification could further improve computational efficiency while maintaining high accuracy.

Additional ablation studies and more detailed analysis of attention mechanisms would provide deeper insight into the architectural components that drive model performance. Furthermore, integrating multi-modal data sources, such as combining image data with genomic or agronomic information, may enhance robustness and scalability. Extending the proposed methodology to other staple crops would also support broader adoption of AI-driven solutions in precision agriculture, crop quality assessment, and decision-support systems.

Author Contributions

Conceptualization, I.T. and V.N.; Methodology, I.T. and V.N.; Software, I.T.; Validation, I.T.; Formal analysis, I.T. and V.N.; Investigation, I.T.; Resources, I.T.; Data curation, I.T. and V.N.; Writing—original draft preparation, I.T.; Writing—review and editing, V.N.; Visualization, I.T. and V.N.; Supervision, V.N.; Project administration, I.T. and V.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available online at https://data.mendeley.com/datasets/sfp9s96prh/1 (accessed on 22 January 2026).

Acknowledgments

The authors would like to thank all individuals and institutions who provided support and resources that contributed to the completion of this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ViT	Vision Transformer
DL	Deep Learning
ML	Machine Learning
CNN	Convolutional Neural Network
SVM	Support Vector Machine
K-NN	K-Nearest Neighbors
RF	Random Forest
VGG	Visual Geometry Group Network

References

FAOSTAT. Rice Market Monitor; Food and Agriculture Organization (FAO): Rome, Italy, 2024; Available online: https://www.fao.org/faostat/en/ (accessed on 22 January 2026).
Islam, M.M.; Himel, G.M.S.; Moazzam, M.G.; Uddin, M.S. Artificial Intelligence-based Rice Variety Classification: A State-of-the-art Review and Future Directions. Smart Agric. Technol. 2025, 10, 100788. [Google Scholar] [CrossRef]
Islam, M.M.; Himel, G.M.S.; Uddin, M.S.; Moazzam, M.G. A visual dataset for recognition of rice varieties. Data Brief 2024, 54, 110442. [Google Scholar] [CrossRef] [PubMed]
Islam, M.M.; Shahriar Himel, G.M.; Moazzam, G.; Uddin, M.S. Precision in Rice Variety Classification using Stacking-based Ensemble Learning. J. Cereal Sci. 2025, 122, 104128. [Google Scholar] [CrossRef]
Tahsin, M.; Mahazabin, K.I.; Nuha, M.B.R.; Efad, A.R.; Momo, M.R.; Niloy, N.T.; Khan, M.S.H.; Tuhin, R.A.; Rashid, M.R.A.; Islam, R.U. Grain by grain: A microscopic image dataset of rice varieties from Bangladeshi rice markets. Data Brief 2025, 63, 112058. [Google Scholar] [CrossRef] [PubMed]
Han, Q.-l.; Long, B.-x.; Yan, X.-j.; Wang, W.; Liu, F.-r.; Chen, X.; Ma, F. Exploration of using acoustic vibration technology to non-destructively detect moldy kernels of in-shell hickory nuts (Carya cathayensis Sarg.). Comput. Electron. Agric. 2023, 212, 108137. [Google Scholar] [CrossRef]
Tabassum, I.; Nunavath, V. A Hybrid Deep Learning Approach for Multi-Class Cyberbullying Classification Using Multi-Modal Social Media Data. Appl. Sci. 2024, 14, 12007. [Google Scholar] [CrossRef]
Tabassum, I.; Kabir, G.A.; Tasnuva, I.; Sultana, S. A deep learning approach for multi-class classification of air quality from image data. Comput. Sci. Eng. Res. 2025, 2, 10–18. [Google Scholar] [CrossRef]
Tahsin, M.; Isat Mahazabin, K.; Rabbani Nuha, M.B.; Rahman Efad, A.; Mariya Rahman Momo, M.R.M.; Niloy, N.T. PRBD: Microscopic Image of Different Processed Rice Varieties of Bangladesh. 2025. Available online: https://data.mendeley.com/datasets/sfp9s96prh/1 (accessed on 22 January 2026). [CrossRef]
Islam, M.M.; Himel, G.M.S.; Moazzam, M.G.; Uddin, M.S. Rice Variety Classification Using Next Generation Convolutional Networks. J. Eng. 2025, 2025, e70102. [Google Scholar] [CrossRef]
Suma, D.; Narendra, V.G.; Raviraja Holla, M.; Darshan Holla, M. Morphological features for multi-model rice grain classification. Int. J. Electr. Comput. Eng. 2025, 15, 3212–3225. [Google Scholar] [CrossRef]
Shadaksharappa, H.; Chakrasali, S.; Ningappa, K.G. Classification of morphologically similar Indian rice variety using machine learning algorithms. Int. J. Electr. Comput. Eng. (IJECE) 2025, 15, 3202–3211. [Google Scholar] [CrossRef]
Suma, D.; Narendra, V.; Darshan Holla, M.; Shreyas; Raviraja Holla, M. Kernel to Computation: Identifying Optimal Feature set for Red Rice Classification. Smart Agric. Technol. 2025, 12, 101065. [Google Scholar] [CrossRef]
Priya, P.K.; Kirupa, P.; Thilakaveni, P.; Devi, K.N.; Mahabooba, M.; Jayachitra, S. DeepRiceTransfer: Exploiting CNN Transfer Learning for Effective Rice Variety Classification. In Proceedings of the 2024 International Conference on Social and Sustainable Innovations in Technology and Engineering (SASI-ITE), Tadepalligudem, India, 23–25 February 2024; pp. 102–107. [Google Scholar] [CrossRef]
Rajora, R.; Banerjee, D.; Upadhyay, D.; Dangi, S.; Rajora, A. Integrated CNN-Random Forest Model for Accurate Rice Variety Classification. In Proceedings of the 2024 4th Asian Conference on Innovation in Technology (ASIANCON), Pune, India, 23–25 August 2024; pp. 1–6. [Google Scholar] [CrossRef]
Deepika, S.; Arunachalam, V. Design of Multi-Class Optimized Lightweight Convolution Neural Network for Rice Classification. In Proceedings of the 2024 Seventh International Women in Data Science Conference at Prince Sultan University (WiDS PSU), Riyadh, Saudi Arabia, 3–4 March 2024; pp. 10–15. [Google Scholar] [CrossRef]
Terlapu, P.V.; Prasan, U.D.; Kumar, T.R.; Bendalam, V.; Pappu, S.R.; Rao, M.J.; Mohitha, M.R. Rice Category Identification through Deep Transfer Learning Features and Machine Learning Classifiers: An Intelligent Approach. IAENG Int. J. Comput. Sci. 2024, 51, 765–784. [Google Scholar]
Din, N.M.U.; Assad, A.; Dar, R.A.; Rasool, M.; Sabha, S.U.; Majeed, T.; Islam, Z.U.; Gulzar, W.; Yaseen, A. RiceNet: Convolutional neural networks-based model to classify Pakistani grown rice seed types. Multimed. Syst. 2021, 27, 867–875. [Google Scholar] [CrossRef]
Mamun, M.A.A.; Karim, S.R.I.; Sarkar, M.I.; Alam, M.Z. Evaluating the Efficacy of Hybrid Deep Learning Models in Rice Variety Classification. 2024. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4749601 (accessed on 22 January 2026).
Hossain, N.; Sabbir Hossain, M.; Islam, R.; Rahman, R. Rice Grain Classification and Automated Quality Control Using Deep Learning Approaches. In Proceedings of the 3rd International Conference on Computing Advancements, Dhaka, Bangladesh, 17–18 October 2024; pp. 910–916. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.H.; Tay, F.E.; Feng, J.; Yan, S. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 558–567. [Google Scholar] [CrossRef]
Li, Z. A saliency map in primary visual cortex. Trends Cogn. Sci. 2002, 6, 9–16. [Google Scholar] [CrossRef] [PubMed]
EvidentlyAI. Multi-Class Classification Metrics. 2024. Available online: https://www.evidentlyai.com/classification-metrics/multi-class-metrics#accuracy-in-multi-class (accessed on 18 December 2024).
Xie, C.; Liu, S.; Li, C.; Cheng, M.M.; Zuo, W.; Liu, X.; Wen, S.; Ding, E. Image inpainting with learnable bidirectional attention maps. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8858–8867. [Google Scholar] [CrossRef]

Figure 1. Overview of the Methodological Pipeline for Multi-Class Classification of Bangladeshi Rice Varieties Using Image Data.

Figure 2. Architecture of the employed Swin Transformer model, showing input patch embedding, hierarchical stages with Swin Transformer blocks, shifted window self-attention, patch merging layers, and the classification head. Each module captures local and global image features, reduces spatial dimensions while increasing channel depth, and enables effective transfer learning for accurate grain classification.

Figure 3. Architecture of the employed Vision Transformer (ViT) model, showing input patch embedding, class token, positional embeddings, stacked Transformer encoder layers with multi-head self-attention and feed-forward sublayers, and the classification head. Each module enables the model to capture global contextual relationships, retain spatial information, and perform accurate transfer learning for grain classification.

Figure 4. Confusion Matrix of the Swin Transformer Model on the PRBD Dataset for Multi-Class Rice Variety Classification.

Figure 5. Confusion matrix of the ViT model on the PRBD Dataset for Multi-Class Rice Variety Classification.

Figure 6. Saliency map visualizations of representative test images for each class for ViT model. The heatmaps, generated using the saliency maps technique, highlight the regions that most influence the model’s predictions. The original images are shown in the background to provide context, allowing comparison of fine-grained features across classes. (T = True Class, P = Predicted Class).

Figure 7. Attention maps of the ViT model for one representative image per rice class. The heatmaps (red–yellow) overlay on the original grain image highlights the regions that most strongly influenced the model’s predictions. T: True class, P: Predicted class. Bright regions indicate higher attention.

Table 1. Comparative Overview of Machine Learning and Deep Learning Approaches for Bangladeshi Rice Classification using Image Data.

Ref.	Objective	Dataset	Total Classes	Model	Accuracy
[10]	Develop a CNN–Transformer model for accurate rice variety classification.	Dataset-01: Aruzz22.5k, Dataset-02: Cinar and Koklu	20, 5	CNN with Transformer-based architecture	99.6% (Dataset-01), 100% (Dataset-02)
[11]	Classify rice grains using ML based on morphological features.	Rice image dataset (Kaggle)	5	SVM, RF, LR, DT, GNB, K-NN	K-NN: 97.8%, RF: 97.51%, DT: 97.48%, GNB: 96.99%, SVM: 96.85%, LR: 96.05%
[12]	Classify similar-looking Sona-Masuri rice varieties.	Private dataset	4	CNN	CNN: 87.5%
[13]	Identify red rice varieties using ML and image features.	Zonal Agricultural & Horticultural Research Station	3	SVM, K-NN	K-NN: 98.67%, SVM: 97.34%
[14]	Improve rice classification using deep learning and transfer learning.	Private Dataset	10	VGG16	99.47%
[15]	Classify five rice varieties and assess model performance.	Private Dataset	5	CNN-Random Forest	94.87%
[16]	Develop a lightweight CNN for five-class rice recognition.	Custom dataset	5	OpLW-CNN	98.14%
[17]	Improve rice category identification using deep transfer learning features + machine learning.	Rice image dataset (Kaggle)	5	Inception V3, VGG-16, VGG-19 with MLP and SVM	MLP (VGG-19): 99.72%, SVM (VGG-19): 99.68%, Inception V3: 99.12%, VGG-16: 99%
[18]	Build RiceNet for accurate multi-class rice identification.	Custom dataset from SKUAST	5	RiceNet	94%
[5]	Classify ten PRBD rice varieties using DenseNet201.	PRBD	10	DenseNet201	Original Dataset: 93%, Augmented Dataset: 94%
[19]	Compare deep learning models for Southern Bangladesh rice varieties.	Southern Bangladesh dataset	5	Hybrid CNN + SVM (custom model), VGG16, MobileNetV2	Hybrid: 99%, VGG16: 95%, MobileNetV2: 93%
[20]	Enhance rice grain classification using CNN-based models.	Self-developed dataset	4	VGG16, MobileNet, InceptionV3, InceptionResNet, Xception, Ensemble	VGG16: 95%, MobileNet: 96%, InceptionV3: 94%, InceptionResNet: 94%, Xception: 96%, Ensemble: 98%

Table 2. Statistical Summary of Class-Wise Image Distribution in the PRBD Dataset Before and After Data Augmentation.

Class	Variety	Original Images (No. of Images)	Augmented Images (No. of Images)
0	Aush	200	800
1	Beroi	200	800
2	BR-28	200	800
3	BR-29	200	800
4	Chinigura	200	800
5	Miniket	200	800
6	Katari Nazir	200	800
7	Ghee Bhog	200	800
8	Katari Siddho	200	800
9	Swarna	200	800
Total		2000	8000

This table summarizes the class-wise distribution of images in the PRBD dataset.

Table 3. Distribution of Preprocessed PRBD Images Across Training, Testing, and Validation Sets.

Class	Variety	Training (No. of Images)	Testing (No. of Images)	Validation (No. of Images)
0	Aush	964	70	70
1	Beroi	973	73	72
2	BR-28	967	74	76
3	BR-29	970	72	70
4	Chinigura	972	73	77
5	Ghee Bhog	966	70	72
6	Katari Nazir	961	69	72
7	Katari Siddho	963	68	68
8	Miniket	965	70	68
9	Swarna	968	75	71
Total number of images:		9669	714	716

The table presents the class-wise distribution of images in the training, validation, and testing subsets after pre-processing.

Table 4. Comparative Performance Evaluation of Swin Transformer and ViT on the PRBD Rice Image Dataset.

Model	Test Accuracy	Precision	Recall	F1-Score
Swin Transformer	0.9944	0.9944	0.9944	0.9943
ViT	0.9986	0.9986	0.9986	0.9986

The table reports test accuracy, precision, recall, and F1-score of transformer-based models evaluated on the PRBD dataset.

Table 5. Comparison of Related Work and Proposed Method for Bangladeshi Rice Classification for the PRBD Dataset.

Previous Work	Objective	Dataset	Total Classes	Model	Accuracy
[10]	Rice variety classification	Dataset-01: Aruzz22.5k, Dataset-02: Cinar and Koklu	20, 5	CNN with Transformer-based architecture	99.6% (Dataset-01), 100% (Dataset-02)
[11]	Rice variety classification on morphological features	Rice image dataset (Kaggle)	5	SVM, RF, LR, DT, GNB, K-NN	K-NN: 97.8%, RF: 97.51%, DT: 97.48%, GNB: 96.99%, SVM: 96.85%, LR: 96.05%
[12]	Sona-Masuri rice variety classification	Private dataset	4	CNN	CNN: 87.5%
[13]	Rice variety identification.	Zonal Agricultural & Horticultural Research Station	3	SVM, K-NN	K-NN: 98.67%, SVM: 97.34%
[14]	Rice classification	Private Dataset	10	VGG16	99.47%
[15]	Rice variety classification	Private Dataset	5	CNN-Random Forest	94.87%
[16]	Rice type recognition.	Custom dataset	5	OpLW-CNN	98.14%
[17]	Rice category identification	Rice image dataset (Kaggle)	5	Inception V3, VGG-16, VGG-19 with MLP and SVM classifiers	MLP (VGG-19): 99.72%, SVM (VGG-19): 99.68%, Inception V3: 99.12%, VGG-16: 99%
[18]	Rice type identification	Custom dataset from SKUAST	5	RiceNet	94% (RiceNet)
[19]	Rice variety classification	Southern Bangladesh dataset	5	Hybrid CNN + SVM (custom model), VGG16, MobileNetV2	Hybrid: 99%, VGG16: 95%, MobileNetV2: 93%
[20]	Rice grain classification	Self-developed dataset	4	VGG16, MobileNet, InceptionV3, InceptionResNet, Xception, Ensemble	VGG16: 95%, MobileNet: 96%, InceptionV3: 94%, InceptionResNet: 94%, Xception: 96%, Ensemble: 98%
[5]	Rice variety classification	PRBD	10	DenseNet201	Original Dataset: 93%, Augmented Dataset: 94%
Our Work	PRBD rice variety classification	PRBD	10	Swin Transformer, ViT	Swin: 99.44%, ViT: 99.86%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tabassum, I.; Nunavath, V. Transformer-Based Multi-Class Classification of Bangladeshi Rice Varieties Using Image Data. Appl. Sci. 2026, 16, 1279. https://doi.org/10.3390/app16031279

AMA Style

Tabassum I, Nunavath V. Transformer-Based Multi-Class Classification of Bangladeshi Rice Varieties Using Image Data. Applied Sciences. 2026; 16(3):1279. https://doi.org/10.3390/app16031279

Chicago/Turabian Style

Tabassum, Israt, and Vimala Nunavath. 2026. "Transformer-Based Multi-Class Classification of Bangladeshi Rice Varieties Using Image Data" Applied Sciences 16, no. 3: 1279. https://doi.org/10.3390/app16031279

APA Style

Tabassum, I., & Nunavath, V. (2026). Transformer-Based Multi-Class Classification of Bangladeshi Rice Varieties Using Image Data. Applied Sciences, 16(3), 1279. https://doi.org/10.3390/app16031279

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transformer-Based Multi-Class Classification of Bangladeshi Rice Varieties Using Image Data

Abstract

1. Introduction

2. Literature Review

Research Gap and Motivation

3. Methodology

3.1. Data Collection and Description

Ethical Approval for the Dataset

3.2. Data Processing

Data Splitting

3.3. Model Architecture

3.4. Swin Transformer Architecture

3.5. Vision Transformer (ViT) Architecture

3.6. Experimental Setup and Hyperparameter Tuning

3.7. Model Performance Evaluation

3.7.1. Confusion Matrix

3.7.2. Accuracy

3.7.3. Precision

3.7.4. Recall

3.7.5. F1-Score

3.8. Visual Explainability

3.8.1. Visual Explainability Using Saliency Maps

3.8.2. Visual Explainability Using Attention Maps

4. Result

4.1. Experimental Results

4.1.1. Performance Evaluation Using Confusion Matrix

4.1.2. Performance Evaluation for Visual Explainability

5. Discussion

5.1. Comparison with Traditional Machine Learning Approaches

5.2. Comparison with CNN-Based Deep Learning Models

5.3. Comparison with Hybrid and Transformer-Based Approaches

5.4. Limitations of This Research

6. Conclusions

Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI