1. Introduction
Rice is the primary dietary staple for more than half of the world’s population and remains a cornerstone of global food security, particularly in Asia [
1]. As a major source of carbohydrates and energy, rice underpins not only human nutrition but also the livelihoods of millions of farmers across developing and developed economies. Beyond consumption, rice quality and varietal identity are critical for seed certification, breeding programs, yield optimization, market pricing, and international trade. However, rice grains exhibit substantial intra- and inter-varietal variability due to genetic diversity, environmental conditions, and post-harvest processing, making accurate variety identification a persistent challenge [
2].
Bangladesh is one of the world’s leading rice-producing countries, where rice holds exceptional economic, agricultural, and cultural significance. The country cultivates a wide range of indigenous and high-yielding rice varieties adapted to diverse agroecological conditions across the
Aman,
Aus, and
Boro seasons [
3]. Accurate identification of Bangladeshi rice varieties is essential for ensuring varietal purity, maintaining quality standards, supporting breeding initiatives, and preventing economic losses caused by mislabeling or adulteration. Nevertheless, many local varieties share subtle visual similarities in grain shape, size, color, and texture, making reliable multi-class classification particularly difficult.
Traditional approaches to rice variety identification, such as manual visual inspection, morphometric analysis, and biochemical or genetic testing are often labor-intensive, time-consuming, costly, and susceptible to subjective bias [
4]. These limitations restrict their scalability and practical applicability, especially in resource-constrained agricultural settings. As a result, there has been growing interest in automated, image-based classification techniques that can provide fast, objective, and reproducible results.
Recent advances in deep learning have significantly improved image-based agricultural analysis, including grain classification, seed identification, and crop disease detection [
5]. Convolutional Neural Networks (CNNs) have been widely adopted for such tasks and have demonstrated promising performance [
6]. However, CNN-based models often struggle to capture long-range spatial dependencies and fine-grained visual distinctions among highly similar classes, an issue that is particularly pronounced in multi-class rice variety classification. Transformer-based architectures, such as the Vision Transformer (ViT) and Swin Transformer, address these limitations by leveraging self-attention mechanisms that model global contextual relationships within images, leading to superior performance in complex visual recognition tasks [
7,
8].
Despite the rapid adoption of transformer models in general computer vision and agricultural applications, their use in rice variety classification particularly for Bangladeshi rice remains largely unexplored. Existing studies on the PRBD dataset [
9] have primarily focused on conventional machine learning or CNN-based approaches, leaving a clear research gap in evaluating the effectiveness of modern transformer-based architectures on this dataset. To the best of our knowledge, this study represents the first systematic application of Vision Transformer and Swin Transformer models for multi-class classification of Bangladeshi rice varieties using the PRBD dataset.
The objective of this research is to develop an automated and robust rice variety classification system tailored to Bangladeshi rice grains. This study represents the first application of transformer-based deep learning models, specifically ViT and Swin Transformer, to the PRBD dataset, enabling accurate discrimination of rice varieties based on subtle visual and structural characteristics. In doing so, the work seeks to advance the existing literature by demonstrating the advantages of transformer architectures over traditional CNN-based approaches for fine-grained agricultural image classification. Ultimately, this research contributes to the digital transformation of agricultural quality assessment in Bangladesh by providing an efficient, scalable, and reliable solution for rice variety identification.
Accordingly, this study addresses the following research questions:
How effectively can transformer-based deep learning models classify multiple Bangladeshi rice varieties using image data?
Which transformer architecture provides the highest accuracy and robustness for rice variety classification on the PRBD dataset?
Overall, this study presents a comprehensive approach to classify rice variety by applying transformer-based models. The proposed method is rigorously evaluated against state-of-the-art models to demonstrate its efficacy. The rest of this paper is organized as follows:
Section 2 reviews the related works on rice classification and image-based grain analysis;
Section 3 describes the dataset and methodology used in this research;
Section 4 presents the experimental results with performance evaluation, and
Section 5 provides discussion; and
Section 6 concludes the study.
2. Literature Review
Significant research has explored the application of deep learning (DL) algorithms for rice type classification during the years 2020 to 2025. This section presents existing studies that focus on multi-class rice classification, the DL models employed, the datasets used, and their obtained results. According to the analysis of previous works in
Table 1, recent studies have made significant progress in rice variety classification using both traditional machine learning (ML) and modern DL techniques. These studies utilized diverse datasets, including publicly available Kaggle datasets, private collections, and custom datasets from different regions of Bangladesh. The main goal has been to enhance multi-class rice classification accuracy.
Early ML-based approaches primarily used morphological features of rice grains. For example, Ref. [
11] applied Support Vector Machines (SVMs), Random Forests (RFs), Logistic Regression (LR), Decision Trees (DTs), Gaussian Naive Bayes (GNB), and K-Nearest Neighbors (K-NN) on a Kaggle (
https://www.kaggle.com/datasets/muratkokludataset/rice-image-dataset, accessed on 22 January 2026) rice image dataset. Among these, K-NN achieved the highest accuracy of 97.8%, followed closely by RF and DT. Similarly, Ref. [
13] classified three red rice varieties using size, shape, and texture features, where K-NN reached 98.67% and SVM reached 97.34%.
Deep learning methods have demonstrated superior performance due to their automatic feature extraction capabilities. Ref. [
10] proposed a hybrid CNN–Transformer model to classify 20 and 5 rice varieties across two datasets, achieving accuracies of 99.6% and 100%, respectively. Other CNN-based studies include [
12], which classified four similar-looking Sona-Masuri varieties with 87.5% accuracy, and [
15], combining CNN with Random Forest to classify five rice varieties, achieving 94.87%.
Transfer learning has been extensively applied to boost performance. Ref. [
14] evaluated several pre-trained models, including MobileNet, ResNet50, VGG16, and their proposed Deep Rice Transfer (DRT) model, for classifying 10 rice varieties. VGG16 achieved the highest accuracy of 99.47%, followed by MobileNet (98.94%) and DRT (98.44%). Similarly, Ref. [
17] combined deep transfer learning features with ML classifiers, achieving over 99% accuracy using VGG19 and InceptionV3 features.
Lightweight and custom-designed architectures have also been explored. Ref. [
16] proposed OpLW-CNN for five-class rice recognition, attaining 98.14% accuracy. RiceNet, introduced by [
18], achieved 94% accuracy on a custom dataset collected from SKUAST. Moreover, Ref. [
5] applied DenseNet201 on 10 PRBD rice varieties, achieving 93% on the original dataset and 94% on the augmented dataset, demonstrating the benefits of data augmentation.
Comparative studies such as [
19,
20] evaluated multiple DL architectures on local Bangladeshi datasets. Ref. [
19] reported a hybrid CNN + SVM model reaching 99% accuracy for five Southern Bangladesh rice varieties, outperforming VGG16 (95%) and MobileNetV2 (93%). Ref. [
20] tested VGG16, MobileNet, InceptionV3, InceptionResNet, and Xception, achieving up to 96% individually and 98% using an ensemble.
Recent advancements in rice classification have shifted toward more optimized and hybrid deep learning architectures. Enhanced residual networks, such as deeper or fine-tuned ResNet variants, improve feature reuse and gradient flow, enabling more robust learning compared to conventional CNNs. However, their increased depth often results in higher computational cost. EfficientNet models, which employ compound scaling to balance network depth, width, and input resolution, have recently gained attention in agricultural image classification due to their superior accuracy–efficiency trade-off, particularly for resource-constrained applications.
Moreover, hybrid CNN–Transformer architectures represent a notable advancement by integrating CNN-based local feature extraction with Transformer-based global attention mechanisms. These models have demonstrated superior performance in distinguishing visually similar rice varieties by capturing long-range dependencies and contextual relationships that traditional CNNs may overlook. Despite their high accuracy, CNN–Transformer models often require greater computational resources, highlighting an ongoing trade-off between performance and efficiency in modern rice classification systems.
Overall, the literature indicates that deep learning, transfer learning, and ensemble methods significantly improve rice classification performance. While traditional ML remains effective for smaller and less complex datasets, modern CNN-based, EfficientNet, and CNN–Transformer architectures consistently outperform earlier approaches in complex multi-class scenarios.
Research Gap and Motivation
After reviewing the related literature, it is evident that only one study has so far utilized the PRBD dataset [
5], where DenseNet201 was employed for rice variety classification. To the best of our knowledge, no existing work has explored Transformer-based architectures such as Vision Transformer (ViT) or Swin Transformer on this dataset. Therefore, this study aims to investigate the performance of multiple state-of-the-art deep learning models such as Swin Transformer and ViT on the PRBD dataset. The proposed work focuses on a comprehensive comparative analysis using evaluation metrics such as accuracy, precision, recall, and F1-score.
3. Methodology
This section outlines the detailed steps followed to obtain the results. First, the data collection and description are presented, followed by image preprocessing. In the third step, the selection and implementation of deep learning models, along with hyperparameter optimization and performance evaluation, are described.
Figure 1 illustrates the overall experimental procedure of this research. After collecting the data, the next step involves data processing, followed by splitting the dataset into training, testing, and validation sets. Subsequently, we apply deep learning models such as Swin Transformer and ViT to the training data, and finally, we evaluate the outcomes using performance metrics.
3.1. Data Collection and Description
In this study, a publicly available dataset called PRBD was used, which was downloaded from Mendeley Data [
9]. The dataset consists of 2000 carefully selected rice kernel images spanning ten distinct Bangladeshi rice varieties:
Aush,
Beroi,
BR-28,
BR-29,
Ghee Bhog,
Katari Nazir,
Katari Siddho,
Swarna,
Miniket, and
Chinigura.
As reported by the authors of the PRBD dataset [
5,
9], the rice kernels were collected from local markets in Karwan Bazar, Dhaka, and imaged using an HP Wide Vision HD digital camera (HP Inc., Palo Alto, CA, USA) configured as a microscope camera under controlled conditions. The imaging setup included 1000× magnification, eight white LED lights for consistent illumination, and a fixed imaging distance to ensure reproducibility. Image acquisition was performed using Pluggable Digital Viewer software (Digital Viewer 3.1.07, Pluggable Technologies, Redmond, WA, USA) at a resolution of 640 × 480 pixels, 120 × 120 dpi, and 30 frames per second. The images were stored in JPG format, resulting in a total of 2000 original images.
The dataset was organized into two main directories,
Original_images and
Augmented_images, each containing ten sub-folders corresponding to the rice categories. The distribution of images per variety is presented in
Table 2, which are equally distributed. More details about the data collection and augmentation process can be found in [
5].
Ethical Approval for the Dataset
The PRBD dataset complies with ethical standards as outlined in its original publication [
5]. The study does not involve animal experimentation, human subjects, or data obtained from social media platforms [
9].
3.2. Data Processing
The dataset package included both the original images and an additional set of 8000 augmented samples by author Tahsin et al. [
5]. After downloading the original and augmented dataset, we split the original data into training, validation, and test sets in a 60–20–20 ratio.
Data augmentation was applied to the training set to increase diversity and improve model generalization. Specifically, the augmented samples were generated using 30° rotations, approximately 20% shear transformations, and horizontal and vertical flips. By introducing these variations, the model is exposed to different perspectives of the same images, reducing the likelihood of memorizing specific patterns and overfitting to the training data. This is particularly important given the near-perfect accuracy observed in early experiments, which can indicate overfitting. Augmentation thus helps the model learn more generalizable features, improving performance on unseen validation and test data. This resulted in 8000 images in total.
Data Splitting
The preprocessed PRBD dataset was split per class into training, validation, and test sets using a 60–20–20 ratio. Data augmentation was applied only to the training set, combining the original training images with 8000 augmented samples to increase diversity. The resulting numbers per class are shown in
Table 3. Minor variations in the number of images across classes arise from rounding during the split to maintain integer counts and class balance. These small differences do not significantly affect model training or evaluation. All columns indicate the number of images per subset.
3.3. Model Architecture
In this section, we present the architectures of the Swin Transformer and Vision Transformer (ViT) models and explain how each component is utilized for classifying rice varieties in the PRBD dataset.
3.4. Swin Transformer Architecture
In this study, the first deep learning model employed was the Swin Transformer (
microsoft/swin-tiny-patch4-window7-224) [
21], implemented using the Hugging Face Transformers library (v4.38.0).
Figure 2 illustrates the employed Swin Transformer architecture used in this research.
The input layer received images of size 224 × 224 pixels, which were divided into non-overlapping 4 × 4 patches. Each patch was projected into a feature embedding vector, forming the initial input sequence of tokens. The embeddings passed through four hierarchical stages, each consisting of multiple Swin Transformer blocks. Within each block, shifted window-based multi-head self-attention was applied, allowing local attention computation while enabling cross-window interaction through alternating window shifts. Patch merging layers between stages progressively reduced the spatial resolution while increasing the channel dimensionality, capturing both local details and broader contextual information.
The model utilized pretrained ImageNet-1k weights provided by the Hugging Face library for all layers except the classification head, which was replaced to match the number of classes in the PRBD dataset. This transfer learning setup allows efficient feature extraction while adapting to the specific grain classification task.
3.5. Vision Transformer (ViT) Architecture
The second deep learning model employed was Vision Transformer (ViT) (
google/vit-base-patch16-224) [
22], implemented using the Hugging Face Transformers library (v4.38.0).
Figure 3 provides a schematic overview of the ViT architecture used in this study.
The input images had dimensions 224 × 224 pixels and were divided into 16 × 16 non-overlapping patches. Each patch was flattened and linearly projected into embedding vectors of dimension 768. A class token was prepended to the sequence of patch embeddings, and positional embeddings were added to retain spatial information.
The sequence was processed through 12 Transformer encoder layers, each containing multi-head self-attention and feed-forward sub-layers, followed by layer normalization and residual connections. The final output corresponding to the class token was passed through a linear classification head, which was replaced to match the number of grain categories in the dataset. The model leveraged pretrained ImageNet-21k weights provided by the Hugging Face library, enabling transfer learning and efficient adaptation to the PRBD dataset while maintaining high classification performance.
3.6. Experimental Setup and Hyperparameter Tuning
To enhance the performance of employed models for the used PRBD public dataset, a common hyperparameter tuning strategy was considered. Both the models were trained for a maximum of 50 epochs, with early stopping applied if the validation loss did not improve over three consecutive epochs, and also for preventing overfitting. A learning rate of was used to ensure stable adaptation of pretrained weights while maintaining convergence speed. Batch size was set to 20 to balance GPU memory usage and gradient estimation quality. The AdamW optimizer handled weight decay regularization, and a StepLR scheduler reduced the learning rate by a factor of 0.9 after each epoch. For multiclass classification, sparse categorical cross-entropy loss function: nn.CrossEntropyLoss was employed (implemented via PyTorch’s version: 2.10), which allowed the models to compute loss directly from integer class labels without requiring one-hot encoding. The classification heads were adapted to the dataset’s class count, and evaluation included both overall and per-class accuracy metrics to comprehensively assess performance across all rice grain types.
3.7. Model Performance Evaluation
To assess the performance of the proposed models, 20% of the total dataset was held out for testing purposes. During training, early stopping with a patience value of three epochs was employed to minimize overfitting. This technique ensures that the training process terminates automatically if the validation performance does not improve for three consecutive epochs, thus promoting better model generalization. After training completion, the reserved test set was utilized to evaluate the final model performance.
A set of comprehensive evaluation metrics, namely the confusion matrix, accuracy, precision, recall, and F1-score [
7], were computed to measure the predictive effectiveness of the models. We also employed the saliency maps technique [
23] to provide visual explanations for the test images.
3.7.1. Confusion Matrix
The confusion matrix summarizes the classification performance of the model across all ten classes (0–9). Each row represents the actual class, while each column corresponds to the predicted class. The diagonal elements indicate the number of correctly classified instances for each class (true positives), and the off-diagonal elements represent misclassifications, showing how often samples from one class were incorrectly predicted as another. This matrix provides a detailed overview of the model’s strengths and weaknesses in distinguishing between all classes [
7].
3.7.2. Accuracy
For multi-class classification, accuracy represents the ratio of correctly predicted instances to the total number of instances in the dataset [
24]. It can be calculated using Equation (
1), where
i denotes the class index and
N is the total number of classes (
):
3.7.3. Precision
Precision measures how effectively the model avoids false-positive predictions for each class [
24]. For a given class
i, it is defined as the proportion of true-positive predictions out of all instances predicted as that class:
The macro-averaged precision, which provides an overall precision score across all classes, is expressed as [
7]:
3.7.4. Recall
Recall quantifies the model’s ability to correctly identify all relevant instances for each class [
24]. It is defined as the proportion of true-positive predictions relative to the total number of actual positives for that class:
The macro-averaged recall, obtained by averaging recall values across all classes, is given by [
7]:
3.7.5. F1-Score
The F1-score provides a balanced evaluation by combining precision and recall into a single metric [
24]. For class
i, it is defined as the harmonic mean of precision and recall:
The macro-averaged F1-score, representing the overall balance between precision and recall across all ten classes, is computed as [
7]:
3.8. Visual Explainability
In this section, we describe the methods used to visually interpret and explain the predictions of our model, focusing on how the model attends to different regions of the input images. We present two complementary techniques, saliency maps (
Section 3.8.1) and attention maps (
Section 3.8.2), which highlight the most informative areas that influence the model’s classification decisions.
3.8.1. Visual Explainability Using Saliency Maps
To present the visual explainability of our model, we apply saliency maps [
23] on the best-performing model among the Swin and ViT architectures. For each class, we select a representative image and compute the gradient of the predicted class score with respect to the input pixels. The absolute values of these gradients form the saliency map, highlighting the regions that most strongly influence the model’s prediction. We normalize both the original image and the saliency map, then overlay the saliency map on the image using a semi-transparent heatmap. Finally, we arrange all classes in a grid so that the saliency patterns across different classes can be compared easily.
This approach allows us to visually inspect the fine-grained patterns, such as textures, edges, or distinguishing details that the model relies on, providing concrete evidence of its discriminative capabilities.
3.8.2. Visual Explainability Using Attention Maps
To present the visual explainability on the best-performing model among the Swin and ViT architectures using attention maps [
25], we generate attention maps from the last transformer layer, capturing how the model attends to different image patches for each class. For each class, we select a representative image and extract the attention weights from the
[CLS] token to all image patches, averaging across attention heads.
These attention weights form the attention map, highlighting the regions that most strongly influence the model’s prediction. We normalize both the original image and the attention map and remove low-attention areas to focus on the most relevant regions. The attention map is then overlaid on the original image using a semi-transparent heatmap. Finally, we arrange all classes in a grid to allow direct comparison of attention patterns across classes.
This approach provides a clear visual explanation of the fine-grained spatial regions, such as textures, edges, or discriminative features, that the best performing model relies on for classification.
5. Discussion
This study examined the effectiveness of transformer-based vision models for fine-grained classification of Bangladeshi rice varieties using the PRBD dataset. The findings indicate that ViT and Swin Transformer models provide a clear performance advantage over previously reported approaches on this dataset. Rather than reiterating numerical results, this discussion interprets these outcomes in relation to existing research and examines the methodological factors underlying the observed differences.
Table 5 summarizes prior studies and illustrates the evolution of rice classification methods, from traditional machine learning approaches to Convolutional Neural Networks (CNNs), and more recently, transformer-based architectures.
5.1. Comparison with Traditional Machine Learning Approaches
Earlier rice classification studies predominantly employed handcrafted morphological or texture features combined with classifiers such as SVM, Random Forest, Decision Trees, and KNN [
11,
13]. Although these methods achieved relatively high accuracy on small or controlled datasets, their reliance on manually engineered features limits their ability to generalize to datasets with higher intra-class similarity and large class counts.
The performance gap observed between these approaches and the transformer-based models in this study suggests that handcrafted features may not adequately capture the subtle visual differences present in fine-grained rice varieties. In contrast, transformer architectures learn discriminative representations directly from data, enabling more robust modeling of complex visual patterns without the need for dataset-specific feature design.
5.2. Comparison with CNN-Based Deep Learning Models
CNN-based models such as VGG, Inception, DenseNet, and custom architectures have been widely adopted in rice classification tasks [
14,
17,
19,
20]. While several studies reported strong performance, many relied on private datasets or datasets with fewer classes and limited variability, which restricts direct comparability.
The most relevant benchmark is the work by [
5], which evaluated DenseNet201 on the PRBD dataset. The markedly improved performance observed with transformer-based models in this study indicates that CNNs may be less effective at capturing the global structural relationships required for discriminating visually similar rice varieties. This limitation arises from the local receptive fields inherent to convolutional operations, which prioritize local texture patterns over long-range spatial dependencies.
5.3. Comparison with Hybrid and Transformer-Based Approaches
Hybrid approaches that combine CNN feature extractors with external classifiers, such as SVMs, have demonstrated competitive performance on smaller regional datasets [
19]. However, these models retain the locality bias of CNN features and introduce additional architectural complexity.
Transformer-based approaches remain relatively underexplored in rice variety classification. Although CNN-transformer hybrids have been proposed [
10], differences in dataset composition and experimental settings limit direct comparison. In this context, the present study provides evidence that pure transformer architectures can outperform both CNN and hybrid models on a challenging public dataset.
The improved performance of ViT and Swin Transformer can be attributed to their self-attention mechanisms, which enable the integration of long-range contextual information with fine-grained local features. This capability is particularly important for rice classification, where class-level distinctions are often defined by subtle variations in grain shape, boundary structure, and surface texture.
Consistent performance across multiple evaluation metrics and low confusion among visually similar classes further suggest that the observed improvements reflect robust class discrimination rather than isolated gains. Attention and saliency map analyses support this interpretation by indicating that the models focus on semantically meaningful grain regions instead of background artifacts.
From a methodological perspective, these findings contribute to agricultural computer vision research by demonstrating that transformer-based models are well suited for fine-grained classification tasks characterized by limited inter-class separability. This supports emerging evidence that attention-based architectures can outperform CNNs when global contextual information is critical.
From a practical standpoint, improved rice variety classification has direct implications for automated quality assessment, seed certification, and large-scale monitoring in Bangladesh. Accurate and reliable automated systems can reduce human error and facilitate scalable deployment in real-world agricultural workflows.
5.4. Limitations of This Research
Despite the promising results, this study is constrained by the scope of the PRBD dataset, which represents region-specific rice varieties and may not capture broader variability in cultivation conditions, imaging environments, or labeling quality. Additionally, evaluation focused primarily on quantitative performance metrics and visual interpretability analyses.
6. Conclusions
This study demonstrates the effectiveness of transformer-based deep learning architectures for fine-grained classification of rice varieties using the PRBD dataset. Both the Swin Transformer and Vision Transformer (ViT) models showed strong discriminative capability across ten Bangladeshi rice varieties, indicating their ability to capture subtle visual differences and complex patterns in grain images.
Among the evaluated models, ViT consistently exhibited superior performance, reflected in higher classification accuracy and more reliable class-level predictions compared with the Swin Transformer. Analysis of the confusion matrices suggests that ViT provides improved class separability, reducing misclassifications among visually similar rice varieties. In addition to quantitative evaluation, saliency and attention map visualizations offered complementary insights into the image regions and fine-grained features that contributed to the model’s predictions.
Overall, the findings indicate that transformer-based models, particularly ViT, represent a robust and effective solution for automated rice variety classification. Beyond this specific application, the study highlights the broader potential of attention-based architectures for agricultural image analysis and fine-grained visual recognition tasks, providing a solid foundation for future research in precision agriculture and intelligent crop monitoring.
Future Work
Future research may focus on developing lightweight transformer architectures optimized for deployment in resource-constrained environments, such as local markets or with limited computational capacity. Hybrid frameworks that combine CNNs for initial feature extraction with transformers for fine-grained classification could further improve computational efficiency while maintaining high accuracy.
Additional ablation studies and more detailed analysis of attention mechanisms would provide deeper insight into the architectural components that drive model performance. Furthermore, integrating multi-modal data sources, such as combining image data with genomic or agronomic information, may enhance robustness and scalability. Extending the proposed methodology to other staple crops would also support broader adoption of AI-driven solutions in precision agriculture, crop quality assessment, and decision-support systems.