1. Introduction
Rice is one of the most important crops worldwide, providing the primary source of calories for more than half of the global population. However, its production is constantly threatened by a wide range of diseases caused by fungi, bacteria, and viruses, such as rice blast, sheath blight, and bacterial leaf blight [
1]. These diseases can lead to severe yield losses, significantly affecting food security and farmers’ income, particularly in developing countries. Early and accurate diagnosis of paddy diseases is, therefore, a key step in implementing timely control measures and ensuring sustainable rice production.
Traditional methods of plant disease diagnosis rely on manual inspection by trained specialists, which is time-consuming, subjective, and impractical at large scales [
2]. In recent years, advances in artificial intelligence (AI) and computer vision have enabled the automatic detection and classification of plant diseases from leaf images, providing an alternative to conventional practices [
3]. Convolutional Neural Networks (CNNs) and related deep learning architectures have achieved remarkable results in image-based recognition tasks [
1,
4]. Although their performance depends on the availability of large, balanced, and well-annotated datasets, conditions are rarely met in real-world agricultural environments.
In paddy disease diagnosis, the problem is intensified by class imbalance and limited data availability, as certain diseases occur more frequently or are easier to capture than others. This imbalance may bias the learning process toward dominant classes, leading to poor generalization for rare diseases. Recent work highlights that class imbalance is a pervasive issue in precision-agriculture machine learning and can make overall accuracy misleading because models optimized for total error tend to favor majority classes. In particular, Miftahushudur et al. [
3] survey imbalance-handling strategies across agricultural applications and emphasize using imbalance-aware metrics, including F1-score, and techniques that explicitly target minority-class performance, including resampling and synthetic data generation.
Recent research in computer vision and pattern recognition has highlighted the importance of methods that can capture both similarities and dissimilarities among samples in a discriminative feature space. In this context, metric learning approaches have demonstrated strong potential, as they map samples into representations where semantic relationships are encoded through the distances between them. The recently proposed contrastive dissimilarity framework [
5] has demonstrated promising results in various domains, particularly in scenarios with imbalanced and limited data [
6,
7,
8]. By integrating representation learning with a dissimilarity-based contrastive metric, this approach offers robustness to unbalanced class distributions and improved generalization from scarce samples.
This work evaluates the applicability of contrastive dissimilarity [
5], a method designed to address data imbalance and limited sample availability. This approach integrates a dissimilarity-based representation with contrastive learning into a unified framework, using representation learning to extract meaningful features and metric learning to estimate a task-specific dissimilarity function. Using the Paddy Doctor dataset [
9], which provides annotated images of diseased and pest-affected rice leaves, we evaluate the capacity of this method to discriminate between visually similar conditions and to mitigate the adverse effects of dataset imbalance.
In summary, this work makes three contributions: (i) we propose contrastive dissimilarity for rice disease diagnosis, especially for imbalanced datasets, as it is in real scenarios; (ii) we provide an empirical analysis showing that our approach can perform well even for diseases that have few samples in the original dataset, and (iii) we demonstrate consistent improvements with different ablations to verify the robustness of our approach across multiple configuration.
The remainder of this work is organized as follows:
Section 2 presents some related works.
Section 3 brings some details about the materials and methods, and also about the experiments, including the datasets and training protocol.
Section 5 describes the results obtained, and following, we point out some concluding remarks achieved with this work in
Section 6.
2. Related Works
Miftahushudur et al. [
3] provide an overview of methods for imbalanced agricultural datasets, organizing approaches into algorithm-level, data-level, and hybrid strategies, with an emphasis on resampling pipelines (e.g., over/under-sampling, SMOTE-style variants) and their trade-offs (e.g., overfitting, boundary overlap, computational cost). The survey also reviews the emerging use of deep generative models for synthetic augmentation in high-dimensional data, and discusses open challenges to field deployment, such as noisy and incomplete data, difficulty of early-stage disease separation, and limited availability of standardized public benchmarks for reproducible comparisons.
Zhang et al. [
10] introduced an ensemble framework that embeds a salient-position attention mechanism into different lightweight backbones (YOLO, EfficientNet, MobileNet, ShuffleNet) and fuses them. The approach improves single-model performance and reaches 98.33% accuracy for rice-leaf disease identification in complex field backgrounds, highlighting gains in robustness and generalization under natural conditions.
Mookkandi et al. [
11] propose a lightweight vision transformer with 814.7 k learnable parameters and 85 layers, designed for classifying crop diseases in paddy and wheat. The architecture consists of a convolutional block attention module, squeeze and excitation (SE), and depth-wise convolution, followed by a ConvNeXt module. The proposed model was tested using a paddy dataset (with 7857 images and eight classes) and a wheat dataset (with 5000 images and five classes). In this work, the author report an accuracy of 98.47% for the paddy dataset, and 92.8% for the wheat dataset.
Using a custom dataset with 5932 diseased rice leaf images and 1500 healthy images, Petchiammal and Murugan [
12] evaluated deep Convolutional Neural Network (CNN) and nine transfer learning models (VGG19, VGG16, DenseNet121, MobileNetV2, DenseNet169, DenseNet201, InceptionV3, ResNet152V2, and NASNetMobile) using TensorFlow. Each model’s performance was assessed to find the most effective classification system, covering four disease categories and one non-disease category. The research aimed for shorter training times, higher accuracy, and easier retraining, with DenseNet121 achieving the highest classification accuracy of 97.6% on the paddy leaf image dataset.
Padhi et al. [
13] present the usage of EfficientNet B4, a deep learning architecture trained on the publicly available Paddy Doctor dataset. They used data augmentation to reach 19,131 labeled images for training and 4785 images for testing. The model aims to classify paddy leaf samples into nine disease categories or into normal specimens. EfficientNet B4, known for its structured scaling technique and Swish activation function, was trained using pretrained weights from ImageNet and optimized with the Adam optimizer, achieving an accuracy of 96.91%.
Also using the Paddy Doctor dataset, Thanuboddi and Nelakuditi [
14] proposed the usage of Self-Supervised Deep Hierarchical Reconstruction (SSDHR), and Long Short-Term Memory (LSTM), which perform early disease detection based on the spatial and temporal data, respectively. The SSDHR network uses multi-branch convolution kernels to extract distinct discriminative characteristics rather than conventional leaf-based indicators. It incorporates spatial, and temporal-based attention mechanisms, Symmetric Fusion Attention, to improve feature selection and XGBoost. The proposed approach was evaluated using the Paddy Doctor dataset with 16,225 sample images from 13 classes, and experiments show that the framework achieves a 99.25% accuracy rate.
Bera et al. [
15] introduce an attention-based CNN framework for plant-disease recognition by focusing on the diseased regions via region-wise feature aggregation and attention weighting. They benchmark on several public datasets, including Paddy Doctor, and report accuracy across datasets when using standard CNN backbones (e.g., DenseNet/MobileNet) with ImageNet initialization. Their best reported accuracy result using Paddy Doctor is 99.65%.
Furthermore, in the work of Petchiammal et al. [
9], the Paddy Doctor dataset was proposed, and also benchmarking results were proposed using the same dataset variant as we used in our work. The benchmarking results, using different models, are presented in
Table 1.
4. Experimental Setup
In this section, we describe the experimental setup designed to evaluate the effectiveness of the proposed contrastive dissimilarity framework in the rice disease diagnosis task. The experiments were conducted using the Paddy Doctor dataset, which offers a realistic and imbalanced representation of field conditions. We outline the dataset characteristics, data preparation, and model training protocols, as well as the evaluation procedures used to assess performance under varying levels of data scarcity. The objective is to verify whether the proposed approach maintains competitive accuracy and robustness across different training regimes and dataset configurations.
4.1. Paddy Doctor Dataset
The Paddy Doctor dataset (Available at [
https://dx.doi.org/10.21227/hz4v-af08], accessed on 24 January 2026) was originally composed of visual and infrared images of paddy leaves obtained from real paddy fields in a village near the Tirunelveli district of Tamilnadu, India [
9]. In this work, we explore the visual images aiming to perform automatic rice disease classification. The images were collected in real fields using high-resolution smartphone cameras between February and April 2021. After cleaning and annotation, the dataset contains 16,225 RGB images at a resolution of 1080 × 1440 pixels, organized into 13 classes: 12 disease categories and healthy leaves, as detailed in
Table 2.
The class distribution in the dataset is imbalanced, ranging from 450 samples for Bacterial Panicle Blight (BPB) to 2405 samples for the Normal class. Excluding the Normal class, approximately half of the remaining classes contain fewer than 1000 samples, while two classes exceed 2000 samples. This uneven distribution indicates a disparity that may bias the diagnostic performance of machine learning models.
Furthermore,
Figure 1 shows some sample images from the dataset.
The original authors of the Paddy Doctor dataset released four variants. In this work, we used the “Small and Split” variant, which consists of 16,225 images resized to 256 × 256 pixels and divided into training and test sets. The training set contains 12,980 images (80%), and the test set contains 3245 images (20%). Both sets were stratified according to class labels and paddy variety to ensure reproducibility.
4.2. Training Protocol
The input images were from the Paddy Doctor dataset variant “Small and Split” (256 × 256). As illustrated in
Figure 2 all dataset variants contain the same amount of images and classes. The only difference between them is image resolutions, which also impacted the zip size, and the fact that the “Small and Split” was already split into training and test sets. This variant was also used by Paddy Doctor’s original author to run their benchmark.
To train the proposed contrastive dissimilarity model, we adopted the same methodological foundations introduced by ref. [
5], where the training is performed through a metric learning framework integrated into a contrastive loss function.
Algorithm 1 summarizes the pseudocode used in our approach, adapted from ref. [
5]. The input consists of two sets,
im′ and
im″, each containing data points. The batch size determines the number of matched pairs in each set, where elements at the same index belong to the same class (i.e.,
matches
,
matches
, and so on). In Line 1,
im′ and
im″ are concatenated into a single set x. Next, Lines 2–3 use tile and repeat to expand
x into
x′ and
x″: tile replicates the array along specified axes, while repeat duplicates each element a given number of times. This expansion is used to enumerate all possible pairings between elements originating from
im′ and
im″. In Line 4, the method computes dissimilarities for the resulting pairs, and in Line 5 reshapes them into the matrix form required by the subsequent contrastive loss computation (Line 6).
| Algorithm 1: Contrastive dissimilarity training |
- Input:
image vector im′ and im″, labels y, batch size n, model m - Output:
updated model m - 1:
- 2:
- 3:
- 4:
- 5:
- 6:
- 7:
- 8:
return m
|
During training, the training set was augmented by creating sample pairs according to class membership, ensuring the presence of both positive (same-class) and negative (different-class) pairs. The dissimilarity function was learned through a projection head consisting of fully connected layers.
In addition, to simulate scenarios of limited data availability, we conducted experiments with progressively smaller portions of the training set: 100%, 50%, 20%, 10%, and 5% of the original training data. Notably, the testing set remained unchanged across all experiments, ensuring a fair and consistent comparison of results under varying training data conditions.
Figure 3 shows the classification setup we adopted in this work.
4.3. Evaluation Protocol
The evaluation of the proposed approach was conducted using the Paddy Doctor dataset, a benchmark dataset designed for rice disease classification. To ensure reproducibility and prevent bias, we followed the official “Small and Split” variant released by the dataset authors, which includes predefined training and testing partitions. Additionally, we considered the stratified variant provided with the dataset to ensure class representation.
To verify different levels of data scarcity, we repeated the evaluation across the exact progressively reduced training set sizes (100%, 50%, 20%, 10%, and 5%) while keeping the testing set unchanged in all scenarios.
For feature extraction, we employed EfficientNetV2 [
20] as the primary convolutional backbone, while prototype selection was carried out using K-Means clustering. Additionally, we evaluated different embedding sizes for both the dissimilarity space and the dissimilarity vector.
For benchmarking, we compared our approach against a baseline Convolutional Neural Network (CNN), EfficientNetV2 [
20], trained directly on the same dataset splits. This comparison enabled us to assess whether contrastive dissimilarity provided consistent improvements over a standard deep learning pipeline.
All representations were evaluated using Logistic Regression as a classifier. This interpretable yet straightforward classifier was chosen to keep the focus on assessing the viability of contrastive dissimilarity in the disease recognition context, without introducing further complexity.
The evaluation metric employed was classification accuracy, as it is the primary metric reported in the Paddy Doctor benchmark and widely adopted for image classification tasks.
5. Results and Discussion
This section presents the results obtained in our experiments, followed by a discussion that contextualizes our achievements with existing literature results.
Table 3 summarizes the results obtained under different training set sizes. As expected, the overall accuracy increased with the availability of more training data. At the most constrained scenario (5% of the training set), the baseline CNN achieved 80.6%, contrastive dissimilarity space (CDS, 79.6%), and contrastive dissimilarity vector (CDV, 79.8%). The difference was small (≈1%), demonstrating that the proposed approach remains competitive even under extreme scarcity conditions.
Table 4 shows that our proposed methods (CDS and CDV) consistently match or improve the baseline, with the largest gains observed in the low-data regime. At 10% training size, both variants improve precision (from 85.2 to 86.7 for CDS and 86.4 for CDV) and have higher F1-scores (84.7 to 85.2 and 85.3), indicating better overall balance between false positives and false negatives. The effect is even more evident at 20%, where CDV achieves the best performance across all metrics, reaching 93.0 precision, 90.9 recall, and 91.7 F1, surpassing the CNN baseline (92.0/90.3/91.1). As the training size increases to 50% and 100%, remains competitive when more data is available.
With 10% and 20% of the training data, all models converged toward similar performance levels (≈86–91%), with CDV showing a small advantage, as seen in
Figure 4. These results are consistent with the observations of [
5], who reported that contrastive dissimilarity maintains stable accuracy across varying levels of data availability. At 50% training data, CDS surpassed the CNN baseline (96.3% vs. 96.1%), and at full training size (100%), both CDS and CDV outperformed the baseline CNN (98.2% vs. 97.3%), and also the benchmark reported by Petchiammal et al. [
9] (98.2% vs. 97.5%).
The results in the most constrained scenario is also shown in
Figure 5, presenting the confusion matrix for training size with 5%.
Since both models predict on the same instances, we used McNemar’s test, which is appropriate for paired classification outcomes. McNemar’s test results were not significant at
, so we do not claim a significant overall difference. Still, the confusion matrix in
Figure 5 shows the key effect: robust predictions for the minority class, which is the main outcome of our approach.
Table 5 presents the impact of different embedding dimensionalities on model performance. Accuracy remained consistently above 90% across all tested dimensions, with the best results achieved using compact embeddings (32-D for CDS at 98.4% and 128-D for CDV at 98.2%). Larger embeddings (256-D) did not show improvements. These findings are also consistent with those of [
5], which stated that higher-dimensional projection heads do not necessarily enhance performance and suggested the use of compact embeddings instead.
To further investigate the robustness of our approach, we conducted an ablation study evaluating the influence of different data augmentation strategies on model performance. Three augmentation pipelines were tested: (i) no augmentation, consisting of random cropping; (ii) geometric transformations, vertical and horizontal flips, and rotations; and (iii) color-based augmentations, incorporating Gaussian blur and random brightness–contrast adjustments.
The results, summarized in
Table 6, reveal that the overall accuracy of both CDS and CDV models remained stable across all configurations, with variations of less than 0.3 percentage points. This consistency indicates that the proposed framework is primarily invariant to the types of transformations.
Table 7 shows that CDS and CDV are insensitive to the number of prototypes. Accuracy is already high with one prototype (98.1%) and quickly saturates, with only minor fluctuations thereafter. CDV is highest at two prototypes (98.4%), while CDS stays essentially the same, around 98.2%, suggesting that a small prototype set is sufficient.
Table 8 shows that the choice of prototype selection algorithm has minimal impact on performance. All evaluated methods fall within a 0.2% range, with k-means giving the best overall results (98.2% for CDS and 98.3% for CDV). This indicates that our approach is not sensitive to how prototypes are selected, and simple clustering (k-means) is sufficient.
In
Table 9 we present a consolidation of results from the literature, our baseline and our approach. Across all configurations, our best-performing model achieves 98.4% accuracy, improving upon both the CNN baseline (98.4% vs. 97.3%) and the previously reported benchmark on the Paddy Doctor dataset (98.4% vs. 97.5%). We also evaluated a cost-sensitive (weighted) Support Vector Machine (wSVM), which introduces class-specific penalties by reweighting the hinge-loss so that minority-class errors are penalized more heavily than majority-class errors [
21]. This approach is used for learning under class imbalance and asymmetric error costs; we employed the standard balanced weighting scheme, which sets weights according to the inverse class-frequency ratio. We trained the wSVMs with hyperparameters selected using grid search as follows:
[“rbf”,“linear”],
and
. For each hyperparameter combination, model fitting and selection were performed using 5-fold cross-validation on the training set, and the final model was refit on the full training set using the best parameters identified during the search.
To ensure a fair comparison, the weighted SVM was trained on feature vectors extracted from the same baseline CNN used in our study. Under this protocol, the weighted SVM reaches 97.5% accuracy on the full dataset, falling short of our best result (98.4%).
In the constrained scenarios, the weighted SVM performs more competitively, surpassing the baseline CNN as well as the CDS and CDV variants. Notably, these gains stem from the control of the decision boundary via cost reweighting, whereas our method targets improvements at the representation level through a representation learning framework. In other words, the SVM’s advantage in these settings reflects calibration of error trade-offs, while our approach seeks robustness through learned features that better capture minority-class structure under limited or constrained conditions.
Also, we contextualize our results against the published numbers available for the Paddy Doctor dataset. Direct comparison to the top-performing approach is limited, as prior work does not report performance under the same constrained evaluation protocol considered here. Nevertheless, under our stricter and more diagnostic setting, our method remains competitive and consistently improves over standard baselines, indicating that the proposed framework provides an effective and principled path for addressing class imbalance without relying solely on post hoc cost tuning.
Finally, we evaluated our approach by constructing a classifier ensemble combining CNN, weighted SVM, and our methods CDS and CDV using the sum rule for decision fusion [
22]. Results are shown in
Table 10, and we can observe that the ensemble consistently outperforms each individual classifier, indicating that the gains cannot be attributed to simple redundancy or averaging effects. In particular, while the CNN captures highly discriminative hierarchical representations and the weighted SVM emphasizes margin-based decisions with class imbalance awareness, the proposed approach contributes with information that is not fully exploited by either model alone. The improvement observed after fusion suggests that our method focuses on distinct characteristics of the data, leading to complementary patterns across classifiers. This diversity is a key requirement for effective ensembles and provides empirical evidence that the proposed approach encodes novel and relevant information, rather than what is already learned by standard deep or kernel-based models.
It is important to note that cross-validation was used only within the SVM hyperparameter tuning procedure, during the grid search on the training set. No cross-validation was performed for any other analyses or model training in this study. The rationale behind it is that Paddy Doctor dataset already provides a predefined and standardized train–test split released by the original authors. To ensure full comparability, we adhered to the official partitioning protocol. Consequently, all results reported here correspond to this fixed evaluation split, which guarantees methodological consistency with the benchmark configuration.
Compared with the original Paddy Doctor benchmarks, where CNN variants typically achieved accuracies in the range of 95–97% under full training sets, the results presented here demonstrate that contrastive dissimilarity can provide competitive results. The achieved accuracy of 98.2–98.4% underlines that the proposed method is robust in imbalanced and low-data scenarios (as originally proposed), and also capable of improving classification in well-resourced conditions. This observation extends the scope of contrastive dissimilarity, aligning with the insights of [
6,
7], who reported comparable robustness for music genre classification and for writer identification tasks across both balanced and imbalanced datasets.
While the proposed method does not yet achieve performance on par with the current state of the art, its primary contribution lies in introducing a novel perspective on feature representation and decision modeling. As widely recognized in the literature, methodological diversity is a key factor in extracting complementary information from the data, which is particularly valuable for the development of robust ensemble systems. Our approach demonstrates the ability to capture information that differs that those learned by standard pipelines, such as SVM classifiers trained on convolutional feature embeddings. These results suggest that the proposed framework represents a promising and functional direction, which can be further refined through future research, performance optimization, and systematic integration into ensemble strategies to enhance overall classification robustness and generalization.
Although the method shows encouraging performance, we identify two key practical limitations. First, the amount of data required for the approach to be advantageous is scenario-dependent, which makes it hard to define a single, clear cutoff where it consistently surpasses conventional baselines. Second, the framework relies on several tunable design choices that influence results and need calibration.
In addition, the Paddy Doctor dataset also highlights directions where complementary work would be valuable: for example, assessing robustness to field factors such as illumination changes, background clutter, and symptom severity, and validating transfer to unseen acquisition conditions such as different devices, seasons, or locations.
6. Conclusions
In this work, we investigated the use of contrastive dissimilarity for addressing class imbalance and limited data availability in rice disease diagnosis using the Paddy Doctor dataset. By integrating representation learning with a dissimilarity-based contrastive metric, the proposed approach successfully captured discriminative relationships between visually similar diseases, demonstrating competitive and, in some cases, superior performance compared to a baseline CNN model.
Our experiments revealed that contrastive dissimilarity not only maintained robustness under severe data scarcity (down to 5% of the training set) but also achieved the highest accuracy when trained with the full dataset (98.2–98.4%). This indicates that the method generalizes well across different levels of data availability, mitigating the adverse effects of class imbalance while preserving discriminative capability.
The results suggest that contrastive dissimilarity is a promising direction for precision agriculture, where data imbalance and limited annotations are common. Its ability to produce compact and effective embeddings while remaining resilient to uneven class distributions makes it suitable for broader applications in agricultural imaging and beyond, including other crops and disease diagnosis systems.
Future work will explore the integration of contrastive dissimilarity with transformer-based vision backbones and few-shot learning frameworks, aiming to enhance performance in ultra-low data regimes further. Additionally, the incorporation of multimodal information, such as spectral or environmental data, may provide complementary cues to improve diagnostic reliability in real-world field conditions further. Finally, we will evaluate cross-region, cross-season, and cross-dataset generalization under field acquisition conditions (lighting variation, occlusion, device shift) and will incorporate uncertainty calibration and open-set detection to reduce failures on unknown diseases.