1. Introduction
Potato crops represent a cornerstone of the global food chain, ranking as the third most important food supply worldwide. It is essential to ensure and prevent the impact of stressors that can affect and impair their production. Abiotic stresses are a rigorous test for modern agriculture, resulting in yield losses in major food crops that can reach up to 82% and millions of USD [
1]. In this study, two major symptoms of herbicide damage and soft wind are examined—interveinal chlorosis and leaf curling. Interveinal chlorosis is recognized by the yellowing of leaf tissue, triggered by the reduction or destruction of chlorophyll—the pigment essential for photosynthesis [
2]. Leaf curling is a plant’s physical response and can be caused by “wind stress”, generally characterized by the leaf margins curling upwards, potentially browning on the edges due to water loss [
3,
4]. To reduce the costs resulting from these factors, precise symptom detection using technological tools is vital.
Automated detection of abiotic stressors, using modern solutions like artificial intelligence, can play a decisive role in combating and reducing the number of crops affected by them. Traditional methods that include irrigation practices and fertilizer application have massive limitations, leading to salinity buildup in soil and pollution. Furthermore, manual scouting is labor-intensive, subjective and often fails to detect early-stage symptoms that are not yet visually distinct. To overcome these limitations, Park et al. (2025) [
5] utilized hyperspectral imaging to identify drought and heat stress with high precision by analyzing multi-spectral bands and Jia et al. (2024) [
6] demonstrated the efficacy of UAV-based spectral sensors for detecting viral diseases on potato plants at a commercial grower’s field. While powerful, these spectral approaches often require expensive, specialized hardware that is possibly unavailable to most farmers. Moreover, Lapajne et al. (2025) [
7] noted that they can struggle with diagnostic ambiguity in potato crops since spectral signatures of abiotic stresses can often overshadow those of biotic infections, making precise diagnosis complicated. In contrast, deep learning models applied to computer vision can recognize complex patterns and make accurate predictions based on the hidden features learned from RGB leaf imagery [
8,
9]. These methods can effectively transform modern agriculture practices into more sustainable and economically viable pesticide management and water efficiency.
In particular, Convolutional Neural Network-based models have shown great results in abiotic plant disease detection classification [
10]. The limitation of these models is their tendency to focus on local feature extraction, like texture and edges, which may lead to suboptimal performance in some cases [
11]. Certain symptoms are not identified just in a region, but may show global patterns and long-range relationships in an image. To address these challenges, Transformer-based models have been proven to excel in general object detection and have proven effective in a variety of applications [
12].
Transformers have played a pivotal role in the development of Neural Networks, since attention mechanism solves limitations in data’s long-distance dependencies [
13]. Even though originally designed for Natural Language Processing, self-attention was adapted for image classification by the Vision Transformer (ViT) [
14], which divided images into patches, treating them as a sequence of tokens and proving that Transformers could deliver powerful results in Computer Vision. A significant leap occurred when Carion et al. (2020) [
15] introduced Detection Transformer (DETR), a Transformer model with learnable object queries, which are trained to “ask” about specific objects in an image, that are passed to a Transformer decoder. This innovation led to the elimination of anchor boxes [
16] and non-maximum suppression techniques that were used excessively to detect varying-sized objects and filter candidate bounding boxes. Cheng et al. (2021) [
17] extended this query-based approach to segmentation by MaskFormer, reframing segmentation as a “mask classification” problem, where queries predict pairs of masks and class labels, unifying semantic and instance segmentation. To deal with MaskFormer’s cross-attention intensive computational cost, Cheng et al. (2022) [
18] proposed “masked attention” within its decoder, defining the Masked-attention Mask Transformer (Mask2Former) model, which constraints each query to attend only to localized features within its predicted mask region, achieving faster convergence, higher performance and efficiency. Mask2Former architecture has been a breakthrough to the image segmentation task, since it combines the Masked Attention mechanism to a single framework to solve instance, semantic and panoptic problems, detecting multi-scale object with high accuracy.
Recently, Zhang and Liu (2025) [
19] proposed a Swin-Transformer based model, utilizing TokenEmbedder and Axial Transformer modules to significantly reduce the computational cost and number of parameters by 46% compared to the base architecture and exploited it to detect a variety of plant diseases, such as leaf blight on corn and septoria leaf spots on tomato leaves. In order to segment plant and leaf instances Darbyshire, Sklar and Parsons (2023) [
20] followed an adapted architecture that uses separate Transformer decoders while also using a single layer for each feature level, instead of three as used in the original Mask2Former implementation, scoring the Panoptic Quality [
21] of crops and leaves equal to 70.18% and 66.91% respectively in the panoptic segmentation task. Wei et al. (2024) [
22] introduced a large-scale growth monitoring technique for lettuce seedlings, integrating multidimensional collaborative attention into Mask2Former for precise instance segmentation and localization, reaching mean Average Precision (mAP) at an Intersection over Union of 0.5 to 0.95 (mAP
0.50:0.95) equal to 78.23%.
In related research, Christakakis et al. (2024) [
23] optimized the early detection of
Botrytis cinerea in cucumber plants using ViTs and “Cut-and-Paste” data augmentation to address dataset imbalance, achieving an overall accuracy of 92%. Building on these methodologies, Kapetas et al. (2024) [
24] established a framework for early B. cinerea detection in tomato crops using YOLOv8 segmentation and Transformer ensembles, achieving 79.41% accuracy. This work was later advanced by Kapetas et al. (2025) [
25], who transitioned to a YOLOv11-based architecture and incorporated five derived vegetation indices (CVI, GNDVI, NDVI, NPCI, and PSRI) to detect symptoms on pepper plants. This refined approach accomplished a significantly improved overall accuracy of 87.42% and successfully validated the model’s sensitivity against molecular RT-qPCR assays for fungal biomass estimation.
This study presents a tool for instance segmentation of potato leaves and identification of interveinal chlorosis and leaf curling, using deep learning methods. The training pipeline consists of a Mask2Former model, selected specifically for its ability to model the long-range dependencies inherent in abiotic stress symptoms, fine-tuned on a binary dataset to develop a strong segmentation foundation on potato leaves. Current methods in precision agriculture rely on definitive, discrete labels, which forces a critical trade-off when facing ambiguity: ambiguous samples are either discarded or incorrectly labeled, introducing label noise. To overcome this limitation, this study’s dataset was enabled to have ambiguous samples and a Partial Labeling Learning (PLL) [
26] approach is introduced, allowing for a candidate set of labels with a single true, latent label. This latent label is extracted using an EfficientNet-b1 [
27] classifier, trained with instance masks cropped from the dataset while exploring different disambiguation and mask cropping methodologies. Finally, the initial Mask2Former model is transferred and fine-tuned on this refined, pseudo-labeled dataset, targeting the two abiotic stress symptoms, yielding precise instance masks while maximizing data utilization. Experimental evaluations demonstrate the efficacy of this framework, achieving an instance segmentation mAP of 89% and a label disambiguation accuracy of 95% on the validation set.
The structure of this paper is organized as follows.
Section 2, Materials and Methods, details the dataset acquisition and the proposed two-stage training pipeline, including the initial fine-tuning of the Mask2Former model and the implementation of the partial label learning (PLL) framework using an EfficientNet-b1 classifier to refine ambiguous annotations.
Section 3, Results, presents the experimental evaluation of the segmentation model across different backbone configurations and image resolutions, analyzing the impact of the proposed methodology on detection accuracy. Finally,
Section 4, Discussion, concludes the study by summarizing the key findings and discussing potential future applications for integration into smart agriculture systems.
2. Materials and Methods
Figure 1 presents a high-level flowchart that outlines key stages of the proposed methodology, which are discussed in detail in the following sections.
2.1. Dataset Collection
The collection of data was performed utilizing a standard commercial smartphone camera (manufacturer: Xiaomi, Beijing, China) with approximately 12 MP effective resolution to introduce device-agnostic variability into the dataset. This approach ensures the model’s generalization capabilities across different hardware typically available to farmers. All images were captured in standard ‘Photo’ mode with automatic exposure and white balance settings to mimic real-world scouting conditions. High Dynamic Range (HDR) and AI scene enhancement features were strictly disabled to preserve raw texture fidelity. Images were saved in RGB color mode, JPEG format, without applying any external post-processing or color grading during the data collection phase. This approach ensures the system’s ease of use and eliminates the need for specialized equipment or specific photographic know-how.
The capture target classes were healthy potato leaves, or potato leaves with symptoms of interveinal chlorosis or leaf curling. The data capture was performed by expert phytopathologists in fields located in Volos, Magnesia, Greece for healthy potato plants on fields, and in La Palma, Cartagena, Murcia, Spain for symptomatic potato plants and further healthy leaf captures.
Figure 2 exhibits one example capture for each of the target classes for this work. The data collection comprises a total of 222 images and at a resolution of 3456 × 3456.
For the data capturing process the aim of each capture was to depict one main leaf where a specific known class would be assigned to it, while the rest of the leaves potentially depicted on the images were assigned one of the other known classes if the characterization was possible or they were assigned as an unknown class in cases where explicit class assignment through the image was not feasible.
2.2. Dataset Annotation and Split
Το enhance the model’s segmentation capabilities, an initial fine-tuning phase was conducted on the images collected. The objective was to refine the model’s generalization of potato leaf morphology, ensuring precise instance masking and effective separation between foreground and background elements.
To achieve a robust representation of abiotic stress features, it was vital to annotate and expand images collected based on four labels, Interveinal Chlorosis, Leaf Curling (Softwind), Healthy and Unknown. The annotation procedure was performed using Roboflow (
https://roboflow.com) [
28], resulting in 8203 leaf annotations. An annotated image sample can be viewed in
Figure 3.
Approximately 15% of these annotations were classified as “unknown” due to visual ambiguity, necessitating a partial label learning (PLL) classifier to disambiguate this class and generate high-confidence pseudo-labels. The dataset used for the classifier’s training was constructed by extracting instance-level masks from the multi-class symptoms dataset used for Mask2Former training. To explore the classifier’s sensitivity to background noise, annotation masks were cropped from the original images and processed using two different configurations: one with background pixels replaced by black padding and another with the original background pixels intact. The number of annotations in the training and validation splits, for each class, is presented in
Table 1.
The relatively small number of annotations made the application of standardized dataset augmentation techniques necessary. The augmentations applied consisted of standardized augmentation techniques including horizontal, vertical flips with probability of 25% and rotation limited to 10 degrees and probability of 50% to replicate the different point of views an instance can be captured and teach the model orientation and rotational invariance, adjustments in brightness, contrast, saturation and hue to account for the fluctuations in lighting throughout the day and teach photometrics invariance. A final augmentation addition to bridge the gap between the different image domains introduced by the multiple capturing locations for the images (i.e., Spain and Greece) was the Fourier Domain Adaption (FDA) [
29]. The augmentations applied and their corresponding probabilities are presented in
Table 2.
After completing 20% of the training, a coarse dropout augmentation was introduced, where black pixel “holes” are created on the instance mask to prevent overfitting and mimic real-world imperfections. This forces the model to learn the entire pattern of the disease rather than relying on its single most obvious symptom. The augmentations mentioned are solely applied to the training set, achieving a great amount of different leaf capturing scenarios and conditions.
Once the PLL classifier assigned pseudo-labels to the unknown instances (all of which were restricted to the training set during the disambiguation phase), the dataset was consolidated into three final classes and divided into training/validation subsets following an image-level grouping approach. To prevent data leakage during the two subsets split for model training, an iterative stratification strategy was employed, testing multiple dataset splits to select a split were each definitive class (Healthy, Chlorosis, and Leaf Curling) in the outcome distribution remained proportional between the training and validation sets with an approximate 75%/25% proportion. This helps ensure that the distribution of instances remained balanced, without breaking the image-level grouping. The resulting number of annotations for each class is shown in
Table 3.
2.3. Evaluation Metrics
The performance of the Partial Labeling Learning classifier was determined using the model’s predictive accuracy and F1-Score, which are standard classification benchmarks. Accuracy was calculated as the ratio of instances where the predicted label matched the single true ground-truth label. To account for potential class imbalances within the dataset, the F1-Score was employed as the harmonic mean of precision and recall, providing detailed insight into the classifier’s disambiguation capabilities.
To assess the performance of the Mask2Former models employed for the leaf segmentation task, a comprehensive set of evaluation metrics was utilized. The key metrics selected were precision, recall, F1-Score and the mean Average Precision (mAP) at an Intersection over Union (IoU) for different threshold values. Precision indicates the fraction of positive predictions that are true leaf instances, recall measures the fraction of actual leaf instances correctly identified, while F1-Score balances them to a single metric. The mAP is a common benchmark for object detection tasks as it evaluates the accuracy of the predicted masks by counting a prediction as correct only if its overlap with the ground truth mask exceeds the threshold value applied. By analyzing these metrics alongside the parameter count for each model configuration, this study evaluates both the segmentation accuracy and the computational efficiency of the models.
2.4. Partial Label Learning
Partial label learning (PLL) [
25] is a weakly supervised multi-class classification framework where each instance is associated with a set of candidate labels, only one of which is the ground truth. This framework is particularly advantageous for real-world problems characterized by inter-class ambiguity. In this context, distinguishing between mild symptoms and healthy leaves can be difficult to categorize definitively as either “diseased” or “healthy” during manual annotation. This method allows the model to learn features from unambiguous instances and subsequently disambiguate the partially labeled data, mitigating the risk of annotation noise. In this work, class labels were turned to multi-hot encoding, following the format shown in
Table 4.
Within the PLL framework, unknown labeled instances are assigned to a complete or subset of the candidate set (Equation (1)).
This formulation states that the true label exists within this subset but remains latent. The core challenge is to train the model to disambiguate the candidate set by identifying the single, true latent label for each ‘unknown’ sample.
Classifier and Loss for Partial Labeling
The reliability of the label disambiguation process is pivotal for the downstream segmentation task. In order to balance between parameter efficiency and feature extraction capabilities, the EfficientNet-b1 [
26] architecture pre-trained on ImageNet [
30] is selected. This allows the model to identify subtle symptomatic differences in leaf texture while minimizing computational cost.
The custom training loss had to take into account that the ground truth is latent within the candidate set. Two different loss objectives were implemented: a Progressive Identification (PRODEN) [
31] and a Naive Strategy [
26]. The fundamental difference between these methods lies in their target distribution: while the Naive strategy computes a fixed uniform probability over the candidate set (treating all candidates equally), PRODEN dynamically weights the candidate labels based on the model’s predicted confidence to identify the true latent label.
The goal here is to create a refined target distribution that is restricted to the candidate set and weighted by the model’s current belief. To achieve this, the logits derived from the classifier are filtered, setting non-candidate logits to negative infinity (−∞) and the rest to their raw value. The soft pseudo-target probability qsoft,i (j) is calculated by applying softmax function to these filtered logits.
The model’s final prediction was made by applying a simple Test-Time Augmentation (TTA) [
32], which means it receives three different [
33] 224 × 224 patches of the image: the original leaf image and two that were flipped horizontally and vertically. The model generates three different predictions for each patch that are then averaged to create the final prediction. Test-Time Augmentation improves predictor’s robustness and reduces variance, since the model might be slightly more sensitive to features in one part of the image. The results of the training process of the EfficientNet-b1 classifier and the different settings applied are presented in
Table 5.
Different loss strategies and black padding of the masks extracted do not seem to affect the classification metrics dramatically. As shown in
Figure 4, PRODEN strategy with black padded masks results in the most balanced performance across all classes while also breaking the 0.94 accuracy barrier for the healthy class.
Naive strategy with black padded masks delivers great score for the chlorosis class, which is the minority as shown in
Table 5, in contrast with the softwind class that scores slightly inefficiently compared to the other settings. Directly cropping the instance masks and feeding them to the model, regardless of loss strategy, performs similarly on average with slight improvement in the chlorosis label score if PRODEN is selected, while the rest of the classes are raised faintly for the Naive strategy.
Based on the showcased analysis, the model used for disambiguation was the EfficientNet-b1 model that was trained with PRODEN loss strategy and was fed with black padded images. The resulting labels were updated in the initial dataset, creating a new multi-class dataset that contained only instances of known classes. An example of this procedure is shown in
Figure 5.
A small subset of the disambiguation results was also examined from expert phytopathologist and concluded that possibly up to 20% of the samples could be incorrectly classified comparatively to the 95% accuracy metric that the model returned.
2.5. Model Fine-Tuning in the Multi-Class Scenario
The Mask2Former model fine-tuned on the leaf segmentation setting is transferred to solve the instance segmentation task that includes three classes. All instances are labeled explicitly, either manually or pseudo-labeled by the EfficientNet-b1 predictor.
To improve the model’s robustness in this scenario, images were augmented by using mosaic method or standard augmentation techniques. Mosaic combines four images into a single, 2 × 2 grid with probability of 25% and standard augmentations include spatial, geometric, color, noise/blur transforms, with application probabilities as mentioned in
Table 2. This ensures the model is trained with a variety of different views of the images, enhancing generalization and accuracy in the feature extraction. Instance segmentation Mask2Former model predicts an instance mask and class for the object and since it was already fine-tuned to detect potato leaves sufficiently, it is essential to apply different learning rates in different parts of the network. To mitigate the model’s ability to distinguish between classes without degrading the high-quality features already learned, parameters related to the class predictor layer are using a higher learning rate value than those responsible for the mask prediction. The key hyperparameters used for this setting are presented in
Table 6.
4. Discussion
The experimental results indicate that input resolution plays a critical role in the detection of subtle abiotic stress symptoms. While the 600 × 600 resolution was insufficient for capturing fine textural details, the 800 × 800 resolution proved to be the optimal trade-off, achieving a balance between feature extraction and computational efficiency. Interestingly, increasing the resolution to 1024 × 1024 did not yield further improvements and instead introduced higher computational demands, likely due to the model becoming more sensitive to background noise in the open-field setting.
Enhancing performance by initially fine-tuning the model on a binary segmentation task showed important boosting in the evaluation metrics. Segmenting potato leaves in open-field environments is inherently difficult due to complex backgrounds and overlapping foliage. Simplifying the task into simpler individual processes allowed the model to optimize its mask-prediction parameters without the added complexity of class-label noise, which highlights the importance of establishing a strong segmentation foundation before addressing multi-class symptom categorization.
A notable disparity in performance was observed across the three target classes. The Chlorosis class performed exceptionally well (AP50 = 0.939), most probably due to the sharp contrast of yellowing leaf tissue against the green leaf color that provides a clear signature for the Masked Attention mechanism. The Softwind class also showed strong results (AP50 = 0.901), as the structural changes, specifically leaf curling, create unique geometric boundaries that the Transformer-based queries can effectively isolate. However, the Healthy class recorded lower metrics (AP50 = 0.827). This can be attributed to two main factors: inter-class ambiguity and the annotation logic. Early-stage stress symptoms may mimic healthy leaf morphology, since the distinct characteristics are not fully visible, leading to inherent classification difficulty. In addition, as observed in the visual assessment and the confusion matrix of the experiment, many “false positives” were correctly segmented healthy leaves that were simply not part of the manual annotation split because they were in the background or out of focus. This suggests the model’s true precision for healthy leaves is higher than the numerical evaluation indicates.
The high-level performance of the proposed framework can be attributed to the architectural advantages of the Transformer models. Unlike CNN-based methods that focus on local feature extraction, the attention mechanism enables the capturing of global context and long-range dependencies, which proved to be vital for distinguishing abiotic stress symptoms across the leaf surface rather than sharp edges. In addition, the query-based, anchor-free methodology of Mask2Former enables a more flexible segmentation of highly irregular shapes that standard anchor-based models fail to enclose sufficiently. However, these advantages come with significantly higher computational cost and memory footprint, compared to CNN-based models. While the exclusive use of RGB imagery ensures the system’s ease of use, accessibility and economic viability for farmers, it inherently limits the spectral information which could potentially lead to an even better performance.
5. Conclusions
This study examined the detection and prediction of healthy potato leaves and two abiotic stress factors, leaf curling and interveinal chlorosis, using Deep Learning and Machine Learning techniques in the open field setting. The challenge of the instance segmentation task was successfully accomplished by employing a Masked Attention Mask Transformer model of different input image sizes from which the 800 × 800 input size resulted in the best metrics outcome, achieving a mAP50 of 89%. To achieve this performance, a base instance segmentation model was trained to learn the potato leaves features, which was then transferred to solve the multi-class scenario of the final model presented. In addition, it was vital to disambiguate leaf instances that could not be explicitly labeled to a class and for that reason, an EfficientNet-b1 classifier was trained, scoring an accuracy of 95%.
This 95% accuracy of the PLL model, although very significant, is not clearly reflected on the on-image validation of the expert phytopathologist in a small subset of the dataset, who noted a possible false positive rate of up to 20% for the PLL classified samples. This creates a future examination point where further data collection is applied and further validation of the PLL model results is assessed. Furthermore, future work could involve the detection of more abiotic stress factors to further extend the Mask2Former model’s capabilities. Another point for future work to be addressed during real-world operation of this system would be the inclusion of uncertainty estimation techniques and the integration of corrective mechanisms for high-uncertainty samples, such as expert phytopathologist validation, ensemble-base uncertainty modeling or incorporation of additional modalities (e.g., multispectral imagery) to improve robustness.
The use case of this framework is the real-time, automated monitoring of potato crops for the early and precise detection of abiotic stress symptoms related to herbicide and soft wind damage. The proposed system is designed for seamless integration into smart agriculture solutions, overcoming the limitations of traditional monitoring methods that include manual labor and subjective assessment. Deployment settings may include ground level robotics, for high-resolution, close-up detection and targeted treatment, or unmanned aerial vehicles, for rapid, large-scale field surveying and monitoring across vast areas. By providing accurate, early symptom detection, this implementation supports more sustainable and economically viable practices for potato crop management.