A Multiple Instance Learning Approach to Study Leaf Wilt in Soybean Plants

Banerjee, Sanjana; Ramos, Paula; Reberg-Horton, Chris; Mirsky, Steven; Locke, Anna; Lobaton, Edgar

doi:10.3390/agriculture15060614

Open AccessArticle

A Multiple Instance Learning Approach to Study Leaf Wilt in Soybean Plants

by

Sanjana Banerjee

^1,*,

Paula Ramos

²

,

Chris Reberg-Horton

²

,

Steven Mirsky

³,

Anna Locke

⁴

and

Edgar Lobaton

¹

Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC 27606, USA

²

Department of Crop and Soil Sciences, North Carolina State University, Raleigh, NC 27695, USA

³

US Department of Agriculture (USDA), Agricultural Research Service, Beltsville, MD 20705, USA

⁴

US Department of Agriculture (USDA), Agricultural Research Service, Raleigh, NC 27606, USA

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(6), 614; https://doi.org/10.3390/agriculture15060614

Submission received: 5 February 2025 / Revised: 3 March 2025 / Accepted: 7 March 2025 / Published: 13 March 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Recent years have seen significant technological advancements in precision farming and plant phenotyping. Remote sensing along with deep learning (DL) techniques can increase phenotyping efficiency and help on-farm decision making with rapid stress detection. In this work, we use these techniques to evaluate drought stress in soybean plants, a crop whose yield is significantly affected by water availability. Images were taken from a high vantage in the field at various times throughout the day. Each image is given a wilting score ranging from 0 to 4 by expert scorers. We implement a DL method called multiple instance learning (MIL) to perform wilt classification as well as generate heat maps that highlight wilt levels in specific regions of the image. Given the significant overlap between adjacent classes in our dataset, we were able to achieve an overall classification accuracy of 64% and a one-off accuracy of 96% on our holdout test set. Our model outperformed DenseNet121 in most metrics, and provided comparable performance to a vision transformer (ViT) while having fewer parameters overall, less complexity (useful for edge implementations), and some interpretability. Furthermore, we were able to show that our model outperformed expert human annotators by predicting more consistent and accurate wilt levels when considering single-image re-annotation. The results show that our proposed methodology can be a useful approach in detecting drought stress in soybean fields to facilitate efficient crop management and aid selection of drought-resilient varieties.

Keywords:

deep learning; computer vision; precision agriculture; remote sensing; soybean cultivation; drought stress; multi-class classification

1. Introduction

Drought poses an enormous threat to crop production. Altered precipitation patterns are expected to increase the frequency and severity of drought in coming years [1]. Soybean is the most widely grown legume crop on Earth, and water availability is a major determinant of its yield [2]. Improving soybean’s drought resilience is important for global food security [3]. Leaf wilting is a visually obvious symptom of drought stress in soybeans, resulting from reduced turgor pressure as leaf water loss via evapotranspiration exceeds water resupply into the leaf. Despite multiple methodologies for quantifying drought stress with instrumentation, visual scoring of leaf wilt remains the dominant method for plant breeders and agronomists due to the speed with which it can be conducted. Soybean researchers often use visual scores of leaf wilting to assess a particular genotype’s potential drought tolerance and select the best lines to advance towards a new variety. The method has significant disadvantages as well, including significant variations between observers and the inability of human observers to evaluate all plots or genotypes at the same moment in time [4,5,6]. These scores are also slow and labor-intensive to collect, normally assigned by walking through the crop fields and manually inspecting the level of wilting on the leaves, and assigning a value along a predetermined scale to indicate the degree of stress [7,8]. In the field of precision agriculture, machine learning (ML) and DL techniques have been widely used to automate many tedious and repetitive tasks on the field [9,10]. Low-cost imaging systems equipped with cloud access or local compute capabilities can be used to evaluate drought stress under real-world field conditions [11].

In our study, we have explored a DL technique called multiple instance learning (MIL) [12,13,14] to classify leaf wilting in soybean plants. MIL is a supervised learning technique that is used to find key instances among a collection of instances, also termed as a bag. Other than modeling ambiguity in datasets, it also provides a way to leverage weak labels on images, where a single label is provided for the entire bag instead of individual instances [15,16]. The MIL approach treats the input dataset as a collection of bags, each bag consisting of a collection of instances in the form of patches. A bag is assigned a positive label if it contains at least one instance belonging to the relevant class. If no such instances appear, the bag is labeled negative. Apart from predicting a bag/classification label, such a model is also able to highlight regions of interest (ROIs) in the image that were responsible for triggering a particular label. Dietrich et al. [13] first introduced the term MIL in an attempt to solve the drug activation problem. They used MIL to find specific molecular conformations among all observed low-energy shapes that contributed towards making a drug active. Maron and Ratan [14] used MIL in image classification of natural scenes. An image was considered a bag and each bag had subimages termed as instances. In the field of agriculture, MIL has been used for automatic wheat disease diagnosis [17], pest classification in citrus plants [18], and automatic counting of cotton flower from aerial imagery [19], among others. MIL also finds use in a wide variety of applications, from object tracking [20,21] to medical data analysis [22,23,24,25]. Ilse, Tomzack, and Welling [26] formulated the MIL problem as a Bernoulli distribution of the bag label. They provide a three-step procedure for modeling the bag probability to classify medical images into cancerous and non-cancerous categories. Additionally, they introduce the attention mechanism to their model to find key instances that highlight malignant cells in the images.

Our study focuses on the detection of drought stress in soybean plants by analyzing leaf wilt from images of the crop. This image-based wilt detection has the potential to be a valuable tool for characterizing the drought tolerance in different soybean varieties, a crucial factor given the strong impact of water availability on crop yield. We have implemented a deep MIL multi-class classification model for this task. The following are some of the key contributions made in our study:

We introduce a labeled dataset of soybean field images with each image assigned a value between 0 and 1 by expert annotators.
We introduce a deep MIL multi-class classification model by converting the MIL approach, generally well suited for binary classification, to handle multiple classes.
Along with providing classification scores, we explore the use of our model to provide interpretability by generating heat maps that highlight ROIs in the images with the intent to allow researchers to pinpoint relevant areas of drought stress in the field.
We demonstrate that our MIL model achieves classification accuracy on par with state-of-the-art classification models like DenseNet 121 [27] and ViT [28]. This model is also able to achieve high one-off accuracy, an important performance metric for our dataset as we will see later. Unlike larger models like ViT, the smaller architecture of our MIL model also ensures faster prediction time which is essential for real world deployment on IoT/edge devices.
Lastly, we are able to show the benefit of utilizing our deep learning model over expert human scorers, and that relying solely on expert decisions is often not the most accurate, efficient, effective, and reproducible approach.

The rest of the paper is organized as follows. In Section 2, we introduce our dataset of soybean field images, each assigned with a wilt label by expert annotators. In Section 3, we provide details on our MIL model along with it is mathematical formulation and compare model architecture and training parameters with our baseline models. We also provide our experimental details in this section. In Section 4, we report on classification performance of both our binary and multi-class MIL models and compare with the baseline models of DenseNet 121 and ViT. We also include results of our study of human variability in annotations and effect of varying patch size and patch overlap ratio on the different performance metrics such as accuracy, one-off accuracy, root mean square error (RMSE), and mean absolute error (MAE).

2. Dataset

The dataset produced in collaboration with the Crop and Soil Science Department of North Carolina State University and the United States Department of Agricultural Research Service (USDA-ARS) comprises 1788 images. Images were taken in 16 plots within a field at Sandhills Research Station in Jackson Springs, NC, USA. Cameras were mounted 1.52 m above the soybean canopy at an angle of 45 degrees and were programmed to capture an image of the plot every 15 min from sunrise to sunset over a period of 8 weeks in August and September of 2020. Half of the plots were irrigated throughout this period, and half of the plots were rain-fed only from 8 August through 15 September. The images represent soybean plants having different levels of wilting, each image being assigned a value between 0 and 5 by expert annotators, where 0 represents leaves with no wilting, 1 represents leaflets folding inward at secondary pulvinus with no turgor loss in leaflets or petioles, 2 represents slight leaflet or petiole turgor loss in upper canopy, 3 represents moderate turgor loss in upper canopy, 4 represents severe turgor loss throughout upper canopy and some turgor loss in lower canopy, and 5 represents severe turgor loss throughout canopy. For ease of analysis and due to the presence of fewer images in class 5, 4 and 5 were combined into a single class 4. Figure 1 shows some sample images from this dataset.

Table 1 shows the distribution of different classes and plot numbers in the dataset. Most plot numbers exhibited an unequal distribution of instances, with some plots missing instances in certain classes. The plot with the most balanced class distribution was 11–46, which was selected as a holdout test set for our experiments.

3. Materials and Methods

This section describes the models used for our prediction task and the computational experiments that were performed. Even though the inference task could be set up as a classification problem, we consider it more appropriate to pose it as a regression task since the ordering of the class labels does matter. That is, given a true class label of 0, it is worse to label something as 3 than it is to label it as 1. This motivates our use of metrics such as standard accuracy and one-off accuracy (i.e., predicted labels are considered correct if they are within 1 value of the true label).

3.1. Baseline Model

Different methods were explored in our attempt to create a baseline for the classification. Deep learning methods outperformed methods that used handcrafted features with traditional machine learning. As one of our baselines we included the popular image classification model, DenseNet121 [27]. It is a type of deep convolutional neural network that belongs to the DenseNet family of models. Each layer is connected to every other layer in a feed-forward manner. This means that the output of each layer is fed into all subsequent layers. DenseNet-121 requires fewer parameters compared to traditional networks like VGG or ResNet, as its dense connectivity promotes feature reuse. This leads to better performance, faster convergence, and reduced risk of vanishing gradients. As our second baseline we chose a ViT, the current state of the art in image classification tasks, which is also equipped with a self-attention mechanism [28]. A ViT is a transformer like architecture that handles vision processing tasks. It divides an input image into smaller patches (e.g., 16 × 16 or 32 × 32 pixels) and treats each patch as a “token”. Each patch is then flattened into a vector and passed through the transformer layers, where the self-attention mechanism learns relationships between patches, regardless of their spatial locations. Positional embeddings are added to the input vectors, providing information about each patch’s position in the image. Vision Transformers use the same transformer encoder architecture as in NLP, with multiple layers of self-attention and feed-forward networks to capture global context and long-range dependencies. When trained on large datasets like ImageNet [29] or JFT-300M [30], ViTs can outperform CNNs, achieving state-of-the-art performance in various vision tasks. A comparison of the model architectures and performance is presented in Section 3.3.

3.2. Multiple Instance Learning (MIL) Model

Drawing inspiration from [26], we modified their binary classification model to a multi-class classification approach. Our MIL model is able to classify images taken from soybean fields into five different wilting levels as well as find ROIs in the images in the form of heat maps. Previously reported work on the same dataset achieved a classification accuracy of 88% using DenseNet 121 [11]. In order to mimic real world scenarios, we have used holdout sets to test our model on data entirely unseen during training. We show that, apart from achieving comparable accuracy to DenseNet 121 on the holdout set, our MIL model can also introduce interpretability and help researchers identify relevant areas of drought stress in the images.

In their work, Ilse, Tomczak, and Welling [26] formulate the bag probability as a Bernoulli distribution, meaning their model can only classify images into two categories. The MIL approach in general is well suited for 0/1, yes or no solutions, where the problem is looking for a presence or absence of a category, for example if malignant cells are present or absent in an image and where such instances of malignancy occur. One way of making such a model suitable for multi-class classification is to convert it to a regression problem instead of classification. To implement such a model, we removed the concept of a positive and a negative bag and instead considered bags belonging to five separate classes corresponding to the five levels of wilting.

Ilse, Tomczak, and Welling proposed a bag of instances as input to their model. Proceeding along the same line, we represent an image X as a bag of instances given by

X = {x_{k}}_{k = 1}^{K}

, where

{x_{k}}_{k = 1}^{K}

is a collection of K patches all having the same dimensions. It is assumed that individual patch labels

{y_{k}}_{k = 1}^{K}

exists but are unknown or inaccessible during training. Instead, we assign a label Y to the entire bag/image, with

Y \in [0, 4] .

These assigned labels are discrete in nature, similar to labels in classification problems. However, we assume that these labels are derived from a continuous range in order to formulate our regression-based model.

Following the general three-step strategy in [26] for classifying a bag of instances in a binary classification problem, we modify it to a multi-class problem as follows:

A transformation function $f_{ψ}$ converts each input instance to a low dimensional embedding $h_{k} = f_{ψ} (x_{k}) \in R^{M}$ , where M is the dimension of the embedding and $f_{ψ}$ is a neural network made up of a combination of convolutional and fully connected layers with parameters $ψ$ .
A symmetric permutation invariant function $σ$ is applied to the transformed instances $H = {h_{k}}_{k = 1}^{K}$ to generate an aggregated score/bag representation z. This step is called MIL pooling. Three types of popular pooling methods were explored:

$MIL Mean Pooling : z = σ (H) = \frac{1}{k} \sum_{k = 1}^{K} h_{k},$

(1)

$MIL Max Pooling : z = σ (H) = max_{k = 1, \dots, K} h_{k},$

(2)

and

$MIL Attention Pooling : z = σ (H) = \sum_{k = 1}^{K} α_{k} \cdot h_{k} where α_{k} = \frac{e x p {w^{⊤} t a n h (V h_{k})}}{\sum_{j = 1}^{K} {w^{⊤} t a n h (V h_{j})}}$

(3)

with $w \in R^{L}$ and $V \in R^{L \times M}$ .
A function $g_{ϕ}$ transforms the aggregated instance scores z to generate a bag score $\hat{Y} = g_{ϕ} (z)$ . $g_{ϕ}$ is a neural network with parameters $ϕ$ .

The training is performed by minimizing the mean square error (MSE) loss between the generated bag scores and the corresponding ground truth values. That is,

L = \frac{1}{N} \sum_{i = 1}^{N} {(Y_{i} - {\hat{Y}}_{i})}^{2} .

(4)

The authors in [26] made use of the attention pooling to propose a novel Deep-Attention-based MIL. However, in our experiments, the more standard mean and max pooling outperformed their attention-based counterparts. This may be due to the small amount of data available for training. Our experiments make use of an instance-based MIL for which

f_{ψ}

is a neural network that returns a scalar value in the range

[0, 4]

, we use either mean or max pooling, and

g_{ϕ}

is set to be the identify function. The model architecture of our instance-based MIL is shown in Figure 2.

An advantage of using a multiple instance learning model over direct regression models is its ability to generate attention maps. The instance score

h_{k}

that each patch receives is a measure of its contribution in the classification of the entire image. The attention maps are generated by multiplying the patches with their corresponding scores and stitching them back together to generate the whole image,

I = \sum h_{k} \cdot x_{k}

. The resulting attention maps I thus highlight regions in the image that belong to a particular class and segments out all other areas. Example attention maps are shown in Section 4.1 for our binary classification approach. For our multi-class classification approach, we have included heat maps instead of attention maps. The heat maps were generated by extracting the attention scores from the model, performing bilinear interpolation on them, and then overlaying the resulting heat maps on the original images. These heat maps can be used to study areas in the image that the model considers important in contributing to the predicted wilting level.

3.3. Implementation Details

In Table 2, we show a comparison of our MIL multi-class classification model against the baseline models DenseNet 121 and a ViT. Here, K stands for number of patches generated per image having dimensions

H \times W

. Our images are of size

480 \times 640

. Patch size for the ViT and the MIL model are

p \times p

and

m \times n

, respectively. In our experiments, p was fixed at 16, which is consistent with the patch size selected in prior studies with ViTs, while we have varied patch size

m \times n

for the MIL model to study its effect on classification performance.

We perform hyperparameter tuning to obtain optimal values for all our models. Training was performed over 100 epochs and data augmentation was performed on the train set using transformations like rotate, horizontal flip, sheer, zoom, and brightness adjustment. This was primarily performed to handle a small dataset and the presence of class imbalance in the dataset. For our MIL model, we used an ADAM optimizer [32], learning rate (LR)

= 0.0001

,

β_{1} = 0.9

,

β_{2} = 0.999

, and weight decay

= 0.0005

. Following the protocol in [26], we select a batch size = 1 corresponding to 1 bag and 10-fold cross validation for training. The DenseNet 121 was pre-trained on ImageNet 1k and finetuned on our dataset. Images were downsized to

224 \times 224

, which is the size of the input for the pre-trained model used. Stochastic gradient descent (SGD) with Nesterov momentum [33] was used for training with an exponential decay LR scheduler with LR and weight decay

= 0.00001

, LR decay steps = 50,000, momentum

= 0.9

, and batch size

= 32

. Similarly for the ViT, pre-training was done on ImageNet 21k and then finetuned on our dataset. An ADAM optimizer with LR

= 0.00005

was used with batch size of 32. All experiments were run on an Ubuntu workstation with an NVIDIA GeForce GTX 1080 Ti GPU.

Before being fed to the model, images were divided into patches of the same dimensions. Since the patches (and not the whole image) are fed to the model as input, the size and the number of patches in each bag is expected to affect performance of the model. Thus, a batch had one image in the form of K patches. The value of K varied based on the patch size and overlap ratio. We considered patch sizes

m \times n

equal to

40 \times 40

,

40 \times 80

,

80 \times 80

,

80 \times 120

,

120 \times 120

, and

120 \times 160

; and overlap ratios of 0, 0.25, and 0.4. These values were varied to study their impact on performance.

Images belonging to plot number 11–46 were set aside as our test dataset. We have used confusion matrices and different metrics, namely accuracy, one-off accuracy, precision, F1 score, RMSE, and MAE to quantify the performance of our models. The one-off accuracy is an important metric in our analysis, as it demonstrated the difficulty of classifying the dataset into each individual class separately but showed that most incorrect classifications by the model occurred between adjacent classes. When reporting performance, we executed multiple runs of our experiments and reported the means and variances for all our metrics.

4. Results

We first explored the binary classification approach proposed in [26] in Section 4.1. Since this approach is suitable only for binary classification, we formulated our problem accordingly. One class or a combination of two classes were selected to represent a positive or a negative bag. Test results were reported on two MIL models and DenseNet121. Multi-class performance is reported in Section 4.2. We further analyzed the effect of patch size and overlap ratio on the MIL approach in Section 4.3.

Keeping in mind the complexity present in our dataset, we also wanted to quantify the amount of human variability present in labeling the images. This analysis gave us an insight into the difficulty in consistently determining a correct wilting level even for expert annotators. Images belonging to the holdout test set 11–46 were selected and sent to the experts for re-annotation with their IDs changed. Confusion matrices were computed by considering the original set of annotations as ground truth and the new set as predictions. We compared the results of the re-annotation to the performance of our MIL model in Section 4.4.

4.1. MIL Binary Classification

Table 3 shows the results of binary classification with images from a subset of classes representing positive instances and images from other classes denoting the negative instances. The accuracy obtained is an indicator of the difficulty in classifying the different bag combinations. Accuracy was highest with class 4 as positive and class 0 as negative bags or vice versa. The accuracy showed a considerable decrease when adjacent classes were treated as positive and negative bags or even when a combination of classes were considered as a single bag. As expected, this showed that images belonging to adjacent wilting levels were more difficult to classify. The instance-based MIL with mean pooling outperformed the max pooling model, the attention-based MIL pooling model [26], and showed results comparable to the baseline.

Some examples of attention maps of the soybean images generated by the model are shown in Figure 3. The attention maps indicate the ROIs in the image that were responsible for the classification by the model. Regions in the image that contribute to a particular class get highlighted while those that did not make any contributions get a low impact score by the model, and hence, get blacked out. The attention maps are a good way of interpreting the decisions made by the model. They also segment out any areas in the field that are not relevant for the classification, including regions that do not contain any soybean plant (e.g., patches or columns of field appearing in between the plantations). The model also places less importance on patches that appear out of perspective in the images; for example, a leaf may appear too small or too large inside a patch.

4.2. MIL Multi-Class Classification

Confusion matrices (Figure 4) were generated to study the performance of our MIL model on multi-class classification and compare them to the baseline models of DenseNet and ViT. Table 4 shows a detailed result on different performance metrics for each of the three models. For our MIL model, performances reported correspond to a patch size of 80 × 120. We report on accuracy, one-off accuracy, precision, and F1 score per class as well as with all classes combined. We also report on combined RMSE, MAE, and speed of predictions made by each model. DenseNet outperformed the other two models in terms of overall accuracy by a small margin. Overall accuracy of all three models was limited due to the complexity of the dataset, particularly the similarity between adjacent classes. However, both our MIL model as well as the ViT outperformed the DenseNet in one-off accuracy, which is an important metric in our analysis. All three models performed significantly better in one-off accuracy compared to classification accuracy. This gave us an insight into the difficulty in classifying adjacent wilting rates but the relative ease in classifying wilting levels separated by more than one class. In terms of classification performance per class, it is interesting to note the significant difference in results for each model. Our MIL model outperforms the DenseNet in four out of five classes, and the ViT in three out of five classes. However, because of it is significant low classification accuracy for class 0, it achieves the lowest overall classification accuracy out of the three models. A considerable difference in architecture of the three models could lead to differences in learnt features. This coupled with the complexity present in the dataset could explain the wide class wise variability in classification performance. Our MIL model, despite having a smaller architecture than the baselines, was second fastest in prediction accuracy speed. The DenseNet 121 had the smallest number of trainable parameters because of it is small input size (224 × 224) and no input patches used for training. As a result, it achieved the highest prediction speed (0.56 s per image). Our MIL model, which was trained on patches of images of size 80 × 120 and a patch overlap of 0.25 (42 patches per image), achieved a slightly slower prediction speed of 0.61 s. The ViT model, which had the most architectural complexity and patches of size 16 × 16, had the slowest prediction speed.

Figure 5 and Table 5 show results of multi-class classification when a subset of images is randomly selected as the test set. In total, 15% of the dataset is randomly selected from each class and set aside as the test set. This resulted in 79, 71, 55, 32, and 31 instances of class 0, 1, 2, 3, and 4, respectively, in the test set. The train set was then subjected to data augmentation, as discussed in Section 3. Comparing the results from the two multi-class classification experiments, we see that all three models for the randomly selected test subset case perform significantly better than the models which had holdout test images set aside during training. In Table 5, it is observed that a combined accuracy of 0.85, 0.82, and 0.87 is obtained for the three models, whereas for the holdout test set scenario, these values are 0.64, 0.66, and 0.65, respectively, as seen in Table 4. This is because the holdout method minimizes model bias. By having a select set of images from a specific plot set aside during model training, it ensures that during test time, the model produces realistic and not overly optimistic results on a new sample set, which was unseen during training.

Figure 6 illustrates heat maps generated using the attention scores received by the patches in our MIL multi-class classification model for the holdout test set. These heat maps were created by extracting attention scores from the model, applying bilinear interpolation to the attention scores, and then overlaying the heat maps onto the original images. Like the attention maps shown in Section 4.1, these heat maps highlight which areas of the image influenced the classification and which did not. It is evident that gaps between leaf columns and regions where leaves appear disproportionate receive less attention, as indicated by the blue regions on the superimposed images.

4.3. Patch Size vs. Patch Overlap Ratio

We also wanted to study how the number of patches present in a bag affected performance of the model. The number of patches an image is divided into and fed as input to the model depend on the patch size (

m \times n

) and the overlap between the patches. Our experiments included an overlap of 0, 0.25, and 0.4 and patch sizes of 40 × 40, 40 × 80, 80 × 80, 80 × 120, 120 × 120, and 120 × 160. The results are shown in Table 6. An overlap of 0.25 gave the best results for the majority of the patch sizes considered. Figure 7 shows performance plots of the MIL model for varying patch sizes and an overlap of 25%. The plots show how the four different performance metrics of the model change with patch size. Multiple re-runs of the experiments were performed to report on a mean, and their standard deviation are shown in the plots. From the experiments, it was observed that there was considerable variance in performance when patch size was changed. The best results were obtained with a patch size of 80 × 120 and 42 patches in a bag. For all the other sizes, there was a significant decrease in performance as well as more variations in between experiment reruns.

4.4. Human Variability in Annotations

Classification accuracy of deep learning models is entirely dependent on ground truth annotations provided by experts. For tasks like rating wilt levels of leaves, an absolute and precise wilt level is often difficult to determine. Thus, we wanted to quantify the extent of human variability in annotations and compare classification performance between humans and machines. In this experiment, the entire set of annotated images used for training our model was provided to the expert annotator as a reference. The task involved re-annotating the holdout test set, which consisted of 341 images. The test set was shuffled by assigning random IDs to the images. However, to create a realistic annotation scenario, the random ID was kept consistent with the date and the temporal and spatial ordering of the images were preserved. Consequently, all images taken on the same day were assigned the same random ID, with the time of day appended to it. The expert scorer was tasked with placing each image from the test set into the appropriate label folder (0–4), and a timer was used to record the time taken to annotate the entire holdout set. The re-annotated set was then compared to the original, with the latter serving as the ground truth. The re-annotated set was compared to the original set with the latter being set as the ground truth. Figure 8 shows a comparison between performance of our MIL model and the expert scorer. We observed that the machine outperformed the human scorer not only in all classification metrics but also in scoring time per image. The machine took an average of 0.61 s to score each image, compared to 11 s on average for the human scorer. This demonstrates the advantages of applying machine learning to complex datasets, where expert decisions are often prone to error and variability. When provided with images and ground truth annotations, a machine can consistently predict the correct labels with greater accuracy and in less time than a human. Consequently, using a trained machine to assess wilt levels, rather than relying on human scorers, would significantly improve labeling consistency and efficiency.

4.5. Discussion

One of the major limitations of our work is the absence of a large-scale dataset, with our input dataset containing only 1788 annotated images. A limited sample size can pose challenges during model training, such as overfitting, and can be negatively impacted by the presence of outliers. There is also a possibility that the data may not be representative of, and hence, may not accurately reflect, the larger population it is part of. Another limitation in our dataset is class imbalance, with class 0 containing 524 instances, while class 4 has only 206 instances. This imbalance could lead to models being biased toward the majority class and performing poorly on the minority class. To address these challenges, we performed data augmentation on the training set and employed k-fold cross-validation during training (as discussed in Section 3.3). We also used a holdout set for testing to obtain a less biased and more realistic estimate of model performance (as shown in Section 4.2).

As part of our future work, we plan to scale up data collection to improve the classification accuracy of wilt levels. We also intend to explore unsupervised learning methods on larger datasets to eliminate any human bias during the labeling process. Lastly, we acknowledge that some variability is expected in complex datasets used in supervised machine learning, especially those involving human annotators. However, establishing a set of human annotations as the ground truth can help us train supervised models to perform more consistent and accurate annotations on new images, which is a major motivation behind our study. Furthermore, with the help of AI-assisted labeling techniques, these trained models can generate ground truth annotations, leading to more reliable labels on any new datasets collected.

5. Conclusions

Multiple instance learning can be used in the field of precision agriculture to leverage weak annotations and find key instances in the dataset. In this paper, we proposed a MIL multi-class classification model to classify leaf wilting rates in soybean plants. Our dataset only had image level annotations, but we were able to highlight key instances in images that triggered a specific bag label. We also showed that the simpler binary classification model of MIL can be changed to a multi-class model by formulating the problem as a regression task. We achieved limited accuracy on our holdout test set due to our dataset being comparatively small and having significant correlation between adjacent classes. However, our model was able to outperform DenseNet121 in most metrics, and provided comparable if performance to the state-of-the-art Vision Transformer. A comparison with human annotators showed that our model was more consistent and accurate in predicting the correct wilt level. In addition to this, our model was also able to provide interpretability in the form of heat maps. Such an approach can be useful to many researchers in the field, in tasks such as automatic weed detection, disease detection, or presence of drought stress.

Author Contributions

Conceptualization, P.R., S.M., C.R.-H. and E.L.; methodology, S.B., P.R. and E.L.; software, S.B.; validation, S.B. and A.L.; formal analysis, S.B.; visualization, S.B.; investigation, P.R., S.M. and E.L.; data curation, P.R. and A.L.; resources, P.R., S.M., C.R.-H. and E.L.; writing—original draft preparation, S.B.; writing—review and editing, P.R., S.M., C.R.-H., A.L. and E.L.; funding acquisition, S.M., C.R.-H. and E.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the United States Department of Agriculture National Institute of Food and Agriculture under the Water CAP program 2018-68011-28372, the Agriculture and Food Research Initiative (AFRI) competitive grant 2018-67007-28423 and the United Soybean Board grant 1820-172-0133.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available on the data sharing platform Zenodo, accessed on 6 September 2023 https://zenodo.org/records/8256382 with DOI 10.5281/zenodo.8256382. Details about the dataset can be found in Section 2 as well as in the link provided in this section. We have also made our code available on github and can be accessed at https://github.com/ARoS-NCSU/Soybean-Leaf-Wilt-Classification, accessed on 4 March 2025. For any questions about the dataset or the code, please reach out to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lee, H.; Calvin, K.; Dasgupta, D.; Krinner, G.; Mukherji, A.; Thorne, P.; Trisos, C.; Romero, J.; Aldunce, P.; Barrett, K.; et al. Climate Change 2023: Synthesis Report; Contribution of Working Groups I, II and III to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change; IPCC: Geneva, Switzerland, 2023.
Mishra, V.; Cherkauer, K.A. Retrospective droughts in the crop growing season: Implications to corn and soybean yield in the Midwestern United States. Agric. For. Meteorol. 2010, 150, 1030–1045. [Google Scholar] [CrossRef]
Kulkarni, K.P.; Tayade, R.; Asekova, S.; Song, J.T.; Shannon, J.G.; Lee, J.D. Harnessing the potential of forage legumes, alfalfa, soybean, and cowpea for sustainable agriculture and global food security. Front. Plant Sci. 2018, 9, 1314. [Google Scholar] [CrossRef]
Yang, X.; Lu, M.; Wang, Y.; Wang, Y.; Liu, Z.; Chen, S. Response Mechanism of Plants to Drought Stress. Horticulturae 2021, 7, 50. [Google Scholar] [CrossRef]
Sloane, R.J.; Patterson, R.P.; Carter, T.E., Jr. Field drought tolerance of a soybean plant introduction. Crop Sci. 1990, 30, 118–123. [Google Scholar] [CrossRef]
Ye, H.; Song, L.; Schapaugh, W.T.; Ali, M.L.; Sinclair, T.R.; Riar, M.K.; Mutava, R.N.; Li, Y.; Vuong, T.; Valliyodan, B.; et al. The importance of slow canopy wilting in drought tolerance in soybean. J. Exp. Bot. 2020, 71, 642–652. [Google Scholar] [CrossRef] [PubMed]
Pathan, S.; Lee, J.D.; Sleper, D.; Fritschi, F.; Sharp, R.; Carter Jr, T.; Nelson, R.; King, C.; Schapaugh, W.; Ellersieck, M.; et al. Two soybean plant introductions display slow leaf wilting and reduced yield loss under drought. J. Agron. Crop Sci. 2014, 200, 231–236. [Google Scholar] [CrossRef]
King, C.A.; Purcell, L.C.; Brye, K.R. Differential wilting among soybean genotypes in response to water deficit. Crop Sci. 2009, 49, 290–298. [Google Scholar] [CrossRef]
Jha, K.; Doshi, A.; Patel, P.; Shah, M. A comprehensive review on automation in agriculture using artificial intelligence. Artif. Intell. Agric. 2019, 2, 1–12. [Google Scholar] [CrossRef]
Saxena, L.; Armstrong, L. A Survey of Image Processing Techniques for Agriculture; Australian Society of Information and Communication Technologies in Agriculture: Perth, Australia, 2014. [Google Scholar]
Ramos-Giraldo, P.; Reberg-Horton, S.C.; Mirsky, S.; Lobaton, E.; Locke, A.M.; Henriquez, E.; Zuniga, A.; Minin, A. Low-cost smart camera system for water stress detection in crops. In Proceedings of the 2020 IEEE SENSORS, Rotterdam, The Netherlands, 25–28 October 2020; pp. 1–4. [Google Scholar]
Keeler, J.; Rumelhart, D.; Leow, W. Integrated segmentation and recognition of hand-printed numerals. Adv. Neural Inf. Process. Syst. 1990, 3. [Google Scholar]
Dietterich, T.G.; Lathrop, R.H.; Lozano-Pérez, T. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 1997, 89, 31–71. [Google Scholar] [CrossRef]
Maron, O.; Ratan, A.L. Multiple-instance learning for natural scene classification. In Proceedings of the ICML, Madison, WI, USA, 24–27 July 1998; Citeseer: Princeton, NJ, USA; 1998; Volume 98, pp. 341–349. [Google Scholar]
Pathak, D.; Shelhamer, E.; Long, J.; Darrell, T. Fully convolutional multi-class multiple instance learning. arXiv 2014, arXiv:1412.7144. [Google Scholar]
Wu, J.; Yu, Y.; Huang, C.; Yu, K. Deep multiple instance learning for image classification and auto-annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3460–3469. [Google Scholar]
Lu, J.; Hu, J.; Zhao, G.; Mei, F.; Zhang, C. An in-field automatic wheat disease diagnosis system. Comput. Electron. Agric. 2017, 142, 369–379. [Google Scholar] [CrossRef]
Bollis, E.; Maia, H.; Pedrini, H.; Avila, S. Weakly supervised attention-based models using activation maps for citrus mite and insect pest classification. Comput. Electron. Agric. 2022, 195, 106839. [Google Scholar] [CrossRef]
Petti, D.; Li, C. Weakly-supervised learning to automatically count cotton flowers from aerial imagery. Comput. Electron. Agric. 2022, 194, 106734. [Google Scholar] [CrossRef]
Babenko, B.; Yang, M.H.; Belongie, S. Robust Object Tracking with Online Multiple Instance Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 1619–1632. [Google Scholar] [CrossRef] [PubMed]
Zhang, K.; Song, H. Real-time visual tracking via online weighted multiple instance learning. Pattern Recognit. 2013, 46, 397–411. [Google Scholar] [CrossRef]
Quellec, G.; Cazuguel, G.; Cochener, B.; Lamard, M. Multiple-Instance Learning for Medical Image and Video Analysis. IEEE Rev. Biomed. Eng. 2017, 10, 213–234. [Google Scholar] [CrossRef]
Sudharshan, P.; Petitjean, C.; Spanhol, F.; Oliveira, L.E.; Heutte, L.; Honeine, P. Multiple instance learning for histopathological breast cancer image classification. Expert Syst. Appl. 2019, 117, 103–111. [Google Scholar] [CrossRef]
Sun, L.; Lu, Y.; Yang, K.; Li, S. ECG Analysis Using Multiple Instance Learning for Myocardial Infarction Detection. IEEE Trans. Biomed. Eng. 2012, 59, 3348–3356. [Google Scholar] [CrossRef]
Han, Z.; Wei, B.; Hong, Y.; Li, T.; Cong, J.; Zhu, X.; Wei, H.; Zhang, W. Accurate Screening of COVID-19 Using Attention-Based Deep 3D Multiple Instance Learning. IEEE Trans. Med. Imaging 2020, 39, 2584–2594. [Google Scholar] [CrossRef]
Ilse, M.; Tomczak, J.; Welling, M. Attention-based deep multiple instance learning. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR: Boston, MA, USA, 2018; pp. 2127–2136. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 843–852. [Google Scholar]
Sirinukunwattana, K.; Raza, S.E.A.; Tsang, Y.W.; Snead, D.R.; Cree, I.A.; Rajpoot, N.M. Locality sensitive deep learning for detection and classification of nuclei in routine colon cancer histology images. IEEE Trans. Med. Imaging 2016, 35, 1196–1206. [Google Scholar] [CrossRef] [PubMed]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Nesterov, Y.E. Minimization methods for nonsmooth convex and quasiconvex functions. Matekon 1984, 29, 519–531. [Google Scholar]

Figure 1. Dataset consisting of soybean plants with five different wilting levels. The labels on top show the wilting level assigned to each image. The numbers on the left denote the plot number to which the images belong. The first part of the number denotes the camera ID, while the second part identifies a particular plot ID. For example, 11–46 indicates camera 11, plot 46.

Figure 2. Our instance-based MIL multi-class classification model, trained on a mean square error (MSE) loss to generate leaf wilt scores for the purpose of classification. The CNN architecture for our model is adopted from [31].

Figure 3. Top row shows original images and bottom row shows the corresponding attention maps generated by our MIL model. (a,b) Images from class 0. Here, class 0 was treated as positive bag and class 4 was treated as negative bag. (c,d) Images from class 4. Here, class 4 was treated as positive bag and class 0 was treated as negative bag.

Figure 4. Confusion matrices of model predictions on holdout test set. (a) DenseNet121. (b) Our MIL model. (c) Vision Transformer.

Figure 5. Confusion matrices of model predictions when a randomly selected subset is used as the test set. (a) DenseNet121. (b) Our MIL model. (c) Vision Transformer.

Figure 6. Heat maps generated from attention weights of our MIL multi-class classification model for holdout set 11–46. Left: sample images belonging to classes 0–4, Middle: attention weights interpolated to generate heat maps, Right: heat map superimposed on original image.

Figure 7. Boxplots showing variation in accuracy, one-off accuracy, RMSE, and MAE with changing patch size for a patch overlap of 25%. The numbers in the parentheses indicate the number of patches generated in a bag. We observe a significant improvement in three of the metrics (Accuracy, RMSE, and MAE) when patch size is set to 80 × 120 with 42 patches in a bag.

Figure 8. Confusion matrices showing performance of a human annotator (left) and our MIL model (right) on holdout test set 11–46 (341 images).

Table 1. Distribution of plot numbers and class labels present in the dataset.

Class	Plot Number										Total
Class	10–87	11–46	14–198	15–29	16–60	3–106	4–228	5–180	5–27	7–130	Total
0	24	37	37	64	95	34	21	76	107	29	524
1	51	82	28	20	4	66	48	18	124	32	473
2	20	72	57	4	6	50	49	4	49	58	369
3	25	84	2	2	0	9	67	1	4	22	216
4	136	66	0	0	0	0	4	0	0	0	206
Total	256	341	124	90	105	159	189	99	284	141	1788

Table 2. Comparison of model architecture and training.

Categories	Instance-Based MIL	DenseNet121	Vision Transformer
Data Augmentation	Yes	Yes	Yes
Patches Created	Yes	No	Yes
Number of Patches	$K = (\frac{H}{m - m * o v} - 1) * (\frac{W}{n - n * o v} - 1)$ $H, W =$ Image dim $m, n =$ patch dim, $o v =$ overlap	No	$K = H * W / p * p$ $H, W =$ Image dim $p =$ patch dim
Input to Model	Pixel arrays of patches	Pixel arrays of resized images	Linear patch embedding + Positional embedding
Convolutional Layers	Yes	Yes	No
Pooling Layers	Max Pool	Max Pool + Average Pool + Global Average Pool	No Pooling layers
Activation Functions	RELU	RELU	GELU
Self-Attention	No	No	Yes
Pre-training	No pre-training, all layers trained from scratch	Pre-trained on ImageNet 1k	Pre-trained on ImageNet 21k
Model Architecture	conv(36,4×4) + ReLU maxpool(2,2) conv(48,3×3) + ReLU maxpool(2,2) Fc(512) + ReLU Dropout(0.5) Fc(512) + ReLU Dropout(0.5) Fc(1) + Sigmoid mil-max/mil-mean	121 layers (117-conv, 3-transition) Global Average Pooling Fc(512) + RELU Dropout(0.5) Fc(128) + RELU Dropout(0.5) Fc(5) + Softmax	12 transformer encoder layers (MLP + Layer Norm + Multi-head Attention) Dropout(0.1) Linear Layer(5)
Trainable Parameters	12,667,349 for m = 80, n = 120, overlap = 0.25	7,544,965	85,802,501
Loss Function	Mean Square Error	Categorical Cross Entropy	Categorical Cross Entropy
Training time	335 min	287 min	512 min

Table 3. Binary classification accuracy of MIL models and the DenseNet baseline on different combinations of positive and negative bag instances.

Positive	Negative	Instance + Max	Instance + Mean	Attention	DenseNet121
4	0	0.9610	0.9805	0.9720	0.9769
4	1	0.9581	0.9797	0.9662	0.9754
4	2	0.9456	0.9647	0.9471	0.9679
4	3	0.8559	0.9278	0.8423	0.9487
4, 3	0, 1	0.8871	0.9352	0.8912	0.9394
2	0, 1	0.7297	0.7614	0.6904	0.7810

Table 4. Multi-class classification comparisons with holdout test set. Accuracy, one-off accuracy, precision, F1 score, RMSE, MAE, and prediction speed per image are reported for our MIL and the baseline models.

Metrics	Our MIL Model				DenseNet121				Vision Transformer
Metrics	Acc	One-Off	Prec	F1	Acc	One-Off	Prec	F1	Acc	One-Off	Prec	F1
Class 0	0.22	0.81	0.67	0.33	0.95	1.00	0.47	0.62	0.76	0.97	0.68	0.72
Class 1	0.73	0.98	0.66	0.69	0.65	0.98	0.62	0.63	0.85	1.00	0.58	0.69
Class 2	0.57	0.99	0.51	0.54	0.53	0.83	0.72	0.61	0.39	0.96	0.49	0.43
Class 3	0.67	0.98	0.58	0.62	0.57	0.83	0.73	0.64	0.55	0.93	0.69	0.61
Class 4	0.79	1.00	0.85	0.82	0.76	0.97	0.82	0.79	0.73	0.97	0.86	0.79
Macro Average	0.64	0.96	0.65	0.60	0.66	0.91	0.67	0.66	0.65	0.96	0.66	0.65
RMSE	0.71				0.78				0.70
MAE	0.36				0.35				0.35
Pred Speed per Image	0.61 s				0.56 s				1.28 s

Table 5. Multi-class classification comparisons with randomly selected subset as test set.

Metrics	Our MIL Model				DenseNet121				Vision Transformer
Metrics	Acc	One-Off	Prec	F1	Acc	One-Off	Prec	F1	Acc	One-Off	Prec	F1
Class 0	0.78	0.95	0.97	0.88	0.89	0.94	0.93	0.91	0.82	0.97	0.96	0.88
Class 1	0.94	0.99	0.80	0.86	0.87	0.96	0.83	0.85	0.93	0.99	0.80	0.86
Class 2	0.82	0.96	0.85	0.83	0.78	0.96	0.77	0.77	0.89	0.96	0.88	0.88
Class 3	0.75	0.94	0.75	0.75	0.63	0.94	0.69	0.66	0.81	0.97	0.84	0.83
Class 4	0.94	1.00	0.88	0.91	0.84	0.94	0.79	0.81	0.90	0.97	0.90	0.90
Macro Average	0.85	0.97	0.85	0.85	0.82	0.95	0.80	0.80	0.87	0.97	0.88	0.87
RMSE	0.49				0.61				0.47
MAE	0.18				0.24				0.16

Table 6. Effect of varying overlap ratio and patch size on performance metrics.

Overlap Ratio	Patch Size (Num of Patches)	Accuracy	One-Off	RMSE	MAE
0	40 × 40 (192)	0.52	0.94	0.80	0.93
0	40 × 80 (96)	0.52	0.94	0.81	0.54
0	80 × 80 (48)	0.48	0.95	0.82	0.57
0	80 × 120 (30)	0.57	0.93	0.79	0.49
0	120 × 120 (20)	0.59	0.94	0.76	0.47
0	120 × 160 (16)	0.57	0.95	0.78	0.49
0.25	40 × 40 (315)	0.53	0.96	0.77	0.51
0.25	40 × 80 (150)	0.57	0.96	0.74	0.47
0.25	80 × 80 (70)	0.51	0.96	0.78	0.53
0.25	80 × 120 (42)	0.64	0.96	0.71	0.4
0.25	120 × 120 (30)	0.51	0.96	0.79	0.5
0.25	120 × 160 (25)	0.52	0.93	0.84	0.55
0.4	40 × 40 (494)	0.54	0.95	0.77	0.51
0.4	40 × 80 (228)	0.55	0.94	0.74	0.5
0.4	80 × 80 (108)	0.55	0.96	0.75	0.49
0.4	80 × 120 (72)	0.56	0.95	0.73	0.51
0.4	120 × 120 (48)	0.53	0.97	0.74	0.5
0.4	120 × 160 (36)	0.49	0.96	0.8	0.55

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Banerjee, S.; Ramos, P.; Reberg-Horton, C.; Mirsky, S.; Locke, A.; Lobaton, E. A Multiple Instance Learning Approach to Study Leaf Wilt in Soybean Plants. Agriculture 2025, 15, 614. https://doi.org/10.3390/agriculture15060614

AMA Style

Banerjee S, Ramos P, Reberg-Horton C, Mirsky S, Locke A, Lobaton E. A Multiple Instance Learning Approach to Study Leaf Wilt in Soybean Plants. Agriculture. 2025; 15(6):614. https://doi.org/10.3390/agriculture15060614

Chicago/Turabian Style

Banerjee, Sanjana, Paula Ramos, Chris Reberg-Horton, Steven Mirsky, Anna Locke, and Edgar Lobaton. 2025. "A Multiple Instance Learning Approach to Study Leaf Wilt in Soybean Plants" Agriculture 15, no. 6: 614. https://doi.org/10.3390/agriculture15060614

APA Style

Banerjee, S., Ramos, P., Reberg-Horton, C., Mirsky, S., Locke, A., & Lobaton, E. (2025). A Multiple Instance Learning Approach to Study Leaf Wilt in Soybean Plants. Agriculture, 15(6), 614. https://doi.org/10.3390/agriculture15060614

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multiple Instance Learning Approach to Study Leaf Wilt in Soybean Plants

Abstract

1. Introduction

2. Dataset

3. Materials and Methods

3.1. Baseline Model

3.2. Multiple Instance Learning (MIL) Model

3.3. Implementation Details

4. Results

4.1. MIL Binary Classification

4.2. MIL Multi-Class Classification

4.3. Patch Size vs. Patch Overlap Ratio

4.4. Human Variability in Annotations

4.5. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI