1. Introduction
Soybean (
Glycine max [L.] Merr.), a major global oilseed crop and protein source, is crucial for agricultural sustainability and food security [
1,
2]. In soybean breeding work, precise seedling counting is a key step, as it forms the basis for evaluating germplasm quality and is an important criterion for selecting superior varieties [
3]. Traditionally, seedling counting is conducted by sampling selected field locations and estimating the total number from these observations, a process that is both labor-intensive and prone to error in complex field environments. Therefore, developing an efficient and accurate automated seedling counting method is crucial. Furthermore, creating an automated seedling counting platform tailored for agricultural breeding settings offers an ideal solution.
In addition to manual counting, various classical image processing techniques have been applied to seedling counting. Liu et al. [
4] extracted wheat seedlings from RGB images captured by a digital camera and counted them using excess green and Otsu’s thresholding methods, followed by skeleton optimization. Gnädinger and Schmidhalter [
5] performed automated maize plant counting using RGB imagery captured by an unmanned aerial vehicle (UAV), applying decorrelation stretch, HSV thresholding, and pixel clustering. Koh et al. [
6] estimated safflower seedling density through object-based image analysis combined with template matching applied to UAV RGB images. Furthermore, Liu et al. [
7] demonstrated that traditional methods such as corner detection and linear regression can also achieve high accuracy in maize seedling counting using UAV imagery.
In recent years, UAV remote sensing [
8] platforms equipped with cameras have facilitated the rapid and high-throughput acquisition of images from large-scale agricultural fields, providing robust technical support for crop seedling monitoring [
9]. In such applications, orthomosaic maps have become widely used as fundamental data products. The generation of orthomosaics typically follows a standard workflow that begins with flight planning and UAV image acquisition over the target area, followed by photogrammetric processing, which includes image alignment and mosaicking [
10]. These orthomosaics allow agricultural researchers and plant breeders to analyze phenotypic data efficiently and nondestructively [
11]. Depending on the sensors mounted on UAVs, remote sensing images may include RGB [
12], multispectral [
13], or hyperspectral data [
14], offering diverse options for crop monitoring and seedling assessment.
In crop seedling counting based on UAV remote sensing, deep learning techniques have played a crucial role [
15,
16], with RGB imagery serving as the most cost-effective data source [
17,
18]. Oh et al. [
19] developed a method for separating, locating, and counting cotton seedlings using an improved YOLOv3 [
20]. Barreto et al. [
21] proposed a sugar beet seedling counting model based on a fully convolutional network (FCN) and demonstrated that extending the previously trained FCN pipeline to other crops was possible with a small training dataset. Chen et al. [
22] introduced a cabbage seedling counting method by modifying YOLOv8n [
23], where a Swin-conv block replaced the C2f block in the backbone, and ParNet attention modules were incorporated into both the backbone and neck parts, enabling accurate tracking and counting of cabbage seedlings. Shahid et al. [
24] utilized U-Net [
25] and YOLOv7 [
26] as plant detection modules for tobacco counting and found the real-time system using YOLOv7 and SORT to be the most effective. Building on TasselNetV2+ [
27], Xue et al. [
28] proposed the TasselNetV2++ model, which demonstrated strong counting performance for soybean seedlings, maize tassels, sorghum heads, and wheat heads. De Souza et al. [
29] employed a multilayer perceptron to classify and count soybean seedlings using RGB and multispectral images with seedlings ranging from 10 to 25.
Among crop seedling counting methods based on UAV and deep learning, detection-based counting methods using the YOLO series have been most widely applied. These methods can simultaneously locate and count seedlings. However, the counting process relies on detection results and is not an end-to-end counting approach. In agricultural scenarios with severe occlusions, segmentation-based models encounter difficulties with sample annotation and target separation. Regression-based crop counting models can directly regress counting results from input images, representing an end-to-end counting method. Among them, TasselNetV2++ exhibited strong adaptability to multiple crops and varying input sizes. However, TasselNetV2++ also achieved counting for small images containing 0 to 22 soybean seedlings. In breeding practices, it is the seedling count and emergence rate per plot that truly matter, not the number of seedlings in each small image.
Although the aforementioned deep learning-based crop counting methods have achieved good performance, they can still be further optimized by incorporating additional information. Multitask learning (MTL) is a method for integrating multi-source information into deep learning models. It can improve model generalization and accuracy by sharing information across related tasks, compared to learning each task independently [
30,
31]. In the field of crop counting, multitask learning has proven to be an effective approach for integrating information. Pound et al. [
32] employed MTL for wheat head counting and classification of images with or without awns, showcasing the potential of MTL in plant phenotyping. Dobrescu et al. [
33] used MTL to simultaneously extract different phenotypic traits from plant images and predict three traits: leaf count, leaf area, and genotype category. It demonstrated that jointly learning multiple related tasks can reduce leaf counting errors. Jiao et al. [
34] introduced confidence scores as a classification head based on YOLOX [
35], improving counting performance by removing redundant detection boxes with classification confidence scores lower than 0.3 during the wheat head counting process. These MTL models generally require both input images and multitask ground truth during training. However, in practical use, only images are needed as input for the trained model to perform the counting task, eliminating the need for additional multitask information.
To provide an intuitive user interface and automated analysis capability, researchers have also focused on model deployment and application. Li and Yang [
36] developed a few-shot cotton pest recognition model and integrated an intelligent recognition system into embedded hardware for practical agricultural applications. Sun et al. [
37] developed a PC-based software application to extract phenotypic traits from images of rice panicles captured by smartphones, facilitating the analysis of panicle characteristics. To achieve high-precision simultaneous detection of mango fruits and fruiting stems on an edge computing platform, Gu et al. [
38] improved the YOLOv8n model and deployed it on both PC and NVIDIA Jetson Orin Nano platforms. Xu et al. [
39] developed a real-time pest detection algorithm based on YOLOv5s [
40] with the lightweight GhostNet network, optimized its neck and head structures, and deployed the model as an Android application.
The development of large language models (LLMs) and other generative artificial intelligence (AI) has opened new opportunities [
41]. Zhu et al. [
42] proposed a multimodal model for potato disease detection, integrating a visual branch, a text description branch, and an image statistical feature branch. Based on this model and GPT-4, they developed an expert system for potato disease diagnosis and control. This system can utilize the LLM to offer suggestions for potato disease control, enabling real-time interaction with control recommendations. The LLM can serve as a useful explainer for the output of the self-built model. Moreover, deploying the self-built model with LLM embedding can enhance consultations by providing all necessary information, facilitating real-world applications.
Therefore, the objectives of this study are as follows: (1) achieve end-to-end soybean seedling counting at the plot scale based on TasselNetV2++, named PlotCounter; (2) explore the incorporation of sowing-related information into PlotCounter during the training stage via multitask learning, through category information integration, classifier design, and loss function construction, referred to as MTL-PlotCounter; and (3) develop web-based plot-scale seedling counting agents based on a multimodal LLM and evaluate their performance using a UAV-derived dataset tailored for plot-scale soybean seedling counting.
2. Materials and Methods
2.1. Dataset Preparation
2.1.1. Study Area and Experimental Design
The experiments were carried out in a soybean breeding field in Jinzhong, Shanxi Province, China (37°26′0.9″N–37°26′2.5″N, 112°35′39.9″E–112°35′40.4″E), located within a temperate continental climate zone that experiences an average annual temperature of 10.4 °C and 462 mm of precipitation.
In 2023, a randomized experiment with replications was conducted to evaluate the emergence performance of soybean under different varieties, numbers of seeds per hole (NSPH), and sowing densities. The soybean varieties selected for the experiment were “Dongdou 1” and “Tiefeng 31”, which are the primary varieties promoted in Jinzhong. The experiment simulated precision sowing with 1, 2, and 3 seeds per hole and tested three sowing densities: 120,000, 154,000, and 222,000 seeds per hectare (ha
−1).
Table 1 presented the hole spacing derived from NSPH values corresponding to various sowing densities with a fixed row spacing of 50 cm. The field experiment tested two soybean varieties under a factorial design combining three NSPH levels and three sowing densities. Each treatment was replicated three times, yielding a total of 54 breeding plots.
2.1.2. Plot-Scale Soybean Seedling Dataset Construction
A DJI Mavic 2 Pro UAV (SZ DJI Technology Co., Shenzhen, China) equipped with a Hasselblad L1D-20C camera (SZ DJI Technology Co., Shenzhen, China) was used to capture RGB imagery. Flights were conducted at 5 m altitude with a speed of 2 m/s, achieving a ground sampling distance of 0.1 cm/pixel. Imaging was performed with the camera oriented downward before sunrise (22 June 2023, 5:30–6:00 AM, 16 days after sowing) to minimize shadows, with front and side overlap rates of 75% and 60%, respectively. Based on preliminary tests, a flight height of 5 m was selected to ensure sufficient image resolution and meet the UAV minimum requirement for image stitching. The images were stitched into a TIF format mosaic image using the DJI Smart Agriculture Platform (SZ DJI Technology Co., Ltd., Shenzhen, China).
Individual plot images were extracted from the stitched mosaic image and resized to 3072 × 6144 pixels via zero-padding to standardize their dimensions. The seedling counts per plot ranged from 55 to 211, with a total of 5932 seedlings annotated across all images. Due to significant mutual occlusion among seedlings in the UAV imagery captured after no new soybean seedling emerged, direct image annotation was infeasible, necessitating field-based labeling. Each soybean seedling was manually annotated in the field using printed high-resolution plot images, and the bounding boxes were later converted into dot annotations indicating seedling centroids.
Figure 1 showed the field data acquisition and experimental site overview.
Additionally, sowing-related information, including the soybean variety, NSPH, and sowing density for each plot, was recorded in the form of a dictionary, where the key was the plot ID and the values were sets containing the labels for the soybean variety, NSPH, and sowing density. The plot-scale soybean seedling dataset consists of images of 54 plots, dot annotation data for each seedling, as well as soybean variety, NSPH, and sowing density information for each plot. All 54 plots were used in the study and were divided into training, validation, and testing sets for model development and evaluation.
Given that NSPH significantly affected the occlusion level among seedlings, the dataset was partitioned by NSPH to ensure model generalizability. Specifically, from each NSPH category, 10 plots were assigned to the training set, 4 plots to the validation set, and 4 plots to the testing set. Consequently, for the three NSPH levels, the training set comprised 30 plots, while the validation and testing sets each contained 12 plots.
2.2. TasselNetV2++
TasselNetV2++ was composed of three concatenated components: an encoder, a counter, and a normalizer, with its network architecture illustrated in
Figure 2. Initially, the encoder adopted a dual-branch architecture. One branch comprised three Conv-CBAM-Pool blocks and two Conv-CBAM blocks, while the other branch employed a simplified version of the YOLOv5s backbone tailored for this architecture. The encoder extracted features through gathering and distribution (GD), fusion via bilinear interpolation, average pooling (AvgPool), and feature concatenation, followed by a low-stage information fusion module (Low-IFM) [
43]. In the counter component, the features were transformed into a redundant count map using AvgPool, a Conv-BN-SiLU (CBS) block, a convolutional block attention module (CBAM) [
44], a Conv-BN-ReLU (CBR) block, and a final convolutional layer. A normalizer then converted this into the final count map, whose pixel sums yield each image’s seedling total. TasselNetV2++ proved robust across diverse tasks, including images with up to 22 soybean seedlings, but it has not yet been validated at the plot scale.
2.3. Plot-Scale Counting
2.3.1. Subimage Count Accumulation
Xue et al. [
28] uniformly divided each plot image into 18 subimages and applied the TasselNetV2++ model to count the seedlings within these subimages. The most straightforward method for deriving plot-scale counting result from subimages is to directly accumulate the counting results of all subimages within a particular plot. The flowchart of the subimage count accumulation method was shown in
Figure 3. As illustrated in
Figure 3, each subimage of a plot was individually input into the TasselNetV2++ model to obtain its corresponding count. Subsequently, these individual counts were summed up to determine the total seedling count for the entire plot. Although the subimage count accumulation method is easy to operate, allowing for quick acquisition of counting results at the plot scale, it has some obvious limitations. One major issue was the possibility of counting the same soybean seedling twice in neighboring subimages. Moreover, during model validation, the global information at the plot scale was overlooked.
2.3.2. PlotCounter
Although the subimage count accumulation method was a straightforward way to obtain counting results at the plot scale, the accumulation process inherently resulted in the accumulation of errors. As a result, outstanding subimage counting outcomes did not necessarily ensure superior counting results at the plot scale. PlotCounter aimed to achieve end-to-end seedling counting at the plot scale based on TasselNetV2++. Specifically, plot images were directly fed into the TasselNetV2++ model to obtain the counting results of soybean seedlings within the plot.
Currently, the training set comprises only 30 plots. Directly using these plot images for training would lead to an insufficient number of training samples, which would be inadequate for training a deep learning model. To address the issue of limited training plots, this study introduced a novel approach for organizing training samples. In each training epoch, 20 random patches per plot image were used as training samples. This method allowed for the generation of 600 randomly cropped image patches per training epoch. For model validation, the 12 plot images in the validation set were directly used to evaluate the model’s counting performance at the plot scale, helping to select the most suitable weights for the plot-scale counting task. The flowchart of PlotCounter was presented in
Figure 4. As illustrated, PlotCounter employed this innovative sample organization strategy, where randomly cropped image patches from plot images were used for training, and the entire plot images were used for validation to obtain weights tailored for the plot-scale counting task. Consequently, by inputting the plot images from the test set into PlotCounter, the total number of seedlings in each plot can be obtained, achieving end-to-end plot-scale counting.
2.4. MTL-PlotCounter
To enhance the accuracy of soybean seedling counting at the plot scale, this study explored the integration of sowing-related information during the training stage. During testing and practical applications, the model was expected to rely exclusively on plot images for counting, without requiring any additional sowing-related inputs. Multitask learning offered an efficient way to integrate various data sources, and during real-world applications, it eliminated the need to supply the model with agronomy data. Therefore, multitask learning was employed to incorporate sowing-related data.
Based on PlotCounter, the multitask driven plot-scale counting model was proposed, namely, MTL-PlotCounter. In MTL-PlotCounter, the main task was to count soybean seedlings at the plot scale, while the classification task served as an auxiliary task to improve counting accuracy. The overall framework of the proposed MTL-PlotCounter was illustrated in
Figure 5. The encoder, counter, and normalizer components were consistent with those in TasselNetV2++ and PlotCounter. Similar to PlotCounter’s innovative sample organization strategy, MTL-PlotCounter randomly cropped image patches from plot images for training. Building upon PlotCounter, MTL-PlotCounter included a classifier that shared features extracted by the encoder with the counter.
As depicted in
Figure 5, during the training process, the output for the counting task was the predicted count map for the image patch, while the output for the classification task was the predicted each class score. The counting loss was determined by the deviation between the predicted count map for the image patch and the corresponding ground truth count map, which was obtained through Gaussian smoothing and mean smoothing of the dot annotation image. The classification loss was determined by the deviation between the predicted class score and the ground truth class. Both the counting loss and the classification loss jointly updated the model weights, with the loss function optimizing both counting and classification tasks simultaneously.
As shown in
Figure 5, during the validation process, directly inputting the plot image into MTL-PlotCounter can obtain the predicted count map and the class score for that plot. The predicted count for the plot was available by summing all the grayscale values in the predicted plot-scale count map. The best model weights for plot-scale counting were selected by calculating the mean absolute error (MAE) between the predicted counts and the ground truth counts for all plot samples in the validation set. The model weights were updated through training, and the most suitable weights for plot-scale counting were selected through validation. When tested, inputting the plot image into MTL-PlotCounter can predict the plot-scale count and plot category.
The field experimental design incorporated three factors, including soybean variety, NSPH, and sowing density. Utilizing these factors, we developed plot-scale soybean seedling counting models based on multitask learning, driven by variety, NSPH, and sowing density classifications. These models were named V-MTL-PlotCounter (variety driven MTL-PlotCounter), NSPH-MTL-PlotCounter (NSPH driven MTL-PlotCounter), and SD-MTL-PlotCounter (sowing density driven MTL-PlotCounter).
To develop the MTL-PlotCounter, three key issues needed to be solved. These involved integrating category information, designing an effective classifier, and formulating a loss function tailored for the multitask scenario. In the following, the implementation of V-MTL-PlotCounter, NSPH-MTL-PlotCounter, and SD-MTL-PlotCounter will be elaborated from these three aspects.
2.4.1. Category Information Incorporation
The plot IDs ranged from 1 to 54. The field experiment included two soybean varieties, labeled as 0 or 1, while both NSPH and sowing density had three levels, labeled from 0 to 2. The sowing-related category information was stored in a dictionary with the structure {plot ID: [soybean variety label, NSPH label, sowing density label]}. For the V-MTL-PlotCounter, NSPH-MTL-PlotCounter, and SD-MTL-PlotCounter models, the corresponding class label (variety label, NSPH label, and sowing density label) could be extracted from the dictionary using the plot ID. During training, each image patch inherited the category label of its source plot, retrieved via the plot ID.
2.4.2. Classifier Design
Given the complexity of the encoder component, the designed classifier structure was relatively simple. The architecture of the designed classifier was illustrated in
Figure 6. It consisted of a convolutional layer, an adaptive AvgPool, a flatten layer, a dropout, a fully connected (FC) layer, batch normalization (BN), a rectified linear unit (ReLU), another dropout layer, and a final FC layer, all connected in series. The input to the classifier was the features extracted from the encoder, and the output was the score for each category. The designed classifier shared the features extracted from the encoder with the counter, featuring a simple structure that could handle input features of various dimensions.
Assume the encoder’s output feature map is of size
H ×
W × 128, where
H and
W denote its spatial height and width, respectively. The detailed parameters and output dimensions of each layer in the classifier were shown in
Table 2. The convolutional layer used 64 convolutional kernels of size 3 × 3 × 128 with a stride of 2, reducing the height and width of the input feature map to half of their original size while changing the number of channels to 64. The AvgPool layer reduced the dimension of each feature map by downsampling it to a 1 × 1 feature map. The flatten layer converted the 3D tensor into a 1D tensor with a length of 64. The dropout layer randomly set some neurons to zero with a probability of 0.5 during training to reduce the risk of overfitting, and the output dimension remained 64. The subsequent FC layer, BN layer, ReLU, and dropout layer also had an output dimension of 64. The output dimension of the final FC layer depended on the certain classification task of models: 2 for V-MTL-PlotCounter, 3 for NSPH-MTL-PlotCounter, and 3 for SD-MTL-PlotCounter.
2.4.3. Loss Function Construction for MTL-PlotCounter
In multitask driven models, the loss function needs to account for the losses of multiple tasks, including the counting loss and the classification loss. The counting loss, denoted as Loss_count, computed the ℓ1 loss between the predicted count map from the model and the ground truth count map, as presented in Equation (1). The computation of the classification loss involved two steps. First, the scores for each class were converted into probabilities using the softmax function, as presented in Equation (2). Subsequently, the cross-entropy loss function was employed to calculate the classification loss, as presented in Equation (3).
where N is the number of samples, and C represents the number of classes in each classification task. The
represents the ground truth count map and
represents the predicted count map for the
i-th sample, respectively. The
represents the predicted score for the
i-th sample belonging to the
c-th class, and
denotes the predicted probability for the
i-th sample belonging to the
c-th class, respectively. The
is the one-hot encoding of the ground truth class label for the
i-th sample, which equals 1 if the
i-th sample belongs to the
c-th class and 0 otherwise.
MTL-PlotCounter adopted a weighted sum approach to construct a loss function suitable for multitask learning. The multitask loss, denoted as MLoss, used a weight of 0.01 to balance the magnitudes of Loss_count and Loss_cls. The MLoss is formulated in Equation (4).
The counting task served as the primary task in MTL-PlotCounter, while the classification task acted as an auxiliary task aimed at driving improvements in counting accuracy. Consequently, MLoss was only used during the training process. During validation, the MAE between the predicted and ground truth plot-scale seedling counts was used to select the optimal model weights suitable for the plot-scale counting task. Given a validation set containing N samples, let
denote the predicted seedling count and
the corresponding ground truth count for the
i-th sample. The MAE is formulated in Equation (5).
2.5. The Design of the Plot-Scale Counting Agent
To facilitate the use of the plot-scale counting model, this study designed and deployed plot-scale counting agents on a web platform, which included both the PlotCounter Agent and the V-MTL-PlotCounter Agent. The core modules of the agent comprised image uploading, plot-scale counting, counting results displaying, and an AI analysis assistant that integrated a multimodal LLM. Technically, the backend employed the FastAPI framework for image uploading and model inference, while the frontend interface was designed using HTML (hypertext markup language) and CSS (cascading style sheets), featuring a green-themed layout. For the multimodal LLM component, the Gemini-2.0-flash model was selected to enable user interaction and intelligent analysis.
The overall framework of the plot-scale counting agent was illustrated in
Figure 7. After users uploaded a plot image to the agent, it first retrieved the image dimensions and fed the image into the trained model, either the PlotCounter or V-MTL-PlotCounter. Using the PlotCounter model, the agent inferred the count map and the count result for the plot (predicted plot count). In addition to predicting the count result, the V-MTL-PlotCounter model could also predict the soybean variety of the plot. The agent displayed the count map and count result on the left side of the webpage and included the predicted plot count, variety (unknown for PlotCounter), and plot image size in the image analysis report, which was shown on the right side of the webpage. Subsequently, the agent inputted the original plot image, the image analysis report, and an analysis prompt into the Gemini-2.0-flash model for further analysis of emergence rate and to provide professional advice.
2.6. Experimental Setting
The experiments were conducted using a laptop featuring a 14-core Intel (R) Core (TM) i7 CPU, 40 GB RAM, 2.28 TB storage, and an NVIDIA RTX 3070 Ti GPU. The operating system utilized was Windows 11, and all models were developed in Python 3.8.3 using PyTorch 1.12. The training was initialized with a learning rate of 0.01, which was reduced ten-fold at both the 200th and 400th epochs during the 500 training epochs. Momentum was configured at 0.95, while a weight decay parameter of 0.0005 was employed.
2.7. Evaluation Metrics
The counting performance of the models was evaluated using MAE, root mean squared error (RMSE), relative MAE (rMAE), relative RMSE (rRMSE), and the coefficient of determination (R
2). The corresponding formulas are provided in the following equations.
where N represents the total number of samples, while
and
represent the ground truth count and predicted count for the
i-th sample, respectively.
denotes the mean value of all ground truth counts.
The evaluation metric for the model’s classification performance is accuracy, which serves as the most intuitive measure of classification effectiveness. In this study, accuracy was defined task-specifically: for variety classification in V-MTL-PlotCounter, it represented the proportion of correctly identified soybean variety plots; for NSPH classification in NSPH-MTL-PlotCounter, it indicated the proportion of plots with accurately determined NSPH; and for sowing density classification in SD-MTL-PlotCounter, it denoted the proportion of plots with correctly assessed sowing densities.
4. Discussion
4.1. The Outset of PlotCounter
Based on discussions with agricultural breeding experts and a review of recent studies [
45], it is clear that the emergence rate at the plot scale is the primary concern in breeding trials, rather than the seedling count in individual subimages. To address this practical requirement, the UAV mosaic image was divided according to plots, and an end-to-end soybean seedling counting model at the plot scale, called PlotCounter, was proposed. This model allows for the acquisition of the total number of seedlings in a plot simply by inputting the plot image.
The PlotCounter adopted a new organization form of samples to address the issue of insufficient training samples at the plot scale and used entire plot images during the model validation process to select the optimal weights suitable for the plot-scale counting task, thereby significantly improving the counting accuracy of soybean seedlings at the plot scale. As stated by Farjon and Edan [
46], correct data augmentation strategies, systematic dataset partitioning, and reasonable image patching methods are crucial for enhancing model performance. For practical application scenarios, adopting a dataset partitioning method along with a reasonable organization of training samples and validation scale tailored to the specific task is an effective strategy for improving accuracy. This approach may be more efficient than modifying the model itself.
4.2. The Counting Effect of V-MTL-PlotCounter
To demonstrate the counting efficacy of V-MTL-PlotCounter, we compared it with PlotCounter and state-of-the-art detection-based counting models, including YOLOv5s [
40], YOLOv7 [
26], and YOLOv8s [
23], at the plot scale in terms of counting accuracy. The quantitative evaluation results of plot-scale counting by various models were presented in
Table 6. It can be observed that V-MTL-PlotCounter outperformed the other models in counting accuracy metrics such as MAE, RMSE, rMAE, and rRMSE, achieving a comparable R
2 to PlotCounter. It is worth noting that PlotCounter and MTL-PlotCounter adopted a patch-based training strategy combined with full-plot validation, which allowed for direct inference on full-plot images to obtain plot-scale seedling counts. In contrast, YOLOv5s, YOLOv7, and YOLOv8s generated plot-scale predictions by aggregating the counts from their subimage-scale outputs.
For an intuitive comparison of the overall counting accuracy across various models at the plot scale,
Figure 10 presented a radar chart with five axes representing MAE, RMSE, rMAE, rRMSE, and R
2. To better visualize performance differences among models, each axis of the radar chart was scaled by setting the maximum to the best performance, extending outward by one-tenth of the range between the best and worst values, and the minimum to the worst performance, extending inward by the same proportion. Upon examining the radar chart, it became apparent that the proposed V-MTL-PlotCounter exhibited superior counting performance at the plot scale compared to these models.
The YOLO series was selected as the state-of-the-art comparison model for two main reasons. It is widely used as the benchmark in recent counting studies, and one branch of the encoder in the proposed models incorporates a customized YOLOv5s backbone. Each model has its own strengths depending on the application scenario. The YOLO models are well recognized for their practicality and versatility in object detection tasks, but they are not end-to-end counting models, as their final counts are derived from detection results. In contrast, the proposed PlotCounter directly estimates the number of seedlings per plot, making it particularly suitable for breeding plot scenarios. With sufficient data across different varieties, V-MTL-PlotCounter can simultaneously perform variety classification and seedling counting, demonstrating strong potential for broader applications.
To evaluate the robustness of V-MTL-PlotCounter against variations in UAV image color characteristics, the model was tested on images with adjusted brightness, contrast, saturation, hue, and color temperature, including slight, moderate, and large levels of change. As shown in
Table 7, the model maintained stable accuracy under slight and moderate adjustments, while performance declined under large shifts. These results indicate that the model is generally robust to common color variations. In future deployments, integrating a color correction module into the counting agent could further improve adaptability to diverse UAV imaging conditions.
4.3. The Application Prospect of the Plot-Scale Counting Agent
Breeders can input plot images into the PlotCounter Agent via a web platform to obtain the total number of seedlings in the plot, significantly facilitating breeding efforts. Additionally, by leveraging the multimodal LLM, the PlotCounter Agent can analyze the emergence rate by inputting the actual sowing parameters for the plot. In the future, when the variety of collected soybean data becomes sufficiently diverse, images can be fed into the V-MTL-PlotCounter Agent to simultaneously predict the total seedling count and soybean variety. The V-MTL-PlotCounter Agent, which integrated seedling counting, variety identification, and intelligent analysis, greatly enhanced the efficiency of obtaining the plot information.
4.4. Current Limitations and Insights for Future Studies
The primary limitation of this study lies in the use of a limited number of plots and soybean varieties, with only 54 in total collected from a single site within a single year, which inevitably imposed constraints on both the model’s performance and its generalizability to broader field conditions. To train the deep learning model with the available 54 plot images, an innovative sample organization strategy was employed due to the use of the regression-based counting model, which accommodated scale differences between validation and training samples compared with others. While the current limitations in plot variety and diversity may restrict generalizability, the developed model offered a viable solution for deep learning applications in small sample scenarios, achieving an accuracy level suitable for practical plot-scale counting. As data from multiple soybean varieties and diverse field environments become available in future studies, the model’s plot-scale counting accuracy is expected to improve further.
The motivation behind jointly optimizing classification and counting as dual tasks during training was the significant impact that different varieties and sowing methods have on emergence. Multitask learning was used to associate these effects and optimize multitask accordingly. Among the three MTL-PlotCounter models, only V-MTL-PlotCounter demonstrated an improvement in counting accuracy. During UAV image acquisition, due to incomplete seed emergence, only the soybean variety remained consistent with the initial sowing settings in each plot, while the NSPH and planting density values deviated. The experiments showed that using accurate and relevant information prediction as an auxiliary task can effectively enhance the main task’s performance.
Although RGB imagery is widely used in breeding trials due to its accessibility and low cost [
17,
18], future extensions of this work could benefit from incorporating multispectral and Light Detection and Ranging (LiDAR) data. Multispectral imagery enables the calculation of diverse vegetation indices and improves crop segmentation by including near-infrared spectral information [
13,
47]. LiDAR, in contrast, provides detailed three-dimensional canopy structure and height measurements [
48,
49]. With accurate geospatial alignment, these complementary data sources could be integrated into the proposed multitask learning framework to enable simultaneous seedling counting, variety identification, and phenological stage estimation [
50].
The proposed V-MTL-PlotCounter, deployed as an interactive agent on a web platform, enables automated soybean variety identification, seedling counting, and emergence rate analysis powered by a multimodal LLM. Future work will focus on improving the model’s generalization by incorporating data from diverse crop varieties and enhancing its robustness across varying field environments. Once adapted for seedling counting in multiple crop species, the model will allow agricultural researchers to directly input UAV imagery of breeding plots, facilitating rapid and labor-efficient crop identification and seedling counting.
5. Conclusions
In this study, we constructed a specialized plot-scale dataset, tailored for breeding plot scenarios, which comprises UAV-captured soybean seedling imagery and key sowing-related metadata, including variety, number of seeds per hole, and sowing density, all of which are closely related to plot-scale emergence rate. Using this dataset, we developed PlotCounter, a plot-scale counting model built on the TasselNetV2++ architecture. Leveraging the patch-based training and full-plot validation strategy, PlotCounter achieved accurate seedling counting even under limited data conditions. To further enhance prediction accuracy, we proposed the MTL-PlotCounter framework by incorporating auxiliary classification tasks through multitask learning. Among its variants, V-MTL-PlotCounter, which integrates soybean variety information, demonstrated the best performance, achieving relative reductions of 8.74% in RMSE and 5.17% in rMAE over PlotCounter. It also outperformed models such as YOLOv5s, YOLOv7, and YOLOv8s in plot-scale counting tasks. Both PlotCounter and V-MTL-PlotCounter were deployed as interactive agents on a web platform, enabling breeders to upload plot images, automatically count seedlings, analyze emergence rates, and perform prompt-driven interaction through a multimodal LLM. This study demonstrates the potential of integrating UAV remote sensing, agronomic data, specialized models, and multimodal large language models to enable intelligent crop phenotyping. In future work, we will focus on enhancing model generalization through multiple varieties of data and improving robustness across diverse field environments.