Real-Time Joint-Stem Prediction for Agricultural Robots in Grasslands Using Multi-Task Learning

Li, Jiahao; Güldenring, Ronja; Nalpantidis, Lazaros

doi:10.3390/agronomy13092365

Open AccessArticle

Real-Time Joint-Stem Prediction for Agricultural Robots in Grasslands Using Multi-Task Learning

by

Jiahao Li

,

Ronja Güldenring

and

Lazaros Nalpantidis

^*

Department of Electrical and Photonics Engineering, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark

^*

Author to whom correspondence should be addressed.

Agronomy 2023, 13(9), 2365; https://doi.org/10.3390/agronomy13092365

Submission received: 12 July 2023 / Revised: 25 August 2023 / Accepted: 29 August 2023 / Published: 12 September 2023

(This article belongs to the Special Issue The Applications of Deep Learning in Smart Agriculture)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Autonomous weeding robots need to accurately detect the joint stem of grassland weeds in order to control those weeds in an effective and energy-efficient manner. In this work, keypoints on joint stems and bounding boxes around weeds in grasslands are detected jointly using multi-task learning. We compare a two-stage, heatmap-based architecture to a single-stage, regression-based architecture—both based on the popular YOLOv5 object detector. Our results show that introducing joint-stem detection as a second task boosts the individual weed detection performance in both architectures. Furthermore, the single-stage architecture clearly outperforms its competitors with an OKS of 56.3 in joint-stem detection while also achieving real-time performance of 12.2 FPS on Nvidia Jetson NX, suitable for agricultural robots. Finally, we make the newly created joint-stem ground-truth annotations publicly available for the relevant research community.

Keywords:

joint-stem detection; rumex; YOLO-Pose; keypoint detection; mulit-task learning; precision agriculture; robotics

1. Introduction

This paper addresses the perception needs for robotic weeding in grasslands, thus contributing to precision farming. While weed detection is a problem often addressed in the relevant literature, the state of the art rarely considers the exact detection of the detected weeds’ joint stems in grasslands. Nevertheless, the precise localization of the joint stem is crucial if robots are to autonomously control weeds on grasslands, e.g., by means of laser weeding, electrocution, or other targeted control methods.

Toward this direction, we have investigated potential multi-task learning (MTL) approaches that optimize jointly two relevant tasks—keypoint detection for joint stems as well as bounding box detection for grassland weeds (plants)—leading to improved performance. We apply two different architectures, namely YOLO-HRNet and YOLO-Pose, both of which are based on the popular YOLOv5 object detector. We use YOLO-HRNet, a heatmap-based architecture operating in two stages, as our baseline approach, whereas we show that YOLO-Pose, which operates in a single stage and is regression-based, can achieve impressive results in terms of detection precision and inference speed for both considered tasks. Even though the considered network architectures are used unmodified, it is not trivial to deploy a full-scale working deep learning approach in environments as challenging as grasslands. Thus, the novelty of this work lies in addressing, for the first time, joint-stem prediction of weeds in grasslands.

Our approach is a significant step toward robotic precision farming. Robotic manipulation requires precise localization of the manipulation target, and agricultural robotics is no exception. Targeted, organic-farming-friendly weed-control approaches, such as laser weeding and electrocution, require the precise location of the joint stem to target the laser beam or position the electrode for maximizing the damaging effect of the energy used. Therefore, fine-grained analysis of the detected weeds is needed to reveal the ideal intervention target position, as shown in Figure 1 for the “GALIRUMI” EU project robot.

The contribution of this paper is threefold: (i) First, we showcase that MTL is a viable approach to boost performance in the weed detection task in grasslands by integrating joint stem detection as a second task, and furthermore, (ii) we experimentally show that YOLO-Pose not only achieves superior results in both tasks but also achieves real-time detection speeds suitable for agricultural robots. Last but most importantly, (iii) to support all the above and contribute to the relevant research community, we make the newly created joint stem ground truth annotation publicly available as an extension of our RumexWeeds dataset at https://dtu-pas.github.io/RumexWeeds/ (accessed on 24 August 2023).

2. Related Work

Weed detection has been widely investigated in the past decade and it has been posed as a classification problem [1,2,3,4,5,6], detection problem via bounding boxes [7,8,9,10,11], as well as segmentation problem [12,13,14,15,16]. The object detection architecture YOLO has evolved greatly over the last decade resulting in different popular variants, such as YOLOv5 [17], YOLOX [18], and YOLOv8 [19], which have been widely applied to the agricultural domain [11,20,21,22,23].

However, agricultural robots that actively interact with detected plants need more fine-grained plant information in order to perform tasks such as weeding, picking, or cutting. This is the reason for an increased interest in joint-stem detection in recent years. Langer et al. [24] take as input the detected plants and apply geometrical computer vision methods in order to retrieve the joint stems. The parameters of their methods are tweaked to address very specific plants at specific growing stages; therefore, they do not adapt to different cases. Deep neural networks are applied to detect the weeds and crops as well as their joint stem keypoints in a multi-task setting. Lottes et al. [25] apply an encoder–decoder network to segment plants and predict joint stem positions. On top of a shared encoder, two decoders follow—one for each task. The same authors improve their model robustness by incorporating spatial–temporal information (i.e., consecutive image sequences) to the network [26]. Unfortunately, Lottes et al. [25,26] do not segment the individual plants as instances, which risks multiple treatments of the same plant. In contrast, Weyler et al. [27] use CenterNet [28] to simultaneously detect crops/weeds with bounding boxes, plant-center keypoints, and leaf keypoints on an instance basis. Lac et al. [29] propose a two-stage method, where they use YOLOv4 for fast stem detection, followed by a temporal aggregation algorithm to refine the stem predictions. Note that they predict joint-stem positions with bounding boxes instead of keypoints. Zhang et al. [29] propose, similarly to [25], an architecture with one shared backbone but two separate decoder paths to generate segmentation masks for segmenting weeds as well as keypoint predictions for joint stems based on heatmaps. They additionally provide the feature maps from the weed segmentation path to the joint stem prediction path and claim to increase accuracy doing that. Note that the architecture only handles input images containing no more than a single plant each.

Our work performs simultaneous weed and joint-stem detection in grassland fields, which is an underexplored field in the domain of precision farming. To the best of our knowledge, the publications mentioned above have only been applied to crop fields where plants can be clearly separated from their background (soil). However, grassland environments are more challenging since weeds on grassland are much more difficult to differentiate and joint stems are generally covered by adjacent weeds or grasses.

3. Method

In this work, we consider two different architectures, both based on YOLOv5 (https://github.com/ultralytics/yolov5, accessed on 24 August 2023).

YOLO-HRNet serves as baseline architectures, which is inspired by Mask R-CNN [30] while being comparable to YOLOv5-based architectures. Like in Mask R-CNN, the object detection model is followed by a simple segmentation model. We choose HRNet16 [31] as the second stage model because its architecture is particularly good for keypoint prediction. Because of its two stages and the relatively large model size, it is expected that accurate prediction results are produced at the cost of slower inference speeds.
YOLO-Pose [32] proposes an extension of anchor-based architectures to predict keypoints. Compared to YOLO-HRNet, the model size of YOLO-Pose [32] is reduced and predictions are made within one stage. A significant improvement in inference speed but worse prediction accuracy are expected.

YOLO-HRNet as well as YOLO-Pose can be categorized as multi-task learning because hard-parameter sharing is performed between the two tasks: weed detection (i.e., bounding box detection) and joint-stem detection (i.e., keypoint detection). While both arichtectures perform weed detection with the same approach, the strategy of predicting joint stems differs between YOLO-HRNet and YOLO-Pose: heatmap-based and two-stage vs. regression-based and one-stage.

3.1. YOLO-HRNet

Figure 2 gives an overview of the YOLO-HRNet architecture, where HRNet16 [31] is integrated as second stage to YOLOv5. The corresponding feature maps of the predicted bounding boxes serve as input for HRNet, meaning that each input contains only a single object. ROI align —proposed by the authors of Mask R-CNN [30]—is used to map the feature maps to the expected input size. HRNet outputs a heatmap and the maximum value is chosen as final position for the keypoint. The forwarded input to the HRNet differs between training and inference pass of the network. In order to increase the amount of training data for HRNet, feature maps from all scales and all positive detections are forwarded to HRNet. On the contrary, during inference, only the final predictions after non-maximum suppression and of the corresponding scale are forwarded to HRNet.

The original loss function of YOLO takes into consideration a weighted sum of the classification loss

L_{c l a s s}

, the localization loss

L_{b o x}

and the objectness loss

L_{o b j}

. The objectness loss

L_{o b j}

computes the binary cross entropy (BCE) of whether a box actually contains an object or not. It can also be understood as confidence score. The objectness label is set to 1 for boxes that include an object and 0 for boxes without object. The classification loss

L_{c l a s s}

computes the BCE for all positive boxes considering the assigned class. Finally, the localization loss

L_{b o x}

uses the Complete-IoU loss (CIoU) as regression criteria for the regressed box coordinates.

The loss function of YOLOv5 is extended with a mean squared error (MSE) loss term,

L_{M S E}

, handling the keypoint prediction during training. The loss for each scale s and anchor k at location

(i, j)

is summed up as follows. Note that

w_{b o x}, w_{o b j}, w_{c l s}, and w_{k p}

are the corresponding weighting factors in order to weigh the different loss terms.

\begin{matrix} L_{Y O L O - H R N e t} = \\ \sum_{s, i, j, k} w_{b o x} L_{b o x} + w_{o b j} L_{o b j} + w_{c l s} L_{c l s} + w_{k p} L_{M S E} \end{matrix}

(1)

3.2. YOLO-Pose [32]

Just recently, in 2022, YOLO-Pose [32] introduced an extension for anchor-based object detection approaches to predict human poses end-to-end. The heatmap-free approach is simple, yet effective: each detection head is extended with a pose head, which regresses keypoints with reference to the anchor center. Figure 3 gives an overview of the YOLO-Pose architecture.

YOLO-Pose introduces a loss function

L_{O K S}

, that optimizes directly towards the popular metric object keypoint similarity (OKS). The loss can only be applied for regression-based methods but not for heatmap-based (i.e., propbability maps) approaches such as YOLO-HRNet. For each visible (

v_{n} > 0)

keypoint n, the exponential is computed for the Euclidean distance between prediction and ground truth

d_{n}^{2}

, divided by the product of squared scale of the object

s^{2}

and the squared keypoint specific weight

σ_{t, n}^{2}

.

L_{O K S} = 1 - \frac{\sum_{n} [e x p (\frac{d_{n}^{2}}{2 s^{2} σ_{t, n}^{2}}) \cdot δ (v_{n} > 0)]}{\sum_{i} [δ (v_{n} > 0)]}

(2)

The original YOLOv5 loss function as described previously is extended with

L_{O K S}

and is summed up for each scale s and anchor k at location

(i, j)

as followed.

\begin{matrix} L_{Y O L O - P o s e} = \\ \sum_{s, i, j, k} w_{b o x} L_{b o x} + w_{o b j} L_{o b j} + w_{c l s} L_{c l s} + w_{k p} L_{O K S} \end{matrix}

(3)

4. Experimental Setup

4.1. Dataset

The publicly available grassland weed dataset RumexWeeds [11] is considered in this work. It is a real-world dataset, targeting the most problematic grassland weed, Rumex. Data have been collected with a Husky robot platform, shown in Figure 1, and its mounted sensors. The dataset includes 98 different image sequences as well as accompanying navigational data from IMU, GPS, and wheel encoders. For all 5510 images, ground-truth bounding box annotations are available, while it is differentiated between two species: Rumex obtusifolius and Rumex crispus. Please refer to [11] for a more detailed description of the RumexWeeds dataset and some baseline results.

We supplement the data of that dataset with additional manually created keypoint annotations: For each bounding box in the dataset, a joint stem annotation has been performed. Note that even for the human annotator, it can be challenging to identify the precise joint stem position. Therefore, the annotator drew a circular region representing the potential joint-stem position. The joint stem annotation consists of a position

(k_{i x}, k_{i y})

as well as a radius

r_{i}

, indicating the uncertainty of the human annotator. The higher

r_{i}

, the more uncertain the annotator was about the performed annotation. In Figure 4, a sample image with bounding box as well as joint stem annotations is shown. Our newly created joint stem annotations will be merged and made publicly available at https://dtu-pas.github.io/RumexWeeds/ (accessed on 24 August 2023).

For our experiments in Section 5, the data is split in train/val/test with the following ratio 0.70/0.15/0.15. Hereby, we make sure that sequences are not separated between the splits to assure that the same plant does not appear in different splits, which would falsify the results. Furthermore, like the authors from [11], we treat both Rumex species as one class because they are equally undesired and should both be removed by the agricultural robot.

4.2. Metrics

4.2.1. Mean Average Precision (mAP)

The object detection task is evaluated with the well-established metric mean average precision (mAP) according to the COCO eval toolkit (https://cocodataset.org/#detection-eval, accessed on 24 August 2023), while two variants are reported in our experiments:

m A P_{50 : 95}

and

m A P_{50}

. The

m A P_{50}

considers all predictions with an Intersection of Union (IoU) over 0.5 equally as positive examples. On the contrary,

m A P_{50 : 95}

averages the mAP over 10 IoU thresholds from 0.5 to 0.95 with an interval of 0.05, which introduces higher weights for higher IoUs. Therefore,

m A P_{50 : 95}

also reflects the precision of the predicted bounding box.

4.2.2. Object Keypoint Similarity (OKS)

The joint stem detection task is evaluated with object keypoint similarity (OKS), which is the most popular metric for the evaluation of keypoints. As shown in Equation (4), for OKS, the Euclidean distance

d_{i}

between predicted and ground truth points is taken and normalized by the scale s of the object to eliminate the effect of various object sizes. It is additionally divided by the standard deviation

σ

between human annotations and true labels, i.e., keypoints that are difficult to identify are weighted with a higher

σ

. In our experiments,

O K S_{50}

is utilized to report the joint-stem detection performance.

O K S = e x p (- \frac{d_{i}^{2}}{2 s^{2} σ_{i}^{2}})

(4)

4.3. Fixed Training Settings

We fix a number of hyperparameters for all our experiments. The input size is

(640 \times 640)

and during training, the following augmentations are applied to the input images: color jitter, random flip, random scale, and random shift. The networks are updated using the Adam optimizer [33]. YOLO-Pose is updated with an initial learning rate of

5 \times 10^{- 4}

, decaying down to

2 \times 10^{- 1}

using cosine learning rate schedule. For YOLO-HRNet, two different initial learning rates are set:

5 \times 10^{- 4}

for YOLOv5 and

1 \times 10^{- 3}

for HRNet. Again, we use the cosine learning rate schedule to decay both down to

1 \times 10^{- 4}

. For all models, we have a warm-up phase of 3 epochs. Unless mentioned otherwise, we train for 100 epochs and set the

σ

in the OKS metric as well as loss to

0.2

. All models are trained and evaluated on Nvidia Tesla V100, unless otherwise stated.

5. Experimental Evaluation

5.1. Ablation Study: Model Complexity

For the computer vision domain, RumexWeeds is a relatively small dataset with 5510 images, and due its nature of providing image sequences, the variance between consecutive images is reduced. In this ablation study, we determine for both methods—YOLO-HRNet and YOLO-Pose—the model complexity, which leads to best results without overfitting the data. Each experiment is performed once and the results for both architectures can be found in Table 1.

For YOLO-HRNet, we vary the size of the shared backbone as well as the input size of the feature map crops to the HRNet in the second stage. As the the results show in Table 1, a bigger input size to HRNet does not lead to significant improvements. Therefore, we fix it to 16, because the smaller the input, the higher the training as well as the inference speed. Increasing the backbone size to DarkNet-53 X leads to slight improvements.

For YOLO-Pose, we vary the size of the shared backbone as well as the complexity of the pose head. Results are presented in Table 1 and we can conclude that using a more complex pose head leads to improvement in both tasks because we have an increase in mAP as well as OKS. However, when using the biggest backbone model Darknet-53 X, overfitting occurs since both the mAP and OKS reduces compared to Darknet-53 M.

According to our ablation study, we fix the backbone to DarkNet-53 M because it generates very good results for both architectures (YOLO-Pose and YOLO-HRNet) while being relatively small. Furthermore, we use an input size of 16 for the HRNet and a complex head for YOLO-Pose unless mentioned otherwise.

5.2. Ablation Study: Task/Loss Weighting

For multi-task learning, loss weighting is crucial in order to achieve good results. We treat the loss weights as hyperparameters and tune

w_{b o x}

,

w_{o b j}

, and

w_{k p}

using random search. Note, that

w_{c l a s s}

has no relevance in our setup, since we only have one class. In this section, we show the most relevant results.

For YOLO-HRNet (see Table 2), lowering the

w_{k p}

relatively to the other weights leads to a performance boost for both tasks, as shown in the second row. On the contrary, further lowering of

w_{k p}

in the third row increases the mAP only marginally while reducing the OKS drastically. For the following experiments, we fix the weights to

w_{b o x} = 0.05

,

w_{o b j} = 0.85

and

w_{k p} = 0.10

.

For YOLO-Pose (see Table 2), lowering

w_{k p}

relatively to the other weights leads to an improved performance of weed detection while sacrificing a bit of performance in joint-stem detection. However, weed detection is generally favored over joint-stem detection. We fix the weights to

w_{b o x} = 0.05

,

w_{o b j} = 0.95

and

w_{k p} = 0.01

. In accordance with our previous results, using YOLO-Pose with a complex head boosts the results even further, as shown in the last row of Table 2.

5.3. Ablation Study: Influence of $σ_{t}$

In accordance to the metric OKS, the loss function

L_{O K S}

weighs each keypoint with

σ_{t}

, which represents the standard deviation between human annotation and true labels. However, the true

σ

is unknown; therefore, we treat it as a hyperparameter and explore its influence on the final prediction. Table 3 shows that the selection of

s i g m a_{t}

has a significant influence on the prediction performance. When choosing a

s i g m a_{t}

that is too small, i.e.,

s i g m a_{t} = 0.1

, the introduced annotation noise has a negative impact on the detection performance. With increasing

s i g m a_{t}

, label noise is reduced; hence, overfitting can be avoided. However, we can observe that at

σ_{t} > 0.4

, we can a see a drastic negative impact on the detection performance. We set

σ_{t} = 0.2

in our final models.

5.4. Final Results

In this section, the final models of YOLO-HRNet and YOLO-Pose, as well a basic YOLOv5 model, are evaluated on the test split. All models are trained once for 200 epochs and the final results are reported on the test split.

Table 4 shows the weed and joint-stem detection performance for all models. YOLO-Pose is the winner among its competitors while retaining a reasonable training time. The results disprove our expectation: the one-stage architecture (i.e., YOLO-Pose) can keep up with the performance of the two-stage architecture (i.e., YOLO-HRNet) in weed detection and significantly outperforms it in joint-stem detection. Secondly, we can conclude that introducing the joint stem detection task enhances the performance in the weed detection task. The effect can be seen for both YOLO-HRNet and YOLO-Pose. In Figure 5, we show qualitative results of the final models.

In Table 5, we present the inference performance. First, we test on a single Nvidia V100 (NVIDIA, Santa Clara, CA, USA) for all final models. As expected, YOLO-HRNet provides a fairly low inference speed due to its two-stage nature. YOLO-Pose is only slightly slower then YOLOv5, which is expected since the YOLO-Pose architecture has additional pose heads. However, its speed is still on the higher end with 101.0/88.5 FPS (simple/complex head) and is a serious candidate for real-time deployment in agricultural robots. Therefore, we deploy YOLO-Pose on the Single Board Computer Nvidia Jetson Xavier NX (NVIDIA, Santa Clara, CA, USA) and get 13.8 and 12.2 FPS for YOLO-Pose with simple and complex heads, respectively, as shown in the lower part of Table 5.

6. Discussion

In our work, we used almost out-of-the-box implementations of popular network architectures, which have the great advantage that relevant pre-trained weights are generally available in order to train efficiently and achieve better results while having limited data available. On the contrary, using a customized architecture requires the generation of pre-trained weights with benchmark datasets, which is costly. In applied domains, it is common to use those popular architectures with powerful pre-trained weights; e.g., Weyler et al. [27] used CenterNet [28] and Lac et al. [29] is based on Yolov4. More precisely, we combined YOLO with HRNet as a two-stage network and YOLO-Pose as a single-stage network—both of them based on the popular YOLOv5 object detector. It is reasonable to expect that more advanced architectures, such as, e.g., the recent YOLOv8, or proper modifications of them, will result in further gains in terms of performance. However, what we want to showcase is the applicability of such methods on the targeted task, rather than their absolute performance. The scarcity of previous works directly related to our problem reveals that it is not trivial to deploy a full-scale working deep learning approach in environments as challenging as grasslands.

We show that MTL of detecting grassland weeds and their joint stems leads to improved detection performance; therefore, both tasks are complementing each other well. These findings are in accordance with the work of Lotte et al. [25,26], who report performance improvements for MTL of plant segmentation and joint-stem detection in crop fields, i.e., on the BoniRob dataset [34]. Our work required the appropriate extension of our existing RumexWeeds dataset. We have extended it with additional joint stem annotations, where each joint stem has been annotated by the position and a radius indicating the uncertainty of the human annotator. As we find the significance of such datasets paramount for advancing the field, we make our new annotations publicly available as part of our RumexWeeds dataset.

It is worth mentioning that before the considered robot can intervene to control the weeds, the 3D positions of the detected keypoints are needed. We take the detected joint-stem detection and align it with the corresponding depth map in order to retrieve the 3D information. Again, it allows us to use out-of-the box architectures since the corresponding pre-trained weights are predominantly retrieved from large RGB datasets, such as ImageNet.

7. Conclusions

This work has investigated two different multi-task learning approaches to solve jointly the joint-stem and weed detection problems in image data from grasslands. More precisely, we compared a two-stage, heatmap-based architecture (YOLO-HRNet) to a single-stage, regression-based one (YOLO-Pose). Our results show that introducing joint-stem detection as a second task boosts the individual weed detection performance in both architectures. However, YOLO-Pose outperforms its competitors in joint-stem detection with OKS of 56.3 while also achieving real-time performance of 12.2 FPS on an Nvidia Jetson NX device. These characteristics yield our final YOLO-Pose model as an ideal candidate for implementation on autonomous weeding robots that need precise localization of weeds’ joint stems to control them.

Author Contributions

J.L.: Conceptualization, Software, Experiment Execution, Root Annotation, Draft Writing, Visualization; R.G.: Conceptualization, Data Acquisition, Software for Root Annotation, Draft Writing, Visualization; L.N.: Supervision, Review and Editing, Funding. All authors have read and agreed to the published version of the manuscript.

Funding

The work has been supported by the European Commission and European GNSS Agency through the project “Galileo-assisted robot to tackle the weed Rumex obtusifolius and increase the profitability and sustainability of dairy farming (GALIRUMI)”, H2020-SPACE-EGNSS-2019-870258.

Data Availability Statement

We make the newly created joint stem ground truth annotation publicly available as an extension of our RumexWeeds dataset at https://dtu-pas.github.io/RumexWeeds/ (accessed on 24 August 2023).

Acknowledgments

This work has been The authors would like to thank the organic farmers Erik Hansen (Lundholm), Allan Clausen (Hegnstrup), and Otto Stengaard (Stengaard) for offering their fields for data collection and insightful discussions regarding Rumex treatment.

Conflicts of Interest

The authors declare no conflict of interest.

References

Farooq, A.; Jia, X.; Hu, J.; Zhou, J. Multi-Resolution Weed Classification via Convolutional Neural Network and Superpixel Based Local Binary Pattern Using Remote Sensing Images. Remote. Sens. 2019, 11, 1692. [Google Scholar] [CrossRef]
Smith, L.N.; Byrne, A.; Hansen, M.F.; Zhang, W.; Smith, M.L. Weed classification in grasslands using convolutional neural networks. In Proceedings of the Applications of Machine Learning; Zelinski, M.E., Taha, T.M., Howe, J., Awwal, A.A.S., Iftekharuddin, K.M., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2019; Volume 11139. [Google Scholar] [CrossRef]
Dadashzadeh, M.; Abbaspour-Gilandeh, Y.; Mesri-Gundoshmian, T.; Sabzi, S.; Hernández-Hernández, J.L.; Hernández-Hernández, M.; Arribas, J.I. Weed Classification for Site-Specific Weed Management Using an Automated Stereo Computer-Vision Machine-Learning System in Rice Fields. Plants 2020, 9, 559. [Google Scholar] [CrossRef]
Wu, X.; Aravecchia, S.; Lottes, P.; Stachniss, C.; Pradalier, C. Robotic weed control using automated weed and crop classification. J. Field Robot. 2020, 37, 322–340. [Google Scholar] [CrossRef]
Reedha, R.; Dericquebourg, E.; Canals, R.; Hafiane, A. Transformer Neural Network for Weed and Crop Classification of High Resolution UAV Images. Remote Sens. 2022, 14, 592. [Google Scholar] [CrossRef]
Garibaldi-Márquez, F.; Flores, G.; Mercado-Ravell, D.A.; Ramírez-Pedraza, A.; Valentín-Coronado, L.M. Weed Classification from Natural Corn Field-Multi-Plant Images Based on Shallow and Deep Learning. Sensors 2022, 22, 3021. [Google Scholar] [CrossRef] [PubMed]
Yu, J.; Sharpe, S.M.; Schumann, A.W.; Boyd, N.S. Deep learning for image-based weed detection in turfgrass. Eur. J. Agron. 2019, 104, 78–84. [Google Scholar] [CrossRef]
Jiang, H.; Zhang, C.; Qiao, Y.; Zhang, Z.; Zhang, W.; Song, C. CNN feature based graph convolutional network for weed and crop recognition in smart farming. Comput. Electron. Agric. 2020, 174, 105450. [Google Scholar] [CrossRef]
Jin, X.; Sun, Y.; Che, J.; Bagavathiannan, M.; Yu, J.; Chen, Y. A novel deep learning-based method for detection of weeds in vegetables. Pest Manag. Sci. 2022, 78, 1861–1869. [Google Scholar] [CrossRef]
Zhao, J.; Tian, G.; Qiu, C.; Gu, B.; Zheng, K.; Liu, Q. Weed Detection in Potato Fields Based on Improved YOLOv4: Optimal Speed and Accuracy of Weed Detection in Potato Fields. Electronics 2022, 11, 3709. [Google Scholar] [CrossRef]
Güldenring, R.; van Evert, F.K.; Nalpantidis, L. RumexWeeds: A grassland dataset for agricultural robotics. J. Field Robot. 2023, 40, 1639–1656. [Google Scholar] [CrossRef]
Fawakherji, M.; Youssef, A.; Bloisi, D.; Pretto, A.; Nardi, D. Crop and Weeds Classification for Precision Agriculture Using Context-Independent Pixel-Wise Segmentation. In Proceedings of the 2019 Third IEEE International Conference on Robotic Computing (IRC), Naples, Italy, 25–27 February 2019; pp. 146–152. [Google Scholar] [CrossRef]
Champ, J.; Mora-Fallas, A.; Goëau, H.; Mata-Montero, E.; Bonnet, P.; Joly, A. Instance segmentation for the fine detection of crop and weed plants by precision agricultural robots. Appl. Plant Sci. 2020, 8, e11373. [Google Scholar] [CrossRef]
Güldenring, R.; Boukas, E.; Ravn, O.; Nalpantidis, L. Few-leaf learning: Weed segmentation in grasslands. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3248–3254. [Google Scholar]
Sodjinou, S.G.; Mohammadi, V.; Sanda Mahama, A.T.; Gouton, P. A deep semantic segmentation-based algorithm to segment crops and weeds in agronomic color images. Inf. Process. Agric. 2022, 9, 355–364. [Google Scholar] [CrossRef]
Fathipoor, H.; Shah-hosseini, R.; Arefi, H. Crop and Weed Segmentation on Ground-Based Images Using Deep Convolutional Neural Network. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2023, X-4/W1-2022, 195–200. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; NanoCode012; Kwon, Y.; Michael, K.; Xie, T.; Fang, J.; imyhxy; et al. ultralytics/yolov5: v7.0—YOLOv5 SOTA Realtime Instance Segmentation. 2022. Available online: https://zenodo.org/record/7347926 (accessed on 24 August 2023). [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 24 August 2023).
Cardellicchio, A.; Solimani, F.; Dimauro, G.; Petrozza, A.; Summerer, S.; Cellini, F.; Renò, V. Detection of tomato plant phenotyping traits using YOLOv5-based single stage detectors. Comput. Electron. Agric. 2023, 207, 107757. [Google Scholar] [CrossRef]
Wang, D.; He, D. Channel pruned YOLO V5s-based deep learning approach for rapid and accurate apple fruitlet detection before fruit thinning. Biosyst. Eng. 2021, 210, 271–281. [Google Scholar] [CrossRef]
Sozzi, M.; Cantalamessa, S.; Cogato, A.; Kayad, A.; Marinello, F. Automatic Bunch Detection in White Grape Varieties Using YOLOv3, YOLOv4, and YOLOv5 Deep Learning Algorithms. Agronomy 2022, 12, 319. [Google Scholar] [CrossRef]
Yang, G.; Wang, J.; Nie, Z.; Yang, H.; Yu, S. A Lightweight YOLOv8 Tomato Detection Algorithm Combining Feature Enhancement and Attention. Agronomy 2023, 13, 1824. [Google Scholar] [CrossRef]
Langer, F.; Mandtler, L.P.; Milioto, A.; Palazzolo, E.; Stachniss, C. Geometrical Stem Detection from Image Data for Precision Agriculture. arXiv 2018, arXiv:1812.05415. [Google Scholar]
Lottes, P.; Behley, J.; Chebrolu, N.; Milioto, A.; Stachniss, C. Joint Stem Detection and Crop-Weed Classification for Plant-Specific Treatment in Precision Farming. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2018, Madrid, Spain, 1–5 October 2018. [Google Scholar] [CrossRef]
Lottes, P.; Behley, J.; Chebrolu, N.; Milioto, A.; Stachniss, C. Robust joint stem detection and crop-weed classification using image sequences for plant-specific treatment in precision farming. J. Field Robot. 2020, 37, 20–34. [Google Scholar] [CrossRef]
Weyler, J.; Milioto, A.; Falck, T.; Behley, J.; Stachniss, C. Joint Plant Instance Detection and Leaf Count Estimation for In-Field Plant Phenotyping. IEEE Robot. Autom. Lett. 2021, 6, 3599–3606. [Google Scholar] [CrossRef]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6568–6577. [Google Scholar] [CrossRef]
Zhang, X.; Li, N.; Ge, L.; Xia, X.; Ding, N. A Unified Model for Real-Time Crop Recognition and Stem Localization Exploiting Cross-Task Feature Fusion. In Proceedings of the 2020 IEEE International Conference on Real-Time Computing and Robotics (RCAR), Virtual, 28–29 September 2020; pp. 327–332. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. arXiv 2019, arXiv:1908.07919. [Google Scholar] [CrossRef] [PubMed]
Maji, D.; Nagori, S.; Mathew, M.; Poddar, D. YOLO-Pose: Enhancing YOLO for Multi Person Pose Estimation Using Object Keypoint Similarity Loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2015, arXiv:1412.6980. [Google Scholar]
Chebrolu, N.; Lottes, P.; Schaefer, A.; Winterhalter, W.; Burgard, W.; Stachniss, C. Agricultural robot dataset for plant classification, localization and mapping on sugar beet fields. Int. J. Robot. Res. 2017, 36, 1045–1052. [Google Scholar] [CrossRef]

Figure 1. The “GALIRUMI” EU project robot operating on grasslands needs to detect weeds (yellow bounding boxes) and their stem joints (cyan keypoints) to control them in an effective and efficient manner, through precisely targeted laser weeding or electrocution.

Figure 2. YOLO-HRNet: HRNet (blue) is integrated in YOLOv5 (yellow) as second stage. The corresponding feature maps of the predicted bounding boxes serve as input to the HRNet and one joint stem keypoint is predicted per input.

Figure 3. YOLO-Pose [32]: YOLOv5 (yellow) is extended with additional Pose Heads for each feature scale (blue). The Pose Heads regress keypoint positions relative to the anchor box center.

Figure 4. Example image from the RumexWeeds dataset [11] with the ground-truth bounding boxes and our additional joint stem annotation: Each joint stem has been annotated by the position

(k_{i x}, k_{i y})

and a radius

r_{i}

, which indicates the uncertainty of the human annotator. Note that it is differentiated between two species: Rumex obtusifolius (yellow) and Rumex crispus (red).

Figure 4. Example image from the RumexWeeds dataset [11] with the ground-truth bounding boxes and our additional joint stem annotation: Each joint stem has been annotated by the position

(k_{i x}, k_{i y})

and a radius

r_{i}

, which indicates the uncertainty of the human annotator. Note that it is differentiated between two species: Rumex obtusifolius (yellow) and Rumex crispus (red).

Figure 5. Qualitative performance of the final models YOLO-HRNet (middle), YOLO-Pose (right), compared to the ground truth (left). Input images are shown with yellow bounding boxes for detected weeds and cyan joint stem positions.

Table 1. Themodel complexity for both architectures is explored. For both architectures, the model complexity mainly depends on the chosen backbone, while for YOLO-HRNet in (a), the HRNet input size is varied, and for YOLO-Pose in (b), the complexity of the pose head is varied.

Backbone	HRNet Input Size	${mAP}_{50}$	${mAP}_{50 : 95}$	OKS
DarkNet-53 M	16	45.4	23.1	44.1
DarkNet-53 M	24	45.1	23.8	44.0
DarkNet-53 M	32	45.7	23.3	44.2
DarkNet-53 S	16	40.8	20.8	37.0
DarkNet-53 X	16	46.1	23.8	44.2
(a) YOLO-HRNet
Backbone	Pose Head	${mAP}_{50}$	${mAP}_{50 : 95}$	OKS
DarkNet-53 M	Simple	52.1	26.3	47.4
DarkNet-53 M	Complex	54.2	27.5	50.5
DarkNet-53 S	Complex	49.1	24.5	48.9
DarkNet-53 X	Complex	52.9	26.3	50.1
(b) YOLO-Pose

Table 2. Hyperparameter tuning of loss weights using random search for YOLO-HRNet (a) and YOLO-Pose (b). We only show the relevant results.

	$w_{box}$	$w_{obj}$	$w_{kp}$	${mAP}_{50}$	${mAP}_{50 : 95}$	OKS
	0.05	0.65	0.30	43.4	22.2	43.2
	0.05	0.85	0.10	45.4	23.1	44.1
	0.05	0.90	0.05	46.3	22.9	37.9
(a) YOLO-HRNet
Pose Head	$w_{box}$	$w_{obj}$	$w_{k p}$	${mAP}_{50}$	${mAP}_{50 : 95}$	OKS
Simple	0.05	0.65	0.15	49.1	22.4	54.6
	0.05	0.75	0.1	50.8	25.2	54.6
	0.05	0.85	0.05	51.5	25.3	53.7
	0.05	0.95	0.01	52.1	26.3	47.4
	0.05	0.95	0.005	50.5	25.8	37.8
Complex	0.05	0.95	0.01	54.2	27.5	50.5
(b) YOLO-Pose

Table 3. Influence of different keypoint weighting

s i g m a_{t}

in the loss function

L_{O K S}

. A low

s i g m a_{t}

introduces too much label noise, leading to reduced detection performance. On the contrary, too high values of

σ_{t} > 0.4

impact the detection performance negatively likewise. For our final models, we fix

σ_{t} = 0.2

.

Table 3. Influence of different keypoint weighting

s i g m a_{t}

in the loss function

L_{O K S}

. A low

s i g m a_{t}

introduces too much label noise, leading to reduced detection performance. On the contrary, too high values of

σ_{t} > 0.4

impact the detection performance negatively likewise. For our final models, we fix

σ_{t} = 0.2

.

$σ_{t}$	${mAP}_{50}$	${mAP}_{50 : 95}$	OKS
0.1	50.0	25.6	50.0
0.2	52.2	26.9	50.4
0.4	55.0	26.4	48.1
0.6	52.3	26.9	37.9
0.8	52.7	26.6	34.6

Table 4. Evaluation of final models on the test split. YOLO-Pose clearly outperforms its competitors. Furthermore, introducing joint-stem detection as second task enhances the performance in weed detection.

Model	Training Time [h]	${mAP}_{50}$	${mAP}_{50 : 95}$	OKS
YOLOv5	5.3	46.5	23.7	/
YOLO-HRNet	42.1	50.0	27.9	46.3
YOLO-Pose	5.8	50.1	26.1	56.3

Table 5. Inference speed of our final models. YOLO-Pose is slightly slower than YOLOv5 but still suitable for deployment on edge devices onboard agricultural robots.

GPU	Model	# Params	Preproc + NMS [ms]	Inference [ms]	FPS
Nvidia V100	YOLOv5	20.9 M	1.2	8.5	103.1
	YOLO-HRNet	26.6 M		33.6	28.7
	YOLO-Pose (Simple)	20.9 M		8.7	101.0
	YOLO-Pose (Complex)	23.3 M		10.1	88.5
Nvidia Jetson NX	YOLO-Pose (Simple)	20.9 M	5.5	67.0	13.8
Nvidia Jetson NX	YOLO-Pose (Complex)	23.3 M	5.5	76.5	12.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Güldenring, R.; Nalpantidis, L. Real-Time Joint-Stem Prediction for Agricultural Robots in Grasslands Using Multi-Task Learning. Agronomy 2023, 13, 2365. https://doi.org/10.3390/agronomy13092365

AMA Style

Li J, Güldenring R, Nalpantidis L. Real-Time Joint-Stem Prediction for Agricultural Robots in Grasslands Using Multi-Task Learning. Agronomy. 2023; 13(9):2365. https://doi.org/10.3390/agronomy13092365

Chicago/Turabian Style

Li, Jiahao, Ronja Güldenring, and Lazaros Nalpantidis. 2023. "Real-Time Joint-Stem Prediction for Agricultural Robots in Grasslands Using Multi-Task Learning" Agronomy 13, no. 9: 2365. https://doi.org/10.3390/agronomy13092365

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Joint-Stem Prediction for Agricultural Robots in Grasslands Using Multi-Task Learning

Abstract

1. Introduction

2. Related Work

3. Method

3.1. YOLO-HRNet

3.2. YOLO-Pose [32]

4. Experimental Setup

4.1. Dataset

4.2. Metrics

4.2.1. Mean Average Precision (mAP)

4.2.2. Object Keypoint Similarity (OKS)

4.3. Fixed Training Settings

5. Experimental Evaluation

5.1. Ablation Study: Model Complexity

5.2. Ablation Study: Task/Loss Weighting

5.3. Ablation Study: Influence of $σ_{t}$

5.4. Final Results

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Real-Time Joint-Stem Prediction for Agricultural Robots in Grasslands Using Multi-Task Learning

Abstract

1. Introduction

2. Related Work

3. Method

3.1. YOLO-HRNet

3.2. YOLO-Pose [32]

4. Experimental Setup

4.1. Dataset

4.2. Metrics

4.2.1. Mean Average Precision (mAP)

4.2.2. Object Keypoint Similarity (OKS)

4.3. Fixed Training Settings

5. Experimental Evaluation

5.1. Ablation Study: Model Complexity

5.2. Ablation Study: Task/Loss Weighting

5.3. Ablation Study: Influence of σ t

5.4. Final Results

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.3. Ablation Study: Influence of $σ_{t}$