A Unified Framework for Enhanced 3D Spatial Localization of Weeds via Keypoint Detection and Depth Estimation

Xie, Shuxin; Quan, Tianrui; Luo, Junjie; Ren, Xuesong; Miao, Yubin

doi:10.3390/agriculture15171854

Open AccessArticle

A Unified Framework for Enhanced 3D Spatial Localization of Weeds via Keypoint Detection and Depth Estimation

by

Shuxin Xie

,

Tianrui Quan

,

Junjie Luo

,

Xuesong Ren

and

Yubin Miao

^*

School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(17), 1854; https://doi.org/10.3390/agriculture15171854

Submission received: 4 August 2025 / Revised: 25 August 2025 / Accepted: 28 August 2025 / Published: 30 August 2025

(This article belongs to the Special Issue Advanced Image Collection, Processing, and Analysis in Crop and Livestock Management)

Download

Browse Figures

Versions Notes

Abstract

In this study, a lightweight deep neural network framework WeedLoc3D based on multi-task learning is proposed to meet the demand of accurate three-dimensional positioning of weed targets in automatic laser weeding. Based on a single RGB image, it both locates the 2D keypoints (growth points) of weeds and estimates the depth with high accuracy. This is a breakthrough from the traditional thinking. To improve the model performance, we introduce several innovative structural modules, including Gated Feature Fusion (GFF) for adaptive feature integration, Hybrid Domain Block (HDB) for dealing with high-frequency details, and Cross-Branch Attention (CBA) for promoting synergy among tasks. Experimental validation on field data sets confirms the effectiveness of our method. It significantly reduces the positioning error of 3D keypoints and achieves stable performance in diverse detection and estimation tasks. The demonstrated high accuracy and robustness highlight its potential for practical application.

Keywords:

weed detection; multi-task learning; keypoint localization; depth estimation

Graphical Abstract

1. Introduction

Global agricultural production faces severe challenges from weeds, which cause hundreds of millions of tons of crop losses annually and often lead to outbreaks of pests and diseases, seriously threatening agricultural sustainability. Traditional weeding methods suffer from significant limitations such as high labor intensity, low efficiency, soil structure damage, and environmental pollution, often leading to increased resistance to weeds [1,2], causing researchers to seek more precise and environmentally friendly alternative solutions.

In this context, laser weeding technology has emerged as a research hotspot due to its unique advantages. This technology precisely targets weed meristems with high-energy laser beams, achieving efficient eradication. Its core lies in a three-stage closed-loop system: “target identification-dynamic localization-precise removal” [3]. Compared to traditional methods, laser weeding uses electricity as energy, precisely targets keypoints, and operates without physical contact, effectively avoiding chemical pollution, minimizing non-target damage, and preserving soil structure.

However, the efficacy of laser weeding highly depends on accurate weed detection and 3D spatial localization. Early image processing methods struggled to adapt to complex field environments. With the rise of deep learning, Convolutional Neural Networks (CNNs) have shown remarkable performance in weed detection, classification, and segmentation tasks, such as U-Net [4] and DeepLab [5] for pixel-level recognition, and YOLO [6] and Faster R-CNN [7] for precise target localization.

It is worth noting that conventional detection methods primarily focus on bounding boxes or segmentation masks of entire plants, but for laser weeding, merely targeting non-critical parts like leaves cannot achieve complete eradication. Research indicates that precise localization and destruction of plant keypoints (meristems) can significantly improve weeding efficiency. This has led us to introduce keypoint detection technology from computer vision into agriculture, similar to its applications in human pose estimation (OpenPose [8], HRNet [9]) and crop phenotyping.

Depth estimation, as a crucial technique for recovering 3D scene information, holds particular importance in laser weeding. While monocular depth estimation, benefiting from its low hardware cost and flexible deployment, is appealing, it inherently suffers from pathological issues in general scenarios. However, for laser weeding, accurate depth information directly influences laser focus position and energy parameters. Leveraging the relatively fixed operating height of weeding robots, which provides a strong prior on the approximate depth range, we can effectively mitigate these pathological issues. This allows us to focus specifically on estimating small-scale depth compensation, rather than absolute depth from scratch, thereby significantly improving prediction accuracy for the specific task.

Although existing techniques can achieve 2D localization of weed keypoints, laser weeding demands even stricter accuracy in spatial coordinates. The characteristic depth of focus of laser beams means that even millimeter-level spatial deviations can lead to failures. To address this, we innovate by introducing a lightweight depth compensation module on top of 2D detection, constructing a complete 3D perception system that maintains accuracy while significantly reducing system complexity and cost, outperforming complex solutions like stereo vision.

The main contributions of this study include:

Proposing the first unified framework for 3D keypoint localization in laser weeding applications, achieving end-to-end learning for 2D planar localization and depth compensation.
Designing a multi-task optimization loss function to effectively balance the joint training of keypoint detection and depth estimation.
Validating the proposed method’s reliability and precision in real operating environments through experiments on datasets from real agricultural fields.

The remainder of this paper is structured as follows: Section 2 reviews related work; Section 3 details the method design; Section 4 presents experimental results and analysis; Section 5 concludes the paper and discusses future research directions.

2. Related Work

2.1. Weed Detection

Weed detection and localization are core tasks in automated weeding technology, and their accuracy directly impacts weeding effectiveness and crop protection. In recent years, with the development of computer vision and deep-learning techniques, significant progress has been made in weed detection and localization methods.

For weed detection, researchers have developed various deep-learning-based methods. For example, Osorio et al. [10] utilized SVM, YOLOv3, and Mask-RCNN for weed detection in lettuce fields. Sun et al. [11] combined CNN with multimodal images (RGB and near-infrared) to improve recognition performance in low-light environments and introduced deformable convolutions for feature extraction. Nasiri et al. [12] used a U-Net architecture to achieve pixel-level semantic segmentation of sugar beet, weeds, and soil. Zhao et al. [13] integrated EMA attention module and SPPF module based on YOLOv8-Pose to enhance the performance of laser weeding robots.

Furthermore, research on weed keypoint localization has been deepening. Zhang et al. [14] proposed a lightweight algorithm based on YOLO instance segmentation, calculating the centroid of segmentation masks for keypoint localization, but it is mainly applicable to isolated individuals. Arsa et al. [15] designed a UNet-based keypoint detection network, using EfficientNet B6 as backbone, fusing segmentation and heatmap prediction results to localize keypoints, and performed well on multiple agricultural datasets. Lottes et al. [13,16] developed a joint stem detection and crop/weed segmentation network, sharing an encoder for detection and segmentation, but with high computational complexity. Li et al. [17] proposed a two-stage keypoint detection method, using YOLOv5 and HRNet to achieve individual localization and keypoint heatmap generation.

Although the above studies have made significant progress in weed detection and keypoint localization, most methods only provide 2D planar localization, which struggles to meet the precise demands of 3D spatial information for laser weeding. Zhu et al. [18] used YOLOX network to detect the pixel coordinates of weeds, and approximate the coordinate difference between the keypoints of weeds and the imaging plane to the height of the camera. When laser array sparseness or operating height changes, lack of depth information can lead to significant localization errors, as shown in Figure 1. Depth information is also beneficial to the precise control of laser focal length and improve weeding efficiency. Moreover, laser scheduling and planning also require depth information for optimal allocation and avoiding accidental damage. Therefore, exploring the integration of depth estimation techniques into keypoint detection is a new research direction.

2.2. Depth Estimation

Depth estimation, as a key technology providing 3D perception capabilities, has wide and significant application potential in precision agriculture, automated equipment, and intelligent management. In the agricultural domain, Coll-Ribes et al. [19] integrated a monocular depth estimation module into an instance segmentation network, enhancing the detection of grape clusters and pedicels. Tamrakar et al. [20] used ZoeDepth to generate depth maps, combining them with RGB images for precise segmentation and localization of strawberries and their pedicels. Cui et al. [21] proposed a lightweight MonoDA model, using depth and pose estimation subnetworks to provide environmental depth perception for unmanned agricultural vehicles. Kim et al. [22] developed a vision-based height measurement system that calculates the height of the crop from depth maps generated by disparities.

The aforementioned research demonstrates that depth information can significantly enhance the perception capabilities of agricultural vision systems. However, existing methods primarily focus on general scene depth reconstruction, lacking specific designs for weed keypoint localization. Additionally, the computational complexity and hardware cost of depth estimation still limit their widespread application in agricultural robots. Therefore, developing a lightweight depth estimation module tailored for laser weeding tasks is crucial.

2.3. Multi-Task Learning

Multi-task learning (MTL) is a machine-learning paradigm that improves model performance by sharing information and optimizing jointly across tasks. Compared to single-task learning, MTL can enhance model generalization, improve data utilization efficiency, and reduce computational costs [23]. In recent years, MTL has been increasingly applied in the agricultural domain.

For example, Goncalves et al. [24] proposed an MTLSegFormer combining MTL with attention mechanisms to improve semantic segmentation performance in precision agriculture. Amrani et al. [25] designed a Bayesian MTL model, combining it with ResNet18 to estimate and classify aphid size. Duc et al. [26] constructed a cross-task distillation design, which promoted the early information sharing between the two tasks to achieve the improvement of specific tasks. He et al. [27] built a decoder with three feature mappings (universal representation, semantic feature and depth feature) for the dual-task network, and mapping modules from monocular depth feature to segmentation feature and segmentation feature to monocular depth feature, and fused the information of the three features into the segmentation header and the depth prediction header, respectively. Zhang et al. [28] proposed Multitask GAN, and developed a semantic-guided smoothness loss to improve the depth completion result, which made the network not only good at semantic segmentation and depth completion, but also improve the accuracy of depth completion through the generated semantic images. Zhang et al. [29] developed a multitask framework integrating object segmentation, keypoint estimation, and monocular image depth estimation for measuring the chest circumference of cattle.

Of particular note, Lottes et al. [16,30] experimentally confirmed that multi-task networks with shared encoders can learn more discriminative feature representations. This finding provides theoretical support for our joint learning of keypoint detection and depth estimation. Building on this, we further explore inter-task interaction mechanisms and design an MTL framework for laser weeding, achieving efficient synergy between keypoint localization and depth compensation.

In the field of weed detection and localization, significant progress has been made; however, these methods still have room for improvement in addressing practical needs such as 3D localization accuracy, model lightweight design, and inter-task collaborative optimization. Notably, existing representative studies have focused solely on pixel-level weed localization [10,11,15,16,17,18,30]. Our research proposes, for the first time, a method that combines pixel-level localization with depth estimation to achieve 3D localization, utilizing a lightweight design to meet practical application requirements.

3. Methods

3.1. Network Architecture

In the existing research, the location of weed keypoints is usually based on the two-dimensional plane information of the image, such as predicting its pixel coordinates, generating a thermal map to represent the probability distribution, or detecting its approximate position through bounding boxes. These methods generally regard the weed keypoint as a point on the horizontal plane, and its depth (or height difference from the camera) is either ignored or assumed to be a fixed value. However, for laser weeding, simply treating weeds as plane targets will lead to the inaccuracy of laser targeting, because weeds are actually distributed in three dimensions, and there may be significant height differences in their keypoints. Therefore, in order to overcome the inherent limitations of two-dimensional positioning and ensure the accurate strike of laser beam, we introduce a depth compensation estimation task based on 2D localization, predicting the actual height offset of keypoints relative to a reference plane.

To enhance model trainability and prediction stability, we normalize the depth map. The maximum value of all pixels in each image is used as the reference height, yielding a relative depth map with respect to the “deepest point”:

Z_{comp} (x, y) = max_{x, y} Z_{raw} (x, y) - Z_{raw} (x, y)

(1)

where

Z_{c o m p}

is the depth compensation amount, which is the prediction target of our network, and

Z_{r a w}

is the original depth map. This processing transforms the prediction target into a vertical offset relative to a “reference plane” (i.e., “depth compensation amount”), effectively compressing the depth value range and improving gradient stability and convergence. During inference, we approximate the average of the maximum depth values from the training set as a global fixed reference

Z_{r e f}

, subtracting the predicted compensation value to recover the absolute depth of keypoints

Z_{f i n a l}

, which will be shown in the experimental section to not introduce too much error. In this paper, we focus on estimating the compensation amount.

Z_{final} (x, y) = Z_{ref} - Z_{comp} (x, y)

(2)

We employ a heatmap-based methodology for keypoint prediction, as it demonstrably enhances the robustness in handling the inherent ambiguity and labeling imprecision of keypoints. In addition, this approach provides a more versatile and efficient framework for simultaneously localizing multiple keypoints, offering distinct advantages over traditional regression-based alternatives.

To simultaneously predict 2D weed keypoint planar localization and depth compensation, we design a lightweight dual-task model WeedLoc3D, shown as Figure 2. The network employs EfficientNetB0 [31] proposed by Tan et al. as its encoder, taking an input RGB image

I \in R^{W \times H \times 3}

and outputting a heatmap

P \in R^{W \times H \times 1}

and a compensated depth map

D \in R^{W \times H \times 1}

. Given the common feature requirements and clear mapping relationships between semantic and depth information for both tasks, they share the same feature extractor to reduce computation and utilize an identical network structure to adapt to task demands.

The encoder extracts five levels of features

{F_{1}, \dots, F_{5}}

, where the high-level feature

F_{5}

contains the most abstract semantic information and is directly fed into the decoder for upsampling. The decoder progressively integrates low-level features (

F_{1}

to

F_{4}

) to compensate for detailed information. Finally, the model outputs predicted heatmaps and depth maps synchronously through dual task heads.

Traditional feature fusion methods, including concatenation and element-wise addition, are relatively simple in feature processing. On one hand, they lead to limited feature fusion effectiveness, and on the other hand, they may introduce noise information. To address these limitations and better fuse decoder and encoder features, we propose the Gated Feature Fusion (GFF) module. To simultaneously maintain output smoothness and high-frequency details, we introduce the Hybrid Domain Block (HDB), which learns features in both spatial and frequency domains in parallel, enhancing global perception and balancing high-frequency sensitivity. Furthermore, given that features from the two branches mutually promote each other, we propose the Cross-Branch Attention (CBA) module, which facilitates feature exchange between branches and improves overall task performance. Next, we will detail these modules.

3.2. Gated Feature Fusion

In encoder-decoder architectures, effective fusion of low-level and high-level semantic features is crucial for dense prediction tasks. Traditional fusion strategies, such as simple concatenation or element-wise addition, often treat all features statically, which can lead to semantic information being diluted by noisy details or critical details being overwhelmed by high-level features. To address this, we draw inspiration from the Squeeze-and-Excitation (SE) mechanism [32] and design the Gated Feature Fusion (GFF) module as shown in Figure 3. Its core lies in achieving adaptive feature fusion, dynamically enhancing informative features and suppressing redundant ones based on feature content.

Specifically, the GFF module first preprocesses the skip connection features

s \in R^{C \times H \times W}

from the encoder and the upsampled features

x \in R^{C \times H \times W}

from the decoder. It generates their compact representations

\hat{s}

and

\hat{x}

. Subsequently, these two representations are added and fed into an SE-style attention gating unit to generate a channel attention weight vector

g \in R^{C \times H \times W}

:

g = σ (W_{2} (ReLU (W_{1} (GAP (\hat{s} + \hat{x})))))

(3)

where

W_{1} \in R^{C / r \times C}

and

W_{2} \in R^{C / r \times C}

are weights of two

1 \times 1

convolutional layers (r is the channel reduction ratio, experimentally taken as 4 in this research),

σ

represents the Sigmoid activation function, and GAP denotes Global Average Pooling.

We introduce a residual gating mechanism, utilizing the attention vector

g

to recalibrate the channels of the original feature

s

, while simultaneously retaining its original information through a learnable scalar parameter

α

, forming a dynamically modulated output feature

s_{out}

:

s_{out} = s ⊙ g + α \cdot s

(4)

where ⊙ denotes the Hadamard product. Finally, the gate-adjusted feature

s_{out}

is concatenated with the decoder feature

x

along the channel dimension, forming the ultimate fused output

y \in R^{2 C \times H \times W}

.

3.3. Hybrid Domain Block

In dense prediction tasks, traditional decoders often lead to excessive smoothing of high-frequency details (e.g., object edges) due to simple convolutional operations. To address this, we propose the Hybrid Domain Block (HDB). The core idea is to analyze and process the features simultaneously in the spatial and frequency domains, and then merge their advantages: the frequency domain excels at capturing global low-frequency and high-frequency components, while the spatial domain is adept at processing local neighborhood relationships. By optimizing the synergy between these two, HDB achieves cross-domain decoupling and enhancement of features. Its overall workflow can be represented as:

y = x_{init} + F (x_{init}) + S (F (x_{init}))

(5)

where

x_{init} = {Conv}_{1 \times 1} (x)

is the initial projected feature, and

F (\cdot)

and

S (\cdot)

respectively represent the frequency domain and spatial domain processing units.

3.3.1. Frequency Unit

This unit maps features to the frequency domain via the Fourier transform to achieve explicit modeling and adaptive weighting of different frequency bands. As shown in Figure 4, for a given input feature

x \in R^{C \times H \times W}

, we first apply the Fast Fourier Transform (FFT) to decompose it into its real part

R

and imaginary part

I

:

F (x) = fft (x) = R + j I

(6)

To dynamically adjust the importance of different frequency components, we introduce a learnable channel-wise filter

W_{filt} \in R^{C \times 1 \times 1}

, and apply it to the real and imaginary parts:

\hat{R} = R ⊙ W_{filt}, \hat{I} = I ⊙ W_{filt}

(7)

Next, to facilitate information exchange between different frequency components (represented by different channels), we apply independent MLP-style projections to the filtered

\hat{R}

and

\hat{I}

, obtaining

{\hat{R}}^{'}

and

{\hat{I}}^{'}

. Subsequently, we concatenate the projected real and imaginary parts and enhance their interaction through a cross-component attention module. Finally, we reconstruct the spatial domain features via Inverse FFT (IFFT) and add them to the original input as a residual connection:

F_{out} = F_{attn} ⊙ σ (Attention (F_{attn})) + F_{attn}

(8)

R^{″}, I^{″} = Split (Norm (F_{out}))

(9)

F (x) = ifft (R^{''} + j I^{''})

(10)

3.3.2. Spatial Unit

As shown in Figure 5, this unit aims to enhance the model’s adaptive perception capabilities for multi-scale spatial patterns. Inspired by [33], we design a multi-branch structure to process spatial information at different scales in parallel. For an input feature

x \in R^{C \times H \times W}

, we first uniformly divide it into L groups along the channel dimension (default

L = 4

), i.e.,

x = [x_{0}, x_{1}, \dots, x_{L - 1}]

, where

x_{i} \in R^{(C / L) \times H \times W}

.

For the i-th feature group

x_{i}

(

i > 0

), we first downsample it to

1 / 2^{i}

resolution using adaptive max pooling to capture more macroscopic contextual information. Then, we apply a mixed receptive field module

M_{i}

composed of square and rectangular convolutions (

1 \times k

and

k \times 1

) for feature extraction. Finally, we upsample it back to the original resolution using nearest-neighbor interpolation. For the 0-th group

x_{0}

, we directly apply

M_{0}

without downsampling.

y_{i} = \{\begin{matrix} Upsample (M_{i} (Downsample (x_{i}))), & if i > 0 \\ M_{i} (x_{i}), & if i = 0 \end{matrix}

(11)

The outputs of all branches are concatenated together and passed through a

1 \times 1

convolution for information aggregation. The final output is multiplied by the original input via a gating mechanism, achieving dynamic spatial feature weighting:

S (x) = x ⊙ GELU ({Conv}_{1 \times 1} [y_{0}, \dots, y_{L - 1}])

(12)

3.4. Cross-Branch Attention

In our dual-task learning framework, the feature requirements for keypoint detection and depth estimation are naturally complementary. To encourage these two tasks to mutually enhance rather than interfere, we design CBA to achieve explicit information exchange between task branches, as shown in Figure 6. Its core goal is to enable each task branch to query and integrate useful information from the other branch.

We draw inspiration from the efficient attention mechanism proposed by Shen et al. [34], realizing a lightweight multi-head cross-attention mechanism. This mechanism significantly reduces computational complexity to a linear level by decomposing attention calculation into independent weighting of Key and Query.

Given features

x_{1}

\in R^{N \times C \times H \times W}

from the keypoint branch and

x_{2}

\in R^{N \times C \times H \times W}

from the depth branch, the CBA module parallelly computes their updated features. We will illustrate the process by detailing the update of

x_{1}

(the keypoint branch).

First, through independent

1 \times 1

convolutions, we generate Query (

Q_{1}

) from

x_{1}

, and Key (

K_{2}

) and Value (

V_{2}

) from

x_{2}

. These tensors are flattened into sequence form, where

N_{pix} = H \times W

:

Q_{1} = f_{Q} (x_{1}), K_{2} = f_{K} (x_{2}), V_{2} = f_{V} (x_{2})

(13)

where

Q_{q}, K_{2} \in R^{N \times C_{k} \times N_{pix}}

and

V_{2} \in R^{N \times C_{v} \times N_{pix}}

. Here,

C_{k}, C_{v} ≪ C

, which significantly reduces computation. Next, we divide these tensors along the channel dimension into

H_{head}

heads, and calculate independently within each head. For the i-th head, we apply Softmax normalization to Key and Query along their sequence dimension:

{\hat{K}}_{2}^{i} = Softmax (K_{2}^{i}), {\hat{Q}}_{1}^{i} = Softmax (Q_{1}^{i})

(14)

Then, the context vector is computed through two matrix multiplications. First, the normalized Key

{\hat{K}}_{2}^{i}

is weighted by Value

V_{2}^{i}

to generate a global context vector. Subsequently, the normalized Query

{\hat{Q}}_{1}^{i}

is weighted by this context vector to obtain the final attention output:

{Context}^{i} = Softmax ({\hat{K}}_{2}^{i} {(V_{2}^{i})}^{T}) / \sqrt{d_{k}}

(15)

{AttnOut}^{i} = {({Context}^{i})}^{T} {\hat{Q}}_{1}^{i}

(16)

All output heads

{AttnOut}^{i}

are concatenated and reshaped back to image dimensions, then mapped back to the original channel count C via a

1 \times 1

convolution. Finally, a residual connection adds this attention output to the original input

x_{1}

, yielding the updated feature map

y_{1}

:

y_{1} = x_{1} + f_{reproj} (Concat ({AttnOut}^{1}, \dots, {AttnOut}^{H_{head}}))

(17)

The process for updating the depth branch feature map

x_{2}

is entirely symmetrical, meaning

x_{2}

is used to generate Query, and

x_{1}

to generate Key and Value.

3.5. Loss Function

To jointly optimize keypoint detection and depth estimation tasks, we design a multi-task loss function

L_{total}

, which is a weighted sum of the keypoint detection loss

L_{kp}

and depth estimation loss

L_{depth}

.

3.5.1. Keypoints Detection Loss

The model predicts keypoint locations in the form of Gaussian heatmaps, essentially performing pixel-level dense classification. To address the severe class imbalance between foreground and background pixels in heatmaps, we employ Focal Loss [35]. This loss function introduces a modulating factor that dynamically down-weights well-classified samples, allowing the training process to focus more on hard, easily misclassified samples.

Specifically, we first compute the standard pixel-wise binary cross-entropy loss. For a prediction

{\hat{y}}_{i j}

and ground truth

y_{i j}

at location

(i, j)

, the loss is calculated as:

L_{BCE} (y_{i j}, {\hat{y}}_{i j}) = - [y_{i j} log ({\hat{y}}_{i j}) + (1 - y_{i j}) log (1 - {\hat{y}}_{i j})]

(18)

Then, we compute a Focal weight, which consists of two parts: (1) a balancing factor

α

for positive and negative samples; (2) a modulating factor

{(1 - {\hat{y}}_{i j})}^{γ}

or

{\hat{y}}_{i j}^{γ}

based on prediction confidence. The final loss is the product of these two.

w_{focal} (y_{i j}, {\hat{y}}_{i j}) = α {(1 - {\hat{y}}_{i j})}^{γ} + (1 - α) {\hat{y}}_{i j}^{γ}

(19)

The final keypoint loss is the following.

L_{kp} = \frac{1}{N_{pix}} \sum_{i, j} w_{focal} (y_{i j}, {\hat{y}}_{i j}) \cdot L_{BCE} (y_{i j}, {\hat{y}}_{i j})

(20)

where

N_{pix}

is the total number of pixels. Based on our experience, we set

γ = 2

and

α = 0.75

to suppress simple background samples while focusing on learning foreground keypoints.

3.5.2. Depth Estimation Loss

Depth estimation is a pixel-wise regression task. Considering that occlusions, reflections, and other abnormalities in field scenes can lead to outliers in depth predictions, which poses challenges for the stability of standard L2 loss, we opt for Huber loss, which provides a good balance between precision and robustness.

Given the normalized predicted depth map

\hat{D}

and ground truth depth map

D

, the depth estimation loss is defined as:

L_{depth} = \frac{1}{N_{pix}} \sum_{i, j} H_{δ} ({\hat{d}}_{i, j} - d_{i, j})

(21)

where

H_{δ} (x)

is the Huber loss function:

H_{δ} (x) = \{\begin{matrix} \frac{1}{2} x^{2} & if | x | \leq δ \\ δ | x | - \frac{1}{2} δ^{2} & otherwise \end{matrix}

(22)

We set the threshold

δ = 0.1

(in units of normalized depth) to accommodate the expected error range in our task.

3.5.3. Total Loss

Finally, our total loss function is the weighted sum of the above two loss terms:

L_{total} = λ L_{kp} + L_{depth}

(23)

where

λ

is hyperparameter used to balance the two tasks. Through experimental exploration, we ultimately set

λ = 0.1

, ensuring that both tasks are adequately and balanced optimized during training.

3.6. Three-Dimensional Coordinate Reconstruction of Keypoints

Our model’s output is a probability heatmap, and to obtain the final discrete keypoint coordinates, an efficient and precise post-processing pipeline is required.

For the raw output heatmap matrix

P \in R^{h \times w}

, we first apply a Gaussian filter to suppress high-frequency noise and smooth the heatmap. Subsequently, we use a predefined threshold

τ

to binarize the smoothed heatmap, filtering out significant regions

M_{roi}

with sufficient confidence.

M_{roi} [i, j] = \{\begin{matrix} 1 & if (G_{σ} * P) [i, j] > τ \\ 0 & otherwise \end{matrix}

(24)

where * denotes the convolution operation and

G_{σ}

represents the Gaussian filter.

Within the significant region

M_{roi}

, we employ Local Non-Maximum Suppression (NMS) to find local peak points, ensuring that each target outputs only one keypoint.

To achieve higher localization precision, we perform sub-pixel level refinement for each candidate point

(x_{0}, y_{0})

. The principle is to approximate the true extremum of the function by Taylor expansion around the heatmap value

P (x, y)

at that point, and use its second-order approximation. By computing its neighborhood gradient information

Δ x, Δ y

and Hessian matrix

\nabla^{2} P_{X X}, \nabla^{2} P_{Y Y}

using central difference approximations, we derive the coordinate offsets

(δ x, δ y)

: Here, the first-order partial derivatives

Δ x

and

Δ y

are approximated as:

Δ x = (P (x_{0} + 1, y_{0}) - P (x_{0} - 1, y_{0})) / 2

(25)

Δ y = (P (x_{0}, y_{0} + 1) - P (x_{0}, y_{0} - 1)) / 2

(26)

And the second-order partial derivatives

\nabla^{2} P_{X X}

and

\nabla^{2} P_{Y Y}

(corresponding to the diagonal elements of the Hessian matrix) are approximated as:

\nabla^{2} P_{X X} = P (x_{0} + 1, y_{0}) - 2 P (x_{0}, y_{0}) + P (x_{0} - 1, y_{0})

(27)

\nabla^{2} P_{Y Y} = P (x_{0}, y_{0} + 1) - 2 P (x_{0}, y_{0}) + P (x_{0}, y_{0} - 1)

(28)

Thus, sub-pixel accurate localization of keypoint positions is achieved using the following formulas:

δ x = - \frac{Δ x}{\nabla^{2} P_{X X}}, δ y = - \frac{Δ y}{\nabla^{2} P_{Y Y}}

(29)

Thus, sub-pixel accurate localization of keypoint positions is achieved:

\hat{x} = x + δ_{x}, \hat{y} = y + δ_{y}

(30)

After obtaining the sub-pixel refined two-dimensional image coordinates

(\hat{x}, \hat{y})

, an essential step for accurate 3D reconstruction is to correct for lens distortion. Using the pre-calibrated camera distortion coefficients (

k_{1}, k_{2}, k_{3}

for radial distortion and

p_{1}, p_{2}

for tangential distortion), the observed distorted pixel coordinates are transformed into undistorted pixel coordinates

({\hat{x}}^{'}, {\hat{y}}^{'})

. This process typically involves applying the inverse of the standard camera distortion model, effectively mapping the points to a pinhole camera projection plane.

For each sub-pixel refined keypoint, we extract the corresponding depth value from the model’s normalized depth map

D_{final}

. This depth value directly represents the distance from the camera’s optical center along the Z-axis, i.e.,

Z_{c} = D_{final} [\hat{y}, \hat{x}]

.

Based on the standard pinhole camera model and utilizing the pre-calibrated camera intrinsic matrix K (including focal lengths

f_{x}, f_{y}

and principal point

c_{x}, c_{y}

), we can transform the two-dimensional undistorted pixel coordinates

({\hat{x}}^{'}, {\hat{y}}^{'})

and depth value

Z_{c}

into three-dimensional coordinates

(X_{c}, Y_{c}, Z_{c})

in the camera coordinate system. To obtain the intrinsic parameters used in this transformation, we calibrated the camera using Zhang’s method, which estimates both intrinsic and distortion parameters from multiple views of a planar calibration pattern.

The transformation formulas are as follows:

X_{c} = ({\hat{x}}^{'} - c_{x}) \cdot \frac{Z_{c}}{f_{x}}

(31)

Y_{c} = ({\hat{y}}^{'} - c_{y}) \cdot \frac{Z_{c}}{f_{y}}

(32)

Z_{c} = Z_{c}

(33)

Here,

(X_{c}, Y_{c}, Z_{c})

denotes the keypoint’s position in a coordinate system with its origin at the camera’s optical center, the Z-axis pointing forward, the X-axis to the right, and the Y-axis downwards (or according to a specific camera coordinate system convention).

Through this integrated pipeline, we combine sub-pixel two-dimensional keypoint detection with per-pixel depth estimation, thereby achieving high-precision three-dimensional spatial localization of weed keypoints. This high-precision 3D information is crucial for laser weeding systems, as it ensures the precise focusing of the laser beam onto weed keypoints, significantly improving weeding efficiency and minimizing the risk of damaging crops.

4. Experiments

4.1. Datasets

4.1.1. Data Collection

The datasets used in this study were collected from real agricultural field environments shown as Figure 7, including lettuce and more than five common weed species, as shown in Figure 8. To obtain 3D scene information, we used an Intel RealSense D405 depth camera, which is designed for high-precision depth perception at close range. All data were acquired during the spring of 2025 in rural farmland areas surrounding Shanghai, China, under typical field conditions. The data collection was conducted during daylight hours with sufficient natural illumination to ensure optimal sensor performance.

All samples were collected under a consistent top-down view, ensuring that the camera’s optical axis was approximately perpendicular to the ground. For each sample, a

1280 \times 720

resolution RGB image and a corresponding depth map of the same resolution were simultaneously captured. Thanks to the D405 camera’s built-in hardware synchronization and calibration, RGB images and depth maps were strictly spatially aligned at the pixel level, requiring no additional registration. The dataset contents include only crops, weeds, and background soil, simulating real field operation scenarios. The data set is very challenging. We have counted the parameters of some data sets, as shown in Figure 9. Here, the vertical coordinate represents normalized distribution frequency of subsets to the total number.

4.1.2. Data Annotation and Splitting

We performed detailed manual annotation on all RGB images. We precisely marked the 2D coordinates of the keypoints (apical meristems) of each weed as Ground Truth Keypoints, which were then used to generate Gaussian heatmaps as supervision signals for the keypoint detection task. Depth maps were filtered and used directly as ground truth depths for the depth estimation task.

All 833 samples were randomly split into training (595 samples), validation (119 samples), and test (119 samples) sets in a 5:1:1 ratio.

4.1.3. Data Preprocessing and Augmentation

To transform the raw collected data into a format suitable for model training, we centrally cropped the original

1280 \times 720

images, retaining a

704 \times 704

pixel region as effective input, to eliminate edge lens distortion. We filtered and denoised the depth maps. Based on manually annotated keypoint coordinates, we generated a corresponding

704 \times 704

8-bit grayscale heatmap for each RGB image. In the heatmap, the position of each keypoint is rendered as a 2D Gaussian kernel (peak value 255, standard deviation

σ

is set to a fixed value), while background regions are zero.

After preprocessing, each training sample consists of an RGB image, a keypoint Gaussian heatmap, and a depth map, all with a resolution of

704 \times 704

and pixel-aligned, as shown in Figure 10.

To enhance model generalization and prevent overfitting, we employed geometric transformation data augmentation strategies: during training, samples were randomly flipped horizontally, randomly flipped vertically, and randomly rotated by multiples of 90 degrees with a certain probability. We intentionally avoided data augmentation methods that alter object size and image scale, such as random scaling, stretching, and affine transformations. This is because such transformations can severely distort scene depth information (e.g., shrinking an object should change its depth value, but the relationship is difficult to model precisely), thereby interfering with the depth estimation task. The rigid geometric transformations we adopted effectively expanded the dataset while maintaining the physical consistency of depth information.

4.2. Training Environment and Hyperparameters

All models were trained and tested using the PyTorch deep-learning framework. The hardware platform was equipped with an NVIDIA GeForce RTX 4090 GPU with 24 GB of VRAM. Training utilized the Adam optimizer, combined with a Cosine Annealing Learning Rate Scheduler, with an initial learning rate of

1 \times 10^{- 4}

and a final learning rate of

1 \times 10^{- 5}

. Models were trained for 200 epochs with a batch size of 32. The entire training process took approximately 6 h.

4.3. Performance Metrics

4.3.1. Keypoint Detection Metrics

In order to comprehensively evaluate the performance of the model in the keypoint detection task, we adopt various evaluation methods, including both newly proposed indicators and traditional coordinate-based indicators, not only to quantitatively predict the quality of heat map, but also to measure the accuracy of keypoint coordinates extracted in practical application.

In keypoint detection tasks, existing evaluation methods face challenges: coordinate-based metrics (e.g., OKS or L2 distance) directly measure localization error but depend on post-processing, failing to directly evaluate heatmap quality and ignoring probability distributions. Traditional pixel-wise IoU is insensitive to small shifts in peak locations, failing to effectively reflect localization accuracy. To address this contradiction, we propose a new metric, HeatmapIoU(HIoU). Its core idea is to introduce dynamic saliency weights, making the evaluation process more focused on peak alignment rather than just regional overlap.

Given the model’s predicted heatmap

P

and ground truth Gaussian heatmap

G

, we first define a saliency mask

M

, used to restrict the evaluation scope to regions where at least one of the prediction and ground truth has a high response (above a threshold

θ

), thereby suppressing interference from vast background regions.

M [i, j] = \{\begin{matrix} 1 & if G [i, j] > θ \\ 0 & otherwise \end{matrix}

(34)

We design a mixed weighting function

w

, which combines spatial location (via mask

M

) and heatmap peak intensity:

w = (1 - α) \cdot M + α \cdot \frac{G + P}{2}

(35)

where the spatial weight

(1 - α) \cdot M

provides a baseline weight, ensuring all salient regions are considered. The peak weight

α \cdot (G + P) / 2

is a critical component, reaching its maximum when prediction and ground truth peaks are aligned. This makes the metric highly sensitive to peak location shifts; the greater the offset, the smaller this term. The parameter

α

(set to 0.8 in experiments) is used to balance the importance of these two weights. Finally, HeatmapIoU is defined as the ratio of the weighted intersection to the union:

HeatmapIoU = \frac{\sum (w ⊙ min (G, P))}{\sum (w ⊙ max (G, P)) + ϵ}

(36)

where

ϵ

is a small stable constant. Through this design, HeatmapIoU effectively unifies the evaluation of overall heatmap shape quality and core localization precision, providing a more reliable basis for directly optimizing and comparing models that generate heatmaps.

Although HeatmapIoU effectively evaluates heatmap generation quality, in practical applications, we ultimately need discrete keypoint coordinates. Therefore, we supplement a set of evaluation metrics based on the final extracted coordinates to measure model performance in actual detection and localization tasks. This set of metrics’ evaluation process is based on matching predicted keypoint set

P = {P_{i}}_{i = 1}^{N}

with ground truth keypoint set

G = {g_{j}}_{j = 1}^{M}

. Here,

P_{i}, g_{j} \in R^{2}

respectively represent the predicted and ground truth 2D coordinates.

We employ a greedy matching strategy based on a distance threshold. As shown in Figure 11, for each ground truth keypoint

g_{j}

, we search for the closest unmatched predicted point

P_{i}

among all currently unmatched predicted points. If this minimum Euclidean distance

d = {∥P_{i} - g_{j}∥}_{2}

is less than a predefined distance threshold

τ

, then this pair of keypoints

(P_{i}, g_{j})

is considered a successful match. This predicted point

P_{i}

and ground truth point

g_{j}

will then be marked as matched and will not participate in subsequent matching processes. Based on the requirements of our application scenario (laser spot size), we set

τ = 16

pixels, which represents an acceptable error range in the physical world.

Based on these matching results, we define the following statistics:

True Positives (TP): The number of successfully matched predicted keypoints.
False Positives (FP): The number of unmatched predicted keypoints, representing incorrect detections.
False Negatives (FN): The number of unmatched ground truth keypoints, representing missed detections.

According to the above statistics, we calculate the following metrics to comprehensively evaluate the model’s performance:

Precision (P):

Precision measures the accuracy of predictions, i.e., what percentage of all detected keypoints are correct.

P = \frac{TP}{TP + FP}

(37)

Recall (R):

Recall measures the completeness of detection, i.e., what percentage of all ground truth keypoints were successfully detected.

R = \frac{TP}{TP + FN}

(38)

Pixel Mean Absolute Error (PMAE):

This metric is specifically designed to evaluate localization accuracy. It only calculates the average Euclidean distance between all successfully matched (TP) keypoint pairs, directly reflecting the accuracy of the model’s localization. Given the set of all matched pairs

M = {(P_{i}, g_{i})}_{i = 1}^{| TP |}

:

PMAE = \frac{1}{| TP |} \sum_{(P_{i}, g_{i}) \in M} {∥P_{i} - g_{i}∥}_{2}

(39)

4.3.2. Depth Estimation Metrics

When evaluating the performance of depth estimation tasks, we choose to assess the depth estimation performance at all ground truth keypoint locations and employ the following two absolute error-based metrics:

Depth Mean Absolute Error (DMAE):

This metric directly measures the average absolute difference between the predicted depth and the ground truth depth at all ground truth keypoint locations. We convert the normalized depth values back to physical units (e.g., centimeters) to make the metric more intuitively interpretable.

DMAE = \frac{1}{M} \sum_{j = 1}^{M} | \hat{d} (g_{j}) - d (g_{j}) |

(40)

where M is the total number of ground truth keypoints,

g_{j}

is the coordinate of the j-th ground truth keypoint, and

\hat{d} (g_{j})

and

d (g_{j})

are the predicted and true depths at that location, respectively.

Accuracy With Threshold (AWT):

To quantify the extent to which model predictions meet practical application accuracy requirements, we introduce Accuracy With Threshold. This metric calculates the proportion of keypoints where the prediction error is less than a specific threshold

δ

:

AWT = \frac{1}{N} \sum_{i = 1}^{N} I (| d_{pr}^{(i)} - d_{gt}^{(i)} | < δ)

(41)

where

I (\cdot)

is the indicator function, which returns 1 if the condition is met, and 0 otherwise.

The choice of threshold

δ

has clear physical and comparative significance. We determine it through a baseline experiment: treating the depth values of all weed keypoints as a fixed constant (i.e., the planar assumption of traditional methods), and finding the constant value that minimizes the mean absolute error. This minimum error is 1.455 cm. Therefore, we set

δ = 1.455

to reflect the advantage of our method compared to traditional planar methods.

4.4. Experimental Results

Our proposed multi-task learning framework achieves excellent performance in both keypoint detection and depth estimation tasks. This section will detail and analyze the quantitative evaluation results on the test set.

4.4.1. Keypoint Detection Performance

The model’s keypoint detection performance is evaluated using Precision, Recall, Point Mean Absolute Error (PMAE), and HeatmapIoU.

As shown in Table 1 and Figure 12, the model demonstrates stable and outstanding performance on the test set. The average Recall reached 87.05%, indicating that the model is able to detect most weed keypoints within the field of view, which is crucial for weeding tasks. At the same time, the average Precision also reached 85.44%, demonstrating that the model generates very few false positives while achieving high Recall. A quarter of Precision and Recall both reached 100%. Our proposed HeatmapIoU is mostly distributed between 0.7 and 0.9 with an average value of 80.45%, indicating that the heatmaps generated by the model align well with the true distribution in terms of shape and peak location. Meanwhile, the localization accuracy PMAE is only 2.37 pixels in average, and less than 3.5 in most samples. According to our camera calibration, this corresponds to a physical error of less than 2 mm on the ground averagely, a precision that fully satisfies the stringent requirements for precise targeting by laser weeding systems.

It is worth noting that due to image resolution limitations and the large number of samples, it is difficult to accurately label each sample. Our model can still correctly predict some unlabeled samples (such as those within the black circle), but it may also miss some samples with true labels (such as those within the gold circle) and samples with missing true labels (such as those within the orange circle) as shown in Figure 13.

4.4.2. Depth Estimation Performance

The performance of depth estimation is directly related to the accuracy of 3D localization. As shown in Figure 14, our model also exhibits high accuracy and reliability in the depth estimation task. At all ground truth keypoint locations, most depth prediction error is within 1 cm, which is much ahead of the traditional method.

More importantly, when using 1.455 cm (the optimal error for the traditional “planar assumption” baseline) as the threshold, the model’s Threshold Accuracy (AccWithThd) reaches 83.46%. This means that in 83.46% of cases, the depth information provided by our model is more accurate than the average performance of optimal planar assumption. In terms of error distribution per pixel, our model performs significantly better than traditional method, with smaller errors and more stable distributions.

As shown in Table 2, to quantitatively validate the advantage of our depth estimation approach, we compare it against a conventional baseline method that assumes a uniform height for all weed keypoints (i.e., plane hypothesis). Specifically, this baseline assigns a constant depth value—the average depth of the training set—to all keypoints without spatial compensation. Our results demonstrate that this oversimplified assumption leads to significant localization errors, whereas our method dynamically estimates depth to achieve higher accuracy. Referring to Section 3.1, we superimpose the depth-compensated values predicted by the model on the depth baseline as the final depth prediction. Although such an approximation process introduces additional error, our method remains significantly ahead of conventional methods.

In order to comprehensively evaluate the performance of our multi-task framework in achieving accurate 3D positioning, we introduce an additional metric: the mean absolute error of 3D coordinates (MAE-3D). This metric is referenced to Section 3.6, which directly quantifies the average Euclidean distance between the predicted 3D keypoint coordinates and their corresponding real 3D coordinates.

These results strongly prove that our proposed method, which estimates per-point depth compensation through deep learning, can significantly overcome the limitations of the traditional planar assumption, greatly enhancing the accuracy and robustness of 3D keypoint localization for weeds, thus laying a solid technical foundation for efficient and precise automated laser weeding.

4.5. Ablation Study

In order to systematically evaluate the effectiveness of each innovative module in our framework and the synergistic gain of multi-task learning, we designed a series of ablation experiments. By comparing the baseline performance under the “decoupled” and “coupled” frameworks, we can quantify the practical value of joint multi-task learning. By analyzing the performance of each module under different frameworks, we can clearly understand the sources of model performance improvement. If there is no “coupling”, it means that the two tasks use two separate encoders and the CBA is removed to ensure no information is exchanged, but the GFF and HDB are retained.

As shown in Figure 15, our proposed full model demonstrates excellent performance in weed keypoint detection, capable of accurately identifying and localizing the vast majority of weed keypoints, representing a significant improvement over the baseline method, especially in missed detection rate. Notably, the model also exhibits strong generalization capabilities, even effectively detecting weed instances that might have been missed in the ground truth annotations.

Regarding depth estimation, as depicted in Figure 16, the baseline model exhibits considerable single-point depth estimation errors along with noticeable noise. The integration of the Hybrid Domain Block (HDB) module effectively handles these high-frequency noises and details, producing smoother results that more closely approximate the true depth distribution, thereby significantly improving the quality and robustness of the predicted depths. Furthermore, the Gated Feature Fusion (GFF) module, through adaptive feature integration, and the Cross-Branch Attention (CBA) module, by facilitating inter-task information synergy, collectively enable the model to better learn and optimize depth feature representations, leading to more accurate and stable depth estimation results. Consequently, with the combined assistance of these three innovative modules, the full model achieves high-quality depth predictions, providing a solid foundation for subsequent 3D spatial localization.

The ablation study systematically validated the effectiveness of each proposed module (GFF, HDB, CBA) and the multi-task learning (MTL) framework’s “coupled” design. The complete model achieved the best performance across all metrics, underscoring the strong synergistic effects among GFF for adaptive feature fusion, HDB for detail preservation, and CBA for inter-task synergy. Crucially, experiments with a “decoupled” framework demonstrated a significant performance drop, confirming that sharing an encoder within an MTL setup is fundamental for learning more discriminative features and enabling robust information exchange. Specific values and standard deviations are as Table 3 and Table 4.

Our ablation studies reveal a compelling advantage of coupling the keypoint detection and depth estimation tasks. By employing a shared encoder for both tasks, our framework not only significantly conserves computational resources compared to decoupled approaches but also yields superior performance across all metrics. This enhancement is further bolstered by the introduction of the Cross-Branch Attention mechanism, which facilitates information exchange between the task-specific branches. The consistent improvements demonstrate that the feature representations learned within a coupled framework, especially with inter-task attention, are more discriminative and beneficial for both keypoint detection and depth estimation. This strongly indicates the existence of significant information complementarity between these two tasks.

Beyond the benefits of task coupling, our investigations highlight the crucial contributions of each proposed module. Each innovative component—the Gated Feature Fusion (GFF) for adaptive feature integration, the Hybrid Domain Block (HDB) for robust detail preservation, and the Cross-Branch Attention (CBA) for enhanced inter-task synergy—independently contributed to improving model performance on their respective tasks, as well as the overall joint learning outcome. The comprehensive performance evaluation shows that the full model, integrating all three modules within the coupled framework, achieved the best results across all metrics. This conclusively validates the effectiveness and robustness of our proposed overall architecture, demonstrating the powerful complementary and synergistic advantages when these modules work in concert for joint learning tasks.

4.6. Backbone Comparison and Computational Analysis

In agricultural automation, the practicality of deep-learning models depends not only on accuracy but also on strict constraints regarding computational resource consumption and inference efficiency. Therefore, we conducted a backbone comparison experiment aimed at systematically evaluating the performance of conventional methods under different computational budgets, and using this as a reference to highlight the advantages of our lightweight, high-performance framework.

We constructed a series of standard dual-branch decoder architectures (keypoint heatmap + depth map output) based on different mainstream backbones (including MobileNetV3, ShuffleNetV2, RexNet, ShuffleNetV2, GhostNet, RegNet). All models were trained and evaluated under unified input resolution and training strategies. We recorded their performance metrics for keypoint localization and depth estimation tasks, along with model parameters (Parameters), theoretical computational complexity (FLOPs), and inference speed measured in frames per second (FPS). All FPS results were measured on an NVIDIA RTX 4090 GPU. The results are presented in Table 5.

The backbone comparison resented in Table 6 further underscored the overall strength of our framework, demonstrating its precision and computational efficiency. While our method does not lead in FPS performance, it maintains competitive inference speed without significant compromise, ensuring practical deployability. More importantly, it significantly outperformed other prominent lightweight networks in key 3D localization metrics (PMAE, DMAE, AWT), indicating that our meticulously designed modules effectively leverage shared representations and task interdependencies. This balance between runtime efficiency and high-precision 3D localization is essential for real-world automated laser weeding applications.

5. Conclusions

This paper addresses the need for precise 3D localization of weed targets in automated laser weeding by designing and implementing a novel, lightweight deep neural network framework based on multi-task learning. This framework can simultaneously perform 2D planar localization of weed keypoints and high-precision depth compensation estimation from only a single RGB image. Notably, for the first time, we use deep estimation in the form of multi-task learning to improve weed localization accuracy.

The core contributions of our work include:

Proposing an end-to-end unified architecture that efficiently integrates keypoint detection and depth estimation tasks.
Designing a series of innovative structural modules, including Gated Feature Fusion (GFF), Hybrid Domain Block (HDB), and Cross-Branch Attention (CBA).

Extensive experimental results strongly confirm the effectiveness of our method. Compared with the traditional plane hypothetical baseline, our model greatly reduces the depth error, thus effectively improving the spatial positioning accuracy, and shows excellent performance in various metrics. This indicates that our method is not only theoretically innovative but also possesses strong robustness and deployment value in practical applications, providing a solid technical foundation for the development of intelligent and precise agricultural robot systems.

Although this paper has achieved positive results, there is still a broad research space worth exploring. We plan to extend our current work in the following aspects:

Enhance Generalization and Robustness: We will build larger datasets containing image data of more species of weeds as well as multiple growth stages. Our future work will focus on validating and optimizing the model across diverse crop types, growth stages, and challenging field conditions (e.g., varying lighting and weather). We also plan to explore domain adaptation techniques to improve cross-scene generalization capabilities.
Reduce Annotation Burden: To mitigate the high cost of manual data annotation, we will investigate weak, semi-supervised, and self-supervised learning paradigms. This includes leveraging multi-view geometric constraints or time-series information for pseudo-label generation, aiming to significantly reduce reliance on human labeling efforts.
Robust Absolute Depth Estimation: While our current approach focuses on depth compensation leveraging a fixed operating height, future work will explore methods for robustly acquiring absolute depth measurements.This includes integrating low-cost auxiliary sensors (e.g., ultrasonic sensors, structured light) or advanced monocular depth estimation techniques that can reduce the inherent error in absolute depth prediction, thus broadening applicability to scenarios with variable robot heights.
Final Deployment Optimization: We aim for extreme model lightweighting through techniques such as knowledge distillation, network pruning, and quantization. This will enable real-time inference on resource-constrained embedded platforms, ultimately achieving seamless software-hardware integration for practical application deployment.
Perception-to-Action Closed-Loop Autonomy: Utilizing the high-precision 3D coordinates from our model, we envision constructing a complete 3D scene reconstruction and task planning module. This includes optimizing multi-target striking sequences and dynamically adjusting laser parameters based on energy models, ultimately forming a closed-loop system from environmental perception to intelligent decision making and precise execution, thereby endowing intelligent weeding robots with a higher level of autonomy.

Author Contributions

S.X.: Investigation; Methodology; Software; Writing—Original Draft. T.Q.: Investigation; Data Curation. J.L.: Data Curation, Software. X.R.: Conceptualization; Methodology. Y.M.: Project Administration; Funding Acquisition; Writing—Review and Editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Shanghai Agricultural Science and Technology Innovation Project (Grant Number 2024-02-08-00-12-F00039), and the National Natural Science Foundation of China (Grant Number 32472005). The authors gratefully acknowledge the financial support, technical guidance, access to experimental facilities, and collaborative assistance provided throughout the project period.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Yaseen, M.U.; Long, J.M. Laser Weeding Technology in Cropping Systems: A Comprehensive Review. Agronomy 2024, 14, 2253. [Google Scholar] [CrossRef]
Li, Y.; Guo, Z.; Shuang, F.; Zhang, M.; Li, X. Key technologies of machine vision for weeding robots: A review and benchmark. Comput. Electron. Agric. 2022, 196, 106880. [Google Scholar] [CrossRef]
Lu, R.; Zhang, D.; Wang, S.; Hu, X. Progress and Challenges in Research on Key Technologies for Laser Weed Control Robot-to-Target System. Agronomy 2025, 15, 1015. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 172–186. [Google Scholar] [CrossRef] [PubMed]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
Osorio, K.; Puerto, A.; Pedraza, C.; Jamaica, D.; Rodríguez, L. A deep learning approach for weed detection in lettuce crops using multispectral images. AgriEngineering 2020, 2, 471–488. [Google Scholar] [CrossRef]
Sun, J.; Yang, K.; He, X.; Luo, Y.; Wu, X.; Shen, J. Beet seedling and weed recognition based on convolutional neural network and multi-modality images. Multimed. Tools Appl. 2022, 81, 5239–5258. [Google Scholar] [CrossRef]
Nasiri, A.; Omid, M.; Taheri-Garavand, A.; Jafari, A. Deep learning-based precision agriculture through weed recognition in sugar beet fields. Sustain. Comput. Inform. Syst. 2022, 35, 100759. [Google Scholar] [CrossRef]
Zhao, P.; Chen, J.; Li, J.; Ning, J.; Chang, Y.; Yang, S. Design and Testing of an autonomous laser weeding robot for strawberry fields based on DIN-LW-YOLO. Comput. Electron. Agric. 2025, 229, 109808. [Google Scholar] [CrossRef]
Zhang, D.; Lu, R.; Guo, Z.; Yang, Z.; Wang, S.; Hu, X. Algorithm for Locating Apical Meristematic Tissue of Weeds Based on YOLO Instance Segmentation. Agronomy 2024, 14, 2121. [Google Scholar] [CrossRef]
Arsa, D.M.S.; Ilyas, T.; Park, S.H.; Won, O.; Kim, H. Eco-friendly weeding through precise detection of growing points via efficient multi-branch convolutional neural networks. Comput. Electron. Agric. 2023, 209, 107830. [Google Scholar] [CrossRef]
Lottes, P.; Behley, J.; Chebrolu, N.; Milioto, A.; Stachniss, C. Joint stem detection and crop-weed classification for plant-specific treatment in precision farming. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, Madrid, Spain, 1–5 October 2018; pp. 8233–8238. [Google Scholar]
Li, J.; Güldenring, R.; Nalpantidis, L. Real-time joint-stem prediction for agricultural robots in grasslands using multi-task learning. Agronomy 2023, 13, 2365. [Google Scholar] [CrossRef]
Zhu, H.; Zhang, Y.; Mu, D.; Bai, L.; Zhuang, H.; Li, H. YOLOX-based blue laser weeding robot in corn field. Front. Plant Sci. 2022, 13, 1017803. [Google Scholar] [CrossRef]
Coll-Ribes, G.; Torres-Rodríguez, I.J.; Grau, A.; Guerra, E.; Sanfeliu, A. Accurate detection and depth estimation of table grapes and peduncles for robot harvesting, combining monocular depth estimation and CNN methods. Comput. Electron. Agric. 2023, 215, 108362. [Google Scholar] [CrossRef]
Tamrakar, N.; Paudel, B.; Karki, S.; Deb, N.C.; Arulmozhi, E.; Kook, J.H.; Kang, M.Y.; Kang, D.Y.; Ogundele, O.M.; Nakarmi, B.; et al. Peduncle Detection of Ripe Strawberry to Localize Picking Point using DF-Mask R-CNN and Monocular Depth. IEEE Access 2025, 13, 73889–73902. [Google Scholar] [CrossRef]
Cui, X.Z.; Feng, Q.; Wang, S.Z.; Zhang, J.H. Monocular depth estimation with self-supervised learning for vineyard unmanned agricultural vehicle. Sensors 2022, 22, 721. [Google Scholar] [CrossRef]
Kim, W.S.; Lee, D.H.; Kim, Y.J.; Kim, T.; Lee, W.S.; Choi, C.H. Stereo-vision-based crop height estimation for agricultural robots. Comput. Electron. Agric. 2021, 181, 105937. [Google Scholar] [CrossRef]
Vandenhende, S.; Georgoulis, S.; Van Gansbeke, W.; Proesmans, M.; Dai, D.; Van Gool, L. Multi-task learning for dense prediction tasks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3614–3633. [Google Scholar] [CrossRef]
Goncalves, D.N.; Junior, J.M.; Zamboni, P.; Pistori, H.; Li, J.; Nogueira, K.; Goncalves, W.N. MTLSegFormer: Multi-task learning with transformers for semantic segmentation in precision agriculture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6290–6298. [Google Scholar]
Amrani, A.; Diepeveen, D.; Murray, D.; Jones, M.G.; Sohel, F. Multi-task learning model for agricultural pest detection from crop-plant imagery: A Bayesian approach. Comput. Electron. Agric. 2024, 218, 108719. [Google Scholar] [CrossRef]
Duc, C.D.; Lim, J. X-PDNet: Accurate Joint Plane Instance Segmentation and Monocular Depth Estimation with Cross-Task Distillation and Boundary Correction. arXiv 2023, arXiv:2309.08424. [Google Scholar]
He, L.; Lu, J.; Wang, G.; Song, S.; Zhou, J. Sosd-net: Joint semantic object segmentation and depth estimation from monocular images. Neurocomputing 2021, 440, 251–263. [Google Scholar] [CrossRef]
Zhang, C.; Tang, Y.; Zhao, C.; Sun, Q.; Ye, Z.; Kurths, J. Multitask GANs for semantic segmentation and depth completion with cycle consistency. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 5404–5415. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Gu, D. Deep Multi-task Learning for Animal Chest Circumference Estimation from Monocular Images. Cogn. Comput. 2024, 16, 1092–1102. [Google Scholar] [CrossRef]
Lottes, P.; Behley, J.; Chebrolu, N.; Milioto, A.; Stachniss, C. Robust joint stem detection and crop-weed classification using image sequences for plant-specific treatment in precision farming. J. Field Robot. 2020, 37, 20–34. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Sun, L.; Dong, J.; Tang, J.; Pan, J. Spatially-adaptive feature modulation for efficient image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 13190–13199. [Google Scholar]
Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; Li, H. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3531–3539. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]

Figure 1. Illustration of how a laser system might miss or hit a weed, and how depth compensation can improve targeting.

Figure 2. Network architecture overview.

Figure 3. Gated Feature Fusion.

Figure 4. Frequency unit.

Figure 5. Spatial block.

Figure 6. Cross-branch attention.

Figure 7. Data acquisition site.

Figure 8. Examples of weed species in the dataset.

Figure 9. Distribution of dataset.

Figure 10. Data sample. From left to right are the RGB image, depth map, and Gaussian heatmap.

Figure 11. Illustration of keypoint matching.

Figure 12. Growing point detection performance metrics box plots.

Figure 13. Visualization of missed predictions and ground truth omissions. The cross symbol indicates the predicted keypoint location. The black circle highlights correctly predicted but unlabeled samples. The gold circle marks missed predictions with true labels, while the orange circle denotes samples with missing ground truth annotations.

Figure 14. Depth estimation performance comparison.

Figure 15. Ablation study on keypoint detection performance.

Figure 16. Ablation study on depth estimation performance.

Table 1. Keypoint detection performance.

Metric Category	P (%)	R (%)	PMAE (pix)	HIoU (%)
Average	$85.44$	$87.05$	$2.37$	$80.45$

Table 2. Comparison of our method with baseline.

Metric Category	DMAE-Comp	AWT-Comp	DMAE-Final	AWT-Final	MAE-3D
WeedLoc3D	0.8358 cm	83.46%	1.2243 cm	67.84%	1.238 cm
Traditional Method	1.4547 cm	62.28%	1.5048 cm	51.22%	1.516 cm
Improvement	42.54%	33.99%	18.64%	32.45%	18.33%

Note: All values are averaged over the test set.

Table 3. Ablation study (Part 1). Note: ↑ indicates that higher values are better, while ↓ indicates that lower values are preferred. Bold values indicate the best performance among all configurations.

Coupled	GFF	HDB	CBA	P(%) ↑	R(%) ↑	PMAE (Pix) ↓
√	√	×	×	$(82.90 \pm 0.46)$	$(85.59 \pm 0.83)$	2.63
√	×	√	×	$(85.40 \pm 0.39)$	$(84.47 \pm 0.37)$	2.37
√	×	×	√	$(85.10 \pm 0.88)$	$(85.94 \pm 0.51)$	2.60
√	√	√	√	$(85.44 \pm 0.52)$	$(87.05 \pm 0.65)$	2.37
√	×	×	×	$(81.77 \pm 0.38)$	$(85.44 \pm 0.94)$	2.75
×	√	√	×	$(85.30 \pm 0.36)$	$(84.18 \pm 0.19)$	2.51

Table 4. Ablation study (Part 2).

Coupled	GFF	HDB	CBA	HIoU (%) ↑	DMAE (cm) ↓	AWT (%) ↑
√	√	×	×	$(77.28 \pm 0.55)$	0.9295	$(79.69 \pm 0.33)$
√	×	√	×	$(77.96 \pm 0.80)$	0.9007	$(80.87 \pm 0.65)$
√	×	×	√	$(77.38 \pm 0.25)$	0.9090	$(80.02 \pm 0.77)$
√	√	√	√	$(80.45 \pm 0.47)$	0.8358	$(83.46 \pm 0.39)$
√	×	×	×	$(73.42 \pm 0.93)$	1.0577	$(74.11 \pm 0.29)$
×	√	√	×	$(78.40 \pm 0.29)$	0.9475	$(78.99 \pm 0.48)$

Table 5. Efficiency comparison across backbone networks.

Backbone	Parameters (M)	FLOPs (G)	FPS
MobileNetV3	3.20	1.66	186.88
WeedLoc3D	6.09	5.85	98.24
RexNet	8.54	5.98	133.92
ShuffleNetV2	7.60	6.57	177.16
GhostNet	14.34	6.00	38.64
RegNet	15.37	19.62	125.57

Table 6. Keypoint location performance comparison with different backbones.

Backbone	P (%) ↑	R (%) ↑	PMAE (pix) ↓	HIoU (%) ↑	DMAE (cm) ↓	AWT (%) ↑
MobileNetV3	$(72.64 \pm 0.97)$	$(89.26 \pm 0.45)$	3.46	$(67.68 \pm 0.83)$	1.0703	$(73.32 \pm 0.83)$
WeedLoc3D	$(85.44 \pm 0.52)$	$(87.05 \pm 0.65)$	2.37	$(80.45 \pm 0.47)$	$0.8358$	$(83.46 \pm 0.39)$
RexNet	$(82.70 \pm 0.22)$	$(84.01 \pm 1.13)$	2.76	$(75.82 \pm 0.66)$	0.9532	$(78.06 \pm 0.38)$
ShuffleNetV2	$(78.48 \pm 0.74)$	$(82.63 \pm 1.08)$	3.45	$(73.96 \pm 0.59)$	1.1414	$(70.36 \pm 0.31)$
GhostNet	$(80.41 \pm 0.90)$	$(86.03 \pm 0.27)$	3.07	$(75.66 \pm 0.91)$	0.9287	$(79.00 \pm 0.28)$
RegNet	$(83.38 \pm 1.00)$	$(81.50 \pm 0.48)$	2.84	$(74.97 \pm 0.76)$	1.0823	$(72.74 \pm 0.85)$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xie, S.; Quan, T.; Luo, J.; Ren, X.; Miao, Y. A Unified Framework for Enhanced 3D Spatial Localization of Weeds via Keypoint Detection and Depth Estimation. Agriculture 2025, 15, 1854. https://doi.org/10.3390/agriculture15171854

AMA Style

Xie S, Quan T, Luo J, Ren X, Miao Y. A Unified Framework for Enhanced 3D Spatial Localization of Weeds via Keypoint Detection and Depth Estimation. Agriculture. 2025; 15(17):1854. https://doi.org/10.3390/agriculture15171854

Chicago/Turabian Style

Xie, Shuxin, Tianrui Quan, Junjie Luo, Xuesong Ren, and Yubin Miao. 2025. "A Unified Framework for Enhanced 3D Spatial Localization of Weeds via Keypoint Detection and Depth Estimation" Agriculture 15, no. 17: 1854. https://doi.org/10.3390/agriculture15171854

APA Style

Xie, S., Quan, T., Luo, J., Ren, X., & Miao, Y. (2025). A Unified Framework for Enhanced 3D Spatial Localization of Weeds via Keypoint Detection and Depth Estimation. Agriculture, 15(17), 1854. https://doi.org/10.3390/agriculture15171854

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Unified Framework for Enhanced 3D Spatial Localization of Weeds via Keypoint Detection and Depth Estimation

Abstract

1. Introduction

2. Related Work

2.1. Weed Detection

2.2. Depth Estimation

2.3. Multi-Task Learning

3. Methods

3.1. Network Architecture

3.2. Gated Feature Fusion

3.3. Hybrid Domain Block

3.3.1. Frequency Unit

3.3.2. Spatial Unit

3.4. Cross-Branch Attention

3.5. Loss Function

3.5.1. Keypoints Detection Loss

3.5.2. Depth Estimation Loss

3.5.3. Total Loss

3.6. Three-Dimensional Coordinate Reconstruction of Keypoints

4. Experiments

4.1. Datasets

4.1.1. Data Collection

4.1.2. Data Annotation and Splitting

4.1.3. Data Preprocessing and Augmentation

4.2. Training Environment and Hyperparameters

4.3. Performance Metrics

4.3.1. Keypoint Detection Metrics

4.3.2. Depth Estimation Metrics

4.4. Experimental Results

4.4.1. Keypoint Detection Performance

4.4.2. Depth Estimation Performance

4.5. Ablation Study

4.6. Backbone Comparison and Computational Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI