An Object Detection Algorithm for Orchard Vehicles Based on AGO-PointPillars

Ren, Pengyu; Qiu, Xuyun; Gao, Qi; Song, Yumin

doi:10.3390/agriculture15141529

Open AccessArticle

An Object Detection Algorithm for Orchard Vehicles Based on AGO-PointPillars

by

Pengyu Ren

,

Xuyun Qiu

^*,

Qi Gao

and

Yumin Song

School of Automotive Engineering, Shandong Jiaotong University, Jinan 250357, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(14), 1529; https://doi.org/10.3390/agriculture15141529

Submission received: 29 May 2025 / Revised: 5 July 2025 / Accepted: 9 July 2025 / Published: 15 July 2025

(This article belongs to the Section Agricultural Technology)

Download

Browse Figures

Versions Notes

Abstract

With the continuous expansion of the orchard planting area, there is an urgent need for autonomous orchard vehicles that can reduce the labor intensity of fruit farmers and improve the efficiency of operations to assist operators in the process of orchard operations. An object detection system that can accurately identify potholes, trees, and other orchard objects is essential to achieve unmanned operation of the orchard vehicle. Aiming to improve upon existing object detection algorithms, which have the problem of low object recognition accuracy in orchard operation scenes, we propose an orchard vehicle object detection algorithm based on Attention-Guided Orchard PointPillars (AGO-PointPillars). Firstly, we use an RGB-D camera as the sensing hardware to collect the orchard road information and convert the depth image data obtained by the RGB-D camera into 3D point cloud data. Then, Efficient Channel Attention (ECA) and Efficient Up-Convolution Block (EUCB) are introduced based on the PointPillars, which can enhance the ability of feature extraction for orchard objects. Finally, we establish an orchard object detection dataset and validate the proposed algorithm. The results show that, compared to the PointPillars, the AGO-PointPillars proposed in this study has an average detection accuracy improvement of 4.64% for typical orchard objects such as potholes and trees, which can prove the reliability of our algorithm.

Keywords:

orchard vehicles; object detection; RGB-D camera; AGO-PointPillars

1. Introduction

Autonomous orchard vehicles can improve the efficiency of orchard operations and reduce the labor intensity of fruit farmers. Hilly orchard terrain is complex and varied. Potholes in uneven roads can affect the stability and the effectiveness of the orchard vehicle in the process of the orchard vehicle’s operation. When the vehicle passes over a pothole or a convex road while operating in an orchard, the vehicle will tilt laterally or longitudinally. If the tilt angle is too large, the vehicle may overturn. On the one hand, the vehicle’s tilt can affect the safety of the operators in the vehicle; on the other hand, the vehicle’s tilt can affect the accuracy and effect of the orchard operation, such as spraying operations for fruit trees on both sides. When the operating vehicle encounters a pothole that can cause the vehicle to tilt and cannot be avoided, the size, depth, and location of the pothole need to be detected to adjust the vehicle’s posture in advance, which can improve the stability of the vehicle’s travel and operation. In addition, the distribution of the fruit trees may not be regular in the orchard. The vehicle needs to detect the position of the trees during orchard operations, such as picking and spraying operations, which can ensure that the vehicle always travels in the direction of the tree rows to achieve precise operations. Moreover, due to the variety and irregularity of the contours of fruit trees, it is easy to recognize trees as similarly shaped objects, such as utility poles, which can cause the vehicle to be unable to work accurately along the tree rows. At present, most of the orchard vehicles still need to be operated by human operators. They manually adjust the vehicles’ posture to keep the stability of the vehicles, avoid dangerous roads that may cause the vehicles to tilt, and follow the rows of trees. Therefore, it is crucial to develop a high-precision object detection algorithm suitable for hilly orchard vehicles. The high-precision object detection algorithm can quickly detect typical orchard objects such as potholes and trees. The algorithm can also provide identification results for tilt leveling and operations along tree rows of orchard vehicles. This study is a key step in achieving fully autonomous operation of orchard vehicles.

Currently, object detection algorithms can be classified into two categories based on the type of data collected: object detection methods based on 2D images and object detection methods based on 3D point clouds.

Early object detections based on 2D images mainly utilize methods such as threshold segmentation, edge detection, and region growth to detect objects. These methods first separate the object from the road and then classify the object. However, such methods are susceptible to factors such as road interference objects, lighting conditions, and stains, which make it difficult to meet the needs of object detection in complex environments. In recent years, with the emergence of detection methods based on convolutional neural networks, the accuracy and efficiency of object detection have significantly improved.

Zhou et al. [1] proposed a field object detection model for unmanned agricultural machinery. They introduced the Convolutional Block Attention Module (CBAM) in the backbone network and integrated the Bi-directional Feature Pyramid Network (BiFPN) after the backbone network. This model enhanced the detection capability for objects such as pedestrians, tractors, and utility potholes in the field. Tang et al. [2] proposed a pothole detection algorithm based on feature fusion to address the difficulty of detecting potholes in complex road environments for autonomous vehicles. They designed a variable convolutional module and introduced a dynamic attention module in the backbone network to realize cross-channel fusion of pothole features, which improved the pothole detection accuracy in different road scenes. Aljohani [3] proposed a heuristic algorithm based on a road pothole detection model. The author used the random forest model for pothole detection and added the particle swarm optimization algorithm to remove irrelevant features. The results showed that the road pothole recognition accuracy is up to 99.37%. Yasmin et al. [4] addressed the issue that autonomous vehicles have weak detection capabilities for small-sized objects such as barrels in high-speed driving scenes. They incorporated the transfer learning method into the semantic segmentation model for object detection, which improves the safety of self-driving cars driving on highways. Li et al. [5] proposed an object detection model, SS-DETR, suitable for autonomous vehicles. They introduced a multi-information fusion attention mechanism and established a backbone network perception field attention model based on it, which enhanced the detection ability of autonomous vehicles for small-sized and partially occluded objects.

To reduce the influence of the external environment on object detection, some scholars have proposed object detection methods based on 3D point clouds. These methods can obtain the three-dimensional information of the road and reduce the visual interference, such as lighting, shadows, and stains. The detection accuracy and efficiency of the 3D point cloud detection methods are better than those of 2D image detection methods. Pang et al. [6] designed an obstacle rapid detection algorithm for UGVs based on multi-sensor point cloud fusion technology. The proposed ground segmentation method based on multi-plane fitting can accurately identify the boundary between the ground and obstacles. Wang et al. [7] proposed a real-time object sensing method for UAVs, which solved the problem of difficult object sensing for UAVs in low-light environments by constructing a novel tracker based on the Kernel Correlation Filter (KCF) and a fusion of the Kalman filter. Liu et al. [8] proposed a dual-LIDAR obstacle detection method for the problem of difficult detection of potholes and other concave obstacles by autonomous land vehicles in field environments. They used the adaptive threshold point cloud segmentation method to realize the ground segmentation and the positive obstacles detection by horizontally mounted LIDAR and used vertically mounted LIDAR to realize the negative obstacle detection by clustering the feature points.

In recent years, with the rise of convolutional neural network algorithms, the accuracy of obstacle detection based on 3D point clouds has been further improved. Qin et al. [9] designed an obstacle detection method based on focal voxel R-CNN for the farmland environment. They converted the point cloud data collected by the LiDAR into a BEV feature map and used the improved voxel R-CNN algorithm to identify obstacles. This obstacle detection method improved the detection performance of autonomous agricultural machines for smaller objects, such as pedestrians in the field. Zhang et al. [10] proposed an agricultural machinery field obstacle detection model. They utilized the depth information obtained by the dual-camera and fused Large Separable Kernel Attention (LSKA) to locate the obstacles. The obstacle detection model enhanced the recognition ability of unmanned agricultural machinery for obstacles such as tractors and pedestrians. Talha et al. [11] proposed a highway pothole detection algorithm based on convolutional neural networks. They converted the point cloud data collected by the LiDAR into a two-dimensional histogram and used the YOLO algorithm for pothole detection. This algorithm significantly improved the accuracy of pothole recognition. Wang et al. [12] proposed an adaptive multimodal fusion object detection algorithm for unmanned vehicles. They fused the features extracted from the two-dimensional images with the original point cloud data and further converted them into BEV features. Then, they performed target extraction through a sparse convolutional neural network. Experimental results showed that the algorithm improved the object detection accuracy under the bad weather conditions.

PointPillars [13] has a fast running speed and is easy to deploy. Therefore, it is widely used in the field of autonomous driving. PointPillars takes 3D point cloud data as input and enables the detection of objects such as vehicles, pedestrians, and barrels. PointPillars is based on VoxelNet [14] and SECOND [15]. PointPillars first divides the point cloud in three-dimensional space into multiple cylinders. Then, the algorithm extracts the features within each cylinder and compresses them in a high-dimensional space to further generate pseudo-images. Finally, the algorithm uses a 2D neural network for object detection. However, the method of point cloud columnization and generating pseudo-images results in the loss of a large amount of detailed information, which leads to a decrease in the detection accuracy of small target objects. To further improve the accuracy and robustness of object detection, researchers have proposed many improved PointPillars algorithms. Le et al. [16] introduced a pillar-aware attention module based on the original algorithm and proposed a weighted multi-scale feature fusion network to improve the detection of pedestrians. The ET-PointPillars algorithm proposed by Liu et al. [17] introduced a feature point sampling module based on OVD into the Pillar Feature Network (PFN) of PointPillars to expand the feature dimension of the point cloud pillars. The algorithm enhanced the detection capabilities for small target objects such as pedestrians and cyclists. Zhang et al. [18] addressed the problem of point cloud feature loss caused by focusing only on local features in the point cloud columnar coding process of the PointPillars algorithm. They used the transformer module to simultaneously process global positional features and local structural features, which improved the detection performance of objects such as cars and pedestrians. Shu et al. [19] proposed an improved PointPillars algorithm based on feature enhancement. They added channel attention and spatial attention mechanisms to the backbone to enhance the feature extraction capability of the backbone. Moreover, the up-sampling modules Carafe and Dysample were introduced to dynamically adjust the resolution of the feature maps, which avoided the loss and blurring of obstacle features and improved the object detection accuracy in complex road scenes.

Most of the point cloud-based object detection algorithms rely on lidars to perceive the surrounding environments, which can cause problems such as high hardware costs and large sensor size. Moreover, the lidar generates a large amount of point cloud data, which is irrelevant to the detected objects during the detection process. It will increase the burden on the computing platform. Currently, most of the PointPillars and its improved algorithms are mainly applicable to autonomous vehicles on roads with good conditions in cities and expressways. The orchard operation scene has complex terrains. The potholes are often located on uneven roads with leaves and small branches in them, which makes the potholes less recognizable. The shapes of the trees are diverse and irregular compared to the streetlamps and power poles in the city. The existing object detection algorithms are less accurate in detecting the orchard objects, such as the hidden potholes and the irregularly shaped trees. Sometimes, the existing algorithms may fail to recognize the orchard objects. In response to the problem of high price and large size of the lidar, this study uses a lower-cost RGB-D camera as the hardware for sensing the object. In response to the problem of low object recognition accuracy of the existing algorithms in the orchard scene, we propose an object detection algorithm for orchard vehicles: Attention-Guided Orchard PointPillars (AGO-PointPillars). Based on the PointPillars algorithm, we construct the Efficient Channel Attention (ECA) [20] after the Pillar Feature Network and add the Efficient Up-Convolution Block (EUCB) [21] to the backbone network. The main contributions of this study can be summarized as follows:

The RGB-D camera is used to replace the lidar as the object sensing hardware for data acquisition, and the acquired depth image data are converted into 3D point cloud data. The 3D point cloud data can lay the foundation for the subsequent object detection algorithm of orchard vehicles;
An orchard vehicle object detection algorithm is proposed to introduce the ECA module and the EUCB module to enhance the capability of feature extraction for orchard objects;
An orchard object detection dataset with multiple scenes is constructed based on the KITTI Vision Benchmark. We verify the effectiveness of the object detection algorithm for orchard vehicles by using the constructed dataset and comparing the proposed algorithm with others.

2. Methods

2.1. Data Acquisition and Preprocessing

Data acquisition and preprocessing of objects in the orchard were carried out by using an orchard vehicle developed by research group as the research vehicle. In this study, the Gemini335L RGB-D camera was selected as the sensing hardware, which uses a binocular infrared structured light sensor to collect depth images and can be operated under strong outdoor light conditions. The Gemini335L RGB-D camera has a maximum ideal working range of 6m, which meets the working conditions of orchard vehicles. The orchard vehicle is equipped with an aluminum profile bracket above the shell. The depth camera is installed in the aluminum profile bracket. The laptop computer is connected to the depth camera as the control center for data acquisition and preprocessing. The hardware installation position is shown in Figure 1.

(1): Data Acquisition

The orchard scene was selected for the data collection of target objects. In order to eliminate the aberrations generated during the production and assembly of the camera, we calibrated the RGB-D camera to obtain the internal reference of the camera by using the checkerboard calibration method. The calibration process is shown in Figure 2. The ROS system installed on the Ubuntu 20.04 operating system was used to record the data of the orchard objects. Figure 3 shows one frame of the data. Figure 3a is the RGB image data, and Figure 3b is the depth image data.

(2): Data Preprocessing

Since the object detection algorithm for the orchard vehicle uses point cloud data in BIN format for training and validation, it is necessary to convert the depth image data captured by the RGB-D camera into 3D point cloud data.

Figure 4 shows the pinhole camera model. According to the triangle similarity principle, the relationship between world coordinate system points and camera coordinate system points is as follows:

\begin{array}{l} X' = f \frac{X}{Z} \\ Y' = f \frac{Y}{Z} \end{array}

(1)

where f indicates the focal length of the camera; X, Y, and Z indicate the coordinates of a space point in the world coordinate system; and X′ and Y′ indicate the coordinates of an image point in the camera coordinate system.

The relation between coordinate P′ and pixel coordinate [u, v]^T is as follows:

\begin{array}{l} u = α X' + c_{x} \\ v = β Y' + c_{y} \end{array}

(2)

where α and β indicate the proportion coefficients, and c_x and c_y indicate the camera’s intrinsics.

Bring Equation (1) into Equation (2), replacing with f_x and f_y:

\begin{array}{l} u = f_{x} \frac{X}{Z} + c_{x} \\ v = f_{y} \frac{Y}{Z} + c_{y} \end{array}

(3)

where f_x and f_y indicate the camera’s intrinsics.

Organize and write it in matrix form:

Z [\begin{matrix} u \\ v \\ 1 \end{matrix}] = [\begin{matrix} f_{x} & 0 & c_{x} \\ 0 & f_{y} & c_{y} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} X \\ Y \\ Z \end{matrix}]

(4)

According to Equation (4), we can obtain the coordinates of the points in the image coordinate system corresponding to the world coordinate system. Figure 5 shows an example of the depth image conversion result.

2.2. Attention-Guided Orchard PointPillars

The PointPillars algorithm uses a two-dimensional convolutional neural network for feature sampling in backbone network. Although the PointPillars is more efficient, it is less accurate in detecting objects such as potholes and trees in the orchard operation environment.

2.2.1. The Architecture of AGO-PointPillars

To improve the object recognition accuracy of the original algorithm in the orchard operation scenes, we propose an orchard vehicle object detection algorithm: AGO-PointPillars. Firstly, the Pillar Feature Network reduces the input point cloud data to a two-dimensional pseudo-image. Secondly, the Efficient Channel Attention module is used to enhance the spatial sensitivity of the algorithm and improve the representation ability for the orchard object features. Thirdly, the pseudo-image is sent to the backbone network integrated with EUCB to learn multi-scale orchard object features. Finally, the detection head is used to achieve object detection and regression of parameters such as the position, classification, and orientation of the three-dimensional detection box. The architecture of the AGO-PointPillars algorithm is shown in Figure 6.

2.2.2. Pillar Feature Network

The point cloud data are transformed into pseudo-point cloud images through the Pillar Feature Network. The PFN first divides the point cloud image into multiple Pillar units and then encodes each point cloud into a D-dimensional vector D (x, y, z, r, x_c, y_c, z_c, x_p, y_p). Where D is equal to 9, x, y, z, and r indicate the coordinates and the reflection intensity of the points in the point cloud; x_c, y_c, and z_c indicate the deviation of the point from the centroid of all points in the pillar; and x_p and y_p indicate the deviation of the point from the physical center of all points in the pillar.

Then, the PointNet [22] is used to process the pillars. Each point with D-dimensional features is processed by using a linear layer, a BN layer, and a ReLU activation function to convert the dimension of each point from D to C, which results in a tensor of dimensions (C, P, N), where P represents the number of non-empty pillars, and N represents the number of points in each pillar. Next, a max pooling operation is performed in each pillar unit to obtain the feature vector of each pillar, which generates a tensor of dimensions (C, P).

Finally, the tensor of dimension (C, P) is transformed into a pseudo image of size (C, H, W) through the pillar index values of each point and sent into the ECA module.

2.2.3. Efficient Channel Attention Module

The attentional mechanism enhances the representations of important information by mimicking human cognitive styles, which can better utilize the limited computational resources to process large amounts of information. We introduce the Efficient Channel Attention module to further extract the features of objects in the orchard and improve the problem of feature redundancy caused by the fusion of different detection objects. The ECA module is shown in Figure 7.

The ECA module takes the pseudo-images with height H, width W, and channel dimension size C as inputs. The ECA module uses a convolutional neural network to assign attention weights to the extracted orchard objects, respectively, which can be used to distinguish the importance of different orchard objects. Firstly, the spatial dimensions of pseudo-images are compressed by the global average pooling layer to obtain the feature vector that contains only channel information.

Secondly, a one-dimensional convolution operation is performed in the feature vector by using an adaptively sized convolution kernel. When the ECA module calculates the weights for orchard objects, the fixed convolution kernel size k affects the number of feature channels in the attentional mechanism, which leads to a decrease in attention to the important objects in the orchard. Therefore, an adaptive method is used to determine the size of k. In group convolutions, high-dimensional channels are proportional to long-range convolutions, while low-dimensional channels are proportional to short-range convolutions. Similarly, the coverage of cross-channel information interaction (convolution kernel size k of a one-dimensional convolutional layer) is proportional to the channel dimension C, i.e., there is a mapping between k and C. The linear functions do not accurately describe the characteristics of the orchard objects, and the channel dimension is usually an exponential multiple of 2. Therefore, an exponential function with a base of 2 is used to represent the nonlinear mapping relationship:

C = ϕ (k) = 2^{(γ * k - b)}

(5)

where C indicates the pseudo-image channel dimension, k indicates the size of the convolution kernel in the one-dimensional convolutional layer, and the values of parameters γ and b are set to 1 and 2 in this study.

Therefore, for the given channel dimension C, k is calculated as shown in Equation (6):

k = φ (C) = {|\frac{\log_{2} (C)}{γ} + \frac{b}{γ}|}_{o d d}

(6)

where

{|\cdot|}_{o d d}

indicates the closest odd number to the result.

Finally, the convolution result is normalized by the activation function to obtain the weight value of each channel, which is shown in Equation (7). Each channel weight and the corresponding element of the original pseudo-images are, respectively, multiplied to output the final pseudo-images that describe the importance of the orchard objects. In this study, σ uses the Sigmoid function to map the weights.

ω = σ ({C 1 D}_{k} (y))

(7)

where ω indicates the value of channel weights, σ indicates the relationship between the effect obtained after channel interaction and the assigned weights, and C1D indicates a one-dimensional convolutional layer with convolutional kernel size k.

2.2.4. Improved 2D CNN Backbone Network

The backbone network extracts multi-scale features by convolutional down-sampling and then restores them to the same scale after up-sampling for feature fusion. To address the problem of insufficient feature extraction capability of the backbone network, this study integrates the Efficient Up-Convolution Block into the backbone network. The EUCB is shown in Figure 8.

The EUCB first does up-sampling of the convolved feature maps (up (•)). Secondly, the convolution operation is performed separately for each channel by the Depth-wise Convolution (DWC (•)). Then, the Batch Normalization (BN (•)) and the ReLU activation functions (ReLU (•)) are used to enhance the mapping ability to the obstacle features. Finally, the 1 × 1 Convolution (C_1×1 (•)) is used to reduce the number of channels, which can match the next stage. EUCB expression is shown in Equation (8).

E U C B (x) = C_{1 \times 1} (Re L U (B N (D W C (u p (x)))))

(8)

2.2.5. Detection Head

Single Shot Detector (SSD) algorithm [23] is used for the detection of 3D objects in this study. The feature maps outputted from the backbone network are used for object recognition based on Visual Geometry Group (VGG). On this basis, different sizes of convolutional layers are added to recognize the information of the objects at different scales. The SSD algorithm first generates multiple candidate frames near the location of the detection object in the 2D point cloud image. Then, the SSD performs object classification for each candidate frame and fine-tunes the position of the object to determine the location of the final object. Finally, the additional height regression is performed to output the final 3D detection frame. The structure of the detection head network is shown in Figure 9.

2.3. Training Loss Evaluation Indicators

This study adopts the same loss function as the PointPillars algorithm as the evaluation indicators during the algorithm training process. The detection head mainly predicts the position of the detection box, the object classification, and the target orientation. The loss function consists of three parts: the loss of detection box position, the loss of object classification, and the loss of target orientation angle. Each three-dimensional detection box is represented by a 7-dimensional vector (x, y, z, w, l, h, θ), where (x, y, z) indicate the center coordinate of the detection box; (w, l, h), respectively, indicate the length, width, and height of the detection box; and θ indicates the rotation angle of the detection box around the z-axis. The schematic diagram of the detection box is shown in Figure 10. The linear regression residual between the prior box predicted by the detection head and the true three-dimensional bounding box of the object is defined as follows:

\begin{array}{l} Δ x = \frac{x^{g t} - x^{a}}{d^{a}}, Δ y = \frac{y^{g t} - y^{a}}{d^{a}}, Δ z = \frac{z^{g t} - z^{a}}{d^{a}} \\ Δ w = \log \frac{w^{g t}}{w^{a}}, Δ l = \log \frac{l^{g t}}{l^{a}} Δ h = \log \frac{h^{g t}}{h^{a}} \\ Δ θ = \sin (θ^{g t} - θ^{a}) \end{array}

(9)

where x^gt, y^gt, z^gt, w^gt, l^gt, h^gt, and θ^gt indicate the position parameter of the object’s true bounding box; x^a, y^a, z^a, w^a, l^a, h^a, and θ^a indicate the position parameter of the predicted generated a priori box; and

d^{a} = \sqrt{{(w^{a})}^{2} + {(l^{a})}^{2}}

.

The localization loss (L_ls) of the detection frame adopts SmoothL1 function, which can effectively avoid the gradient explosion phenomenon of the localization loss function:

L_{l s} = \sum_{b \in (x, y, z, w, l, h, θ)} S m o o t h L 1 (Δ b)

(10)

where Δb indicates the residual of the localization regression.

For the problem of imbalance in the number of positive and negative samples in object detection, the classification loss (L_cls) of the object uses the Focal Loss function as shown in Equation (11):

L_{c l s} = - α_{a} {(1 - p^{a})}^{γ} \log p^{a}

(11)

where p^a indicates the category probability of the a priori value of the detection frame, and α and γ indicate the weighting factors, which are taken as α = 0.25 and γ = 2.

In addition, to avoid orientation discrimination errors, the orientation information of the detection frame needs to be learned. The orientation loss function is introduced to learn the orientation of the object, which is defined as L_dir. The overall loss function is as follows:

L = \frac{1}{N_{p o s}} (β_{l s} L_{l s} + β_{c l s} L_{c l s} + β_{d i r} L_{d i r})

(12)

where L indicates the overall loss function; N_pos indicates the number of positive samples; and L_ls, L_cls, and L_dir, respectively, indicate the loss functions for the detection frame position, classification, and orientation angle, which are weighted with β_ls = 2, β_cls = 1, and β_dir = 0.2.

3. Results and Discussion

3.1. Dataset

We refer to the standards provided by KITTI Vision Benchmark Suite [24] to establish an orchard object detection dataset for the orchard environment. Firstly, the depth image data captured by the RGB-D camera was converted into point cloud data according to the formula in Section 2.1. Then, the SUSTechPOINTS point cloud labeling tool was used to label the point cloud with 3D boxes (Figure 11), and the labeling results were saved in a file in JSON format. Finally, the Python script was used to convert the JSON format file into a KITTI format TXT file. The recognition type was set to potholes and trees based on the importance and probability of the object’s appearance. A total of 1560 data samples were collected and divided into training sets, validation sets, and test sets in the ratio of 8:1:1.

3.2. Analysis of Experimental Results

This experiment utilized the 3D point cloud object detection framework OpenPCDet based on PyTorch 2.1.1. The computer operating system used for model training and validation was Ubuntu 20.04, with a processor model Intel Ultra 7 265K, 32GB of RAM, and a graphics processor NVIDIA GeForce RTX4080 Super. The Adam optimizer is used with the learning rate of 0.0003, the weight decay value of 0.01, and the momentum value of 0.9. The trends of total loss, position loss, classification loss, and orientation loss values during this experimental training are shown in Figure 12.

(1): Quantitative Analysis

To validate the effectiveness of the object detection algorithm for orchard vehicles, we compared AGO-PointPillars with three advanced detection models. The same training parameters and orchard object detection dataset are used for training, and Average Precision (AP), mean Average Precision (mAP), and detection speed are used to evaluate the performance of the object detection algorithm. The algorithm detection results are shown in Table 1.

The experimental results show that our algorithm outperforms the other three algorithms in the detection of orchard objects. Compared with VoxelNet, SECOND, and PointPillars, the AGO-PointPillars algorithm we proposed has increased the accuracy of pothole recognition by 17.75%, 9.43% and 5.15%, respectively, and the accuracy of tree recognition by 18.10%, 13.49%, and 4.13%, respectively. The mAP has increased by 17.92%, 11.46%, and 4.64%, respectively. The detection speed of the AGO-PointPillars algorithm is not much different from the original algorithm, which is better balanced between the detection accuracy and efficiency.

(2): Qualitative Analysis

We selected three orchard scenes to visualize and compare the detection performance of AGO-PointPillars and PointPillars. Figure 13 shows the detection results of AGO-PointPillars and PointPillars in three scenes. In Scene 1, PointPillars incorrectly recognizes two trees as one due to some overlapping branches of two trees located in the middle of the scene. AGO-PointPillars is not disturbed by the overlapping branches and accurately recognizes the trees. In Scene 2, PointPillars do not recognize potholes with smaller sizes, while AGO-PointPillars detects two potholes and labels them. In Scene 3, the shape of the trees is not regularly cylindrical, which causes PointPillars to recognize the trunk portion of the tree, and our algorithm does not misrecognize it.

(3): Ablation Experiment

In order to assess the impact of each module on the algorithm’s detection performance, a series of ablation experiments were designed and analyzed in this study. All the following experiments were trained on the homemade orchard object detection dataset and evaluated on the validation set. Table 2 shows the results of the ablation experiments.

It is clear from the experiments that each module in the AGO-PointPillars helps to improve the object detection performance. After introducing the ECA module alone, the pothole detection accuracy is improved from 61.53% to 63.58%, and the tree detection accuracy is improved from 81.39% to 84.52%. After introducing the EUCB module alone, the pothole detection accuracy is improved from 61.53% to 62.75%, and the tree detection accuracy is improved from 81.39% to 83.49%. After adding both ECA and EUCB at the same time, the pothole and tree detection accuracies are, respectively, improved by 5.15% and 4.13%. Experimental results show that the detection accuracies of the algorithms are improved after adding two modules separately, and the detection accuracies after adding two modules at the same time exceed the accuracies after adding each module separately. The results prove the effectiveness of ECA and EUCB, and the influence between the modules is small.

4. Conclusions

The orchard operation scene is complex, and there are small branches, leaves, and other interfering objects in the orchards. The recognition accuracy of the existing object detection algorithms is low for the detection of partially obscured and small objects. Sometimes, the existing detection algorithms may fail to recognize the orchard objects. To solve the above problems, an object detection algorithm for the orchard vehicle based on AGO-PointPillars was proposed. Firstly, we used a low-cost RGB-D camera as the sensing hardware to collect the orchard road information and converted the depth image data obtained by the RGB-D camera into 3D point cloud data. Then, the ECA was introduced to enhance the algorithm’s recognition ability for the orchard objects. Additionally, the EUCB was introduced in the backbone network to further enhance the feature extraction ability of the algorithm we proposed. Finally, we established an orchard object detection dataset and validated the object detection algorithm proposed based on our dataset. The experimental results showed that the AGO-PointPillars was more accurate than other existing algorithms used to detect objects. Compared with the VoxelNet, the SECOND and the PointPillars, the AGO-PointPillars, respectively, increased the accuracy of pothole recognition by 17.75%, 9.43%, and 5.15%, and, respectively, increased the accuracy of tree recognition by 18.10%, 13.49%, and 4.13%. The mAP, respectively, increased by 17.92%, 11.46%, and 4.64%. The AGO-PointPillars achieved a detection speed of 58 Hz for orchard objects. The results proved the effectiveness of the proposed algorithm. Our algorithm can provide recognition results for tilt leveling and operation along the tree rows of orchard vehicles. In future work, we will add more types of objects, such as convex roads, pedestrians, and agricultural machinery, into the orchard object detection dataset to further improve the accuracy for the type detection of orchard objects. We will also continue to optimize our algorithm to improve the object detection accuracy and apply it to other fields such as autonomous driving on city roads.

Author Contributions

Conceptualization, P.R. and X.Q.; methodology, P.R. and X.Q.; software, P.R.; validation, P.R., X.Q. and Y.S.; formal analysis, P.R.; investigation, Y.S.; resources, Y.S.; data curation, P.R.; writing—original draft preparation, P.R.; writing—review and editing, P.R.; visualization, P.R. and Q.G.; supervision, X.Q.; project administration, Q.G.; funding acquisition, X.Q. and Q.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Research and Development Program in Shandong Province (No. 2022CXGC020706) and the Municipal University Integration Development Strategy Project in Jinan (No. JNSX2023072).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, X.; Chen, W.; Wei, X. Improved Field Obstacle Detection Algorithm Based on YOLOv8. Agriculture 2024, 14, 2263. [Google Scholar] [CrossRef]
Tang, P.; Lv, M.; Ding, Z.; Xu, W.; Jiang, M. Pothole detection-you only look once: Deformable convolution based road pothole detection. IET Image Process. 2025, 19, e13300. [Google Scholar] [CrossRef]
Aljohani, A. Optimized Convolutional Forest by Particle Swarm Optimizer for Pothole Detection. Int. J. Comput. Intell. Syst. 2024, 17, 7. [Google Scholar] [CrossRef]
Yasmin, S.; Durrani, M.Y.; Gillani, S.; Bukhari, M.; Maqsood, M.; Zghaibeh, M. Small obstacles detection on roads scenes using semantic segmentation for the safe navigation of autonomous vehicles. J. Electron. Imaging 2022, 31, 061806. [Google Scholar] [CrossRef]
Li, X.; Deng, X.; Wu, X.; Xie, Z. SS-DETR: A strong sensing DETR road obstacle detection model based on camera sensors for autonomous driving. Meas. Sci. Technol. 2025, 36, 025105. [Google Scholar] [CrossRef]
Pang, F.; Chen, Y.; Luo, Y.; Lv, Z.; Sun, X.; Xu, X.; Luo, M. A Fast Obstacle Detection Algorithm Based on 3D LiDAR and Multiple Depth Cameras for Unmanned Ground Vehicles. Drones 2024, 8, 676. [Google Scholar] [CrossRef]
Wang, H.; Wang, H.; Liu, Y.-J.; Liu, L. Real-time obstacle perception method for UAVs with an RGB-D camera in low-light environments. Signal Image Video Process. 2025, 19, 256. [Google Scholar] [CrossRef]
Liu, Z.; Fan, G.; Rao, L.; Cheng, S.; Chen, N.; Song, X.; Yang, D. Positive and negative obstacles detection based on dual-lidar in field environments. IEEE Robot. Autom. Lett. 2024, 9, 6768–6775. [Google Scholar] [CrossRef]
Qin, J.; Sun, R.; Zhou, K.; Xu, Y.; Lin, B.; Yang, L.; Chen, Z.; Wen, L.; Wu, C. Lidar-based 3D obstacle detection using focal voxel R-CNN for farmland environment. Agronomy 2023, 13, 650. [Google Scholar] [CrossRef]
Zhang, Y.; Tian, K.; Huang, J.; Wang, Z.; Zhang, B.; Xie, Q. Field Obstacle Detection and Location Method Based on Binocular Vision. Agriculture 2024, 14, 1493. [Google Scholar] [CrossRef]
Talha, S.A.; Manasreh, D.; Nazzal, M.D. The Use of Lidar and Artificial Intelligence Algorithms for Detection and Size Estimation of Potholes. Buildings 2024, 14, 1078. [Google Scholar] [CrossRef]
Wang, S.; Xie, X.; Li, M.; Wang, M.; Yang, J.; Li, Z.; Zhou, X.; Zhou, Z. An Adaptive Multimodal Fusion 3D Object Detection Algorithm for Unmanned Systems in Adverse Weather. Electronics 2024, 13, 4706. [Google Scholar] [CrossRef]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar]
Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
Le, D.T.; Shi, H.; Rezatofighi, H.; Cai, J. Accurate and real-time 3D pedestrian detection using an efficient attentive pillar network. IEEE Robot. Autom. Lett. 2022, 8, 1159–1166. [Google Scholar] [CrossRef]
Liu, Y.; Yang, Z.; Tong, J.; Yang, J.; Peng, J.; Zhang, L.; Cheng, W. ET-PointPillars: Improved PointPillars for 3D object detection based on optimized voxel downsampling. Mach. Vis. Appl. 2024, 35, 56. [Google Scholar] [CrossRef]
Zhang, L.; Meng, H.; Yan, Y.; Xu, X. Transformer-based global PointPillars 3D object detection method. Electronics 2023, 12, 3092. [Google Scholar] [CrossRef]
Shu, X.; Zhang, L. Research on PointPillars Algorithm Based on Feature-Enhanced Backbone Network. Electronics 2024, 13, 1233. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Rahman, M.M.; Munir, M.; Marculescu, R. Emcad: Efficient multi-scale convolutional attention decoding for medical image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 11769–11779. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. 2016; pp. 21–37. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]

Figure 1. Hardware installation position of the orchard vehicle. 1. RGB-D camera; 2. laptop computer; 3. orchard vehicle.

Figure 2. RGB-D camera calibration process.

Figure 3. Examples of collected data. (a) RGB image; (b) Depth image.

Figure 4. Pinhole camera model. o indicates the optical center of the camera; f indicates the focal length of the camera; P [X,Y,Z]^T indicates a space point in the world coordinate system (O_wX_wY_wZ_w); P′ [X′,Y′]^T indicates an image point in the camera coordinate system (O_cX_cY_cZ_c); [u, v]^T indicates a coordinate of P′ in the image coordinate system (O_iX_iY_i).

Figure 5. The result of depth image conversion.

Figure 6. Architecture of AGO-PointPillars.

Figure 7. ECA module. σ indicates the Sigmoid function; C indicates the dimension of the pseudo-image channel; the value of the convolution kernel size k in the figure is 5.

Figure 8. Efficient Up-Convolution Block.

Figure 9. Detection head network structure.

Figure 10. Examples of bounding boxes. (a) Three-dimensional view; (b) top view. c indicates the center of the bounding box; (x, y, z) indicate the center coordinate of the bounding box; w, l, and h, respectively, indicate the length, width, and height of the bounding box.

Figure 11. An example of object data labeling for orchards.

Figure 12. The trends in loss values during training.

Figure 13. Detection results of AGO-PointPillars and PointPillars in three scenes. (a) shows the front view of obstacle detection results in point cloud; (b) shows the oblique view of obstacle detection results in point cloud. Yellow dashed circles mark the locations of potholes not detected by the PointPillars.

Table 1. Comparison of algorithm detection results.

Method	Modality	AP (%)		mAP (%)	Speed (Hz)
Method	Modality	Pothole	Tree	mAP (%)	Speed (Hz)
VoxelNet	Lidar	48.93	67.42	58.18	22
SECOND	Lidar	57.25	72.03	64.64	36
PointPillars	Lidar	61.53	81.39	71.46	59
AGO-PointPillars	RGB-D camera	66.68	85.52	76.10	58

Table 2. Results of ablation experiments.

Model	AP (%)		mAP (%)
Model	Pothole	Tree	mAP (%)
PointPillars	61.53	81.39	71.46
PointPillars + ECA	63.58	84.52	74.05
PointPillars + EUCB	62.75	83.49	73.12
AGO-PointPillars	66.68	85.52	76.10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ren, P.; Qiu, X.; Gao, Q.; Song, Y. An Object Detection Algorithm for Orchard Vehicles Based on AGO-PointPillars. Agriculture 2025, 15, 1529. https://doi.org/10.3390/agriculture15141529

AMA Style

Ren P, Qiu X, Gao Q, Song Y. An Object Detection Algorithm for Orchard Vehicles Based on AGO-PointPillars. Agriculture. 2025; 15(14):1529. https://doi.org/10.3390/agriculture15141529

Chicago/Turabian Style

Ren, Pengyu, Xuyun Qiu, Qi Gao, and Yumin Song. 2025. "An Object Detection Algorithm for Orchard Vehicles Based on AGO-PointPillars" Agriculture 15, no. 14: 1529. https://doi.org/10.3390/agriculture15141529

APA Style

Ren, P., Qiu, X., Gao, Q., & Song, Y. (2025). An Object Detection Algorithm for Orchard Vehicles Based on AGO-PointPillars. Agriculture, 15(14), 1529. https://doi.org/10.3390/agriculture15141529

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Object Detection Algorithm for Orchard Vehicles Based on AGO-PointPillars

Abstract

1. Introduction

2. Methods

2.1. Data Acquisition and Preprocessing

2.2. Attention-Guided Orchard PointPillars

2.2.1. The Architecture of AGO-PointPillars

2.2.2. Pillar Feature Network

2.2.3. Efficient Channel Attention Module

2.2.4. Improved 2D CNN Backbone Network

2.2.5. Detection Head

2.3. Training Loss Evaluation Indicators

3. Results and Discussion

3.1. Dataset

3.2. Analysis of Experimental Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI