3D Semantic Map Reconstruction for Orchard Environments Using Multi-Sensor Fusion

Wang, Quanchao; Chen, Yiheng; Li, Jiaxiang; Chen, Yongxing; Wang, Hongjun

doi:10.3390/agriculture16040455

Open AccessArticle

3D Semantic Map Reconstruction for Orchard Environments Using Multi-Sensor Fusion

by

Quanchao Wang

,

Yiheng Chen

,

Jiaxiang Li

,

Yongxing Chen

and

Hongjun Wang

^*

College of Engineering, South China Agricultural University, Guangzhou 510642, China

^*

Author to whom correspondence should be addressed.

Agriculture 2026, 16(4), 455; https://doi.org/10.3390/agriculture16040455

Submission received: 23 January 2026 / Revised: 9 February 2026 / Accepted: 12 February 2026 / Published: 15 February 2026

(This article belongs to the Special Issue Advances in Robotic Systems for Precision Orchard Operations)

Download

Browse Figures

Versions Notes

Abstract

Semantic point cloud maps play a pivotal role in smart agriculture. They provide not only core three-dimensional data for orchard management but also empower robots with environmental perception, enabling safer and more efficient navigation and planning. However, traditional point cloud maps primarily model surrounding obstacles from a geometric perspective, failing to capture distinctions and characteristics between individual obstacles. In contrast, semantic maps encompass semantic information and even topological relationships among objects in the environment. Furthermore, existing semantic map construction methods are predominantly vision-based, making them ill-suited to handle rapid lighting changes in agricultural settings that can cause positioning failures. Therefore, this paper proposes a positioning and semantic map reconstruction method tailored for orchards. It integrates visual, LiDAR, and inertial sensors to obtain high-precision pose and point cloud maps. By combining open-vocabulary detection and semantic segmentation models, it projects two-dimensional detected semantic information onto the three-dimensional point cloud, ultimately generating a point cloud map enriched with semantic information. The resulting 2D occupancy grid map is utilized for robotic motion planning. Experimental results demonstrate that on a custom dataset, the proposed method achieves 74.33% mIoU for semantic segmentation accuracy, 12.4% relative error for fruit recall rate, and 0.038803 m mean translation error for localization. The deployed semantic segmentation network Fast-SAM achieves a processing speed of 13.36 ms per frame. These results demonstrate that the proposed method combines high accuracy with real-time performance in semantic map reconstruction. This exploratory work provides theoretical and technical references for future research on more precise localization and more complete semantic mapping, offering broad application prospects and providing key technological support for intelligent agriculture.

Keywords:

LiDAR-Inertial-Visual SLAM; semantic segmentation; semantic map

1. Introduction

The construction of environmental maps is crucial for the navigation tasks of orchard robots. It not only serves as a prerequisite for achieving autonomous robot movement but also forms the foundation for subsequent precision variable operations such as inspection tasks, targeted spraying, selective harvesting, and digital orchard management. In many precision agricultural operations—such as monitoring orchard trees [1], pesticide spraying and yield estimation [2,3,4], or weeding between rice and foxtail plants in paddy fields [5,6] and intelligent robot path planning and navigation—additional processing of fruits and the perceived environment is often required. In such scenarios, reconstructing and segmenting agricultural environments using semantic maps [7,8,9] assists agricultural robots in executing complex tasks while advancing the intelligence and automation of agricultural production. However, traditional point cloud maps can only model the environment from a geometric perspective and fail to classify various point cloud obstacles. In contrast, semantic maps not only contain geometric information about the environment, but more importantly, also incorporate semantic information, such as the categories of obstacles, the positional relationships among obstacles, and their affiliations. By constructing semantic maps, orchard robots are endowed with the ability to understand their environment during navigation, enabling more intelligent and efficient decision-making. For instance, when combined with LLMs, semantic maps can facilitate language-guided autonomous navigation and efficient path planning.

In recent years, the advancement of Simultaneous Localization and Mapping (SLAM) technology has promoted the wide application of LiDAR and depth cameras in orchard mapping and positioning. However, the construction methods of three-dimensional semantic maps in existing research mostly focus on visual sensors, such as binocular cameras and RGB-D cameras [10,11]. Papadimitrion et al. [12] proposed a Graph-SLAM system that utilizes convolutional neural networks for grapevine segmentation while extracting visual features to enhance segmentation robustness. However, such sensors are severely affected by light, and cannot function properly in clear conditions such as dark areas and lack of texture. Moreover, they have a heavy computing load, and the maps they construct are difficult to be directly used for path planning and navigation. The dynamic performance of the sensor is limited. When constructing the map, there will be cumulative errors that require back-end optimization. Therefore, in large-scale outdoor orchards with variable lighting and complex structures, the robustness and reliability of environmental modeling methods that solely rely on vision are often difficult to guarantee. In recent years, thanks to the development of technology, the cost of LiDAR has decreased, and its application in orchard modeling has become increasingly mature. Compared with visual methods, modeling based on LiDAR point clouds has higher accuracy, stronger robustness, and better adaptability to orchard environments [13,14]. Dong et al. [15] addressed the issue of incomplete reconstruction on one side in traditional orchard mapping by proposing a semantic mapping method that integrates reconstruction results from both sides of tree rows. This approach utilizes a pre-trained MobileNetV2 network to extract high-level semantic features and project them onto a 3D point cloud, providing effective map references for orchard mobile robots. Peng et al. [16] addressed the challenges of dense tree coverage and severe occlusion in orchard scenarios by proposing an integrated solution for UAV image-based 3D reconstruction and semantic segmentation using Neural Radiance Fields (NeRF). This approach deeply integrates semantic segmentation with NeRF reconstruction, achieving high-precision individual tree segmentation and semantic information completion for occluded areas. However, positioning drift and semantic inconsistencies still occur when handling large scenes and long tree rows. Peng et al. [9] focused on semantic mapping requirements in unstructured orchard environments. By combining an improved MobileNetV2 architecture with a fully connected CRF framework, they constructed a semantic map generation scheme. Pan et al. [17] proposed a solution for sensor perception and structured semantic mapping in unstructured orchards. By integrating a 3D-ODN detection network with an improved CSF algorithm, they constructed a 3D semantic map, ensuring efficient path planning and orchard operations for robotic systems. Fu et al. [18] proposed the DSC-DeepLabv3+ model for high-speed semantic segmentation of maize crops and weeds in cornfields by optimizing DeepLabv3+ and implementing lightweight modifications. The model achieved a weed segmentation IOU of 87.5%. Sodano et al. [19] developed a three-dimensional semantic mapping algorithm for hierarchical panoramic segmentation of orchards. By integrating multi-sensor data to acquire 3D orchard data, the 3D-HPS algorithm directly outputs hierarchical semantic information and instance IDs. Jose et al. [20] proposed an active semantic mapping framework based on a lightweight semantic segmentation network combining MobileNetV3+U-Net. This framework employs multimodal fusion and NBV planning algorithms. In recent years, the accurate segmentation of slender and tortuous plant structures such as branches and vines in agricultural environments has become the research focus of robot picking and pruning vision systems. In particular, to cope with the slender morphology and complex topology of branches, researchers have proposed a variety of dedicated network designs. For example, the EDSC-HRAFNet model proposed by Liu et al. [21], significantly improves the accuracy and continuity of apple branch segmentation in complex orchard scenes, achieving advanced performance in multiple challenging scenarios.

However, existing mapping approaches face pronounced limitations in orchard environments. Vision-only methods are susceptible to illumination changes and occlusions while lacking geometric stability. LiDAR-only systems fail to capture semantically rich information. Most learning-based semantic mapping frameworks operate on a closed set of object categories, rendering them ineffective against novel, unexpected objects prevalent in orchards. Moreover, although multi-sensor fusion is promising, achieving real-time and accurate data alignment remains challenging, and perceptual ambiguities often lead to inconsistent semantic labels across viewpoints.

To solve the above-mentioned problems, this paper subsequently focuses on the fusion technology of LiDAR and cameras. By combining the SLAM framework with neural networks, it aims to achieve robust three-dimensional semantic map construction. This map not only supports the navigation and positioning of robots, but also provides a foundation for high-level scene understanding and interaction tasks such as path planning and target search, thereby comprehensively enhancing the overall efficiency and intelligence level of autonomous mobile systems in orchard environments. Specifically, multi-sensor fusion technology is used to fuse LiDAR point clouds, camera color images, and inertial measurement units (IMU) through error state iterative Kalman filtering (ESIKF) to output precise poses and color point clouds. Unlike the direct use of segmentation neural networks, considering that some unexpected objects may appear in the orchard environment, the model cannot be segmented without prior training. Instead, we first use open vocabulary to detect all objects, and then input the detected regions into the segmentation model for further fine segmentation. At the same time, to solve the problem of the same object being detected as different categories, we adopt the Bayesian fusion method to improve robustness. It can effectively improve the consistency of point cloud labels with similar spatial structures and make them have the same color features in the scene. The main contributions of this article are as follows:

Integrate data from LiDAR, cameras and IMU for real-time positioning and construction of 3d semantic maps to solve the efficiency and accuracy problems when only vision is used.
The extraction of semantic information is divided into two stages to meet the complex and changeable environmental characteristics of the orchard.
Use the Bayesian fusion method to further improve the accuracy of constructing semantic maps.

2. Materials and Methods

2.1. Data Collection and Processing

The dataset is composed of two parts. One part consists of 898 publicly available images from RobFlow, and the other part comprises 1200 images collected from a dragon fruit orchard in Conghua District, Guangzhou City. The dataset was collected during the period from June to July 2025. All data were captured on a sunny day to ensure good visibility and sensor performance. The orchard environment presented a natural variation in lighting conditions: while substantial areas were under direct sunlight, other sections were partially shaded due to the surrounding terrain and structures, creating a mix of illumination intensities and shadow patterns. This variability provides a realistic testbed for evaluating the robustness of perception algorithms against the challenging and uneven lighting commonly encountered in orchard operations. Color images were captured in an orchard using an Intel RealSense D435i camera (Intel Corporation, Santa Clara, CA, USA) and a Hikvision MV-CS050-10UC global shutter camera (Hikvision, Hangzhou, China), mounted on a self-built drone. LiDAR point clouds were captured using a Livox MID360 scanner (Livox, Shenzhen, China), while depth maps were acquired via the Intel RealSense D435i camera. Due to excessive noise from the RealSense D435i’s built-in IMU, which negatively impacted state estimation, IMU data was sourced from the LiDAR’s integrated IMU and the PX4 6C Mini flight controller’s built-in IMU. The final LiDAR point cloud, IMU data, and data from the Hikvision MV-CS050-10UC camera serve as inputs for the Livo system. For comparison with the Intel RealSense D435i camera’s depth map, the color image and the built-in IMU data from the PX4 6C Mini are used as inputs for the VIO system. Data acquisition was performed using the Ubuntu operating system, with ROS message bags recorded for storage. To simplify semantic annotation complexity, objects in the images are categorized into only three common classes: fruit, branches and leaves, and background. Remaining objects receive no specific annotation. As shown in Figure 1, during annotation, objects too distant from the camera are treated as background.

2.2. Methods

Figure 2 presents the system flowchart for a large-scale outdoor semantic map reconstruction system based on visual-LiDAR fusion. System inputs include RGB images, LiDAR point clouds, and IMU data. Upon receiving input data, the FAST-LIVO2 framework first fuses multi-source sensor information through an iterative error state Kalman filter to achieve real-time localization, outputting dense color point cloud maps and camera pose sequences. Simultaneously, the RGB image sequence feeds into the YOLO-world [22] object detection network to generate 2D bounding boxes. These detection results serve as input prompts for the FAST-SAM [23] segmentation model, which produces pixel-level instance masks. During the semantic fusion stage, the system projects the point cloud onto images from various viewpoints using refined camera parameters. Valid points are filtered through 3D bounding box constraints and depth consistency checks, while multi-view semantic evidence is accumulated using a Bayesian log-likelihood framework. Finally, the semantic annotation results undergo multi-scale spatial consistency refinement and boundary smoothing processing to generate a dense semantic point cloud map.

2.2.1. Open Vocabulary Detection and Semantic Segmentation Algorithm

The construction of semantic maps relies on accurate identification and segmentation of object categories within the environment. Compared to traditional rule-based or manually annotated methods, deep neural networks can automatically learn hierarchical feature representations, demonstrating superior generalization capabilities and robustness in complex scenarios. In recent years, deep learning-based semantic segmentation methods for 3D point clouds have made significant progress. PointNet [24] pioneered a deep network architecture directly processing unordered point clouds, while PointNet++ [25] further introduced hierarchical feature learning mechanisms. For large-scale outdoor scenes, RandLA-Net [26] employs random sampling and local feature aggregation strategies, whereas Cylinder3D [27] leverages cylindrical partition representations to enhance LiDAR point cloud segmentation accuracy. However, these methods suffer from the following limitations: Training data dependency: Requires large-scale point-labeled 3D datasets with high annotation costs and fixed categories. Lack of texture information: Pure geometric features struggle to distinguish visually similar yet semantically distinct objects. Limited generalization: Models exhibit significant performance degradation in cross-domain scenarios. Weak open-word capability: Inability to recognize new categories not covered in training sets. In contrast, 2D vision models benefit from pre-training on large-scale datasets like ImageNet [28] and COCO [29], demonstrating greater maturity in object detection and segmentation tasks. Therefore, this paper employs image-based neural networks to extract pixel-level semantic information from RGB images. Through multi-view fusion, this information is projected onto a 3D point cloud map, enabling the construction of semantic maps without requiring 3D annotation data.

We employ a two-stage semantic extraction strategy as shown in Figure 3: first, YOLO-world is used for object detection to obtain 2D bounding boxes and confidence scores. Subsequently, these detected boxes are fed as input to FAST-SAM to generate precise instance masks. Compared to directly employing end-to-end semantic segmentation networks, this approach offers several advantages: First, each detection box corresponds to an independent instance, facilitating the construction of instance-level semantic maps. Simultaneously, the 3D bounding boxes constructed with depth information effectively suppress background noise. Furthermore, the detection confidence supports multi-view Bayesian probability fusion, significantly enhancing the robustness of object category classification.

2.2.2. LiDAR-Based Positioning and Point Cloud Map Construction

In recent years, Simultaneous Localization and Mapping (SLAM) technology has made remarkable progress. Due to its ability to estimate pose and reconstruct maps in real time, SLAM has become an indispensable tool for various robotic navigation tasks. Dense 3D maps provide critical environmental information, while color maps also carry substantial semantic information. Traditional visual SLAM lacks depth perception and is sensitive to lighting conditions, whereas LiDAR SLAM lacks color information and tends to fail in geometrically degenerate scenes. This paper employs Fast Livo2 [30] for simultaneous localization and mapping. This method efficiently fuses IMU, LiDAR, and image measurement data through an Error State Iterative Kalman Filter (ESIKF). By utilizing a sequential update strategy within the Kalman filter to address the dimensional mismatch between LiDAR and image measurements, it achieves reliable localization and precise mapping, its system framework is shown in Figure 4. The framework employs four key mechanisms for robust operation:

Sequential ESIKF Updates: LiDAR measurements are processed first via point-to-plane residuals, followed by visual updates using sparse direct image alignment. This sequential approach avoids dimensional mismatch while maintaining theoretical equivalence to joint updates.
Unified Voxel Map Representation: A single adaptive voxel hash map stores both geometric (LiDAR points, plane parameters) and visual (image patches) information, enabling efficient data association and memory management through ring-buffer sliding.
Adaptive Robustness Features: The system incorporates several adaptive mechanisms to enhance robustness. First, plane priors derived from LiDAR points are utilized to improve the accuracy of visual alignment. Second, online exposure time estimation dynamically compensates for varying illumination conditions. Third, an on-demand raycasting strategy actively retrieves map points in scenarios where LiDAR measurements become sparse, ensuring continuous constraints for state estimation.
Fault-Tolerant Design: The ESIKF naturally weights sensor measurements based on their noise characteristics. During sensor degradation (e.g., LiDAR in featureless environments or camera in low-light conditions), the system automatically relies on remaining reliable sensors, preventing catastrophic failure.

In the LIVO system, it is assumed that the time offsets between the three sensors (LiDAR, inertial measurement unit, and camera) are known. These offsets can be determined through prior calibration or synchronization. This paper employs a calibration method using a target plate (featuring four circular holes and four ArUco markers) to calibrate external parameters. Time calibration was not performed, as external synchronization was used to achieve precise alignment. The calibration process for external parameters is shown in Figure 5.

First, identify the four ArUco markers on the calibration plate. Calculate the 3D pose of each marker relative to the camera coordinate system based on the known marker dimensions and camera intrinsic parameters. Take the average of the four marker poses as the calibration plate’s pose, thereby deriving the 3D coordinates of the four circular hole centers in the camera coordinate system, denoted as the point set

P_{C}

. Next, use pass-through filtering to roughly extract the point cloud region containing the calibration plate. Fit a plane using RANSAC to extract the point cloud of the calibration plate’s plane. Rotate and align the plane point cloud to the z = 0 plane, converting it into a 2D point cloud. If a point’s neighborhood contains angular gaps exceeding 25°, it is classified as an edge point. Cluster the edge points and fit an ellipse to each cluster as shown in the following formula:

A x^{2} + B x y + C y^{2} + D x + E y + F = 0

(1)

where the parameters

(A, B, C, D, E, F)

are obtained through fitting. Subsequently, the ellipse center is calculated as shown in Equation (2) and transformed back to the original LiDAR coordinate system, yielding the point set

P_{L}

.

x_{c} = \frac{2 C D - B E}{B^{2} - 4 A C}, y_{c} = \frac{2 A E - B D}{B^{2} - 4 A C}

(2)

Ultimately, by leveraging the known correspondence between the

P_{C}

(camera coordinate system) and

P_{L}

(LiDAR coordinate system) where the centers of the four circular apertures correspond one-to-one, the optimal rigid transformation

T_{C L}

is solved to minimize the distance error between corresponding points:

\frac{1}{4 \cdot N} \sum_{i = 1}^{4 \cdot N} {∥p_{i}^{C} - T_{C L} p_{i}^{L}∥}^{2}

(3)

After obtaining reliable external references, the asynchronous laser point cloud and camera images are first temporally aligned, followed by state prediction using the IMU. During the update phase, geometric constraints are first established using the laser point cloud and map plane to optimize the state and update the map structure. Subsequently, visual optimization is performed by minimizing photometric error with the current image, utilizing visual map points extracted from the map that are associated with reference image blocks. The raw data is processed using a direct approach, eliminating the need for feature extraction methods like ORB. Mechanisms such as plane priors, real-time exposure estimation, and on-demand ray projection are incorporated to enhance accuracy and robustness. Due to space constraints, this paper does not provide detailed derivations of FAST-LIVO2’s core algorithms (such as ESIKF state estimation and VoxelMap incremental mapping). Readers are strongly encouraged to consult the references [31,32] for further details. This section focuses on FAST-LIVO2’s functional role within the system and its output data format. The final outputs—high-precision state estimates and textured dense point cloud maps—are illustrated as shown in Figure 6 and Figure 7.

2.2.3. 3D Semantic Maps Construction and Postprocessing

Typically, cameras and LiDAR are configured to the same acquisition frequency, while the inertial measurement unit (IMU) operates at a higher frequency. By utilizing ROS’s time synchronization mechanism, multiple topic data streams can be received. The system generates a callback function for the synchronized results, allowing us to process the synchronized data within this callback function. However, ROS’s message filters synchronization mechanism relies on software-layer message matching. Affected by network latency, its synchronization accuracy is only millisecond-level, and timestamps across sensors suffer from drift issues. In highly dynamic scenarios, millisecond-level timing errors can lead to significant inaccuracies. To address this, the system employs external hardware clock synchronization. As shown in Figure 8, a unified clock source triggers simultaneous data acquisition from multiple sensors, followed by strict timestamp alignment achieved through Linux’s shared memory mechanism. In this hardware synchronization architecture, the STM32 microcontroller and the onboard PC (Intel NUC) assume distinct roles. The STM32 acts as the precision timing core, generating and outputting the hardware synchronization signals, including the absolute time reference (GPRMC) and periodic trigger pulses (PPS/PWM), to physically align the data acquisition moments of the LiDAR and camera. The onboard PC serves as the high-performance computing and alignment unit. It receives the sensor data streams that have been pre-synchronized at the hardware level, performs a secondary timestamp alignment in the software layer using Linux’s shared memory mechanism to further eliminate residual jitter, and is subsequently responsible for executing all compute-intensive algorithms for state estimation and semantic mapping. The synchronization accuracy of this hardware scheme was measured and validated. After the initial frame, all subsequent LiDAR scans and camera frames are aligned with microsecond-level precision. This high-degree of temporal coherence is critical for our fusion-based semantic mapping pipeline.

Before fusion, ground point clouds must be removed. This paper employs the CSF algorithm to achieve this. The fusion process involves taking the dense point cloud map

P = {\{p_{i}\}}_{i = 1}^{N},

output by the SLAM system, where

p_{i} \in R^{3}

represents the 3D coordinates of the i-th point, along with the corresponding RGB image sequence

{I^{(k)}}_{k = 1}^{M}

and camera pose

{T_{c w}^{(k)}}_{k = 1}^{M}

. The objective is to assign a semantic category label

c_{i} \in {0, 1, \dots, C - 1}

to each point

p_{i}

. For each image frame, detection results are first extracted using the YOLO-World object detection network:

D^{(k)} = {(c_{j}, b_{j}, s_{j})}_{j = 1}^{J_{k}}

(4)

where

c_{j}

denotes the category label,

b_{j} = (x_{1}, y_{1}, x_{2}, y_{2})

represents the bounding box coordinates, and

s_{j} \in [0, 1]

indicates the detection confidence score. Subsequently, the detection box is input as a prompt into the FastSAM segmentation model to generate a pixel-level binary mask

M_{j} \in {0, 1}^{H \times W}

.

For a point

p_{i}

in the world coordinate system, transform it to the camera coordinate system using the camera’s external parameter matrix

T_{c w}^{(k)} \in S E (3)

:

p_{i}^{c a m} = T_{c w}^{(k)} \cdot {\tilde{p}}_{i}

(5)

where

{\tilde{p}}_{i} = {[p_{i}^{T}, 1]}^{T}

is the homogeneous coordinate. Let

p_{i}^{c a m} = {[X_{i}, Y_{i}, Z_{i}]}^{T}

be projected onto the pixel plane via the pinhole camera model:

u_{i} = f_{x} \cdot \frac{X_{i}}{Z_{i}} + c_{x}, v_{i} = f_{y} \cdot \frac{Y_{i}}{Z_{i}} + c_{y}

(6)

where

(f_{x}, f_{y})

represents the focal length and

(c_{x}, c_{y})

denotes the principal point coordinates. Only points satisfying

Z_{i} > 0

and lying within the image range are retained.

To avoid mislabeling background points as foreground targets, bounding box constraints are constructed in 3D space based on 2D detection boxes. First, the depth of the detection box center is estimated (by taking the median of the projection point depths in the box’s central region). Then, the 3D bounding box dimensions are computed via backprojection:

W_{3 D} = Z_{c e n t e r} \cdot \frac{w_{p i x e l}}{f_{x}}, H_{3 D} = Z_{c e n t e r} \cdot \frac{h_{p i x e l}}{f_{y}}

(7)

D_{3 D} = max (W_{3 D}, 0.5)

(8)

where

w_{p i x e l} = x_{2} - x_{1}, h_{p i x e l} = y_{2} - y_{1}

represents the pixel dimensions of the detection box. The center coordinates of the 3D bounding box are

X_{c e n t e r} = Z_{c e n t e r} \cdot \frac{(x_{1} + x_{2}) / 2 - c_{x}}{f_{x}}, Y_{c e n t e r} = Z_{c e n t e r} \cdot \frac{(y_{1} + y_{2}) / 2 - c_{y}}{f_{y}}

(9)

Then only retain points that satisfy the following constraints:

p_{i}^{c a m} \in [X_{c e n t e r} \pm \frac{W_{3 D}}{2}] \times [Y_{c e n t e r} \pm \frac{H_{3 D}}{2}] \times [Z_{c e n t e r} \pm \frac{D_{3 D}}{2}]

(10)

For points passing the above screening, a Bayesian framework is employed to fuse multi-view semantic observations. A log-likelihood vector for each category is maintained for each point, initialized as a zero vector. When a point is observed by a detection box of category

c_{j}

with confidence

s_{j}

in frame k, the log-likelihood for the corresponding category is updated:

L_{i}^{(c_{j})} \leftarrow L_{i}^{(c_{j})} + log \frac{{\hat{s}}_{j}}{1 - {\hat{s}}_{j}}

(11)

where

{\hat{s}}_{j} = clamp (s_{j}, 0.55, 0.95)

is the truncated confidence level to avoid numerical instability. This update rule originates from the log-likelihood form of Bayesian posterior probability. Let

O_{0} = P (c) / P (\neg c)

denote the prior probability that point p belongs to class c. Then, after n independent observations, the posterior probability is

O_{n} = O_{0} \cdot \prod_{t = 1}^{n} \frac{P ({obs}_{t} | c)}{P ({obs}_{t} | \neg c)}

(12)

This Bayesian log-odds update framework is derived from the assumption of independent observations across frames. It operates in the log-odds space (

l = log (p / (1 - p))

), which provides numerical stability and allows for efficient additive updates. Each point is initialized with a prior log-odds of 0 for all categories, corresponding to a neutral prior probability of

p = 0.5

(i.e., no initial bias). When a point is observed within a detected bounding box of category

c_{j}

with confidence score

s_{j}

(after truncation), the observation is treated as independent evidence. The term

log ({\hat{s}}_{j} / (1 - {\hat{s}}_{j}))

acts as the log-likelihood ratio, which is added to the accumulated evidence for category

c_{j}

at that point (Equation (11)). After processing all frames, the category with the highest accumulated log-odds is assigned to the point.

Due to potential inconsistencies in 2D segmentation results across different viewpoints, 3D spatial consistency constraints are introduced for refinement. A vertical columnar grid structure is employed, with the grid partitioned in the

X Y

plane at resolution r and spanning the Z direction. For each grid cell

G

, the number of points belonging to each category

n_{c} = | {i : p_{i} \in G, c_{i} = c} |

is counted. If the dominant category exceeds a threshold:

\frac{{max}_{c} n_{c}}{\sum_{c} n_{c}} \geq γ

(13)

All points within the grid are uniformly labeled with the dominant category. A multi-scale strategy is employed, sequentially refining the resolution at

r \in {1.0, 0.5, 0.2}

meters to progressively optimize from coarse to fine scales. To further optimize noise annotation at category boundaries, a K-nearest neighbor-based boundary smoothing strategy is employed. First, boundary points are identified, followed by local majority voting on these boundary points:

c_{i}^{s m o o t h} = arg max_{c} \sum_{j \in N_{r} (i)} 1 [c_{j} = c]

(14)

Update only when the dominant category exceeds 60% to avoid excessive smoothing. The above key parameters are shown in the Table 1.

The semantic graph optimization results are shown in the Figure 9. The figure shows the optimized effect. The red point cloud represents the optimized fruit point cloud, while the blue point cloud represents the redundant point cloud that was wrongly segmented before. The picture explains that the reason for the segmentation error is due to the incorrectness of a certain frame mask. Some scattered points might be caused by the insufficient point cloud density within the previous mask area, resulting in incorrect coloring of the background. Experimental results validate the effectiveness of the semantic segmentation optimization method.

3. Experiments and Results

3.1. Experiments Platform

We independently built an autonomous navigation drone, as shown in Figure 10, for data collection tasks. It is important to note that the drone is solely responsible for data acquisition, while verification runs offline. This offline verification stems from our specific experimental platform choice. Although the drone’s flexible and compact design makes it ideal for data acquisition in diverse environments, it lacks an onboard NVIDIA GPU for real-time visual model inference. Critically, this hardware limitation pertains only to the validation setup and does not reflect a limitation of the proposed technology itself. The segmentation model achieves 74.85 FPS on a standard GPU, a rate far exceeding the sensor’s acquisition frequency. Therefore, when deployed on a platform with integrated GPU compute (e.g., a robotic platform with a Jetson AGX Orin), the entire pipeline is fully capable of real-time, end-to-end semantic mapping, as claimed in the abstract. The flight platform is primarily constructed from lightweight carbon fiber and aluminum, ensuring structural strength and extreme lightness. The core sensor suite includes: Livox MID360 LiDAR providing a 360° field of view, with an integrated six-axis inertial measurement unit (IMU); MV-CS050-10UC camera; Intel RealSense D435i camera for RGB-D image capture; LiDAR and IMU operating at 10Hz and 200Hz, respectively. The onboard Intel NUC 12 Pro serves as the core computing unit, with the LiDAR and MV-CS050-10UC camera synchronized via an STM32 microcontroller.

3.2. Semantic Segmentation Performance Analysis

To train a semantic segmentation model for dragon fruit, this paper divides the annotated dataset into training, validation, and test sets at an 8:1:1 ratio, then fine-tunes the model based on FastSAM pretrained weights. Figure 11 compares segmentation results between traditional and deep learning approaches: The conventional threshold-based segmentation method using the HSV color space achieves moderate results under uniform lighting and simple backgrounds, but its performance significantly degrades in scenarios involving leaf occlusion, backlighting, and complex backgrounds. In contrast, the FastSAM model learns high-level semantic features, enabling accurate segmentation of fruit and foliage regions even under these challenging conditions. Although affected by environmental factors and annotation quality, the model still exhibits missed detections in severely occluded regions. Nevertheless, its overall segmentation accuracy and robustness surpass traditional methods.

To comprehensively evaluate the performance of the Fastsam segmentation model, we also selected several classical networks for comparison, such as U-NET, DeepLabV3+, and BiSeNet. Testing was conducted on our own dataset using mean Intersection over Union (mIoU), pixel accuracy (PA), F1 score, and inference speed (FPS) as evaluation metrics. mIoU measures the overlap between predicted and ground-truth regions, PA reflects overall pixel classification accuracy, F1 score comprehensively evaluates precision and recall, and FPS assesses real-time processing capability. The relevant calculation formula is as follows:

M I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{T P}{F N + F P + T P}

(15)

P A = \frac{\sum_{i = 0}^{k} p_{i i}}{\sum_{i = 0}^{k} \sum_{j = 0}^{k} p_{i j}}

(16)

Here,

k + 1

denotes the total number of classes, while

T P

,

F P

, and

F N

represent the pixel counts for true positives, false positives, and false negatives, respectively.

P_{i j}

indicates the number of pixels in the confusion matrix where the true class is i but the predicted class is j, and

P_{i i}

denotes the number of correctly classified pixels.

Table 2 compares the performance of different segmentation methods on the dragon fruit test set. mIoU, PA, and F1 are expressed as percentages, while FPS was measured on an NVIDIA RTX 3090ti. Although its accuracy is lower than that of BisNet, Fast-sam can support text prompts. For some uncommon objects, there is no need for cumbersome separate training, as that would be too costly. It is worth noting that several recently released segmentation models, such as Segment Anything Model (SAM) [33] and Mask2Former [34], have demonstrated strong performance in open-domain segmentation tasks. However, these models often rely on large-scale pre-training datasets and substantial computational resources, which can pose challenges for deployment in resource-constrained agricultural environments. In particular, SAM requires significant inference time and memory overhead, which may hinder real-time performance in dynamic orchard scenarios. Similarly, transformer-based architectures like Mask2Former demand extensive training data and fine-tuning to adapt to specific agricultural objects, which may not be feasible when labeled data are scarce. Therefore, while these advanced models represent promising directions for future research, we selected Fast-SAM for its favorable balance between accuracy, inference speed, and adaptability, which is the key considerations for real-time semantic mapping in orchards. Overall, FastSAM maintains a high level of segmentation accuracy while having a slower inference speed compared to other networks. However, with over 70 fps, it is sufficient for mapping tasks, indicating that it is suitable for visual segmentation tasks with high real-time requirements.

3.3. Evaluation of 3D Semantic Map Construction in a Orchard

To evaluate the reconstruction effectiveness and accuracy of the system, we conducted experiments in a greenhouse dragon fruit orchard at South China Agricultural University in Guangzhou, as shown in the Figure 12. The environment featured densely distributed plants with a height of approximately 2 m.

First, evaluate the reconstruction results for a small group of trees to determine whether the number of fruits obtained from the reconstructed point cloud matches the actual count. In the reconstructed point cloud, fruits are colored red, paper strips are colored green, and other unintended objects are colored with different random hues, while the remaining points retain their original RGB colors.

Observation of the reconstructed individual fruit trees reveals that most semantic elements (e.g., branches and fruits) are correctly classified, with point clouds colored green and red, respectively. This finding is closely related to the accuracy of semantic segmentation. The Bayesian log-likelihood fusion approach effectively mitigates discrepancies in color assignment caused by the same point cloud being segmented into different categories. Compared to point clouds constructed using visual methods, those obtained via SLAM exhibit significantly more pronounced geometric features and appear more regular and distinct. This is because visual methods do not directly capture distance information for each point but instead calculate it indirectly through means such as stereo disparity, resulting in lower absolute accuracy, particularly at long distances.

To quantitatively assess the quality of semantic reconstruction, we conducted quantitative experiments. First, we employed conditional filtering to extract dragon fruit point clouds from the semantic map based on color. Subsequently, since manual counting in semantic point cloud maps is challenging, we used a clustering algorithm to group discrete point clouds into multiple clusters for counting. Based on point cloud density and actual fruit size, we set the minimum allowed points per cluster to 30, with a minimum radius of 0.03 and a maximum radius of 0.08, while other parameters remained at default values. The clustering results are shown in Figure 13, where each small square is regarded as a piece of pulp. However, it should be noted that due to the error in the mask segmentation area, some point clouds of branches may be mistakenly regarded as belonging to the fruit during clustering, resulting in an increase in the number. Finally, based on the fitting results, the predicted value

N_{p}

is determined and compared with the true value

N_{t}

. The relative error

α

is calculated using Formula (17) to evaluate the system’s reliability. Experimental statistical results are shown in Table 3.

α = \frac{|N_{p} - N_{t}|}{N_{t}}

(17)

Relative error serves as a crucial metric for evaluating semantic reconstruction quality, with lower values indicating more reliable reconstruction results. As shown in the table, the average relative error of the proposed method is 12.4%, demonstrating overall high prediction accuracy, though some counting bias remains. The sources of error can be summarized as follows:

Occlusion issues: In natural orchard environments, dragon fruit are often obscured by dense branches and foliage. Some fruits have only a small portion of their surface exposed within the sensor’s field of view, preventing complete reconstruction in the 3D point cloud and leading to missed detections.
Point cloud noise: Influenced by factors such as LiDAR measurement accuracy, multipath reflections, and edge effects, semantic point clouds inevitably contain outliers and areas with blurred boundaries. This noise interferes with density-based clustering algorithms, leading to adjacent fruits being incorrectly merged or individual fruits being excessively fragmented;
Semantic segmentation errors: During the projection of 2D image segmentation results onto the 3D point cloud, imprecision in segmentation boundaries is transferred to point cloud annotations, affecting final clustering and counting outcomes.

Despite these error sources, the average relative error of 12.4% remains within an acceptable range, validating the practicality of this method in complex orchard environments.

To situate the fruit counting performance of our 3D semantic mapping pipeline within the broader research landscape, we compare our results with recent, representative work in orchard fruit detection and counting. Vasconez et al. [35] conducted a comprehensive evaluation of 2D CNN-based methods, reporting the following best-case relative counting errors on different crops: approximately 7% for avocados, 10–13% for apples, and 20–25% for lemons. Our method, applied in a densely planted and severely occluded dragon fruit orchard, achieves an average relative counting error of 12.4%. This result is directly comparable to the performance reported for apples and avocados in structured, well-managed orchards. More fundamentally, the compared methods rely on 2D image sequences coupled with multi-object tracking to associate detections across frames and avoid double-counting. In contrast, our approach intrinsically resolves data association through 3D geometric fusion within the semantic mapping process. The fruit count is therefore a direct product of a globally consistent 3D map, which simultaneously provides a rich, actionable environmental model for navigation and targeted operation, an output that pure 2D video analysis cannot provide.

To further validate the semantic mapping performance in large-scale scenarios, testing was conducted across the entire orchard, traversing multiple tree rows to cover all areas. As shown in Figure 14 and Figure 15 branches in each row were reconstructed effectively, and fruits were correctly segmented. Furthermore, quantitative analysis of the reconstructed rows indicates that fruit spacing is relatively uniform: excluding occasional clusters, the typical inter-fruit distance along a row ranges between 2 m and 3 m. As shown in Table 3, the mean counting error in large-scale rows is 13.7%, slightly higher than that in small-scale scenes, which is primarily attributed to increased occlusion. This demonstrates the value of our method for semantic mapping in orchards. It should be noted that due to the non-repeating scanning pattern of the LiDAR, distant noise points may be detected. Before converting the point cloud into a grid map for navigation, filtering can be applied to remove these outliers. While this process may inevitably eliminate some valid points, it has minimal impact on semantic information and navigation maps. Indeed, our experiments confirmed that filtering did not significantly reduce the point cloud volume but instead enhanced structural clarity.

To systematically evaluate the advantage of our multi-sensor fusion approach, we constructed a vision-only semantic mapping baseline that adheres to the typical “geometric SLAM + 3D point-cloud fusion” paradigm. In this baseline, we first reconstructed a 3D point cloud for each frame using the depth map from the Intel RealSense D435i camera and its intrinsic parameters. The same Fast-SAM model was then applied to segment this per-frame point cloud based on the corresponding RGB image. Finally, the camera poses estimated by ORB-SLAM3 were used to align and merge all individual point clouds into a unified global map. This pipeline represents a standard vision-centric approach that relies solely on visual and depth data, without incorporating LiDAR measurements for geometry or pose optimization.

A qualitative comparison is shown in Figure 16, Figure 17 and Figure 18. To comprehensively assess geometric consistency, we inspected the point cloud reconstructed by the vision-only method from three representative viewpoints: the view parallel to the direction of motion (Figure 16) reveals noticeable fragmentation and breakage in the point cloud, especially in texture-sparse vegetation areas; the 45-degree oblique view (Figure 17) further shows surface blurring and loss of detail due to depth-estimation noise; and the view perpendicular to the direction of motion (Figure 18) clearly exposes duplicated and misaligned structures caused by loop-closure errors. In contrast, the map generated by our multi-sensor fusion system (Figure 15) exhibits continuous, complete, and sharply defined geometric structures across all viewpoints, demonstrating its significant advantage in reconstructing robust geometry in complex orchard environments.

3.4. Localization Accuracy Evaluation

Although semantic maps can help robots operate according to human rules, plan and execute advanced tasks, all of this relies on reliable positioning. Whether robots can truly leverage the value of semantic maps ultimately depends on the accuracy of their positioning. Given these considerations, we also evaluated the system’s positioning accuracy. We selected a starting point within the orchard, navigated around the tree rows, and returned to the origin. In our orchard environment, acquiring external ground-truth data (e.g., from motion-capture systems or continuous RTK-GPS under dense canopy) was infeasible. We therefore adopted an internal reference protocol: the trajectory estimated by the Visual-Inertial (VI) subsystem, which fuses camera and IMU data—was used as a high-quality baseline. This choice is justified because VI odometry typically provides smooth and locally accurate pose estimates when visual features are well-tracked.

To assess the robustness of our full LiDAR -Inertial-Visual (LIV) fusion system, we performed two closed-loop experiments along similar paths but under different operational conditions. The absolute trajectory error (APE) between the LIV trajectory and the VI reference for each experiment is summarized in Table 4 and the performance comparison is shown in the Figure 19.

As shown in Table 4, Figure 20 and Figure 21, the system achieved a mean absolute trajectory error of 0.529 m under stable conditions with slow motion and uniform lighting (Experiment 1). When the platform moved rapidly under drastically changing illumination (Experiment 2), the error increased to 1.290 m, a rise of approximately 2.4 times. This increase stems primarily from two factors: high-speed motion amplifies sensor noise and can introduce motion blur, degrading visual feature tracking, while rapid lighting variations violate the photometric consistency assumption essential for visual odometry, leading to increased drift. Notably, even under the more challenging conditions of Experiment 2, our LiDAR-Inertial-Visual fusion system maintained a bounded error without catastrophic failure. This demonstrates that the LiDAR-derived geometric constraints, which are invariant to lighting changes, effectively stabilize the system. The result highlights the essential role of multi-sensor fusion in enhancing robustness in dynamic orchard environments.

We also recorded Fast-Livo2 and odometry (visually impaired) measurements, and the results are shown in Figure 22.

Finally, to provide an independent measure of absolute localization accuracy, we evaluated the same Fast-Livo2 pipeline on the publicly available HILTI dataset, which offers precise ground-truth trajectories. On this benchmark, our method achieved a mean translation error of 0.022 m, demonstrating competitive performance in a controlled setting and confirming the generalizability of the underlying fusion algorithm.

4. Conclusions

This study proposes a method for constructing semantic point clouds of dragon fruit orchards based on the fusion of 2D semantic segmentation and 3D LiDAR point clouds. The core approach involves using FastSAM to perform semantic segmentation on captured orchard images, obtaining pixel-level annotations for fruits and foliage. Subsequently, through camera-LiDAR fusion, the 2D semantic labels are projected onto the 3D point cloud space, enabling automatic construction of the semantic point cloud. Quantitative evaluation results demonstrate that the proposed method achieves an average intersection-over-union (IoU) of 74.33%, pixel accuracy of 85.02%, and an F1 score of 84.72% on the dragon fruit test dataset. Regarding inference efficiency, FastSAM maintains a processing speed of 75 FPS, meeting real-time requirements. Experiments on fruit counting using the semantic point cloud revealed an average relative error of 12.4% between predicted and actual values. Errors primarily stemmed from incomplete fruit reconstruction due to branch and leaf occlusion, as well as point cloud noise interfering with clustering algorithms. These results validate the utility of our method for providing actionable information for orchard management and for supporting robots in executing higher-level tasks. Furthermore, the proposed method is designed with robustness in mind to address the challenges inherent in unstructured orchard environments, such as variable lighting, terrain unevenness, and occlusion. To mitigate the impact of lighting and weather variations, our multi-sensor fusion framework reduces reliance on pure visual appearance by utilizing LiDAR to provide stable geometric structure information under diverse illumination conditions. For terrain unevenness and severe shading, the integration of LiDAR with inertial sensing helps maintain pose estimation accuracy, while the multi-view Bayesian fusion strategy semantically consolidates information from different perspectives, alleviating the occlusion problem inherent in single viewpoints. Regarding static and dynamic obstacles, the current pipeline assumes a predominantly static environment for map construction; however, transient dynamic objects (e.g., moving personnel) can be filtered through temporal consistency checks in the point cloud sequence, an avenue for future refinement. Concerning seasonal changes (e.g., leaf growth or fruit maturation), the current model, while robust within the trained season, would benefit from continuous learning or seasonal adaptation to maintain segmentation accuracy, which we identify as a key direction for subsequent research. Thus, the system demonstrates a foundational robustness through its multi-modal design, with clear pathways identified for enhancing its adaptability in perpetually changing extreme environments.

5. Discussion and Future Work

Although the method proposed in this paper can achieve semantic reconstruction of dragon fruit orchards, the following limitations remain: A primary limitation is that the generated semantic map, while rich in object-level annotations, remains a static and descriptive representation rather than an actionable model for direct navigation. It cannot be directly used for high-level, task-oriented navigation planning in its current form. As highlighted in recent literature on embodied navigation, a semantic map primarily provides the foundational layer of environmental understanding; however, to enable actual robot movement, especially in complex, unstructured environments like orchards, this representation must be translated into a navigation-centric structure, such as a semantically annotated costmap or a graph representation of traversable space [36]. For instance, while our map identifies “thin branches” and “fruit,” it does not inherently encode which regions are safely traversable by a ground robot, nor does it quantify the risk or cost associated with moving near certain objects (e.g., avoiding low-hanging branches that could damage the robot or the crop). In agricultural settings, navigation planning often requires not only collision avoidance but also task-aware optimization, such as minimizing damage to produce, adhering to preferred crop row patterns, or dynamically replanning around temporary obstacles. Future work should therefore focus on bridging this gap, by integrating the reconstructed semantic map with a costmap generation module that incorporates terrain traversability, semantic risk penalties, and dynamic updates, thereby enabling closed-loop, semantically guided navigation in the orchard.

Another important consideration is the generalization capability of the proposed framework. The present study is validated within a dragon fruit orchard environment. To assess its applicability to other orchard types and crop species, such as apple trees or vineyards, the inherent flexibility of the pipeline’s design should be noted. The open vocabulary detection stage reduces dependency on a fixed set of categories, allowing the system to respond to textual prompts for novel objects. However, the performance on other crops would likely require adaptation of the two-dimensional segmentation module, as its current training is specific to dragon fruit imagery. A comprehensive cross-crop validation is planned as future work.

The accuracy of semantic segmentation has a decisive impact on the quality of semantic map reconstruction. As shown in Figure 23, the dragon fruit orchard environment is complex, primarily manifested in the following aspects: (1) Randomly distributed branches cause significant variations in the distance between the camera and the target during data collection, resulting in incomplete representation of some fruits within the image field of view; (2) To ensure synchronization between the camera and LiDAR frequencies, the camera operates at a frame rate of 10 Hz. Motion blur during movement reduces segmentation accuracy. (3) Rapid changes in orchard lighting conditions, coupled with severe occlusion of fruits by branches and leaves, resulted in incorrect segmentation of some targets. Furthermore, the complex morphological structure of dragon fruit plants, where branches and fruits intertwine, made dataset annotation laborious and error-prone. Inconsistencies and errors in annotation directly impact training data quality, causing the model to learn incorrect semantic features and consequently degrading segmentation performance. (4) The three-class semantic design (fruit, branches/leaves, background) allows us to focus evaluation on the core objects essential for navigation and fruit-level tasks. Our two-stage pipeline inherently supports open-set detection, enabling recognition of unlisted objects (e.g., tools, structures) during operation, though these are not quantitatively assessed here. To address these challenges, future research will focus on the following directions: First, optimizing the dataset to provide higher-quality training samples. Second, implementing lightweight improvements to the segmentation network tailored to the practical deployment requirements of orchard robots. This will enable adaptation to computationally constrained embedded platforms while meeting real-time demands in highly dynamic operational environments. The framework can be directly extended to more categories in future work by expanding the training dataset.

Author Contributions

Conceptualization, Q.W. and Y.C. (Yiheng Chen); Methodology, Q.W. and Y.C. (Yiheng Chen); Software, Q.W. and Y.C. (Yiheng Chen); Validation, Q.W., J.L. and Y.C. (Yongxing Chen); Formal analysis, Q.W. and J.L.; Data curation, Q.W.; Writing—original draft, Q.W.; Writing—review & editing, H.W.; Supervision, H.W.; Project administration, H.W.; funding acquisition, H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the National Natural Science Foundation of China under Grant 32372001.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kazerouni, I.A.; Fitzgerald, L.; Dooly, G.; Toal, D. A survey of state-of-the-art on visual SLAM. Expert Syst. Appl. 2022, 205, 117734. [Google Scholar] [CrossRef]
Liu, T.; Kang, H.; Chen, C. ORB-Livox: A real-time dynamic system for fruit detection and localization. Comput. Electron. Agric. 2023, 209, 107834. [Google Scholar] [CrossRef]
Yuan, Q.; Wang, P.; Luo, W.; Zhou, Y.; Chen, H.; Meng, Z. Simultaneous Localization and Mapping System for Agricultural Yield Estimation Based on Improved VINS-RGBD: A Case Study of a Strawberry Field. Agriculture 2024, 14, 784. [Google Scholar] [CrossRef]
Kutyrev, A.; Khort, D.; Smirnov, I.; Zubina, V. UAV-based sustainable orchard management: Deep learning for apple detection and yield estimation. In E3S Web of Conferences; EDP Sciences: Les Ulis, France, 2025; Volume 614, p. 03021. [Google Scholar]
Casado-García, A.; Heras, J.; Milella, A.; Marani, R. Semi-supervised deep learning and low-cost cameras for the semantic segmentation of natural images in viticulture. Precis. Agric. 2022, 23, 2001–2026. [Google Scholar] [CrossRef]
Liu, Z.; Wang, J.; Liu, C.; Li, Z.; Jiang, H.; Ma, Y.; Zhang, Y.; Wang, Z. Online Point Coverage Path Planning for Prior-Free Robotic Weeding Using Deep Reinforcement Learning. Authorea Prepr. 2025. Available online: https://www.techrxiv.org/doi/full/10.36227/techrxiv.175338547.70910319 (accessed on 13 January 2026).
Rapado-Rincon, D.; Kootstra, G. Tree-SLAM: Semantic object SLAM for efficient mapping of individual trees in orchards. Smart Agric. Technol. 2025, 12, 101439. [Google Scholar] [CrossRef]
Lei, J.; Prabhu, A.; Liu, X.; Cladera, F.; Mortazavi, M.; Ehsani, R.; Chaudhari, P.; Kumar, V. Spatio-Temporal Metric-Semantic Mapping for Persistent Orchard Monitoring: Method and Dataset. IEEE Robot. Autom. Lett. 2025, 10, 8610–8617. [Google Scholar] [CrossRef]
Peng, C.; Roy, P.; Luby, J.; Isler, V. Semantic mapping of orchards. IFAC-PapersOnLine 2016, 49, 85–89. [Google Scholar] [CrossRef]
Xiong, J.; Liang, J.; Zhuang, Y.; Hong, D.; Zheng, Z.; Liao, S.; Hu, W.; Yang, Z. Real-time localization and 3D semantic map reconstruction for unstructured citrus orchards. Comput. Electron. Agric. 2023, 213, 108217. [Google Scholar] [CrossRef]
Nakaguchi, V.M.; Abeyrathna, R.R.D.; Liu, Z.; Noguchi, R.; Ahamed, T. Development of a Machine stereo vision-based autonomous navigation system for orchard speed sprayers. Comput. Electron. Agric. 2024, 227, 109669. [Google Scholar] [CrossRef]
Papadimitriou, A.; Kleitsiotis, I.; Kostavelis, I.; Mariolis, I.; Giakoumis, D.; Likothanassis, S.; Tzovaras, D. Loop closure detection and slam in vineyards with deep semantic cues. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 2251–2258. [Google Scholar]
Wang, S.; Song, J.; Qi, P.; Yuan, C.; Wu, H.; Zhang, L.; Liu, W.; Liu, Y.; He, X. Design and development of orchard autonomous navigation spray system. Front. Plant Sci. 2022, 13, 960686. [Google Scholar] [CrossRef]
Blok, P.M.; van Boheemen, K.; van Evert, F.K.; IJsselmuiden, J.; Kim, G.H. Robot navigation in orchards with localization based on Particle filter and Kalman filter. Comput. Electron. Agric. 2019, 157, 261–269. [Google Scholar] [CrossRef]
Dong, W.; Roy, P.; Isler, V. Semantic mapping for orchard environments by merging two-sides reconstructions of tree rows. J. Field Robot. 2020, 37, 97–121. [Google Scholar] [CrossRef]
Peng, H.; Guo, S.; Zou, X.; Wang, H.; Xiong, J.; Liang, Q. UAVO-NeRF: 3D reconstruction of orchards and semantic segmentation of fruit trees based on neural radiance field in UAV images. Comput. Electron. Agric. 2025, 237, 110631. [Google Scholar] [CrossRef]
Pan, Y.; Hu, K.; Cao, H.; Kang, H.; Wang, X. A novel perception and semantic mapping method for robot autonomy in orchards. Comput. Electron. Agric. 2024, 219, 108769. [Google Scholar] [CrossRef]
Fu, H.; Li, X.; Zhu, L.; Xin, P.; Wu, T.; Li, W.; Feng, Y. DSC-DeepLabv3+: A lightweight semantic segmentation model for weed identification in maize fields. Front. Plant Sci. 2025, 16, 1647736. [Google Scholar] [CrossRef] [PubMed]
Sodano, M.; Magistri, F.; Marks, E.; Hosn, F.; Zurbayev, A.; Marcuzzi, R.; Malladi, M.V.; Behley, J.; Stachniss, C. 3D Hierarchical Panoptic Segmentation in Real Orchard Environments Across Different Sensors. arXiv 2025, arXiv:2503.13188. [Google Scholar] [CrossRef]
Cuaran, J.; Ahluwalia, K.S.; Koe, K.; Uppalapati, N.K.; Chowdhary, G. Active Semantic Mapping with Mobile Manipulator in Horticultural Environments. In Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA), Atlanta, GA, USA, 19–23 May 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 12716–12722. [Google Scholar]
Liu, Z.; Feng, Q.; Qin, C.; Lin, Y.; Xia, P.; Wang, H.; Gong, L.; Liu, C. EDSC-HRAFNet: An apple tree branch semantic segmentation model for harvesting robots under complex orchard conditions. Artif. Intell. Agric. 2026; in press. [Google Scholar] [CrossRef]
Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. YOLO-World: Real-Time Open-Vocabulary Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 19–21 June 2024. [Google Scholar]
Zhao, X.; Ding, W.; An, Y.; Du, Y.; Yu, T.; Li, M.; Tang, M.; Wang, J. Fast Segment Anything. arXiv 2023, arXiv:2306.12156. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5105–5114. [Google Scholar]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 11108–11117. [Google Scholar]
Zhou, H.; Zhu, X.; Song, X.; Ma, Y.; Wang, Z.; Li, H.; Lin, D. Cylinder3d: An effective 3d framework for driving-scene lidar semantic segmentation. arXiv 2020, arXiv:2008.01550. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Zheng, C.; Xu, W.; Zou, Z.; Hua, T.; Yuan, C.; He, D.; Zhou, B.; Liu, Z.; Lin, J.; Zhu, F.; et al. Fast-livo2: Fast, direct lidar-inertial-visual odometry. IEEE Trans. Robot. 2024, 41, 326–346. [Google Scholar] [CrossRef]
Liu, Z.; Li, H.; Yuan, C.; Liu, X.; Lin, J.; Li, R.; Zheng, C.; Zhou, B.; Liu, W.; Zhang, F. Voxel-slam: A complete, accurate, and versatile lidar-inertial slam system. arXiv 2024, arXiv:2410.08935. [Google Scholar]
Zheng, C.; Zhu, Q.; Xu, W.; Liu, X.; Guo, Q.; Zhang, F. Fast-livo: Fast and tightly-coupled sparse-direct lidar-inertial-visual odometry. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 4003–4009. [Google Scholar]
Carion, N.; Gustafson, L.; Hu, Y.T.; Debnath, S.; Hu, R.; Suris, D.; Ryali, C.; Alwala, K.V.; Khedr, H.; Huang, A.; et al. SAM 3: Segment Anything with Concepts. arXiv 2025, arXiv:2511.16719. [Google Scholar] [CrossRef]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention Mask Transformer for Universal Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 1290–1299. [Google Scholar]
Vasconez, J.P.; Delpiano, J.; Vougioukas, S.; Cheein, F.A. Comparison of convolutional neural networks in fruit detection and counting: A comprehensive evaluation. Comput. Electron. Agric. 2020, 173, 105348. [Google Scholar] [CrossRef]
Yuan, M.; Wang, L.; Waslander, S.L. Opennav: Open-world navigation with multimodal large language models. In Proceedings of the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hangzhou, China, 19–25 October 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 18948–18955. [Google Scholar]

Figure 1. Schematic diagram of semantic annotation.

Figure 2. System overview of 3D semantic map construction.

Figure 3. Two-stage semantic segmentation model diagram.

Figure 4. Multi-sensor fusion framework.

Figure 5. System overview of external calibration.

Figure 6. LiDAR SLAM experiment results. (a,b) The colored point cloud map; (c) The point cloud map.

Figure 7. Detailed magnified illustration. (a1,b1) are magnified versions of Figure 6a,b.

Figure 8. The hardware synchronization scheme.

Figure 9. Optimization results.

Figure 10. The experiments platform.

Figure 11. Comparison of Segmentation Effects.

Figure 12. Experimental environment setup.

Figure 13. Small-scale point cloud reconstruction results.

Figure 14. Color-based point cloud preprocessing.

Figure 15. Fruit clustering results.

Figure 16. Point cloud reconstructed by the vision-only baseline, viewed parallel to the direction of motion.

Figure 17. Point cloud from the vision-only baseline, viewed at approximately 45° to the motion direction.

Figure 18. Point cloud from the vision-only baseline, viewed perpendicular to the direction of motion. (a) The perspective is perpendicular to the direction of motion; (b) Prediction masks illustrating segmentation errors.

Figure 19. Algorithm performance comparison.

Figure 20. Experiment 1 (slow motion, stable lighting): Trajectory comparison between LIV fusion and visual-inertial.

Figure 21. Experiment 2 (fast motion, rapid lighting changes): Trajectory comparison between LIV fusion and visual-inertial.

Figure 22. Visual ablation positioning error experiment. (a) Trajectory schematic diagram. (b) Error comparison chart, based on visual standards.

Figure 23. Example of model segmentation error. (a1,a2): Original picture. (b1,b2): Prediction Mask.

Table 1. Key parameters in semantic mapping pipeline.

Parameter	Value	Description
Confidence range	[0.55, 0.95]	Bayesian fusion score bounds
Voxel resolutions	1.0, 0.5, 0.2 m	Multi-scale grid sizes
Label threshold $γ$	0.6	Dominant category ratio
Smoothing KNN	10	Boundary point neighbors
Smoothing threshold	60%	Minimum vote for reassignment
Min 3D depth	0.5 m	Detection box depth lower bound

Table 2. Performance comparison of different segmentation methods on pitaya test set.

Method	mIoU/%	PA/%	F1/%	FPS
BiSeNet	78.61	87.75	88.0	115.15
U-Net	66.65	83.75	87.56	67.3
DeepLabV3+	67.14	82.24	79.28	133.08
FastSAM	74.33	85.02	84.72	74.85

Table 3. Comparison of actual and predicted fruit counts (Exp. 1–2: small-scale; Exp. 3–5: large-scale).

Experiment ID	Actual Value (Count)	Predicted Value (Count)	Relative Error $α$ (%)
1	24	19	20.8
2	20	20	0.0
3	19	16	15.7
4	19	17	10.5
5	20	23	15.0
Mean	–	–	12.4

Table 4. Absolute Trajectory Error (APE) of LIV system under different operational conditions.

Condition	Max (m)	Mean (m)	Median (m)	RMSE (m)	Std (m)
Slow motion, stable lighting	1.256	0.529	0.434	0.628	0.337
Fast motion, rapid lighting changes	2.357	1.290	1.375	1.376	0.479

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Q.; Chen, Y.; Li, J.; Chen, Y.; Wang, H. 3D Semantic Map Reconstruction for Orchard Environments Using Multi-Sensor Fusion. Agriculture 2026, 16, 455. https://doi.org/10.3390/agriculture16040455

AMA Style

Wang Q, Chen Y, Li J, Chen Y, Wang H. 3D Semantic Map Reconstruction for Orchard Environments Using Multi-Sensor Fusion. Agriculture. 2026; 16(4):455. https://doi.org/10.3390/agriculture16040455

Chicago/Turabian Style

Wang, Quanchao, Yiheng Chen, Jiaxiang Li, Yongxing Chen, and Hongjun Wang. 2026. "3D Semantic Map Reconstruction for Orchard Environments Using Multi-Sensor Fusion" Agriculture 16, no. 4: 455. https://doi.org/10.3390/agriculture16040455

APA Style

Wang, Q., Chen, Y., Li, J., Chen, Y., & Wang, H. (2026). 3D Semantic Map Reconstruction for Orchard Environments Using Multi-Sensor Fusion. Agriculture, 16(4), 455. https://doi.org/10.3390/agriculture16040455

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

3D Semantic Map Reconstruction for Orchard Environments Using Multi-Sensor Fusion

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection and Processing

2.2. Methods

2.2.1. Open Vocabulary Detection and Semantic Segmentation Algorithm

2.2.2. LiDAR-Based Positioning and Point Cloud Map Construction

2.2.3. 3D Semantic Maps Construction and Postprocessing

3. Experiments and Results

3.1. Experiments Platform

3.2. Semantic Segmentation Performance Analysis

3.3. Evaluation of 3D Semantic Map Construction in a Orchard

3.4. Localization Accuracy Evaluation

4. Conclusions

5. Discussion and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI