1. Introduction
With the continuous expansion of modern agricultural production, it has become increasingly difficult and costly to rely solely on manual labor for tasks such as pest and disease monitoring, phenotypic trait acquisition, and yield estimation. As a result, integrating artificial intelligence technologies into various aspects of agricultural production [
1,
2,
3] has emerged as a key research focus in the field of smart agriculture. Three-dimensional reconstruction of crops enables a comprehensive understanding of plants’ phenotypic traits, which reflect the current growth status and cultivation conditions of the plants [
4]. These phenotypic characteristics facilitate the extraction of various plant parameters [
5,
6,
7] and support automated monitoring.
With the advancement of smart agriculture, numerous studies have been conducted on three-dimensional (3D) reconstruction methods for crops. Traditional 3D reconstruction techniques include multi-view stereo (MVS) [
8], structure from motion (SfM) [
9], and time-of-flight (ToF) [
10]. Both MVS and SfM reconstruct 3D structures from two-dimensional images captured from multiple viewpoints. Bao et al. [
11] employed MVS to reconstruct 3D models of sorghum plants and successfully extracted their phenotypic traits. Van et al. [
12] combined MVS with the A * algorithm on an autonomous robot to enable automatic cucumber harvesting. Zhu et al. [
13] applied SfM to reconstruct 3D models of rice and maize and further achieved high-fidelity visualization using clustering and fitting methods. Liu et al. [
14] proposed a combined SfM-MVS approach for measuring the morphological parameters of potato tubers. ToF technology is typically integrated into scanning devices such as LiDAR and depth cameras. Shi et al. [
15] utilized LiDAR for 3D reconstruction of maize and achieved real-time phenotypic trait measurement. Marinello et al. [
16] used a Kinect depth camera to estimate the volume and mass of grape bunches. Thapa et al. [
17] employed an LMS 511 LiDAR scanner to acquire phenotypic parameters such as the leaf area and leaf inclination angles for maize and sorghum. However, traditional 3D reconstruction methods are often sensitive to the lighting conditions or dependent on expensive scanning equipment, making them challenging to apply in outdoor environments. Over the past decade, visual simultaneous localization and mapping (vSLAM) has undergone rapid development. Owing to its low implementation cost and reliable reconstruction performance in outdoor settings, it has gradually become a mainstream approach for 3D plant reconstruction in smart agriculture.
Numerous studies have employed visual SLAM to acquire the phenotypic trait parameters of plants and guide agricultural development. Fan et al. [
18] proposed a portable RGB-D SLAM system to reconstruct large-scale forests, obtaining tree location and height information. Dong et al. [
19] utilized a depth camera combined with the ORB-SLAM2 system to capture and track the point cloud data of fruits in orchards. Pierzchala et al. [
20] integrated multiple sensors, including stereo cameras and LiDAR, to map forest environments, providing precise tree location data for forestry vehicles. Kim et al. [
21] developed a framework named P-AgSLAM to extract distinctive morphological features of maize in cornfields. Lobefaro et al. [
22] achieved high-precision spatiotemporal matching and data association in pepper fields by combining RGB-D SLAM with visual place recognition. Nellithimaru et al. [
23] integrated deep learning with visual SLAM and used stereo cameras to capture regions of interest, enabling stable grape counting in vineyards. Fan et al. [
24] reconstructed maize crops using RGB-D cameras and estimated the maize stalk diameters by analyzing the convexities and projections in the point clouds. Wang et al. [
25] improved ORB-SLAM2 and combined it with the YOLOv4 and DeepSORT algorithms to localize and count melons. Islam et al. [
26] proposed the AGRI-SLAM system, which integrates image enhancement techniques to enable real-time environmental reconstruction under a wider range of agricultural conditions, including low-light scenarios. Tang et al. [
27] employed a multi-sensor fusion SLAM system to conduct fruit tree localization experiments in pear and persimmon orchards, achieving high localization accuracy and constructing global maps of the orchards through 3D reconstruction. In previous studies, SLAM has been utilized to construct 3D maps for acquiring crop phenotypic traits. However, these maps typically represent only geometric features and exhibit limited expressiveness for regions of interest. By incorporating semantic information into the map, the representation of such regions can be enhanced, thereby improving both the expressiveness and the usability of the 3D map.
Integrating semantic or instance segmentation into SLAM to construct semantic point cloud maps that emphasize the phenotypic trait information of crops has become a research hotspot in agriculture. Xiong et al. [
28] incorporated the semantic segmentation network BiSeNetV1 into VINS-RGBD to achieve 3D semantic reconstruction of citrus trees and estimated the fruit yield through conditional filtering and spherical fitting. Liu et al. [
29] proposed a fruit counting framework using Faster R-CNN for fruit detection followed by semantic SfM for reconstruction and counting. Yuan et al. [
30] combined the PP-LiteSeg-T network with VINS-RGBD to map strawberry fields and count strawberry fruits. Giang et al. [
31] employed ORB-SLAM3 alongside a semantic segmentation network to generate semantic point cloud maps of sweet peppers, facilitating the detection of the lowest branch pruning points. Liu et al. [
32] integrated ORB-SLAM3 with semantic segmentation to classify mature strawberries, guiding harvesting activities. Xie et al. [
33] proposed a multi-object semantic segmentation network, MM-ED, combined with ORB-SLAM3 to build semantic point cloud maps of unstructured lawns, providing a reference and guidance for weeding operations. Pan et al. [
34] designed a novel semantic mapping framework that employs the 3D-ODN detection network to extract object information from point clouds. The experimental results demonstrated that the system enables high-precision understanding of agricultural environments. Although these studies enrich point cloud representation by incorporating semantic segmentation and highlighting regions of interest, the data acquisition is performed manually, which is highly challenging in large-scale agricultural plantations. The use of embedded devices such as unmanned aerial vehicles (UAVs) or unmanned ground vehicles (UGVs) to assist with data collection can alleviate labor demands and reduce long-term operational costs.
Integrating unmanned aerial vehicles (UAVs) or unmanned ground vehicles (UGVs) enables more flexible 3D reconstruction of crops, thereby improving the efficiency of plant phenotypic trait acquisition. The ultimate goal of smart agriculture is to combine intelligent equipment to support automated agricultural production. UAVs can rapidly acquire phenotypic trait data for small-scale fields, such as the plant height [
35] or canopy coverage [
36]. Sun et al. [
37] utilized UAVs for 3D reconstruction of apple trees to obtain parameters including the tree height and canopy volume. Sarron et al. [
38] performed 3D reconstruction of mango orchard trees using UAVs, achieving accurate mango yield estimation. UGVs are more suitable for detailed 3D reconstruction of shorter plants and provide finer mapping results. Siebers et al. [
39] employed a UGV combined with LiDAR to generate precise 3D point clouds of grapevine rows for rapid phenotypic information extraction and conducted preliminary assessment of the grape nutritional status. Qiu et al. [
40] designed a LiDAR-equipped mobile robot to reconstruct maize plants and measured phenotypic parameters such as the plant height and row spacing. Wang et al. [
41] developed a 3D reconstruction robot for greenhouse environments and validated the system in an eggplant cultivation base, successfully predicting the fruit counts. Nguyen et al. [
42] developed DairyBioBot, a fully automated phenotyping platform equipped with LiDAR and RTK sensors, to reconstruct perennial ryegrass and estimate its field biomass. Qi et al. [
43] proposed a CNN-based loop closure detection method tailored for intelligent embedded agricultural devices and suitable for RGB-D SLAM. The system was validated in greenhouse environments and demonstrated high accuracy and real-time performance in loop detection.
Ma et al. [
44] employed an improved YOLOv5 model to detect tree trunks and pedestrians, integrating it into a visual SLAM system to assist intelligent agricultural equipment in achieving more precise localization. These studies primarily focus on embedded devices for crop 3D reconstruction, but most rely on high-cost sensors such as LiDAR and are conducted offline. Therefore, employing lightweight visual SLAM for crop 3D modeling can reduce the investment costs while enabling real-time acquisition of phenotypic traits and growth status.
To address the challenges encountered in the application of 3D modeling technologies in smart agriculture, this study proposes an ORB-SLAM3-based system integrated with the lightweight semantic segmentation network BiSeNetV2. The system aims to overcome the lack of semantic information in point cloud maps and the limited real-time performance of existing approaches. A two-stage filtering method is employed to remove noise from the semantic point cloud, and OctoMap is used for efficient storage, thereby addressing the issue of large-scale point cloud data management. The entire system is deployed on an unmanned ground vehicle equipped with the A * algorithm and validated in a tomato plantation. The experimental platform demonstrates the potential to significantly reduce labor requirements. The main contributions of this study are as follows:
- 1.
Develop a 3D reconstruction robotic platform tailored for greenhouse environments, enabling accurate extraction of crop phenotypic traits from 3D maps. The platform was validated in a tomato plantation.
- 2.
Design a reconstruction system optimized for low-profile crops, integrating a lightweight semantic segmentation module to enhance the representation of regions of interest, while maintaining real-time performance by segmenting only keyframes.
- 3.
Propose a two-stage filtering method to enhance the point cloud accuracy and adopt OctoMap for efficient storage on embedded devices.
- 4.
A tomato fruit semantic segmentation dataset with precise annotations is constructed, providing valuable data support for future research on tomato plants.
2. Materials and Methods
2.1. Study Area and Data Acquisition
In this study, a tomato cultivation base located in Gu’an County, Langfang City, was selected as the study area for data acquisition. Gu’an County is situated in the central part of the North China Plain (39°26′–39°45′ N, 116°09′–116°32′ E) and has a semi-humid continental monsoon climate, characterized by four distinct seasons, abundant sunlight, and rich thermal resources. The average annual temperature is approximately 12.1 °C, and the annual precipitation ranges from 500 to 600 mm, providing favorable conditions for agricultural development. In recent years, Gu’an County has actively promoted modern agriculture, with a continuously expanding scale of protected vegetable cultivation. As one of the major crops, tomatoes are widely grown with diverse management practices, making the region a representative sample area for studying crop phenotypic traits.
On 18 April 2025, image data of tomato plants were collected at the tomato cultivation base in Gu’an County using an Intel RealSense D435i depth camera(Intel, Bangkok, Thailand) and a Nikon Z30 camera (Nikon, Tokyo, Japan) equipped with a 50–250 mm f/4.5–6.3 lens. The image acquisition covered tomato plants and fruits of varying maturity levels and under varying lighting conditions. The high-resolution images clearly captured the morphological structures of the plants, providing a reliable data foundation for subsequent tasks such as semantic segmentation, feature extraction, and 3D reconstruction of tomato plants.
Images captured by the D435i and Nikon Z30 cameras were jointly used to train the semantic segmentation network. In the self-constructed tomato dataset, the D435i was used to acquire wide-view images of tomato plants, while the Nikon Z30 was used to capture close-up images. A total of 300 wide-view and 300 close-up images were collected. All the images were uniformly resized and cropped to a resolution of 640 × 480 pixels and annotated using the Labelme software (version 5.4.1). The final dataset consisted of 600 annotated images. Sample images from the dataset are shown in
Figure 1.
The system was tested in real-world scenarios using a UGV platform, as shown in
Figure 2a, while
Figure 2b presents the corresponding visualization results during the experiment. The UGV platform consists of a sensor module, a chassis module, and a control module. The sensor module includes an Intel RealSense D435i depth camera, a 2D LiDAR, and an RTK positioning system. The chassis module supports keyboard and joystick control, with a maximum payload capacity of 50 kg and a maximum operating speed of 0.8 m/s. The control module is PC-based, running Ubuntu 20.04 and ROS. Equipped with the A * algorithm, the UGV performs autonomous navigation in the field to collect data and conduct 3D reconstruction. During the experiment, RTK was used to obtain the ground-truth trajectory of the UGV; the Intel RealSense D435i was used to capture RGB and depth images; a monitor was used to visualize the experimental results; and a LiDAR sensor was employed for obstacle avoidance and path planning.
2.2. Improved ORB-SLAM3 System
This study performs 3D semantic reconstruction of tomato plants based on an improved ORB-SLAM3 system. Compared to ORB-SLAM2, ORB-SLAM3 incorporates inertial measurement unit (IMU) input to assist in camera pose estimation and introduces a multi-map system, resulting in improved reconstruction quality and localization accuracy. Based on the practical application scenario, we further enhanced ORB-SLAM3. The specific workflow is illustrated in
Figure 3.
Since the 3D reconstruction results of ORB-SLAM3 consider only geometric information and lack semantic expressiveness, we integrated a semantic module into the system to enhance the representation of the point cloud map. Considering that the system runs on embedded devices, measures are necessary to reduce the computational and storage burdens. As the majority of the computational load originates from the semantic segmentation module’s inference, we selected the lightweight semantic segmentation network BiSeNetV2 and performed segmentation only on keyframe images. For the point cloud map storage, OctoMap was employed to significantly reduce the storage requirements, making it suitable for large-scale agricultural 3D reconstruction.
It is worth noting that this study adopted a “segment-then-reconstruct” approach to generate semantic point clouds. This strategy is based on the maturity of current image semantic segmentation techniques, which can provide more detailed and accurate semantic information at the image level. By generating point clouds from segmented semantic images, the complexity of the subsequent point cloud processing is effectively reduced, thereby improving the computational efficiency. Moreover, this module is easier to integrate in practical applications, demonstrating favorable engineering feasibility.
The system first performs tracking and pose estimation on each input frame and determines whether the frame qualifies as a keyframe based on the motion information. The keyframe selection is guided by the pose change and tracking quality between the current frame and the reference keyframe. Specifically, a frame is marked as a keyframe when the translation or rotation relative to the previous keyframe exceeds the predefined thresholds, or when the number of successfully tracked map points drops significantly. For each selected keyframe, semantic segmentation is performed, and the pixel-level semantic labels are projected into the 3D space using the corresponding depth map, pose information, and camera intrinsics to generate a 3D semantic point cloud. The resulting semantic point cloud is then processed through a two-stage filtering pipeline to remove noise and redundant information, and it is subsequently integrated into an OctoMap to construct the 3D semantic map. The system retains the local optimization and loop closure modules of ORB-SLAM3 to ensure the global consistency and accuracy of the reconstructed map. The detailed workflow of the improved ORB-SLAM3 system is presented in Section 1 of
Appendix A.
2.3. Semantic Segmentation Module
In this study, we employed the lightweight bilateral segmentation network BiSeNetV2 for the pixel-level semantic segmentation of tomato plant images. The network architecture is illustrated in
Figure 4. BiSeNetV2 is particularly well-suited for real-time 3D semantic reconstruction of low-height crops in agricultural scenarios due to its efficient network design and excellent segmentation performance.
The core architecture of the BiSeNetV2 network consists of three main components: the bilateral segmentation network, the guided aggregation layer, and an enhanced training strategy. This design allows the network to maintain a high inference speed while meeting the accuracy requirements of semantic segmentation. As shown in
Figure 4, the green box represents the bilateral segmentation network, which is composed of a detail branch (highlighted in purple) and a semantic branch (highlighted in yellow).The detail branch adopts a wide and shallow structure to effectively extract high-resolution boundary and texture information, which is beneficial for delineating structurally complex regions such as the leaves and fruits of tomato plants. In contrast, the semantic branch employs a deep and narrow architecture, incorporating depthwise separable convolutions, fast downsampling, and global average pooling to capture rich contextual semantics. This enhances the network’s ability to interpret complex scenes involving background clutter and occlusion. The dual-branch structure enables the complementary fusion of semantic and geometric features, resulting in robust segmentation performance even under conditions of dense crop distribution or leaf occlusion.
The blue box represents the guided aggregation layer, which employs a bidirectional attention mechanism to efficiently fuse the detail and semantic features extracted by the bilateral segmentation network. Compared with traditional operations such as concatenation or element-wise addition, this module minimizes the information loss during the fusion process. As a result, it offers improved discriminative capability when dealing with challenges such as blurred plant edges and leaf occlusions in tomato plants.
BiSeNetV2 also incorporates an enhanced training strategy, indicated by the orange box in the figure. This strategy embeds lightweight FCNHead auxiliary heads at different levels of the semantic branch to guide the network in learning multi-scale features more effectively. During training, these auxiliary heads help improve the convergence speed and robustness of the network. However, they are completely removed during inference, thus introducing no additional computational overhead into the semantic segmentation process and ensuring the overall efficiency of the system.
The semantic segmentation network integrated into the SLAM system performs segmentation only on selected keyframes to avoid compromising the system’s real-time performance. BiSeNetV2 accurately and efficiently segments the keyframe images, and the semantic segmentation results are used to assign semantic labels to the point clouds generated during mapping, thereby constructing a structurally clear and semantically explicit 3D semantic point cloud map. Its excellent inference speed and segmentation accuracy ensure the real-time capability and precision of the semantic map construction, facilitating subsequent tasks such as fruit identification and region extraction in tomato plants.
2.4. Two-Stage Filtering Method
Abnormal points may appear in the 3D reconstruction map due to factors such as lighting variations and dynamic objects. The presence of these outliers degrades the quality of the final point cloud map and increases the errors in estimating the phenotypic shape of tomato plants from the point cloud data. Therefore, to enable the point cloud map to more accurately represent tomato plants, we applied a two-stage filtering process on the tomato point cloud, consisting of pass-through filtering followed by statistical filtering.
In the first stage, pass-through filtering is applied to the point cloud of tomato plants along the vertical direction to remove a large number of non-plant points in the background. This process can be expressed by Equation (1), where
x,
y, and
z represent the coordinates of points in the point cloud, and
,
and
denote the filtering ranges along the three coordinate axes, respectively.
Since pass-through filtering only provides an approximate range of valid tomato plant points, a secondary filtering step is necessary to improve the accuracy of the local point cloud. In the second stage, statistical filtering is applied to the remaining point cloud. Statistical filtering removes outliers by analyzing the distances between neighboring points within each point’s local neighborhood. Specifically, for each point,
, the average distance
to its
k nearest neighbors,
,
k is calculated using Equation (2). Then, the mean
and standard deviation
of these average distances are computed, assuming they follow a Gaussian distribution. The valid range of the point cloud distances is defined as
. For each point
, if its average distance
lies within this interval, it is retained; otherwise, it is removed. By tuning the parameters of the statistical filter, a balance between mapping accuracy and filtering speed for tomato plant reconstruction is achieved.
2.5. OctoMap Construction
In this study, the map constructed by ORB-SLAM3 is a dense point cloud, which requires substantial storage space. This poses a challenge for embedded devices with limited storage capacity, making it difficult to store large-scale 3D reconstruction results in agricultural scenarios. To address this issue, we employed OctoMap to convert and store the dense point cloud. Maps constructed with OctoMap offer flexible storage and can be incrementally updated, making them well-suited for large-scale agricultural applications.
The principle of representing dense point clouds using OctoMap is illustrated in
Figure 5. OctoMap models the 3D space as a cube, which can be recursively subdivided into eight smaller cubes of equal size. This process begins from a root node, where each node can expand into eight child nodes, and the subdivision continues iteratively. Based on this principle, the entire dense point cloud is progressively partitioned so that each node represents a smaller spatial region, until a predefined resolution is reached. By adjusting the threshold parameters, OctoMaps of different resolutions can be generated to meet varying precision requirements. Section 2 in
Appendix A describes the algorithmic process of OctoMap.
In the final OctoMap, each node contains information indicating whether the corresponding space is occupied. The occupancy status of a node is represented using a log-odds formulation, as defined in Equation (3), where
denotes the log-odds and
p represents the probability of occupancy. By rearranging Equation (3), we obtain Equation (4), which shows that as
varies within the range
, the corresponding probability
p lies within the interval [0, 1].
When all eight child nodes within a block are consistently marked as either occupied or free, the root node corresponding to that block will cease further subdivision. When the value of a node exceeds the defined threshold, the node is considered occupied and can be visualized, thereby being regarded as a stable node.
In this study, the improved ORB-SLAM3 system integrates with OctoMap through the mapping thread, which outputs dense semantic point clouds. When the mapping thread processes keyframes and generates the corresponding dense semantic point clouds, these point clouds are immediately passed to the OctoMap module for spatial voxelization and incremental map updates. Compared to directly storing point clouds, this representation significantly reduces the memory consumption, improves the processing efficiency, and enhances the system’s deployability on resource-constrained platforms. As a result, it increases the feasibility of performing large-scale 3D semantic reconstruction in agricultural environments.
4. Conclusions
This study proposes an ORB-SLAM3 system integrated with the lightweight semantic segmentation network BiSeNetV2, which is embedded on an unmanned vehicle for phenotypic trait acquisition of tomato plants in a greenhouse environment. To validate the feasibility of the proposed method, field experiments were conducted in a tomato cultivation base. The integration of BiSeNetV2 enables real-time semantic point cloud reconstruction, achieving a segmentation accuracy of 95.37% mIoU and a processing speed of 61.98 FPS on the tomato dataset. Outlier points in the point cloud were removed using a combination of pass-through and statistical filtering, and the processed data were stored using OctoMap, resulting in an average storage space reduction of 96.70%. Analysis of the point cloud allowed extraction of the plant height, canopy width, and plant volume, with relative errors of 3.86%, 14.34%, and 27.14%, respectively, compared to the ground truth measurements. The number and size of the tomato fruits were estimated via spherical fitting, yielding relative errors of 14.36% and 14.25%, respectively. In practical greenhouse scenarios, the system demonstrated robust localization performance, achieving an mATE of 0.16 m and an RMSE of 0.21 m on the field dataset.
Although the proposed system has achieved promising results in acquiring phenotypic traits of tomato plants, several limitations remain. Since the system relies on an RGB-D camera for data acquisition, it is sensitive to lighting variations—strong illumination or shadowed regions can introduce depth errors, thereby reducing the accuracy of the point cloud. In real-world operating environments, issues such as leaf overlap and fruit occlusion may occur, leading to incomplete semantic segmentation. These segmentation gaps can subsequently result in inaccuracies in the phenotypic parameter estimation.
Future work will explore the application prospects of the system across various other greenhouse fruits and vegetables to evaluate its generalizability. Meanwhile, efforts will be made to optimize the methods for acquiring accurate ground truth values for plant phenotypic traits.