1. Introduction
Modern agriculture is facing increasingly severe food security challenges. As a representative form of controlled-environment agriculture, plant factories play a critical role in enabling precision agriculture, where intelligent technologies are essential for sustainable and efficient production [
1]. The core of precise process management and yield prediction in plant factories lies in the real-time, non-destructive acquisition of key phenotypic traits that accurately reflect crop growth status, such as plant height, leaf number, and leaf area. However, traditional phenotyping approaches, including manual measurement and destructive sampling, suffer from low efficiency, strong subjectivity, and disruption of plant growth continuity, making them inadequate for the high-frequency and high-throughput monitoring demands of plant factories [
2]. Advances in computer vision have spurred the development of vision-based phenotyping platforms [
3], which can be broadly classified into 2D image processing methods and 3D reconstruction techniques [
4,
5]. However, the complex structural patterns and severe mutual occlusion of leafy vegetables under intensive cultivation conditions significantly limit the effectiveness of 2D imaging approaches by introducing self-occlusion, loss of depth information, and perspective distortion. As a result, projection-based 2D methods can only capture basic traits (e.g., leaf length and projected area) and often yield inaccurate measurements for complex phenotypes [
6]. In contrast, 3D reconstruction enables non-invasive spatial analysis and provides more comprehensive phenotypic information [
7]. Therefore, accurate 3D crop modeling is crucial for continuous phenotypic monitoring throughout the production process [
8].
Active 3D data acquisition methods, such as laser scanning and depth cameras, still face notable limitations. Laser scanners are prohibitively expensive [
9], while depth sensors typically suffer from low point cloud quality and limited robustness to illumination variations [
3]. More critically, both approaches perform poorly when reconstructing small-scale targets such as leafy vegetables, often failing to capture fine structural details. Deep learning–based 3D reconstruction methods provide a promising alternative to address these challenges. In recent years, neural rendering techniques, represented by Neural Radiance Fields (NeRF) [
10] and 3D Gaussian Splatting (3DGS) [
11], have revolutionized 3D reconstruction and novel view synthesis. These methods reconstruct realistic 3D digital models of real-world scenes using only multi-view 2D images and camera poses, without requiring explicit 3D or depth supervision, thereby significantly improving reconstruction efficiency and accuracy. They have been widely applied in fields such as 3D surface extraction, human avatar modeling, large-scale urban scene representation, and view synthesis. Mildenhall et al. [
10] first introduced NeRF, which represents a scene as a continuous volumetric function parameterized by a neural network mapping spatial coordinates and viewing directions to color and density values. Hu et al. [
12] demonstrated the feasibility of NeRF-based methods for measuring plant phenotypic parameters, including leaf morphology, plant height, and canopy structure, in complex agricultural environments. Despite these advances, NeRF-based methods [
13,
14,
15,
16,
17] generally require substantial computational resources during optimization. Recently, 3DGS has emerged as a novel approach for 3D reconstruction and rendering. By explicitly modeling scenes using a set of structured Gaussian primitives and adopting a splatting-based rendering strategy, 3DGS achieves millisecond-level training and rendering speeds, effectively addressing the high computational cost of NeRF while enabling real-time, high-quality 3D reconstruction. Unlike implicit neural representations, 3DGS relies on interpretable geometric primitives, offering a favorable balance between reconstruction accuracy, rendering efficiency, and model interpretability. Chen et al. [
18] proposed an improved 3DGS-based framework for high-quality orchard reconstruction, achieving accurate multi-scale reconstruction of peach orchards. Shen et al. [
19] leveraged 3DGS to address leaf overlap and incomplete structural information in complex outdoor environments, enabling accurate 3D reconstruction and biomass estimation of oilseed rape.
Nevertheless, the application of 3DGS in agricultural scenarios remains limited. Numerous studies [
20,
21] have shown that the unordered and irregular nature of Gaussian primitives makes it difficult for standard 3DGS to accurately model real scene surfaces. Moreover, optimizing 3DGS solely based on image reconstruction objectives often leads to local minima, resulting in inaccurate depth estimation and poor geometric fidelity. To alleviate geometric ambiguity, 2DGS [
22] and PGSR [
23] flattens 3D volumes into sets of view-oriented planar Gaussian ellipses, providing inspiration for addressing geometric uncertainty in 3D Gaussian representations. Yu et al. [
24] introduced a Gaussian Opacity Field (GOF) to facilitate geometry extraction. However, for leafy vegetable phenotyping, geometric reconstruction accuracy is a critical requirement. Existing 3DGS-based methods still struggle to generate high-precision depth maps and maintain multi-view geometric consistency, leading to severe depth artifacts when applied to leafy vegetables with complex surfaces and intricate geometries. These limitations significantly hinder downstream phenotyping and related agricultural applications. Therefore, there is an urgent need for an efficient and high-quality geometric reconstruction framework specifically tailored to leafy vegetable scenes.
To address these challenges, this study proposes an improved 3DGS-based framework designed to enhance 3D reconstruction performance in leafy vegetable scenarios and enable precise phenotypic measurement. Specifically, multi-view image data of various leafy vegetables are captured using RGB cameras. The blurred reconstruction module, planar optimization strategy, and Gaussian pruning strategy are introduced and integrated into the reconstruction pipeline. Based on this framework, reconstruction and phenotypic measurement experiments are conducted in real cultivation environments. The main contributions of this work are summarized as follows:
(1) Motion blur reconstruction. A blurred reconstruction method is proposed based on improvements to the original 3DGS model. By estimating the camera movement trajectory and sampling sub-frames along the approximated motion path, a clear new view is rendered. This effectively addresses reconstruction artifacts caused by motion blur in the sampled data under real agricultural production environments.
(2) Planar optimization strategy. To address the difficulty of reconstructing realistic leaf surfaces and geometries with conventional 3DGS, we propose: (i) a prior depth-guided initialization to bootstrap geometry in low-texture regions; (ii) Gaussian flattening to enforce the planar prior of leaves; (iii) a normal-constrained rendering module for geometrically accurate rasterization; and (iv) a median depth optimization to robustly handle severe occlusions. These strategies jointly enhance surface reconstruction fidelity and reduce geometric errors.
(3) Gaussian pruning strategy. Based on the analysis of individual Gaussian contributions, we introduce a contribution-based pruning strategy that selectively removes inaccurate structures and learns Gaussian primitives with precise geometry, achieving accurate 3D reconstruction while reducing memory consumption and improving rendering efficiency.
(4) Geometric regularization and quantitative evaluation metrics. We propose local geometric consistency constraints between rendered normals and depth maps, as well as global geometric consistency across multiple views. Furthermore, a quantitative geometric evaluation metric is introduced based on global geometric consistency to assess the geometric quality of the reconstruction results.
The remainder of this paper is organized as follows.
Section 2 describes the data acquisition and dataset construction process.
Section 3 presents the proposed LV-3DGS model and the associated phenotypic measurement methodology.
Section 4 evaluates the reconstruction performance of LV-3DGS through comparative and ablation experiments and reports phenotypic measurement results based on reconstructed leafy vegetables.
Section 5 concludes the paper, discusses current limitations, and outlines future research directions.
3. Methods
High-quality reconstruction of the 3D morphology of leafy vegetables is essential for phenotypic growth monitoring and yield analysis, and also provides a fundamental basis for exploring the feasibility of digital twin technologies in agricultural scenarios. However, existing NeRF-based and 3DGS-based models struggle to achieve high-quality reconstruction in leafy vegetable scenes due to inherent challenges such as monochromatic appearance, highly similar surface textures, and severe inter-plant and inter-leaf occlusions. In addition, motion blur introduced during image acquisition significantly degrades rendering quality, further limiting the applicability of these models in real-world agricultural environments. An overview of the proposed LV-3DGS framework is illustrated in
Figure 1. The integrated blurred reconstruction module addresses reconstruction under motion-blurred conditions and is described in
Section 3.1. The proposed high-quality surface reconstruction strategy, consisting of Prior Depth-Guided Initialization (PDGI), Gaussian Flattening, Normal Constraint (NC), and Median Depth Rendering (MDR), is presented in
Section 3.2. The Gaussian Pruning (GP) strategy selectively removes redundant Gaussians to obtain accurate scene geometry while reducing memory consumption and improving computational efficiency, as detailed in
Section 3.3.
3.1. Blurred Reconstruction Module
The clarity of scene reconstruction is critical for accurate phenotypic analysis. Multi-view image data captured in real agricultural environments are often affected by camera motion blur, which significantly degrades the quality of 3D reconstruction [
27]. Although certain motion-blurred images can be filtered during preprocessing using blur detection methods (e.g., Fast Fourier Transform (FFT)-based techniques), such FFT-based approaches cannot fundamentally resolve the issue and only reduce the amount of low-quality data. The original 3DGS framework is designed to reconstruct 3D scenes from clean input images and, to the best of our knowledge, does not explicitly address optimization from motion-blurred inputs. To improve data usability and adapt to natural acquisition conditions, we propose a blurred reconstruction module that can be seamlessly integrated into the existing 3DGS framework. This module synthesizes clear views by estimating the camera motion trajectory and rendering approximate sub-frames along the estimated motion path, thereby reducing reconstruction artifacts. Furthermore, it prevents the generation of inaccurate Gaussian primitives caused by unreliable camera poses during the early stages of training.
From a physical perspective, camera motion blur arises from the temporal integration of irradiance over the exposure duration during unintended camera motion, such as hand shake or jitter [
28]. During the shutter interval, the camera cannot maintain a stable pose, causing the accumulated clear sub-frame images to appear blurred. A blurred image
B can therefore be modeled as the temporal integration of irradiance
I from camera pose
over the exposure interval
, as defined in Equation (
1):
where
denotes a sharp image captured at pose
, and
represents the
i-th sub-frame pose sampled during the exposure time. The integral is approximated by uniformly dividing the exposure duration into
N sub-frames and accumulating their irradiance contributions. In practice, we set
N = 12, which provides sufficiently accurate approximation of motion blur while maintaining a reasonable computational cost. Further increasing
N yields only marginal improvements in reconstruction quality but significantly increases rendering time. Initial camera trajectories and poses are obtained from COLMAP, and the
N sub-frames are uniformly sampled over the normalized time interval
.
As illustrated in
Figure 2, sub-frame sampling is performed along the estimated camera motion trajectory. Following ExBluRF [
29] and DeBlur-GS [
30], we parameterize the rigid camera motion using Bézier curves in the Lie algebra space SE(3). For each sub-frame pose
, a sub-frame alignment parameter
is introduced to refine the pose along the estimated trajectory, yielding an optimized camera pose
that better approximates the latent camera pose at time
. Specially, the alignment parameters
are initialized as identity transformations in the Lie algebra space SE(3), based on the assumption of locally smooth camera motion between adjacent frames. The blurred image can thus be expressed as Equations (
2) and (
3):
The corrected poses are accumulated across N temporal samples and rendered using Gaussian splatting rasterization to synthesize motion-blurred images. Given a set of M blurred input images , the optimization objective is to estimate the alignment parameters that best describe the underlying camera motion trajectory while producing a sharp scene representation. This is achieved by minimizing the Manhattan distance between the reconstructed images and the observed blurred inputs.
Following prior work [
27,
29,
31], a gamma correction function is applied to the synthesized blurred views to accurately model the camera imaging process. Specifically,
is used to convert irradiance to image intensity, together with a nonlinear response function to approximate the physical image formation process.
3.2. The Planar Optimization Strategy
In this section, we first address the issue of point cloud sparsity in weakly textured regions of leafy vegetable scenes when using conventional 3DGS by introducing a Prior Depth-Guided Initialization (PDGI) module. Next, we discuss how 3D Gaussian primitives can be transformed into planar representations. Based on this planar formulation, we propose a Normal-Constrained (NC) planar Gaussian rendering method, which jointly renders plane-to-camera distances and surface normals. The rendered depth values are further constrained by surface normals and converted into depth maps, thereby improving geometric accuracy. Finally, to handle severe occlusions commonly observed in leafy vegetable scenes, we introduce a Median Depth Rendering (MDR) strategy to improve the robustness of depth estimation in 3DGS.
3.2.1. Prior Depth-Guided Initialization (PDGI)
Inspired by the work of PlanarGS [
32], we observe that SfM-based initialization in 3DGS heavily depends on feature extraction results. In scenes dominated by similar textures, such as leafy vegetables, this dependency often leads to sparse point clouds over large regions. To alleviate this issue, we back-project prior depth information into dense 3D space to supplement missing point clouds in texture-similar areas. Specifically, a pretrained monocular depth estimation model (Depth Anything [
33]) is first employed to predict depth maps. As illustrated in
Figure 3, for each pixel
in the depth map, four neighboring (radius = 1) pixels (
) are sampled under a local planar assumption to estimate the distance from pixel
to the camera. These pixels are then back-projected into 3D space. The normal
of the local plane at pixel
is computed as Equation (
4):
Based on the local plane normal and the depth value in depth map, the distance from the local plane to the camera can be computed using Equation (5).
3.2.2. 3D Gaussian Flattening
Accurate geometric reconstruction and high-quality rendering require Gaussian primitives to closely approximate the true surface geometry of the target scene. Leafy vegetables are characterized by multi-leaf structures with approximately planar surfaces, making faithful surface representation particularly important for reconstruction accuracy. Inspired by prior work such as 2DGS [
22] and PGSR [
23], we observe that representing surfaces using 3D Gaussian ellipsoids often leads to geometric ambiguity and blurred surface reconstructions that deviate from true geometry. In contrast, planar Gaussian primitives provide a better approximation of local planar structures and enable direct rendering of depth and surface normals. Therefore, we flatten 3D Gaussians into 2D planar Gaussians to more accurately represent the geometric surfaces of leafy vegetables. In the geometric reconstruction process, each Gaussian ellipsoid is flattened into a plane so that it more closely aligns with the surface of real leaf-like objects, thereby reducing depth and normal blurring.
In 3DGS, the covariance matrix is defined as
, which represents the shape of the Gaussian ellipsoid. Here,
i denotes the
i-th Gaussian primitive, and
denotes the orientation of the ellipsoid’s principal axes, and
contains the scaling factors along each axis. By compressing the scaling factor along a specific axis, the Gaussian ellipsoid can be flattened into a planar structure. Specifically, we identify the minimum scaling factor
and compress the Gaussian ellipsoid along the corresponding axis direction [
34]. We adopt an adaptive compression strategy as Equation (
6). The scaling result is updated as
:
where
is defined as the adaptive compression coefficient. In our implementation, we set
and
. The clip function ensures that
remains within the range
. This design ensures that the Gaussian is flattened into a near-planar structure while preserving a small but non-zero thickness.
This operation effectively flattens the ellipsoid into a planar Gaussian that best approximates the local leaf surface geometry. The shortest axis direction is then defined as the normal n of the planar Gaussian. The orientation of is determined according to the camera viewing direction, and the angle between the viewing direction and the normal is constrained to be greater than .
3.2.3. Normal-Constrained (NC) Planar Gaussian Rendering
Unlike prior surface reconstruction methods [
21,
22,
23,
24], which focus primarily on appearance modeling, we propose a normal-constrained planar Gaussian rendering module driven by geometry for accurate surface reconstruction in leafy scenes. Given planar Gaussian primitives, we first render a surface normal map
from the current viewpoint via
-blending and the rotation matrix
R from the camera coordinate system to the global coordinate system, which serves as a geometric descriptor of local surface orientation rather than a purely rendering attribute, and is defined as:
where
is the opacity value, and
is the number of Gaussians that the ray passes through.
Beyond surface normals, accurate depth recovery is crucial for enforcing geometric consistency in thin, heavily occluded leafy structures. Unlike the original 3DGS [
11], which uses the distance to the Gaussian center as depth, we explicitly distinguish planar distance from true ray depth. As illustrated in
Figure 4, it is important to note that the camera viewing direction
v is not necessarily aligned with the normals
n of all planar Gaussians. Therefore, the planar distance
is not equivalent to depth
, and a geometric angle exists between them. For each planar Gaussian, the distance
from the camera center
u to the plane is computed as the projection of the Gaussian center
onto the normal direction
, and is defined as:
The rendered planar distance map
is obtained via
-blending and is defined as:
Inspired by PGSR [
23], we finally render a depth map for geometric optimization in leafy vegetable reconstruction. The rendered distance and normals not only enable precise geometric depth computation but also provide supervisory signals for subsequent local and global consistency optimization—critical for handling severe self-occlusions and thin-layered leaves. The final rendered depth map
is derived from the planar distance map
and the normal map
as:
where
denotes a 2D pixel location on the image plane,
is its homogeneous coordinate,
K is the camera intrinsic matrix, and
represents the direction of the ray that passes through the camera’s optical center and through the pixels in the camera’s imaging plane, which can be regarded as the camera viewing direction
v.
3.2.4. Median Depth Rendering (MDR)
Standard 3DGS employs mean depth rendering. Specifically, it counts all the Gaussians that the ray passes through, and calculates the mean value by sorting them in descending order of depth:
where
and
denote the depth from the camera plane to the
i-th Gaussian and the contribution weight of the
i-th Gaussian, respectively. Although effective in general scenarios, this strategy produces unstable depth estimates in discontinuous regions (e.g., overlapping leaves and cavities), which are common in densely planted leafy vegetable scenes, leading to sharp depth variations over short spatial distances. To enhance robustness, inspired by 2DGS [
22], we adopt median depth rendering [
1]. When the accumulated alpha
along a ray does not reach 0.5, 2DGS uses the depth of the last Gaussian. However, in dense leafy vegetable scenes, such cases are widespread and typically correspond to occluded regions. Therefore, we instead assign the depth value to half of the default maximum depth used in 3DGS, which better reflects the invisibility of occluded points. Specifically, along each pixel ray, we accumulate the depth weights of sorted Gaussians and select the depth at which the cumulative weight is closest to 0.5 as the pixel depth estimate:
3.3. The Gaussian Pruning (GP) Optimization Strategy
Pruning is a crucial technique in 3DGS. Although 3DGS achieves significantly faster reconstruction than NeRF-based methods, it suffers from high memory consumption and a large number of redundant Gaussian primitives. Training all Gaussians indiscriminately may cause the model to overlook fine-grained scene geometry, leading to degraded geometric accuracy. In real leafy vegetable growth scenarios, spatial structures are highly interwoven and folded. An appropriate pruning strategy can selectively remove inaccurate or redundant Gaussians while preserving essential geometric structures, thereby improving reconstruction accuracy while reducing memory usage and training time.
The core of Gaussian pruning lies in accurately evaluating the contribution of each Gaussian primitive. In the original 3DGS framework, a Gaussian’s contribution is implicitly measured by its opacity. This strategy tends to preserve Gaussians with high opacity while discarding those with low opacity. However, previous work [
35] has shown that although high-opacity Gaussians often contribute significantly to image rendering, they have limited capacity to represent complex geometric structures. This limitation can result in blurred artifacts in high-frequency regions, substantially degrading perceptual quality and geometric fidelity. Consequently, opacity-based pruning may mistakenly remove geometrically important Gaussians while retaining floating artifacts with high opacity. To address this issue, we propose a more precise contribution metric and adopt a progressive pruning strategy. In the original 3DGS formulation, during
-blending, the blending weight
represents the contribution of a Gaussian to a pixel and is defined as:
where
denotes the transmittance. The overall contribution
of a Gaussian to the
k-th rendered image can be computed as the sum of blending weights over all pixels
, and is defined as:
where
denotes the index of the Gaussian sorted by depth along the ray corresponding to pixel
p. This formulation inherently favors large Gaussians that contribute to many pixels, while assigning very low contribution scores to small Gaussians. However, large Gaussians have limited ability to represent fine geometric details and are often difficult to optimize. To enhance the model’s sensitivity to geometric structures, we normalize the contribution by the number of projected pixels and introduce a hyperparameter
to balance opacity and transmittance. The final contribution metric is defined as:
As increases, the influence of transmittance diminishes. In particular, when , the formulation degenerates to the original opacity-based pruning strategy used in 3DGS. Moreover, controls contribution bias: Gaussians deviating from the true surface often exhibit higher transmittance near the outer surface and lower transmittance internally. By adjusting , the model achieves bidirectional bias control, dynamically balancing internal and external Gaussian distributions. This mechanism enables adaptive pruning tailored to specific geometric characteristics of the scene, ultimately leading to more accurate geometric reconstruction.
In multi-view reconstruction, the contribution of a Gaussian across views typically follows a long-tail distribution. A Gaussian usually contributes significantly to only a limited number of views—approximately 15–30% [
35]—where strong geometric cues such as sharp edges and clear occlusion relationships are present. Contributions from other views are often diminished due to motion blur, viewpoint redundancy, or sensor noise. Therefore, we compute the overall contribution
C as the average contribution over a small set of high-contribution views:
where
V denotes the set of views with the highest contributions, and in practice, we select the top five views. This choice provides a robust trade-off between accuracy and computational efficiency: using fewer views leads to unstable estimation due to noise, while including more views introduces low-contribution observations that dilute the geometric signal. During training, pruning is performed progressively at predefined iteration intervals. At each pruning step, we evaluate the overall contribution of each Gaussian across the training set and remove a fixed proportion of Gaussians with the lowest contribution scores.
3.4. Regularization Functions for Model Training
3.4.1. Image Reconstruction Loss
Following the original 3DGS framework, we compute the image reconstruction loss
as the Manhattan distance between the rendered RGB image and the corresponding ground truth image:
Optimizing 3DGS solely based on image reconstruction loss can easily lead to geometric ambiguities and local geometric overfitting. To mitigate this issue, we introduce both local and global geometric consistency regularization terms, encouraging Gaussians to better conform to the true scene geometry.
3.4.2. Local Geometric Consistency Loss
Under the local planar assumption, a pixel and its neighboring pixels (within the radius of one pixel) can be approximated as lying on a local plane. During training, the model renders both a depth map
and a normal map
. For each pixel, four neighboring pixels are sampled, and a local planar normal is estimated based on the rendered depth values. Repeating this process over the entire image yields a locally estimated normal map
derived from the depth map. We then minimize the difference between the rendered normal map and the locally estimated normal map to enforce consistency between depth and normal geometry:
3.4.3. Global Geometric Consistency Loss
While local geometric regularization enforces consistency between depth and normals within a single view, the irregular and discrete nature of Gaussian optimization may still lead to inconsistencies across multiple views. Therefore, we further introduce a global geometric consistency constraint to enforce cross-view geometric alignment. Inspired by stereo matching and optical flow [
1], depth values rendered from different views should correspond to the same 3D spatial locations. As illustrated in
Figure 5, for a pixel
in view
with depth
, its corresponding world coordinate
is computed as:
where
K is the camera intrinsic matrix and
denotes the camera pose of view
. Projecting
into view
yields the corresponding pixel
:
where
is the camera pose of view
, and
denotes the depth of
in view
. Mapping
back to world coordinates using the corresponding depth
yields
:
If the depth estimates are geometrically consistent,
should coincide with
. Otherwise, we minimize the discrepancy between their depth values to enforce global geometric consistency:
3.4.4. Blur Reconstruction Loss
When the optional blurred reconstruction module is enabled, we additionally introduce a blur reconstruction loss. After convergence, the reconstructed blurred observation
B is compared with the input image
I:
The final total loss function of the proposed LV-3DGS framework is defined as a weighted combination of all loss terms:
3.5. Leafy Vegetable Phenotyping
Due to the low-density noise around leaf edges and the high-density point cloud of the leafy vegetable, a statistical outlier removal (SOR) filter is applied to eliminate outliers with significant density differences. Following our previous work [
1], the cleaned point cloud is then used for phenotypic measurements.
(1) Height: Under the natural growth conditions of leafy vegetables, the optimal plane is fitted based on the normal vector of the root to serve as the XY plane of the Cartesian coordinate system. The direction perpendicular to the XY plane and pointing upwards from the root is taken as the Z-axis for the coordinate correction of the leafy vegetables. The lowest and highest points of the Z-axis of the leafy vegetables are calculated, and the vertical distance difference between them is calculated, which is the height H of the leafy vegetables.
(2) Number of Leaves: Unlike other plants, the internal structure of leafy vegetables is very complex. The stems and leaves overlap with each other, making it difficult to calculate the number of leaves by extracting the skeletal structure. However, 3D point cloud models can provide comprehensive spatial structure information. Based on the positional relationship and density difference between the point clouds, different leaves can be clustered, as shown in
Figure 6a. We perform conditional Euclidean clustering on the complete point cloud of the leafy plant and create a KD-Tree as the search mechanism for the point cloud. Select a starting seed point and set a threshold distance. Points within this distance are considered to be of the same type, while those outside this distance are not of the same type. This process clusters points that are close to each other into the same cluster. Each independent cluster is identified as a leaf, and the number of clusters can be counted to calculate the number of leaves.
(3) Surface Area: The Delaunay triangulation is used to reconstruct the three-dimensional mesh of the leaf and stem, as shown in
Figure 6b. The triangles need to meet two conditions: Firstly, no points exist within the smallest enclosing sphere of each triangle. Secondly, the edges of the triangle are smaller than a certain threshold to avoid connecting discontinuous surfaces. For each triangle obtained from the triangulation, the area of the individual triangle can be calculated using Heron’s formula. Then, the sum of the areas of all triangles within the outermost convex hull can be obtained to calculate the leaf area of each leaf and stem. The Delaunay triangulation formula is shown in Equations (
25) and (
26):
where
is half of the perimeter of the
i-th triangle,
are the side lengths of the
i-th triangle, and
n is the total number of triangles. The total surface area
S is obtained by summing the areas of all triangles within the convex hull.
4. Results and Discussion
This section evaluates the performance of the proposed methodology. Firstly, the implementation details of the experiments in this paper are introduced, which include the model evaluation method. Secondly, evaluate the proposed motion blur removal module. Subsequently, the optimized LV-3DGS model was compared with other mainstream models, and the accuracy and performance of the model were analyzed. Meanwhile, we conducted ablation experiments on the model, which included the ablation of hyperparameters and modules. Finally, we evaluated and compared the performance of our method in phenotyping. A comprehensive evaluation is provided, with quantitative metrics and qualitative assessments.
4.1. Experimental Environment and Evaluation Indicators
All experiments in this paper are implemented based on Ubuntu 20.04, Pytorch 1.12.1, CUDA 11.8. We extend the differentiable Gaussian splatting rasterizer to support depth, pose, and cumulative opacity for both forward and backward propagation. In addition, the model is optimized using Stochastic Gradient Descent techniques. In this study, the model was trained on the Nvidia RTX 4090 GPU 24 GB platform. The model was trained for 30,000 iterations with a learning rate of 0.01 using the Adam optimizer. We conduct pruning every 1000 iterations, and at each pruning step, we remove 10% of the Gaussians with the lowest contribution scores. The optimal hyperparameters involved in the model are selected through subsequent hyperparameter experiments. To assess the reconstruction quality of the proposed LV-3DGS model, we employ the following evaluation metrics:
- (1)
Image fidelity metrics:
Peak Signal-to-Noise Ratio (PSNR) is used to measure the distortion degree of the rendering image. The larger the value is, the better the rendering effect will be. The calculation formula is shown in Equations (
27) and (
28):
where
h and
w are the height and width of the image,
I and
K are the ground truth image and the rendering image, respectively, MAX is the maximum possible pixel value of the image, and MSE is the mean square error.
Structural Similarity Index Measure (SSIM) is used to measure the similarity of edges and textures, which is defined in Equation (
29):
where
are the local window means of
x and
y, respectively.
are the variances.
is the covariance.
and
are a constant used to maintain stability and
L represents the dynamic range of pixel values,
and
.
Learned Perceptual Image Patch Similarity (LPIPS) is more accurate compared with PSNR and SSIM, capturing more complex image features and perceptual differences. LPIPS quantitatively measures the rendered image against the ground truth image through a deep learning model (using VGG as the backbone network) that ranges from 0 to 1. The values are negatively correlated with the image rendering quality, with lower LPIPS values indicating that the two images are more similar.
- (2)
Geometric accuracy:
Geometric Consistency (GC) is the global geometric consistency metric accepted in
Section 3.4. It serves as a quantitative indicator to measure the geometric accuracy of the reconstructed model and the multi-view consistency of the rendering depth values obtained by the method. Its unit is centimeters.
- (3)
Computational efficiency metrics:
Training time. We calculated the average training time for all scenarios across 30,000 iterations.
The performance of the phenotyping measurements is evaluated using the Correlation Coefficient
and Root Mean Square Error (RMSE). The
quantifies the strength of the linear relationship between the computed and ground truth phenotypic values, calculated as Equation (
30). The RMSE assesses the consistency and overall accuracy of the phenotyping results, calculated as Equation (31).
where
and
represent the predicted and ground truth measured phenotype values,
represents the average of
, respectively.
4.2. Evaluation of 3D Rendering Performance at Motion Blur Scenes
We evaluated the performance of the proposed blurred reconstruction module by comparing it with Deblur-NeRF [
27] and a representative 2D image deblurring approach combined with 3DGS. Deblur-NeRF jointly optimizes neural radiance field reconstruction and pixel-wise blur kernel estimation. For the 2D deblurring baseline, we employed Restormer [
36] to independently deblur the input images before feeding them into the standard 3DGS pipeline for scene reconstruction. To ensure a fair comparison with the ground truth images, the Gaussian scene parameters were fixed during evaluation, and only a global transformation was optimized to estimate the appropriate camera pose alignment.
Table 1 presents the quantitative evaluation results of blur reconstruction and scene rendering on the leafy vegetable dataset under different levels of self-collected motion blur. The results demonstrate that the proposed blur reconstruction module integrated into the 3DGS framework consistently outperformed the comparison methods. Preprocessing with Restormer showed limited performance gains, likely due to its isolation from the 3D reconstruction process. Without integrating scene geometry during deblurring, the 2D-only approach may introduce inconsistencies that degrade the quality of the reconstructed Gaussians. Deblur-NeRF was able to reconstruct 3D scenes with reasonable consistency by jointly estimating spatially varying blur kernels during training. However, its modeling of motion blur relies on image-space convolution and MLP-based point spread function estimation, without explicitly incorporating camera motion or scene occlusion information. Consequently, we observed that Deblur-NeRF required longer training time and larger blur kernels to handle severe motion blur, and the reconstructed scenes occasionally exhibited residual blur or discontinuities across views. In contrast, the proposed method explicitly models camera motion trajectories and samples clear sub-frames during training. This strategy avoids generating inaccurate Gaussians at incorrect spatial locations and enables faster convergence while producing sharper rendering results. Qualitative comparisons in
Figure 7 further confirm that our method yields clearer textures and more consistent geometry than the competing approaches under motion-blurred conditions.
4.3. Comparison of Training and Rendering Efficiency Across Different Models
To ensure a fair comparison, all baseline and comparison methods were trained under identical experimental settings, except for the specific architectural or algorithmic modifications introduced by each method. We compared the proposed LV-3DGS with several advanced novel view synthesis and surface reconstruction approaches, including NeRF [
10], Neuralangelo [
37], the baseline 3DGS [
11], and recent Gaussian-based surface reconstruction methods such as SuGaR [
21], GOF [
24], 2DGS [
22], and PGSR [
23]. The evaluation focused on reconstruction quality, geometric accuracy, and computational efficiency in real crop production scenarios.
Neuralangelo extends traditional NeRF by combining multi-resolution 3D hash grid representations with neural surface rendering. SuGaR, GOF, 2DGS, and PGSR represent scenes using planar Gaussian primitives, with differences in geometric constraints and optimization strategies. GOF constructs a Gaussian opacity field and extracts geometry via level-set estimation, while 2DGS and PGSR introduce depth consistency constraints to improve surface reconstruction quality.
As shown in
Table 2, the proposed LV-3DGS achieved superior performance across various leafy vegetable scenes. In terms of training efficiency, LV-3DGS achieved the shortest average training time, improving efficiency by 11.67% compared with the baseline 3DGS, which can be attributed to the proposed contribution-based Gaussian pruning strategy. All other comparison methods required longer training times than 3DGS. In the image quality evaluation indicators, when compared with other existing models, thanks to the optimization of the proposed LV-3DGS model in surface reconstruction, the model achieves the highest reconstruction quality in leafy vegetable scenarios. Specifically, compared with the baseline 3DGS model, the PSNR and SSIM values of the LV-3DGS model have increased by 2.70% and 3.23%, respectively, while the LPIPS value has decreased by 6.50%. A paired
t-test confirmed that these improvements over 3DGS are statistically significant (PSNR:
, SSIM:
, LPIPS:
). Compared with the current state-of-the-art PGSR method in surface reconstruction, the PSNR and SSIM values of the LV-3DGS model have increased by 0.92% and 1.82%, respectively, and the LPIPS value has decreased by 5.57%. These differences also reached statistical significance (PSNR:
, SSIM:
, LPIPS:
). In the geometric accuracy evaluation indicators, the LV-3DGS model achieves the smallest geometric error, with GC reduced by 1.566 cm compared with the baseline 3DGS model. The reduction in geometric error was statistically significant (
) compared with 3DGS. These results demonstrate that LV-3DGS achieves competitive rendering quality and geometric accuracy while also improving training efficiency.
Furthermore,
Figure 8 also compares the rendering results of the Neuralangelo, 3DGS, PGSR and LV-3DGS models in different leafy vegetable scenes. LV-3DGS reconstructs fine details with higher quality. Leaf veins are sharply reconstructed, stem textures are clearly stratified, and boundaries between vegetation and background remain well-defined. Neuralangelo captures the overall structure but produces blurred edges and background artifacts. PGSR uses a 2D plane Gaussian optimization model, and the rendered leaf surfaces are smoother and contain more details, but it performs poorly in geometric structure. Some veins still have unclear outlines. These problems weaken the realism of the reconstruction work. These improvements in LV-3DGS are attributed to the combined effect of planar optimization and pruning strategies, local and global geometric consistency, which together steer the optimization towards more accurate surface geometry.
In addition to the general-purpose methods compared above, several 3DGS-based approaches have been successfully applied in agricultural domains, achieving promising results in large-scale scenarios such as orchards and farmlands [
18,
19]. However, these methods primarily focus on the reconstruction of large scenes (such as orchards and farmland), whereas our target is high-quality reconstruction and phenotypic monitoring of leafy vegetables in controlled-environment plant factories.
Overall, these results demonstrate that LV-3DGS achieves high-quality reconstruction, improved geometric accuracy, and enhanced training efficiency across a wide range of leafy vegetable scenes, highlighting its practical potential for large-scale agricultural 3D reconstruction applications.
4.4. Performance Comparison of Different Network Structures
4.4.1. Hyperparameters Optimization
In this section, we conducted ablation experiments to optimize the hyperparameters associated with the proposed loss function and Gaussian pruning strategy, including the image reconstruction loss weight
, local geometric consistency loss weight
, global geometric consistency loss weight
, and the transmittance exponent
. The blur reconstruction loss weight
was set to zero in this section, as the blurred reconstruction module was evaluated independently in
Section 4.2.
The local geometric regularization term can restrict the geometric consistency between local parts of a single view, providing good initial geometric accuracy without relying on multi-view information. The global geometric regularization term limits the geometric consistency between multiple views, improving the overall reconstruction accuracy. As can be seen from the above
Table 3, the local and global geometric consistency is crucial for improving the reconstruction accuracy of the model. Control experiments with either term disabled (
or
) yield increased GC values (0.997 and 1.253, respectively) and degraded rendering metrics, confirming that their combination is essential. Through experiments,
= 1.0:1.0:1.2 was selected as the hyperparameters of the final loss function for the model.
Table 4 reports the impact of different
values in the Gaussian pruning strategy. The results indicate that smaller
values yield better performance, which aligns with our theoretical analysis: a lower
gives more weight to transmittance, enabling the pruning metric to identify and remove low-opacity Gaussians that are not consistently visible along rays. When
, the transmittance term dominates, retaining Gaussians in thin or occluded regions, which improves reconstruction. As the
value increases to 0.75, the contribution index approximates the original opacity-based pruning method (since transmittance is suppressed) due to the preservation of floating objects. This further validates the effectiveness of the proposed contribution-based pruning strategy compared with the default opacity-based approach.
4.4.2. Effectiveness of Different Module
Based on the hyperparameter ablation study, we verified that the selected model had the best hyperparameter configuration. In this section, we validated the performance of each proposed module through ablation experiments of different modules. Based on the baseline 3DGS model, the following components were independently trained and evaluated: 3DGS+PDGI (introducing Prior Depth-Guided Initialization), 3DGS+Flattening+NC (introducing 3D Gaussian Flattening and Normal Constraint), 3DGS+PDGI+Flattening+NC, 3DGS+MDR (introducing Median Depth Rendering), 3DGS+GP (introducing Gaussians Pruning), and LV-3DGS (all proposed modules). The experimental results are shown in the
Table 5. The PDGI module effectively fills in the missing points in the leafy vegetable texture regions of the point cloud, thereby improving the feature perception of leafy vegetable texture details during model training. The PSNR and SSIM values increase by 1.29% and 1.64%, respectively, the LPIPS value decreases by 1.19%, and the GC value decreases by 0.309 cm. The Flattening and NC modules fit the leafy vegetable surface in a planar Gaussian manner and constrain the depth rendering from the normal perspective, reducing geometric errors. The PSNR and SSIM values increase by 2.05% and 1.72%, respectively, the LPIPS value decreases by 1.14%, and the GC value decreases by 1.284 cm. Moreover, the combination of PDGI with Flattening and NC modules has a better effect. Unlike the mean depth estimation in baseline 3DGS, MDR uses the median of depth contributions to robustly handle occlusions and surface discontinuities, and it alleviates the error depth estimation problem that occurs in areas with surface discontinuities or incomplete reconstruction (for example, overlapping leaves). The GP module achieves precise geometric representation by deleting redundant Gaussians based on the contribution-based pruning strategy, and the GC value decreases by 1.548 cm with a significant improvement in training efficiency. In summary, the 3D Gaussian model with the proposed modules shows better reconstruction quality in leafy vegetable scenes, reduces the perceptual difference between the reconstructed image and the GT image, and reduces geometric errors.
4.5. The Results of Leafy Vegetable Phenotypic Calculation and Regression
To validate the effectiveness of the proposed method, phenotypic traits including plant height, leaf number, and leaf surface area were estimated for all reconstructed leafy vegetable scenes using the phenotypic measurement pipeline described in
Section 3.5.
Figure 9 illustrates the comparison between the phenotypic values estimated from the reconstructed 3D models and the corresponding manual measurements. Meanwhile, the result of the paired
t-test (
p > 0.05) indicates that the differences between the three phenotypic measurement results and the true values are not statistically significant. In addition, we compared the phenotypic estimation performance of the proposed method with results reported in related studies, as summarized in
Table 6. Specifically, the coefficient of determination (
) for plant height estimation reached 0.9959 with a root mean square error (RMSE) of 0.33 cm. For leaf number estimation, the
value was 0.9651 with an RMSE of 0.85. The estimation of leaf surface area achieved an
of 0.9895 and an RMSE of 14.78 cm
2. These results demonstrate a strong linear correlation between the estimated phenotypic traits and manual measurements, indicating that the proposed LV-3DGS framework provides reliable phenotypic estimation performance. Although some reported methods achieved slightly higher accuracy in specific scenarios—for example, multi-view stereo approaches applied to corn stems (
) [
38] and lettuce (
) [
39]—such methods typically rely on extensive point cloud post-processing and manual intervention. In contrast, the proposed approach achieves competitive accuracy while maintaining a higher level of automation and computational efficiency. Furthermore, compared with previous binocular vision methods [
1], our multi-view approach provides richer scene information. These results suggest that high-quality 3D reconstruction based on LV-3DGS can serve as a robust and efficient foundation for crop phenotypic measurement.
5. Conclusions
This study proposed the LV-3DGS framework to address the limitations of conventional 3DGS methods in reconstructing leafy vegetable scenes characterized by low-texture, uniform color distribution, and complex surface geometry in controlled agricultural environments. By integrating planar Gaussian surface modeling, contribution-aware Gaussian pruning, and local and global geometric consistency regularization, the proposed method significantly improves both reconstruction fidelity and geometric accuracy across diverse leafy vegetable scenarios in real plant factory settings. Unlike previous works that focused on planar Gaussian representation or blur correction, LV-3DGS introduces entirely new features: LV-3DGS is designed for high-quality reconstruction and phenotypic measurement systems of leafy vegetables. It explicitly models camera motion during multi-view acquisition to solve motion blur problems; optimizes Gaussian structure representation based on the spatial structural characteristics of leafy vegetables; and establishes pruning strategies to improve geometric accuracy and computational efficiency through the analysis of Gaussian contributions. Experimental results demonstrate that LV-3DGS achieves superior rendering quality and geometric precision compared with NeRF, Neuralangelo, 3DGS, SuGaR, GOF, 2DGS, and PGSR. The proposed framework attains an average SSIM of 0.94, PSNR of 34.53 dB, LPIPS of 0.11, and a geometric consistency error of 0.317 cm, while maintaining high training efficiency with an average training time of approximately 10 min. Furthermore, the proposed motion-blurred reconstruction module effectively mitigates artifacts caused by camera motion during multi-view image acquisition, improving data utilization efficiency and reconstruction robustness. Based on the reconstructed 3D models, phenotypic traits including plant height, leaf number, and leaf surface area were accurately estimated. The obtained phenotypic measurements achieved values of 0.9959, 0.9651, and 0.9895, with corresponding RMSE values of 0.33 cm, 0.85, and 14.78 cm2, respectively. These results confirm that phenotypic extraction based on LV-3DGS enables accurate and efficient computation of key plant traits, providing a practical solution for precision agriculture and high-throughput crop phenotyping. It is important to note that these metrics were obtained in a controlled indoor vertical farm with static artificial lighting, uniform background, and limited occlusions.
Despite its promising performance, this study still has several limitations. The current validation remains confined to a single operational domain: indoor vertical farming with fixed artificial lighting and static backgrounds. Open-field scenarios, which involve larger spatial scales, increased environmental complexity, and more diverse crop architectures, have not yet been explored. In such scenarios, reconstruction would face higher computational demands due to the need for more Gaussians to represent expansive scenes, as well as robustness challenges arising from uncontrolled illumination changes and wind-induced plant motion. Extending LV-3DGS to these conditions may require additional components, such as lighting-invariant feature embedding to handle variable lighting, and addressing wind-induced non-rigid deformation in the blur reconstruction module. Future work will focus on improving model scalability and computational efficiency to support large-scale agricultural applications. At the model architecture level, several inherent assumptions warrant further discussion. First, the flattening of 3D Gaussians into planar primitives, while effective for broad leaf surfaces, may under-represent regions with high curvature or sharp creases, where the piecewise planar approximation introduces discretization error. Second, the reliance on -blending for normal and depth rendering can bias geometric estimates toward high-opacity primitives in occluded or semi-transparent regions, potentially causing surface bleeding artifacts. Finally, the PDGI module inherits the scale ambiguity of monocular depth predictors, though we observe that multi-view geometric optimization partially attenuates such errors during training. Future extensions could explore adaptive target spatial morphology to optimize Gaussian graph structure, unbiased depth estimation strategies, multi-view deep fusion, etc., to address these structural limitations. Furthermore, extreme leaf occlusion may lead to incomplete geometry where multi-view coverage is insufficient, and specular highlights on the blade surface can sometimes cause surface artifacts. Regarding initialization, we note that COLMAP is used only for initial camera poses; its sparse point cloud may be noisy in low-texture regions, but multi-view optimization and the depth prior effectively mitigate this limitation. In terms of geometric evaluation, the difficulty of acquiring high-fidelity 3D ground truth for delicate leafy vegetables means our validation relies on a custom Geometric Consistency (GC) metric. Importantly, GC reflects multi-view alignment consistency rather than true physical accuracy, and standard benchmarks such as Chamfer Distance or point-to-surface error are currently absent. We therefore explicitly acknowledge the lack of independent geometric validation against an objective external standard as a primary limitation of this study. Additionally, natural illumination variability poses challenges to data acquisition quality. Although this study incorporated monocular depth estimation as a supplementary data source, the reliance on vision-based data remains a limiting factor. Future research will investigate multi-sensor data fusion strategies that integrate complementary information from cameras, LiDAR, GPS, and IMU sensors to further enhance reconstruction accuracy and robustness. Ultimately, future efforts will aim to extend the proposed framework to a broader range of crop species and deploy it in real-world agricultural production systems, enabling automated, large-scale, and high-precision plant phenotypic analysis.