1. Introduction
Traditionally, phenotypic measurement and analysis have been laborious, costly, and time-consuming processes [
1]. The use of three-dimensional (3D) imaging technologies has facilitated the study of plants in agricultural research, focusing on the systematic quantification of morphological and structural attributes, such as canopy architecture, leaf area, height, stem diameter, and biomass, among other relevant parameters [
1,
2,
3]. These morphological traits serve as indicators of factors such as stress, yield, growth, and overall plant development [
4,
5]. Among 3D characterization methods, stereoscopic vision has established itself as an attractive alternative due to its balance between accuracy, low cost, and ease of implementation [
6]. This approach captures the spatial and geometric relationships of plant structures, providing complementary information to two-dimensional methods [
7,
8,
9].
Several studies have demonstrated the effectiveness of stereo vision in automated crop characterization [
10,
11]. Dandrifosse et al. [
12] developed a stereo vision system to characterize wheat canopy architecture in the field, evaluating parameters such as height and leaf area. The results showed high accuracy compared to manual measurements, with an RMSE of 0.37 for leaf area and 97.1% agreement in canopy height estimation. According to Kim et al. [
3], stereo vision also enables automated estimation of crop height, achieving high correlation with manual measurements (R
2 between 0.78 and 0.84). The study by Sampaio et al. [
13] proposed a system based on RGB-D images that combines color (RGB) and depth (D), integrating segmentation and volumetric fusion to obtain accurate three-dimensional reconstructions of corn plants in dynamic conditions. On the other hand, Wen et al. [
14] developed a stereo system to estimate the height of wheat stalks and adjust the position of a combine harvester’s header in real time, achieving an average error of 5.5 cm compared to manual measurements.
The 3D data acquisition system developed in this work is based on a binocular stereoscopic vision scheme, supported by the favorable cost-benefit ratio widely documented in the literature. Stereoscopic systems are characterized by their low cost (i.e.,
$100–
$1000 approximately), high video transmission speed, and ability to operate both indoors and outdoors, making them suitable for agricultural environments with variable lighting conditions [
6]. These characteristics are essential factors in ensuring accessibility and efficiency in capturing large volumes of phenotypic data. Despite their advantages, the literature also reports an inherent weakness in stereo vision-based systems, which is associated with their high degree of dependence on calibration algorithms [
15] and stereo correspondence algorithms [
16,
17,
18], which can significantly affect the quality of the disparity map and, therefore, the accuracy of the three-dimensional reconstruction. However, by using an appropriate methodology that integrates robust stereo detection and correspondence, it is possible to mitigate these limitations and obtain consistent results even under variable environmental conditions.
Extracting phenotypic characteristics from 3D data presents various difficulties in outdoor environments [
12]. Unlike controlled laboratory settings, field-captured images include environmental elements, such as soil and adjacent vegetation, which can be mistaken for the plant of interest and affect segmentation. Therefore, prior detection of the plant is an essential step, as it allows the region of interest (ROI) to be delimited and the analysis to be focused solely on the plant area, avoiding background interference [
19]. However, acquisition under variable conditions of natural lighting, cloud cover, or projected shadows directly affects detection accuracy, making it difficult to correctly identify the plant relative to its surroundings.
In this context, this article introduces a low-cost, automated 3D phenotyping system that reconstructs and analyzes the morphology of maize plants in the field using stereo vision. The main novelty and contribution of this research lies in the direct integration of a deep learning-based detector into the stereo matching process, allowing the region of interest to be dynamically narrowed and the disparity calculation to be confined solely to the plant volume. This strategy reduces redundant background processing, reduces computation time, and improves the robustness of the depth map, enabling its execution on low-power embedded platforms. Rather than replacing CNN-based segmentation strategies, our proposal represents an alternative suited for scenarios where on-device execution and constrained computing environments must be prioritized.
The rest of the paper is structured into four sections.
Section 2 describes the hardware and algorithms employed in the implementation of the system.
Section 3 presents the experimental results along with a comparative analysis with a commercial reference system.
Section 4 discusses the advantages, applicability, and limitations of the proposed approach in real agricultural environments. Finally, the main contributions and future research directions are presented in
Section 5.
3. Results
3.1. Performance of the Plant Detection Model
Figure 11 shows the evolution of the precision, recall, and mAP@0.5 metrics during the training of the YOLOv8n model, evaluated using five-fold cross-validation (k = 5). These metrics allow us to evaluate both the capacity of the model to correctly detect corn plants and its stability across different partitions of the dataset. During the first epochs (≈0–5), the metrics showed high dispersion, with average values below 0.5, reflecting the initial adjustment of the model weights. Starting in epoch 10, an increase was observed across all three metrics, reaching values close to 0.9, indicating a progressive improvement in the model’s ability to discriminate the regions of interest correctly. At approximately epoch 20, precision, recall, and mAP@0.5 reached values close to 1.0 and remained virtually constant until epoch 50 was completed.
On the other hand, the gradual reduction in the shaded area (representing the standard deviation across the five folds) in the curves indicates low fold-to-fold variability, demonstrating stable and generalizable training. This behavior suggests that the morphological and textural characteristics of the plants were learned consistently, without noticeable overfitting to specific subsets of the dataset. The results above demonstrate the effectiveness and robustness of the YOLOv8n model in identifying corn plants under variable lighting conditions, including shadows and angular variations.
Figure 12 shows the evolution of processing time throughout the measurements. The blue line represents the performance of the complete image processing, which includes disparity estimation using SGBM and three-dimensional reprojection of the entire rectified image. In contrast, the orange line corresponds to the total combined processing time, which integrates automatic ROI detection using YOLOv8, stereoscopic disparity calculation, and 3D reprojection only within the detected regions.
During the first few days, when the plant coverage was low (approximately 90,000 to 200,000 pixels), the combined processing time was significantly reduced, reaching values close to 2 s per frame. As the plant grew and the number of detected pixels increased (exceeding 600,000 in the last few days), the execution time of the combined method gradually increased, approaching the time required to process the entire image. These results indicate that YOLO-based automatic detection significantly reduces computational time by restricting stereoscopic processing and 3D reprojection to regions of interest, while preserving spatial accuracy. The ROI was key to effectively isolate the plant, ensuring that morphological measurements were derived from the plant’s structure while minimizing the influence of neighboring vegetation.
3.2. Analysis of Phenotyping Metrics
Figure 13 shows the temporal evolution of the height of six corn plants between 20 August and 6 September 2025, a period corresponding to the vegetative elongation phase. Each curve represents an individual plant, with daily measurements expressed in centimeters (cm) obtained from the point cloud reconstructed by detecting the ground plane using RANSAC and subsequently estimating the maximum vertical distance (Z
max) within each plant region. As shown in
Figure 13, all plants exhibit a sustained increase in height throughout the monitoring period, demonstrating the system’s ability to capture growth dynamics at daily resolution. However, the growth rate differs among the individuals analyzed during the experiment, which is attributable to both biological variability and microenvironmental effects (lighting, soil moisture, and leaf density). In particular, Plant 3 (green line) reached the highest recorded height, exceeding 60 cm, while Plant 6 (brown line) showed the least growth, remaining below 40 cm at the end of the observation period.
The negative fluctuations observed on specific dates do not correspond to measurement errors but rather to temporary structural alterations associated with leaf damage from rain and wind, which temporarily altered the aerial architecture and the vertical projection of the leaves. This behavior was confirmed by direct field visual inspection and reflects the sensitivity of the stereo vision system to record actual physical changes in plant morphology.
Considering only valid measurements, Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) of 1.1 cm and 1.29 cm, respectively, were obtained, demonstrating high stability in the three-dimensional estimates. These differences, of the order of 1 cm, are consistent with the expected growth of plants over short intervals and can be attributed to both slight environmental fluctuations and actual biological growth during the observation period. In practical terms, these results demonstrate that the variation observed between consecutive measurements does not arise from the vision system but from environmental fluctuations or actual growth, confirming the robustness of the method to changes in natural lighting.
Figure 14 shows the temporal evolution of the estimated three-dimensional volume of the aerial part of the plants during the monitoring period. This parameter acts as a structural indicator of biomass, reflecting the three-dimensional expansion of the canopy. The results reveal a progressive increase in volume across all plants, with varying growth slopes among individuals. Plant 3 reached the maximum value, close to 430 cm
3, while Plant 5 showed the lowest volumetric development, less than 105 cm
3. The temporal trend reflects a pattern consistent with corn vegetative growth, characterized by rapid leaf expansion and a sustained increase in aerial volume. Non-invasive monitoring of this parameter using 3D reconstruction demonstrates the system’s ability to quantify biomass dynamics with daily resolution.
Regarding the physiological consistency of morphological metrics, Pearson’s correlation coefficient yielded a value of between plant height and three-dimensional volume. These findings reveal a statistically significant and positive association (), demonstrating that proportional expansions in estimated biomass typically parallelled increases in plant height.
From a physiological perspective, this correspondence reflects a balance between stem elongation and leaf development, characteristic of vegetative growth of maize plants. However, the correlation was not perfect, suggesting differences in architecture between individuals. In some cases, growth was laterally oriented, with longer leaves or inclined stems, increasing volume without a significant change in height. This behavior is associated with morphophysiological responses to environmental factors, such as light direction, spatial competition, and mechanical stress caused by wind.
3.3. Comparison Against a Commercial Reference System
Table 3 shows the MAE of the height measurements obtained with the proposed system compared to the ZED 2i camera. Three independent measurements were taken for each plant on 4–6 September. The MAE was calculated as the average absolute error (AE) between the measurements of the proposed system and those recorded with the ZED 2i camera, reflecting the average deviation from the reference system. The column “MAE per plant” indicates the average individual error for each plant, considering the three measurement dates. The results show that the proposed system reproduces height measurements with high fidelity compared to the ZED 2i, achieving an overall MAE of 1.48 cm, demonstrating its accuracy and consistency in phenotypic data acquisition.
Figure 15a presents the global correlation between the height measurements obtained with the proposed stereo-vision system and those recorded with the ZED 2i camera, pooling all 18 paired observations collected over three consecutive measurement days. Each point corresponds to a single plant on a given date, with ZED 2i measurements shown on the X-axis and those from the proposed system on the Y-axis. The resulting regression exhibits a strong linear relationship, with a coefficient of determination
, indicating that more than 93% of the variance in the proposed system’s measurements is explained by the commercial reference device. The regression line lies close to the identity line (
), and the residuals remain limited across the entire height range, suggesting that the system maintains consistent accuracy for both shorter and taller individuals. The absence of any visible trend in the residual dispersion confirms the lack of proportional bias, while the interspersed distribution of points across different days demonstrates stability under varying illumination and canopy configurations.
To complement the correlation analysis, we quantified the agreement between the proposed system and the ZED 2i using several standard error metrics. The global MAE was 1.48 cm, and the RMSE was 1.87 cm, indicating that both average and larger deviations remained small. The MAPE of 3.67% shows that relative errors stayed below 4% across the full range of plant heights.
A Bland–Altman analysis was also performed to evaluate systematic effects. The mean bias was
cm (95% CI: [
,
] cm), and because the confidence interval includes zero, no significant systematic overestimation or underestimation is present. The limits of agreement ranged from
cm to
cm, with their respective confidence intervals contained within agronomically acceptable ranges. These limits indicate that 95% of the measurements fall within approximately
cm of the reference device. The Bland–Altman plot (
Figure 15b) further confirms an even distribution of differences across the height range, with no evidence of proportional bias.
4. Discussion
The obtained results demonstrate that the proposed stereo vision system is an efficient, accurate, and low-cost alternative for the three-dimensional phenotypic characterization of corn plants under real field conditions. The combination of automatic deep learning-based detection and SGBM enabled the automation of acquiring and analyzing morphological data, significantly reducing the manual intervention required in traditional photogrammetry or direct measurement methodologies.
The low average reprojection error (0.374 px) confirms robust geometric calibration, comparable to that reported for laboratory stereo systems [
34]. This accuracy translates into a depth resolution of around 2–3 mm, validating the suitability of the system for applications involving the reconstruction of delicate plant structures. Likewise, the low residual vertical disparity and intercalibration stability demonstrate the mechanical rigidity of the assembly and the repeatability of the acquisition process, critical factors in field contexts where vibration and variable lighting often degrade the quality of stereo correspondence. Additionally, the YOLOv8n model achieved accuracy, recall, and mAP values greater than 0.98, with a standard deviation across folds of less than 0.02, demonstrating stable convergence and strong generalization. These results are consistent with recent studies using YOLO architectures for crop detection and segmentation [
19,
35]. The observed interfold stability indicates that the model abstracts the distinctive morphological characteristics of corn across varying lighting and shade conditions, supporting its applicability in real agricultural settings.
On the other hand, reconstructing the point cloud using SGBM captured the plant geometry within the detected region of interest with high fidelity. The accuracy of the disparity map, along with subsequent filtering and reprojection using the Q matrix, enabled the generation of clean, dense three-dimensional clouds without significant loss of leaf information. The average height error (MAE ≈ 1.48 cm) is comparable to those reported in reference stereo systems such as the ZED 2i or RealSense D435i [
12,
36]. Moreover, the consistency of complementary agreement indicators—including RMSE, relative error, and the Bland–Altman analysis, which shows a small, non-significant bias and narrow limits of agreement—confirms that the reconstruction pipeline maintains stable performance across varying plant sizes and field conditions.
It is noteworthy that while both height and volume were extracted from the 3D reconstructions, direct validation was performed exclusively for height. Although we estimated the plants’ height and volume, as described in
Section 2.2.3, height estimation from 3D point clouds is inherently more objective and metrologically tractable than volume estimation. Particularly, plant height can be directly and unambiguously measured using conventional methods (rulers, lidar systems, or commercial stereo cameras like the ZED2i) as a reference measurement (ground-truth). In contrast, volumetric measurements of live plants are challenging and lack a universally accepted ground-truth methodology, and comparisons can be system- or algorithm-dependent. Besides, volume estimations depend on a comprehensive 3D reconstruction of the entire plant structure, which is influenced by accurate segmentation of all leaf surfaces and occlusions or overlapped regions. Therefore, the estimated biomass was included to corroborate that plant height does not follow a strictly linear pattern. Moreover, the positive correlation between height and volume (
r = 0.802) confirms that increases in plant height typically parallel proportional increases in estimated biomass.
Finally, integrating automatic detection, stereo reconstruction, and morphological analysis on a low-power embedded platform opens new possibilities for real-time phenotyping and autonomous crop monitoring. In practical terms, the system can be scaled to mobile devices or terrestrial drones for continuous measurements without human intervention.
5. Conclusions
This study presented the design, implementation, and validation of a low-cost stereo vision system for 3D phenotyping of maize plants under real field conditions. The integration of deep-learning-based ROI detection (YOLOv8n) with SGBM enabled an automated, non-invasive workflow that captured structural plant traits with millimetric accuracy. The stereo calibration achieved sub-pixel reprojection errors, ensuring geometric stability and depth precision of approximately 2–3 mm within the working range. The detection model reached high performance with low cross-fold variability, demonstrating robust generalization to variations in illumination and shadow. Disparity-based 3D reconstruction yielded accurate height estimates (MAE = 1.1 cm; RMSE = 1.29 cm) and volumetric measurements strongly correlated with plant height, confirming the system’s ability to quantify phenotypic growth dynamics reliably.
Compared with a commercial stereo camera, the proposed setup reproduced height measurements with high fidelity, confirming its metrological reliability and cost-effectiveness. A comprehensive method-comparison analysis was conducted in accordance with established metrological guidelines, and the global error metrics demonstrate that the system maintains centimeter-level accuracy across all measurement days. Likewise, the Bland–Altman analysis further validates its robustness, revealing a non-significant mean bias of −0.66 cm (95% CI: [−1.56, 0.23] cm) and narrow limits of agreement (−4.18 cm to 2.85 cm), with confidence intervals fully contained within agronomically acceptable ranges. These findings confirm that the system introduces neither systematic nor proportional error and that it performs consistently across the entire measured height range.
The obtained results position the proposed system as a viable alternative for low-cost field phenotyping, particularly in research or precision-agriculture contexts where portability, autonomy, and affordability are essential. The inclusion of a statistically rigorous agreement analysis significantly strengthens the system’s validation and demonstrates its potential for reliable deployment in real-world agricultural conditions.
Finally, future work will focus on expanding the dataset to include different crop species and growth stages, enhancing segmentation accuracy through semantic and self-supervised learning, and integrating multispectral and thermal imaging for joint structural–physiological analysis. Additionally, deploying the platform on mobile or robotic units will enable large-scale, real-time phenotyping aligned with the vision of Agro 4.0 and smart farming.