3d Reconstruction of Plant/tree Canopy Using Monocular and Binocular Vision

Three-dimensional (3D) reconstruction of a tree canopy is an important step in order to measure canopy geometry, such as height, width, volume, and leaf cover area. In this research, binocular stereo vision was used to recover the 3D information of the canopy. Multiple images were taken from different views around the target. The Structure-from-motion (SfM) method was employed to recover the camera calibration matrix for each image, and the corresponding 3D coordinates of the feature points were calculated and used to recover the camera calibration matrix. Through this method, a sparse projective reconstruction of the target was realized. Subsequently, a ball pivoting algorithm was used to do surface modeling to realize dense reconstruction. Finally, this dense reconstruction was transformed to metric reconstruction through ground truth points which were obtained from camera calibration of binocular stereo cameras. Four experiments were completed, one for a known geometric box, and the other three were: a croton plant with big leaves and salient features, a jalapeno pepper plant with median leaves, and a lemon tree with small leaves. A whole-view reconstruction of each target was realized. The comparison of the reconstructed box's size with the real box's size shows that the 3D reconstruction is in metric reconstruction.


Introduction
Three-dimensional (3D) reconstruction of a plant/tree canopy can not only be used to measure the height, width, volume, area, and biomass of the target, but also can be used to visualize the object in virtual 3D space.3D reconstruction is also called 3D digitizing or 3D modeling.Plant/tree 3D reconstruction could be cataloged into two types: (1) depth-based 3D modeling; and (2) image-based 3D modeling.Depth-based 3D modeling involves using sensors, such as, ultrasonic sensors, lasers, Time-of-Flight (ToF) cameras, and Microsoft red, green, and blue depth (RGB-D) cameras.
Using ultrasonic sensors, Sinoquet et al. [1] created a 3D model of corn plant profiles and canopy structure.The 3D results were used to calculate the leaf area and its distribution in the plant.Tumbo et al. [2] used ultrasonics in the field to measure citrus canopy volume.Twenty ultrasonic transducers were arranged on vertical boards (10 sensors per side).The ultrasonic sensors were installed behind a tractor, which was assumed to travel at an approximate speed of 0.5 km/h.A formula was provided to calculate the volume.To study the accuracy of this calculation, Zaman and Salyani [3] conducted research on the effect of ground speed and foliage density on canopy volume measurement.The experimental results showed that there was a 17.37% to 28.71% difference between the estimated and manually measured volumes.Also using laser sensors, Tumbo et al. [2] described how to measure citrus canopy volume.Comparisons were made between the estimated volume and manually measured volume.The results showed high correlation.Wei and Salyani [4] employed a laser scanner and developed a laser scanning system, data acquisition system, and corresponding algorithm to calculate tree height, width, and canopy volume.To evaluate the accuracy of their system, a rectangular box was used as a target.Five repeated experiments were conducted to measure the box's height, length, and volume.However, no direct comparison between estimated volume and manually measured volume of citrus trees was made.Wei and Salyani [5] extended the same laser scanning system to calculate foliage density.They defined foliage density as the ratio of foliage volume to tree canopy volume, where foliage volume was defined as the space contained within the laser incident points and the tree row plane, while canopy volume was defined as the space enclosed between outer canopy boundary and the tree row plane.Lee and Ehsani [6] developed a laser scanner-based system to measure citrus geometric characteristics.After the experimental trees were trimmed to an ellipsoid shape, whose volumes were easy to manually measure, the surface area and volume were estimated by using a laser scanner.Rosell et al. [7] reported the use of a 2D light detection and ranging (LIDAR) scanner to obtain the 3D structures of plants.Sanz-Cortiella et al. [8] assumed that there was a linear relationship between the tree leaf area and the number of impacts of laser beam on the target.The point clouds generated by the laser scanner were used to calculate the total leaf area.Both indoor and outdoor experiments were conducted to validate this assumption.Zhu et al. [9] reconstructed the shape of a tree crown from scanned data-based on alpha shape modeling.A boundary mesh model was extracted from the boundary point cloud.This method resulted in a rough shape reconstruction of a big (20-meter high) tree.
Studying the application of a ToF camera, Cui et al. [10] described a 3D reconstruction method initiated by scanning the object using the ToF camera, and the reconstruction was realized through the combination of 3D super-resolution and a probabilistic multiple scan alignment algorithm.In 3D reconstruction, a ToF camera was usually used in combination with a red, green, and blue (RGB) camera.The ToF camera provided depth information, and the RGB camera would give color information.Shim et al. [11] presented a method to calibrate a multiple view acquisition system composed of ToF cameras and RGB color cameras.This system has the ability to calibrate multi-modal sensors in real time.Song et al. [12] combined a ToF image with images taken from stereo cameras to estimate a depth map for plant phenotyping.The experiments were conducted in a glasshouse using green pepper plants as targets.The canopy characteristics such as stem length, leaf area, and fruit size were estimated.This estimation was a challenging task since occlusion was occuring.The depth information from the ToF image was used to assist the determination of the disparity between left and right images.A global optimization method, using graph cuts developed by Boykov and Kolmogorov [13], was also used to find the disparity.The result using the graph cuts (GC) method was compared with the one resulting from combing graph cuts and ToF depth information.A quality evaluation was conducted, and GC + ToF gave the highest score.A smooth surface reconstruction of a pepper leaf was obtained using this method.Adhikari and Karkee [14] developed a 3D vision system to automatically prune apple trees.The vision system was composed of a ToF 3D camera and a RGB color camera.Experimental results showed that this system had about 90% accuracy in identifying pruning points.
The RGB-D camera is a Microsoft [15] product called Kinect that is designed for Xbox360.Kinect is composed of a RGB camera, a depth camera, and an infrared laser projector.Kinect was mostly used indoors for video game and view reconstruction.Izadi et al. [16] and Newcombe et al. [17] used a moving Kinect to reconstruct a dense indoor view.Kinect Fusion was employed to realize the reconstruction in real time because there was a special requirement on their hardware, specifically the GPU, to use it.Chene et al. [18] applied Kinect on 3D phenotyping of plants.An algorithm was developed to segment the depth image from the top view of the plant.The 3D view of the plant was then reconstructed from the segmented depth image.Azzari et al. [19] used Kinect to characterize vegetation structure.The measurements calculated from their depth image matched well with the results of a plant size measured manually.Different experiments were conducted in the lab, and in an outdoor field under different light conditions-such as early afternoon, late afternoon, and night.Experimental results showed that the Kinect had a limitation under direct sunlight.Wang and Zhang [20] used two Kinect devices to make a 3D reconstruction of a dormant cherry tree that was moved into a laboratory environment.During the experiment, some parts of the branches were missed due to occlusion and a long distance between camera and tree.The reconstructed results could be used for automatic pruning.
Image-based 3D modeling involved reconstructing the 3D properties from 2D images by using single camera or stereo cameras.Zhang et al. [21] used stereo vision to reconstruct a 3D corn model.The boundaries of the corn leaves were extracted and matched.The 3D leaves were modeled using a space intersection algorithm from 2D boundaries.This was a two-image reconstruction.Song [22] used stereo vision to model crops in horticulture.The cameras were installed on the top of the crops, and a top view of the crop was reconstructed.Han and Burks [23] did work on 3D reconstruction of a citrus canopy.Multiple images were used, and consecutive images were stitched together through image mosaic techniques.The canopy was reconstructed from the stitched image.The results did not realize real-size reconstruction.
The estimation of camera matrices is the first step in 3D reconstruction.The method of self-calibration described by Pollefeys et al. [24,25] is usually used.Fitzgibbon and Zisserman [26] described a method to automatically recover camera matrices and 3D scene points from a sequence of images.These images were sequentially acquired through an uncalibrated camera, and image triplets were used to estimate camera matrices and 3D points.Then the consecutive image triplets were formed into a sequence through one-view overlapping or two-view overlapping.Snavely et al. [27] developed a novel method to recover camera matrices and 3D points from unordered images.All these technologies were known as Structure from Motion (SfM).The sparse feature points were used to match the images.The most often used features were called Scale Invariant Feature Transform (SIFT) as described by Lowe [28].
Quan et al. [29] did research on plant modeling based on multiple images.SfM was used to estimate camera motion from multiple images.Here, instead of using sparse feature points, quasi-dense feature points as described by Lhuillier and Quan [30] were used to estimate camera matrices and 3D points of the plant.The leaves of the plant were modeled by segmenting the 2D images and computing the depths using the computed 3D points, and the branches were drawn through an interactive procedure.This modeling method was suitable for a plant with distinguishable leaves.To model a tree, which has small leaves, Tan et al. [31] did research on image-based 3D reconstruction.SfM was also employed to recover camera matrices and 3D quasi-dense points.To make a full 3D reconstruction of the tree, the visible branches were first reconstructed, followed by the occluded branches.The occluded branches were reconstructed through an unconstrained growth and constrained growth method.Subsequently, the leaves were added to the branches.Some of the leaves were from segmented images, while others were derived from the synthesizing methodology.Teng et al. [32] used machine vision to recover the sparse and unoccluded leaves in three dimensions.The method used was similar to the work of Quan et al. [29].The results of the 3D reconstruction were used to classify the leaves and to identify the plant's type.
Furukawa and Ponce [33] provided a patch-based multiple view stereo (PMVS) algorithm to produce dense points to model the target.Small rectangular patches, called surfel, were used as feature points.The cameras' matrices were pre-calibrated using the method provided by Snavely et al. [27].Features in each image were detected, then matched across multiple images.An expansion procedure, similar to the method provided by Lhuillier and Quan [30], was used to produce a denser set of patches.
Santos and Oliveira [34] applied the PMVS method to agricultural crops, such as basil and ixora.Plants with big and unoccluded leaves were well reconstructed.The reported processing time for 143 basil images was approximately 110 min, and almost 40 min for 77 ixora images.The image numbers will increase with the plant's size, consequently the processing time required will also increase with the increased number of images.Most of the processing was spent on feature detection and matching.The matching procedure was conducted through serial computation; however, if it could be conducted in parallel computation, the processing time would be significantly reduced.
Currently, a Graphics processing unit (GPU)-based SIFT, which is known as SiftGPU described by Wu [35], is available to do key points detection and matching via parallel computing.The bundler package described by Snavely [36] and the PMVS package developed by Furukawa and Ponce [33] were combined into a single package called VisualSFM by Wu [37], which involved using parallel computing technology.This would significantly decrease the running time.
The objectives of our study were to: • Provide a new method to calibrate camera calibration matrix in metric level.

•
Apply the fast software 'VisualSFM' on complicate objects, e.g., plant/tree, to generate a full-view 3D reconstruction.

•
Generate the metric 3D reconstruction from projective reconstruction and achieve real-size 3D reconstruction for complicate agricultural plant scenes.

Hardware
In this paper, two Microsoft LifeCam Studio web high definition (HD) cameras (1080 p) were assembled inside a wooden box, and mounted approximately in parallel, with the baseline at 30 mm, as shown in Figure 1.To acquire images, they were connected to a Lenovo IdeaPad Y500 laptop with a NVIDIA GeForce GT650M GPU, which can be used in parallel computation to accelerate the computing time in feature points detection and matching.
conducted in parallel computation, the processing time would be significantly reduced.Currently, a Graphics processing unit (GPU)-based SIFT, which is known as SiftGPU described by Wu [35], is available to do key points detection and matching via parallel computing.The bundler package described by Snavely [36] and the PMVS package developed by Furukawa and Ponce [33] were combined into a single package called VisualSFM by Wu [37], which involved using parallel computing technology.This would significantly decrease the running time.
The objectives of our study were to: • Provide a new method to calibrate camera calibration matrix in metric level.

•
Generate the metric 3D reconstruction from projective reconstruction and achieve real-size 3D reconstruction for complicate agricultural plant scenes.

Hardware
In this paper, two Microsoft LifeCam Studio web high definition (HD) cameras (1080 p) were assembled inside a wooden box, and mounted approximately in parallel, with the baseline at 30 mm, as shown in Figure 1.To acquire images, they were connected to a Lenovo IdeaPad Y500 laptop with a NVIDIA GeForce GT650M GPU, which can be used in parallel computation to accelerate the computing time in feature points detection and matching.

Stereo Camera Calibration
A 3D point ( ) and its projection ( ) in 2D image is related through camera calibration matrix P. The relationship is expressed as = , where = ( , , 1) is in homogenous form in 2D, = ( , , , 1) is in homogenous form in 3D, P is a 3 × 4 matrix, and s is a scale.The objective of camera calibration is to determine camera calibration matrix P, which includes both intrinsic parameters and extrinsic parameters.Zhang [38] provided a flexible technique for camera calibration using only five images taken from different angles.A checkerboard was used as calibration pattern.For each image, the plane of the checkerboard was assumed as z-plane, so the Z coordinates for all the 3D points were zero.X and Y coordinates could be obtained from the actual checkerboard size.All these provided ground truth.A Matlab toolbox, developed by Bouguet [39], was used to solve camera calibration matrix using Zhang's algorithm.This toolbox is not only suitable for a single camera, but is also suitable for stereo cameras.
The external camera parameters provided by Zhang's [38] method were built on each checkerboard's own coordinate system, not on the same world coordinate system.In order to build the same world coordinate system, a large 2D x-z coordinate system was plotted on an A0-size paper, together with a vertical checkerboard (Figure 2), and all of these provided the 3D ground truth on the

Stereo Camera Calibration
A 3D point ( → X) and its projection ( → x ) in 2D image is related through camera calibration matrix P. The relationship is expressed as T is in homogenous form in 3D, P is a 3 × 4 matrix, and s is a scale.The objective of camera calibration is to determine camera calibration matrix P, which includes both intrinsic parameters and extrinsic parameters.Zhang [38] provided a flexible technique for camera calibration using only five images taken from different angles.A checkerboard was used as calibration pattern.For each image, the plane of the checkerboard was assumed as z-plane, so the Z coordinates for all the 3D points were zero.X and Y coordinates could be obtained from the actual checkerboard size.All these provided ground truth.A Matlab toolbox, developed by Bouguet [39], was used to solve camera calibration matrix using Zhang's algorithm.This toolbox is not only suitable for a single camera, but is also suitable for stereo cameras.
The external camera parameters provided by Zhang's [38] method were built on each checkerboard's own coordinate system, not on the same world coordinate system.In order to build the same world coordinate system, a large 2D x-z coordinate system was plotted on an A0-size paper, together with a vertical checkerboard (Figure 2), and all of these provided the 3D ground truth on the same coordinate system.The detailed 2D x-z coordinate system is shown in Figure 3.Each line in the x-z plane was at 50 mm spacing.The middle line (oz) was rotated −10 • around O orig to get the left line, and rotated +10 • around O orig to get the right line.The checkerboard was then placed at different locations on the left, middle, and right line (marked as 1 through 45 in Figure 3).The 3D coordinates of each corner on the checkerboard, at each location, could be solved as the ground truth.Two images were taken at each location from the left and right cameras.From these 2D images, the 2D projection of these corners could also be solved.
Based on these 2D and 3D coordinates, the gold standard algorithm of Hartley and Zisserman [40] was used to calculate the camera matrices for both left and right cameras.
same coordinate system.The detailed 2D x-z coordinate system is shown in Figure 3.Each line in the x-z plane was at 50 mm spacing.The middle line (oz) was rotated −10° around Oorig to get the left line, and rotated +10° around Oorig to get the right line.The checkerboard was then placed at different locations on the left, middle, and right line (marked as 1 through 45 in Figure 3).The 3D coordinates of each corner on the checkerboard, at each location, could be solved as the ground truth.Two images were taken at each location from the left and right cameras.From these 2D images, the 2D projection of these corners could also be solved.
Based on these 2D and 3D coordinates, the gold standard algorithm of Hartley and Zisserman [40] was used to calculate the camera matrices for both left and right cameras.3).The 3D coordinates of each corner on the checkerboard, at each location, could be solved as the ground truth.Two images were taken at each location from the left and right cameras.From these 2D images, the 2D projection of these corners could also be solved.
Based on these 2D and 3D coordinates, the gold standard algorithm of Hartley and Zisserman [40] was used to calculate the camera matrices for both left and right cameras.Based on camera calibration matrix and 2D image coordinates, we can get estimated 3D points.When compared to the actual 3D points, we can estimate the error in X, Y, and Z directions (Figure 4).These experimental results showed that this stereo camera set had good accuracy when the distance between cameras and the target was less than 800 mm.The statistical analysis for errors in the X, Y, and Z directions are shown in Table 1.The mean error in x direction is 0.42 mm, the mean error in y direction is 0.36 mm, and the mean error in z direction is 2.78 mm.
Based on camera calibration matrix and 2D image coordinates, we can get estimated 3D points.When compared to the actual 3D points, we can estimate the error in X, Y, and Z directions (Figure 4).These experimental results showed that this stereo camera set had good accuracy when the distance between cameras and the target was less than 800 mm.The statistical analysis for errors in the X, Y, and Z directions are shown in Table 1.The mean error in x direction is 0.42 mm, the mean error in y direction is 0.36 mm, and the mean error in z direction is 2.78 mm.

Image Acquisition
To make a full view reconstruction of the plant or tree, multiple images from different view angles had to be taken over the target.The stereo camera (shown in Figure 1) and a laptop with image acquisition software were used to acquire the images.One setup of the experiment is shown in Figure 5, where the target plant was in the center, and the stereo cameras positions are shown around it.The images taken from the adjacent locations should have an overlapping region.

Image Acquisition
To make a full view reconstruction of the plant or tree, multiple images from different view angles had to be taken over the target.The stereo camera (shown in Figure 1) and a laptop with image acquisition software were used to acquire the images.One setup of the experiment is shown in Figure 5, where the target plant was in the center, and the stereo cameras positions are shown around it.The images taken from the adjacent locations should have an overlapping region.

Feature Points Detection and Matching
At the beginning, feature points were detected as Harris corners [41].A pixel was selected as a salient pixel if its response was an eight-way local maximum.Normalized cross correlation (NCC) and normalized sum of squared differences (NSSD) described by Richard [42] could be used to match the features.Harris corner features were not invariant to affine and scale transform.Mikolajczyk and Schmid [43] provided different scale and affine invariant feature point detectors, such as Harris-Laplace and Harris-Affine.Mikolajczyk and Schmid [44] did a performance evaluation for four different local feature detectors (Harris-Laplace, Hessian-Laplace, Harris-Affine, and Hessian-Affine) and 10 different feature descriptors.Lowe [28] provided a Scale Invariant Feature Transform (SIFT) descriptor to describe the detected keypoints.Using Lowe's SIFT research, Yan and Sukthankar [45] derived a PCA-based SIFT (PCA-SIFT), and Morel and Yu [46] provided an affine SIFT (ASIFT).To enhance the computing speed of SIFT, a speeded up robust features (SURF) was provided by Bay et al. [47].To further improve the computation speed of SIFT, a parallel algorithm called SiftGPU was provided by Wu [35].
Snavely et al. [27] applied SIFT on multiple-view reconstruction from unordered images.Snavely [36] provided the software called Bundler to realize this method.In Snavely's research, the SIFT feature points for each image were detected.Each pair of two images were then matched using ANN algorithm from Arya et al. [48].This process was conducted in serial computation.The computation time required increased significantly as the number of input images and the number of feature points per image increased.Santos and Oliveira [34] applied Bundler on their plant

Feature Points Detection and Matching
At the beginning, feature points were detected as Harris corners [41].A pixel was selected as a salient pixel if its response was an eight-way local maximum.Normalized cross correlation (NCC) and normalized sum of squared differences (NSSD) described by Richard [42] could be used to match the features.Harris corner features were not invariant to affine and scale transform.Mikolajczyk and Schmid [43] provided different scale and affine invariant feature point detectors, such as Harris-Laplace and Harris-Affine.Mikolajczyk and Schmid [44] did a performance evaluation for four different local feature detectors (Harris-Laplace, Hessian-Laplace, Harris-Affine, and Hessian-Affine) and 10 different feature descriptors.Lowe [28] provided a Scale Invariant Feature Transform (SIFT) descriptor to describe the detected keypoints.Using Lowe's SIFT research, Yan and Sukthankar [45] derived a PCA-based SIFT (PCA-SIFT), and Morel and Yu [46] provided an affine SIFT (ASIFT).To enhance the computing speed of SIFT, a speeded up robust features (SURF) was provided by Bay et al. [47].To further improve the computation speed of SIFT, a parallel algorithm called SiftGPU was provided by Wu [35].
Snavely et al. [27] applied SIFT on multiple-view reconstruction from unordered images.Snavely [36] provided the software called Bundler to realize this method.In Snavely's research, the SIFT feature points for each image were detected.Each pair of two images were then matched using ANN algorithm from Arya et al. [48].process was conducted in serial computation.The computation time required increased significantly as the number of input images and the number of feature points per image increased.Santos and Oliveira [34] applied Bundler on their plant phenotyping, and they reported that almost one hour would be needed to match the features for each two images of the total 143 images, and almost 30 min for 77 images.
Wu [37] provided a fast method called visual structure from motion (SfM) method to accelerate the feature points' detection, matching, bundle adjustment, and 3D reconstruction.Wu's method was applied in this paper.

Sparse Bundle Adjustment
Given a set of images, the matched feature points, also known as 2D projections, could be found through the feature matching algorithm introduced in the previous section.Each matched feature point had a corresponding 3D point in the scene.The camera matrices and 3D points could be estimated through bundle adjustment method [40].The j-th 3D point Xj will be projected on the i-th image as xi j through the i-th camera calibration matrix Pi , where xi j = Pi Xj [40].By minimizing the errors between re-projected projection xi j and the actual projection x i j , the camera calibration matrix Pi , and sparse 3D points Xj could be estimated.A software package called sparse bundle adjustment (SBA) was provided by Lourakis and Argyros [49] to realize this minimization.

Dense 3D Reconstruction Using CMVS and PMVS
The patch model developed by Furukawa and Ponce [33,50] to produce 3D dense reconstruction from multiple view stereo (MVS) was used in this research.Patch was reconstructed through three steps: feature matching, patch expansion, and patch filtering.Feature matching was used to generate an initial bundle of patches.Then the patches were made denser.The outliers were removed by filtering.Finally, the patches were used to build a polygonal mesh.Furukawa and Ponce [51] developed software called PMVS to implement this method.PMVS used the output (camera matrices) from Bundler as the input.Other inputs for PMVS were from another software called CMVS [52].

Stereo Reconstruction Using VisualSFM
VisualSFM, which was proposed by Wu [37], integrated three technologies together: feature points detection and matching [35], multicore bundle adjustment [53], and dense 3D reconstruction [33].Multiple images from a full view of the plant/tree would be imported into this software.A fully reconstructed result would be generated through the previously mentioned three steps.

Metric Reconstruction
The result from bundle adjustment was not metric reconstruction, which means that the reconstructed result did not show the actual size of the target.
A direct reconstruction method using ground truth was provided by Hartley and Zisserman [40] to realize the metric reconstruction.Using pre-calibrated stereo cameras, the Euclidean ground truth of a set of 3D points X i euc could be solved from the 2D correspondence x i 1 ↔ x i 2 , and the estimated 3D points X i est could be obtained from bundle adjustment.The estimated 3D points and the Euclidean 3D points were related through a homography transformation (H).Then we have X i euc = H•X i est .The first two images from the stereo camera were used to solve the Euclidean ground truth from the 2D projection.From our stereo camera calibration we knew that this stereo camera pair had good accuracy only when the distance between camera and the target was less than 580 mm.Therefore, those 3D points whose Z coordinates were bigger than 580 mm would be filtered as outliers.
To minimize the homography fitting error, these two sets of 3D points had to be normalized.After normalization, using the method described by Hartley and Zisserman [40], the centroid of the new points was at the origin, and the average distance from the origin is √ 3.After applying normalization, est the homography between {X i newpts1 } and {X i newpts2 } was estimated using rigid transformation Forsyth and Ponce [54].By fitting rigid transformation, we get X i newpts1 = H est •X i newpts2 .To de-normalize it, we have Applying H on all the 3D points from bundle adjustment, we can transfer them back to metric scale.The new camera calibration matrix was P i euc = P i est •H −1 .

Experimental Results and Discussion
Four test experiments were conducted, one was a box with known geometry, and the other three were a croton plant with salient features, a jalapeno pepper plant with medium-size leaves, and a lemon tree with small leaves.
Test 1: A hexagon box with a given geometry was used to verify the reconstruction result.The box was placed on the top of a table.The stereo camera was manually moved around the box to take the images.Images taken at the adjacent locations should have some overlap, which is good for feature matching.The side length of the hexagon is 64 mm, and the height is 70 mm.To give the box texture, paper with printed citrus leaf images was wrapped around the box, as shown in Figure 6.Approximately 86 images were taken from various positions around this box using the stereo camera.The box was first reconstructed by using VisualSFM [37].The result is shown in Figure 7A.The box was then was reconstructed by applying the metric reconstruction method (mentioned in step 2.8).The result is shown in Figure 7B, which shows the real size of the target.

Experimental Results and Discussion
Four test experiments were conducted, one was a box with known geometry, and the other three were a croton plant with salient features, a jalapeno pepper plant with medium-size leaves, and a lemon tree with small leaves.
Test 1: A hexagon box with a given geometry was used to verify the reconstruction result.The box was placed on the top of a table.The stereo camera was manually moved around the box to take the images.Images taken at the adjacent locations should have some overlap, which is good for feature matching.The side length of the hexagon is 64 mm, and the height is 70 mm.To give the box texture, paper with printed citrus leaf images was wrapped around the box, as shown in Figure 6.Approximately 86 images were taken from various positions around this box using the stereo camera.The box was first reconstructed by using VisualSFM [37].The result is shown in Figure 7A.The box was then was reconstructed by applying the metric reconstruction method (mentioned in step 2.8).The result is shown in Figure 7B, which shows the real size of the target.The reconstructed length for each side of the above hexagon and the reconstructed height of each side face are shown in Tables 2 and 3.

Experimental Results and Discussion
Four test experiments were conducted, one was a box with known geometry, and the other three were a croton plant with salient features, a jalapeno pepper plant with medium-size leaves, and a lemon tree with small leaves.
Test 1: A hexagon box with a given geometry was used to verify the reconstruction result.The box was placed on the top of a table.The stereo camera was manually moved around the box to take the images.Images taken at the adjacent locations should have some overlap, which is good for feature matching.The side length of the hexagon is 64 mm, and the height is 70 mm.To give the box texture, paper with printed citrus leaf images was wrapped around the box, as shown in Figure 6.Approximately 86 images were taken from various positions around this box using the stereo camera.The box was first reconstructed by using VisualSFM [37].The result is shown in Figure 7A.The box was then was reconstructed by applying the metric reconstruction method (mentioned in step 2.8).The result is shown in Figure 7B, which shows the real size of the target.The reconstructed length for each side of the above hexagon and the reconstructed height of each side face are shown in Tables 2 and 3.The reconstructed length for each side of the above hexagon and the reconstructed height of each side face are shown in Tables 2 and 3. From this verifying test, we can see that the hexagon box is well reconstructed.The estimated length and height of the box is very close to the actual size.This method was then applied to complicated objects, such as a plant and a small tree.
Test 2: Three kinds of plants with different leaf sizes were reconstructed using the method introduced.The croton plant has big and sparse leaves with salient features.The jalapeno pepper has medium and sparse leaves.The lemon tree has small and dense leaves.They are shown in Figure 8.  From this verifying test, we can see that the hexagon box is well reconstructed.The estimated length and height of the box is very close to the actual size.This method was then applied to complicated objects, such as a plant and a small tree.
Test 2: Three kinds of plants with different leaf sizes were reconstructed using the method introduced.The croton plant has big and sparse leaves with salient features.The jalapeno pepper has medium and sparse leaves.The lemon tree has small and dense leaves.They are shown in Figure 8. Firstly, the objects were reconstructed in projective views by using VisualSFM from Wu [37].Then the metric reconstruction algorithm (mentioned in step 2.8) was applied to get the 3D reconstruction in Euclidean space.For croton plants, the first pair of images were used as the ground truth.The feature points for these two images were extracted and matched.Together with the camera matrices of the stereo cameras, the actual 3D points could be calculated by using triangulation method of Hartley and Zisserman [40].The estimated 3D points for the same 2D correspondences could be found from the reconstructed results of VisulaSFM.By applying rigid transform, the transformation between actual 3D points and estimated 3D points could be achieved.Applying this transformation to all the estimated 3D points for all the images, the final metric 3D reconstruction could be obtained, which is shown in Figure 9A.A similar process was applied to the other two plants.For the pepper plant, the first pair of images was used, and for the lemon tree, the seventh pair of images was used.The reconstructed view of the target was displayed in a bounding box, which was shown in Figure 9B,C respectively.Firstly, the objects were reconstructed in projective views by using VisualSFM from Wu [37].Then the metric reconstruction algorithm (mentioned in step 2.8) was applied to get the 3D reconstruction in Euclidean space.For croton plants, the first pair of images were used as the ground truth.The feature points for these two images were extracted and matched.Together with the camera matrices of the stereo cameras, the actual 3D points could be calculated by using triangulation method of Hartley and Zisserman [40].The estimated 3D points for the same 2D correspondences could be found from the reconstructed results of VisulaSFM.By applying rigid transform, the transformation between actual 3D points and estimated 3D points could be achieved.Applying this transformation to all the estimated 3D points for all the images, the final metric 3D reconstruction could be obtained, which is shown in Figure 9A.A similar process was applied to the other two plants.For the pepper plant, the first pair of images was used, and for the lemon tree, the seventh pair of images was used.The reconstructed view of the target was displayed in a bounding box, which was shown in Figure 9B,C respectively.To roughly calculate the volume of the reconstructed plant canopy, the bounding box was divided into voxels.If the 3D point is inside the voxel, then that voxel will be marked as used.Unused voxels will be removed, as shown in Figure 10.All the 3D reconstructed points reside inside some voxels.The summation of the volume of all these voxels will be the canopy volume.There is a tradeoff between the size of the voxel and the volume of canopy.This tradeoff was not analyzed in this research since it is not the primary task.The estimated volume for these three plants are shown in Table 4.
(A) (B)  To roughly calculate the volume of the reconstructed plant canopy, the bounding box was divided into voxels.If the 3D point is inside the voxel, then that voxel will be marked as used.Unused voxels will be removed, as shown in Figure 10.All the 3D reconstructed points reside inside some voxels.The summation of the volume of all these voxels will be the canopy volume.There is a tradeoff between the size of the voxel and the volume of canopy.This tradeoff was not analyzed in this research since it is not the primary task.The estimated volume for these three plants are shown in Table 4.To roughly calculate the volume of the reconstructed plant canopy, the bounding box was divided into voxels.If the 3D point is inside the voxel, then that voxel will be marked as used.Unused voxels will be removed, as shown in Figure 10.All the 3D reconstructed points reside inside some voxels.The summation of the volume of all these voxels will be the canopy volume.There is a tradeoff between the size of the voxel and the volume of canopy.This tradeoff was not analyzed in this research since it is not the primary task.The estimated volume for these three plants are shown in Table 4.

Figure 1 .
Figure 1.Stereo cameras which are used to acquire images: (A) whole view; (B) inside view.

Figure 1 .
Figure 1.Stereo cameras which are used to acquire images: (A) whole view; (B) inside view.

Figure 4 .
Figure 4. Error plots in X, Y, and Z direction.(A) errors in x direction; (B) errors in y direction; (C) errors in z direction.Figure 4. Error plots in X, Y, and Z direction.(A) errors in x direction; (B) errors in y direction; (C) errors in z direction.

Figure 4 .
Figure 4. Error plots in X, Y, and Z direction.(A) errors in x direction; (B) errors in y direction; (C) errors in z direction.Figure 4. Error plots in X, Y, and Z direction.(A) errors in x direction; (B) errors in y direction; (C) errors in z direction.

Figure 5 .
Figure 5.An example of stereo camera setup for image acquisition with the 3D reconstruction results (56 stereo pairs were used).

Figure 5 .
Figure 5.An example of stereo camera setup for image acquisition with the 3D reconstruction results (56 stereo pairs were used).

Figure 10 .
Figure 10.Demo of volume calculation.(A) Bounding box was divided into voxels; (B) Voxels left after removing unused voxels.

Figure 10 .
Figure 10.Demo of volume calculation.(A) Bounding box was divided into voxels; (B) Voxels left after removing unused voxels.Figure 10.Demo of volume calculation.(A) Bounding box was divided into voxels; (B) Voxels left after removing unused voxels.

Figure 10 .
Figure 10.Demo of volume calculation.(A) Bounding box was divided into voxels; (B) Voxels left after removing unused voxels.Figure 10.Demo of volume calculation.(A) Bounding box was divided into voxels; (B) Voxels left after removing unused voxels.

Table 1 .
Statistical analysis of errors between estimated and actual corners.

Table 1 .
Statistical analysis of errors between estimated and actual corners.

Table 2 .
Estimated length vs. actual length.