Vision-Aided Localization and Mapping in Forested Environments Using Stereo Images

Forests are traditionally characterized by stand-level descriptors, such as basal area, mean diameter, and stem density. In recent years, there has been a growing interest in enhancing the resolution of forest inventory to examine the spatial structure and patterns of trees across landscapes. The spatial arrangement of individual trees is closely linked to various non-monetary forest aspects, including water quality, wildlife habitat, and aesthetics. Additionally, associating individual tree positions with dendrometric variables like diameter, taper, and species can provide data for highly optimized, site-specific silvicultural prescriptions designed to achieve diverse management objectives. Aerial photogrammetry has proven effective for mapping individual trees; however, its utility is limited due to the inability to directly estimate many dendrometric variables. In contrast, terrestrial mapping methods can directly observe essential individual tree characteristics, such as diameter, but their mapping accuracy is governed by the accuracy of the global satellite navigation system (GNSS) receiver and the density of the canopy obstructions between the receiver and the satellite constellation. In this paper, we introduce an integrated approach that combines a camera-based motion and tree detection system with GNSS positioning, yielding a stem map with twice the accuracy of using a consumer-grade GNSS receiver alone. We demonstrate that large-scale stem maps can be generated in real time, achieving a root mean squared position error of 2.16 m. We offer an in-depth explanation of a visual egomotion estimation algorithm designed to enhance the local consistency of GNSS-based positioning. Additionally, we present a least squares minimization technique for concurrently optimizing the pose track and the positions of individual tree stem[s].


Introduction
The widespread adoption of sustainable forestry practices in much of the developed world has given rise to increasingly complex forest management objectives catering to a diverse array of interests. Consequently, silvicultural prescriptions often take into account numerous non-monetary factors, including forest resiliency and adaptability, wildlife conservation, aesthetic preservation, hydrological values, and other ecological functions that are dependent on stand structure [1]. In general, the incorporation of non-timber values into management objectives has served as the primary driver for transitioning silvicultural practices from homogeneous, even-aged systems to heterogeneous, uneven-aged systems. This paradigm shift directly affects forest operations; implementing complex silvicultural prescriptions becomes more costly in terms of layout, harvesting, and administration [2]. This challenge has spurred the demand for advanced precision forestry tools that offer accurate, real-time machine positioning, as well as forest measurement and mapping capabilities. Such tools have the potential to eliminate certain layout tasks, such as individual tree and boundary marking, thereby reducing operational costs and enhancing the economic feasibility of alternative silvicultural systems.
Real-time machine positioning and forest mapping are essential components of sustainable forest harvesting [3]. However, practical low-cost technologies for precise localization and large-scale mapping in forested environments during harvesting remain underdeveloped. Unlike the substantial advancements in automation introduced in agricultural systems [4], forest harvesting has not experienced groundbreaking technological progress in terms of automation and robotics. This lag can be attributed, in part, to the challenging environments in which forest machines operate, as well as potential cultural factors, resistance to change, or inadequate research and development efforts.
In this paper, we introduce visual-processing algorithms that facilitate precise realtime machine positioning and forest mapping. We employ visual egomotion estimation from a stereo camera to maintain the position of a forestry machine during instances of degraded or failed reception from a global navigation satellite system (GNSS). Furthermore, we demonstrate how precise position maintenance can be utilized to generate tree maps when combined with a tree-detection algorithm. Below, we itemize the main contributions of this work: • A novel camera-based system that integrates GNSS, visual odometry, and a treedetection system and egomotion estimation. • A low-cost system using an off-the-shelf stereo camera and a consumer-grade GNSS receiver. • An efficient pixel-selection method with predictable inter-frame runtime for direct image alignment. • A global orientation parameter in the optimization framework for preliminary alignment between the GNSS and visual odometry pose track. • A system that provides the tree position and dendrometric information to the user in real time.
In the remainder of this introduction, we will discuss the crucial role of GNSS in forest operations, along with the associated challenges and limitations. Additionally, we will introduce the concepts of visual odometry (VO) and simultaneous localization and mapping (SLAM).

Global Positioning
Global navigation satellite systems (GNSSs) have played a significant role in forest resource management. These systems have been employed for inventory plot localization [5], forest traverse surveys [6], mapping forest disturbances [7], machine tracking [8,9], automated time studies [10,11], and general operational monitoring [12]. Recently, GNSS technology has been applied to enhance occupational safety in logging operations by providing virtual geofences around workers on the job site [13][14][15]. Most modern forest machines come equipped with a GNSS receiver from the factory, enabling real-time positioning of the machine during operation.
While highly accurate GNSS positioning may not be essential for many forestry applications, it becomes crucial when constraining machines with a virtual boundary, particularly when boundaries coincide with ownership demarcations or delineate highvalue, sensitive, or hazardous areas. High-precision localization is also necessary for mapping individual trees. Challenges with GNSS accuracy have impeded the widespread adoption of GNSS technology in the virtual boundary and mapping domains.
GNSS accuracy depends on various factors, one of which is the number and geometry of visible satellites. This is characterized by an index called position dilution of precision (PDOP), which is a multiplicative term that scales the expected accuracy of the receiver. PDOP values less than 1 provide the highest possible accuracy, while PDOP values greater than 20 generally render coordinate readings futile, e.g., a PDOP value of 20 with a GNSS receiver capable of 3 m accuracy results in an actual accuracy of 60 m. The forest canopy can block signals from reaching the receiver, an effect known as the canopy effect [6]. Another important factor-primarily responsible for degraded GNSS accuracy under forest canopy-is signal diffraction and reflection, known as multipath errors [16]. Lastly, GNSS accuracy also depends on the class of receiver, such as survey-grade or consumer-grade receivers. Research on GNSS accuracy in forested environments has shown consumergrade receivers to have an accuracy range between 4 and 12 m [17][18][19][20][21]. In this work, we utilize a specific GNSS called the global positioning system (GPS), operated by the United States Air Force. The GPS satellite constellation comprises 33 satellites, 31 of which are currently operational.

Visual Odometry
Estimating the egocentric motion of a camera is a fundamental task in computer vision. Egomotion estimation aims to determine a 3D geometric transformation that describes the incremental translational and rotational change of a camera in motion relative to the observed environment. Motion estimation occurs in discrete time steps, where a new estimate at time t reveals the camera's motion relative to its pose at time t − 1. Each new estimate can be incrementally composed with the previous to produce a pose graph, i.e., a track of position and orientation over time. Visual odometry (VO), coined by Nistér et al. [22], is the term often used to describe time-integrated motion estimates and is analogous to other variants of odometry, e.g., wheeled odometry, where wheel encoders are used to estimate the traveled distance in ground vehicles. Odometric navigation is subject to errors that accumulate over time, eventually leading to positional drift, a well-known property of dead reckoning navigation systems.
Early research on the problem of recovering relative camera poses tackled the problem of structure-from-motion [23,24]. Structure-from-motion is the problem of recovering both 3D structure and camera poses from a set of unordered images. VO is a special case of structure-from-motion, where images are sequenced and pose recovery is the sole objective. To date, most algorithms for VO are based on a feature extraction and matching pipeline [25][26][27][28]. These algorithms are known in the vision community as feature-based methods. In general, feature-based methods involve three steps: (1) feature extraction and description using one of many available approaches (e.g., FAST detector [29,30], SIFT detector/descriptor [31] Harris detector [32], Shi and Tomasi detector [33], and SURF detector/descriptor [34]); (2) temporal feature matching, e.g., using mutual consistency or constrained matching [35]; and (3) pose recovery by minimizing the 3D-2D reprojection error using a perspective from n point (PnP) algorithm, e.g., EPnP [36], which is typically embedded in a random sample consensus (RANSAC) scheme [37]. See [38] for an overview of VO fundamentals in the context of feature-based methods.
Recent approaches to VO have migrated away from feature-based approaches due to complexity and the plethora of configurations in feature-based pipelines. Recent approaches use image intensities directly rather than extracting and matching features [39][40][41][42][43][44]. These methods are called direct methods and are founded on the work by Lucas and Kanade [45] introducing parametric image alignment. In contrast to feature-based methods, which recover relative camera motion by minimizing reprojection error, direct methods minimize the photometric error, the sum of squared differences in image intensities between two consecutive frames.
In this work, we employ a direct method to solve for the incremental 6-DoF motion parameters. We provide a detailed description of image warping and pixel selection techniques, as well as the optimization procedure. Our method is somewhat simplified in contrast to state-of-the-art methods, e.g., [40,43], which involve keyframe selection and windowed bundle adjustment optimization to mitigate trajectory drift. As we will show in the localization and mapping section, we align the odometry track with global coordinates from a GNSS, and therefore, maintaining global consistency within our VO framework is unnecessary. Instead, in this work, we focus on local consistency and computational efficiency.

Simultaneous Localization and Mapping
SLAM is a critical problem in robotics, which involves constructing a map of an unknown environment while concurrently determining the robot's location within it. If the robot's position and orientation are known at any given time, mapping its surroundings using sensor data becomes a straightforward task. Conversely, if the map is already known, the problem is reduced to localization, where the robot's observations are utilized to determine its position and orientation within the map. When neither the pose nor the map is known, both the map and path must be estimated simultaneously. SLAM can be broadly classified into online and full SLAM problems. Online SLAM incrementally estimates the current pose along with the map, while full SLAM estimates the entire path and map using data from all poses and observations [46]. There has been research in forest harvesting related to the localization problem [47,48]; however, a stem map is required in order to localize the machine, which we consider to be a major limitation of such an approach since stem maps, in general, are not readily available.
Visual odometry (VO) and vision-based SLAM are closely related, as VO provides an estimate of the robot's path. The key difference is that SLAM mandates maintaining a map-even if it is not ultimately needed-so that the robot can recognize previously explored areas. When the robot revisits an area on the map, it provides additional constraints to optimize the trajectory and map, ensuring global consistency. This process is known as loop closure. Generally, VO is focused on local consistency, while SLAM aims for global consistency. It is worth noting that if VO is free from drift, the SLAM problem is reduced to merely mapping observations. However, error-free VO has never been achieved in practice, making loop closures essential for maintaining global consistency. To recognize loop closures, robots need to operate in environments containing distinct features that can be reidentified upon returning to a previously mapped location.
Detecting loop closures in forested environments can be challenging, as the spatial configurations of features, such as trees, may not be unique enough to provide robust information for redetection. Integrating GNSS positioning with SLAM can assist in identifying loop closures by maintaining a globally consistent position path. In this paper, we demonstrate how intermittent and degraded GPS reception can be fused with VO to provide a globally consistent position estimate. Additionally, we present a graph-based SLAM algorithm for refining the estimated path and map simultaneously. For an introductory tutorial on graph-based SLAM, readers can refer to [49].

Notation
In our method description below, we denote vectors as bold lowercase letters, e.g., v, matrices as bold capital letters, e.g., M, and scalars as lowercase italic letters, e.g., s. We use the notation · as a shorthand for · 2 , i.e., the Euclidean norm. We represent images as functions, I : Ω → R 3 for 3-channel color images, and I : Ω → R for gray-scale images where Ω ⊂ R 2 is the image domain. Sets are denoted by capital script letters, e.g., A, and the number of elements in a set is given by |A|. See Appendix A for an overview of Lie groups and rigid transformations used in this paper.

Direct Visual Odometry
Given a reference image I t−1 : Ω → R acquired at time t − 1 and an input image I t : Ω → R acquired at time t, we seek to estimate the 3D egomotion of the camera between the frames. Estimating the camera's motion is performed by solving for the parameters of a warp, ξ t−1:t , which relates the pixels in I t−1 to the pixels in I t . For brevity, we drop the time subscript on the warping parameters and denote it by ξ, and denote the reference image by I and the input image by I . The intensity of a pixel in the reference image is given by I(p), where p = (u, v) T ∈ Ω. Similarly, the intensity of a pixel in the input image is given by I (p ), where the position vector p is the result after warping p according to the parameters ξ, Leaving the warping function undefined for the moment, the images are registered, or aligned, by minimizing the photometric error according to the following objective: where Φ ⊂ Ω is a set of selected pixels from the image domain. Solving this expression is non-linear regardless of the warping function, as pixel intensities are unrelated to their coordinates. A common method for minimizing the objective is the Gauss-Newton (GN) algorithm. In general, the expression is linearized w.r.t. some small change in the parameters, ∆ξ, and then the parameters are updated by until ξ and ∆ξ converge. This formulation, as well as a solution procedure, was first given by Lucas and Kanade [45] in their seminal work, on which parametric image alignment, and subsequently direct VO, is founded. For a detailed presentation of optimization algorithms and alternative formulations, see Baker and Matthews [50]. We will discuss our optimization technique in a latter section after introducing the warping function.

Image Warping
The egomotion of a camera with no assumed holonomic constraints has 6 degrees of freedom; rotation about the x, y and z axes and translation along the x, y and z axes. The motion according to these degrees of freedom is represented by a transformation matrix in the special Euclidean Lie group: where R ∈ SO(3) is a 3D rotation matrix and t ∈ R 3 is a 3D displacement or translation vector. This is a rigid-body transformation that encodes the change in rotation and translation of a non-deformable object in motion between two discrete points in time. Although the matrix T has 6 degrees of freedom, there are 12 values that make up the non-homogeneous portion of the matrix: 9 values in the rotation matrix and 3 values for translation. This imposes unnecessary computational demands in an optimization setting. For this reason, we use the Lie algebra se(3) to parameterize the transformation. We denote the transformation matrix as a function of the parameters ξ ∈ se(3), The exponential map of se(3) can be computed in closed form, as can the logarithm map that takes SE(3) back to the algebra se(3), Using the Lie algebra parameterization of the transformation, we define a warping function W : (R 3 × R 6 ) → R 2 that takes a homogeneous pixel coordinate as an input, back projects it to R 3 , transforms the back projection according to the warping parameters, and then projects it back to R 2 , where the notation p = (u, v, 1) T ∈ P 2 denotes a pixel position vector in homogeneous coordinates. The function π −1 (·) performs the back-projection and is defined as where z is a depth measurement for the pixel and K −1 is the inverse of the projection matrix. Note that we homogenized, i.e., appended a fictitious coordinate, to the back-projected vector so that it is compatible with the transformation matrix. The transformation matrix is not homogeneous, i.e., the last row of the matrix shown in Equation (5) is omitted, so the resulting vector following the transformation has three dimensions. The function that performs the forward projection is defined as where x = (x, y, z) T ∈ R 3 , K is the camera projection matrix, and n : R 3 → R 2 normalizes the homogeneous coordinates, i.e., n (x, y, w) T = (x/w, y/w) T . Finally, the projection matrix and its inverse are defined as where f is the focal length, assuming unit aspect ratio pixels, and (c u , c v ) is the principal point. The only variable that still needs attention is the depth measurement z used in the back-projection function. We estimate the depth of each pixel by first computing the stereo correspondence via semi-global matching [51]. We use a real-time GPU implementation presented in [52]. Due to the smoothness constraints imposed by the semi-global matching scheme, occlusions are filled by some non-zero value. We handle occlusions by right-left consistency check: The disparity map D ,r is computed first, then the second disparity map D r, by reflecting the left and right images along the v-axis and using the right image in place of the left, and the left in place of the right. The final disparity map D is equal to D ,r when the absolute difference between D ,r and D r, , evaluated at a position vector p, does not exceed some threshold δ. Otherwise, D(p) takes zero, In this work, we use δ = 1. We do not perform sub-pixel refinement to interpolate disparity values. We simply use positive integers to represent the disparity map, D : Ω → N. The depth estimate, z, for the pixel p is given via triangulation, where f is the focal length and b is the baseline distance of the stereo rig in centimeters. This assumes that the stereo camera is calibrated and row-aligned. We follow methods presented in [53] for camera calibration.

Optimization
We will now extend our discussion regarding the minimization of photometric error. As noted earlier, minimizing the expression defined in Equation (2) is a non-linear optimization task. This formulation requires a linearization step during each GN iteration. Namely, the Jacobian of the warp and the Hessian need to be computed during each iteration, which can lead to significant computational demands depending on the size of the Jacobian. Following from Baker and Matthews [50], we redefine the objective under the inverse compositional (IC) formulation by interchanging the roles of the reference and input image and solving for incremental warp parameters instead of additive updates as in the Lucas-Kanade formulation. Given some initial guess of the parameters, ξ, the objective is to minimize the following expression w.r.t the incremental warp parameters, The parameters are updated by inverting the incremental warp parameters and composing with the current estimate, ξ ← ξ • ∆ξ −1 , where the notation • denotes composition. The incremental warp parameters need to be inverted at each GN iteration as the linearization, which we will discuss next, is performed on the reference image. The update rule can be explicitly written as The Jacobian encodes the partial derivatives of the reference image at each pixel p {i} m 1 with respect to the six warping parameters, The linearization of Equation (14) can be achieved by performing a first-order Taylor expansion about the current estimate of the parameters. Denoting the ith row in the Jacobian as ∂I (p) /∂ξ, corresponding to some pixel p ∈ Φ and applying the chain rule, we obtain The partial derivative of the reference image w.r.t. some pixel position p is simply the gradient vector of the image along the u and v axes, We purposefully denoted the gradient vector as a row vector. Taking p to be equal to a reduced form of the forward projection function π(·) such that p = ( f x /z + c u , f y /z + c v ) T , and x = (x, y, z) T to be a back projection of the pixel, the partial derivative can be written as Finally, the partial derivative of the back-projected pixel x w.r.t. the warping parameters evaluated at the identity warp ξ = 0 takes the form, This result follows from the skew-symmetric matrix operator used when computing the SE(3) exponential map of the warp parameters. Multiplying out the last two partials gives For convenience, we also show a row in the Jacobian corresponding to some pixel p written out explicitly, The row vector stated above is computed for each pixel p ∈ Φ and stacked into the m × 6 Jacobian matrix, again where m is the number of pixels in Φ. The columns of the Jacobian can be visualized as the steepest descent images shown in Figure 1.
Since the linearization is performed on the coordinate frame of the reference image, the Jacobian J and the GN approximation of the Hessian H = J T J, as well as its inverse H −1 only need to be computed once. These are the computational savings of the IC formulation [50] over the original Lucas-Kanade algorithm [45].
Steepest descent images; each image corresponds to a column in the Jacobian matrix.

Pixel Selection
Since minimizing photometric error is based on image gradients, only pixels with a non-zero gradient contribute to optimization. Therefore, including pixels with zero gradient in the Jacobian introduces unnecessary computations. In feature-rich environments, such as forests, selecting pixels with a non-zero gradient might not significantly reduce the number of pixels used in optimization. As shown in Figure 2a, selecting all non-zero gradient pixels results in using approximately 85% of the image. We can further reduce the number of pixels, thus decreasing computation, by thresholding pixels based on gradient magnitude. This approach, however, results in a Jacobian matrix that is subject to change in dimension between frames since the distribution of gradient magnitude is not guaranteed to be consistent; we must allocate enough memory to store the expected maximum number of non-zero gradient pixels across all frames, which cannot be determined in advance, or dynamically allocate memory prior to optimizing each frame. This issue can be resolved by selecting a percentage of the total number of pixels in the image either by performing binary search for a gradient magnitude threshold that results in the desired number of pixels or sorting the gradients in descending order and selecting the first N p 100 pixels, where p is the desired percentage and N is the number of pixels in the image. In Figure 2b-d, we show the selected pixels resulting from desired percentages ranging from 25% to 75%.
Prior to selecting pixels based on the gradient magnitude, we discard all pixels with a disparity value of zero since these pixels are projected to infinity during image warping. Zero disparity pixels are apparent in Figure 2 as areas in the images that obviously have non-zero gradients but are too distant to have a non-zero disparity value.

Robustness
GN optimization assumes Gaussian distributed errors. It is often the case in real-world data, however, that non-Gaussian errors arise due to inaccurate pixel correspondences during disparity computation, motion-induced occlusions, illumination changes and autoexposure adjustments. A cost function for minimizing photometric error that is insensitive to non-Gaussian distributed noise is said to be robust. In the GN framework, this can be achieved with iterative re-weighted least squares (IRLS) using a robust cost function. The decision of the cost function is somewhat arbitrary, typically selected through empirical evaluation or by means of some prior knowledge regarding the structure of outliers in the data. In this work, we choose to use Tukey's biweight cost function [54,55] since it suppresses large residuals in contrast to Huber's cost function [56] which simply down weights their influence. Tukey's biweight cost function takes the form where the constant c is usually chosen to be 4.6851 to achieve 95% asymptotic efficiency with the normal distribution and r i is the ith residual from the error vector given by Equation (17). The cost function assumes that the residuals have unit variance. A common choice to estimate the scale parameter to standardize the residuals is to compute the median absolute deviation (MAD) and multiply by the expected MAD for a standard normal distribution,ŝ where k = 1.4826. To incorporate robustness in the GN minimization routine, we construct a weight vector w, where each weight w i = ρ( r i/ŝ) and set the diagonal entries of a weight matrix equal to the weight vector, i.e., W = diag(w). It follows that the solution to the incremental update of the linearized system under the IRLS framework takes the form which is used in place of the normal equations shown in Equation (16).

Localization and Mapping
In this section, we present a graph-based algorithm for maintaining a globally and locally consistent pose track using the estimated frame-to-frame egomotion parameters from the previous section and a consumer-grade GPS receiver. We also show how the optimized pose graph can be refined to generate a map of detected tree stems. Our algorithm consists of two phases: global alignment, in which we minimize errors between the odometry-based pose graph and global positions provided by the GPS receiver, and local refinement, where we relate poses by multiple observations of tree stems and simultaneously optimize the configuration of the pose graph and tree stem positions.
In order to increase the efficiency of computations, we use the se(2) Lie algebra of rigid transformations to represent the graph, and ultimately the map, as opposed to the se(3) parameters we optimized for in the odometry section. We do this by simply extracting the translation parameters corresponding to the x and z axes in se(3) to represent the translation along the x and y saxes on a planar pose graph, and the rotation component about the y axis from the se(3) parameters to represent the heading. Using se(2) parameters produces a meaningful map that can easily be presented on a 2D display monitor.

Global Alignment
Given a set of frame-to-frame odometry observations, {∆ξ 1 , ∆ξ 2 , . . . , ∆ξ m }, where each ∆ξ i = (x, y, θ) T ∈ se (2), we seek to align a pose graph constructed from the odometry observations with a set of global position readings from a GPS receiver. We assume odometry observations to be locally consistent but subject to drift and assume GPS coordinates to be locally bounded by a Gaussian distribution specified by an arbitrarily large covariance matrix that captures the expected errors due to multi-pass signals and geometric dillution of precision. We denote a GPS coordinate as g = (x, y) T ∈ R 2 , where x and y represent the global position estimate in meters within the Universal Transverse Mercator (UTM) coordinate system. We also convert the translation component of the odometry observations to meters for compatibility with the UTM coordinate frame.

Graph Construction
A pose graph is used to represent the camera poses and the motion constraints between the poses. A node, or vertex, in the graph denotes a pose, i.e., a position and orientation, and an edge denotes the relative motion constraint given by the odometry observation. We use v i to denote the ith node in the graph and ∆ξ i to denote the relative motion between v i and v i+1 . We construct the initial pose graph by sequentially transforming poses with the odometry observations. First, we fix the first node in the graph to the zero vector, then each subsequent pose is computed by right multiplying the previous pose with a homogeneous transformation matrix T(∆ξ) ∈ SE(2), representing the exponential map of the odometry observation, To simplify the notation, we use ξ ij to denote the motion constraint between the poses v i We also take n to be equal to the number of poses, which is the number of odometry constraints plus one, i.e., n def = m + 1. Associated with each motion constraint is a 3 × 3 covariance matrix Σ ij that represents the uncertainty of the motion. As we describe in the optimization section that follows, we use the information matrix ij to represent the strength of the edge, or constraint, in the graph. Let C be an ordered set of 2-tuples representing the correspondences between poses and GPS readings. Thus, the tuple (i, k) ∈ C specifies that pose v i corresponds to the GPS reading g k . We insert GPS coordinates as nodes in the graph and add an edge to the corresponding camera pose. We also translate all GPS coordinates according to the first correspondence in C. We do this by storing the translation, g 0 ← g k∈C 1 , where C 1 is the first GPS-odometry correspondence, and subtracting g 0 from all coordinates in the track, We store the translation so that we can invert the pose graph back to the original UTM coordinates after optimization. Associated with each GPS coordinate is a 2 × 2 covariance matrix representing the expected accuracy of the receiver. We invert to covariance matrix, as we did with the odometry covariance, to obtain an information matrix Q k . The values in this matrix depend on the expected accuracy of the GPS receiver and the environment in which the receiver is operating. For example, a consumer-grade GPS device operating under a dense forest canopy will have a relatively large uncertainty and thus small values in the information matrix.
Although we anchor the GPS track to a camera pose in the graph, their global orientation will likely differ. Thus, we introduce a global orientation parameter φ as another node in the graph that imposes a constraint on each node representing a camera pose that has a corresponding GPS reading. See Figure 3 for an illustration of the graph.

Optimization
Given the graph structure outline above, we seek to find the optimal configuration of the state vector, that minimizes the sum of squared errors. The state is simply a vector consisting of the global orientation parameter followed by the camera poses, where each pose is parameterized as v i = (x, y, θ) T . Thus, the size of this vector is 1 + dn, where d is the number of dimensions used to describe a camera pose, in this case 3, and n is the number of camera poses. The errors associated with the state configuration are given by two functions: one corresponding to the relative motion constraints given by the odometry observations and one corresponding to the constraints imposed by the observations from the GPS. The odometry error function, which is equivalent to the error function used in [49], is given by where the notation R(θ) represents a 2D rotation matrix of the form and θ i , θ j and θ ij are the rotation angles corresponding to v i , v j and ξ ij , respectively. The notations t i , t j and t ij represent the translation vectors from v i , v j and ξ ij , respectively. The error function defined in Equation (31) gives the translational and rotational errors by first transforming pose v j into the coordinate frame of v i , then computing the differences according to the odometry observation. Thus, this function returns zero when the state vector is configured according to Equations (28a) and (28b). The second error function calculates the position difference between a pose and its corresponding GPS coordinate according to the global orientation parameter, This function rotates the translation vector of pose v i about the origin of the coordinate frame according to a rotation matrix constructed from the global orientation parameter φ and returns the offset to the corresponding GPS coordinate. Given the two error functions, we can write the sum of squared errors of the state configuration s as where the error vectors r ij and r ik are weighted by the information matrices to incorporate the degree of belief given to the observations. This leads to the following objective function: This is a non-linear least-squares optimization problem that can be solved with the GN algorithm. Specifically, the error function is linearized about a current estimate of the state configuration and we iteratively solve the linear system and incrementally update the state vector. Since our initial estimate of the state vector might be far from a minimum solution, we use a dampened version of the GN algorithm called the Levenberg-Marquardt (LM) algorithm [57,58]. The linear system that is solved during each LM iteration takes the form where ∆s is the solution to the linear system that provides incremental improvements to the state vector in the non-linear solution space bys ←s + ∆s, wheres is the current estimate of the state configuration. The variable λ is a non-negative damping factor that, when large, forces the update to behave as the steepest descent and, when small, brings the algorithm closer to GN. We initialize λ = trace(H) and divide by two if the objective value decreases and multiply by two if the objective value increases. To construct the Hessian H and the gradient vector b, we first take the derivatives of the error functions with respect to the state parameters evaluated at the current estimated of the state. We initialize the poses in the state vector according to Equations (28a) and (28b), and set the global orientation parameter to φ = 0. The partial derivatives of the odometry error function can be written as where ∂R(θ i ) T /∂θ i in A ij is given by The derivatives for the second error function r ik (v i , φ) are defined as The partial derivate of the rotation matrix w.r.t. the rotation angle in C ik takes the same form as presented in Equation (38). Note that we have not taken any derivatives w.r.t. the GPS coordinates, as we do not wish to reconfigure them, and thus they do not appear in the state vector. For clarity, we specify the dimensions of these matrices: both A and B are 3 × 3 matrices, C is a 2 × 1 matrix, and D is a 2 × 3 matrix. These derivatives lead to sparse Jacobian matrices for each of the error functions, We use ellipses to indicate that unspecified values in the matrix are zeros. Since the odometry part of the pose graph only has constraints between consecutive nodes, the Jacobian J ij will always have a contiguous 3 × 6 block of non-zero values corresponding to the odometry constraints between nodes i and j. Furthermore, the Jacobian J ik will always have non-zero values in the first 2 × 1 block corresponding to the global orientation parameter and a 2 × 3 non-zero block at the ith node representing the constraint between the GPS-odometry correspondences. Now that the Jacobians are specified, we obtain the Gauss-Netwon approximation to the Hessian matrices by and the gradient vectors by From an implementation standpoint, it is easier to construct the Hessian directly using our definitions for the non-zeros blocks in the Jacobians. The Hessian matrix for the ith pose in graph, assuming there is a corresponding GPS coordinate with the node, takes the form Since the Jacobians are sparse, the resulting Hessian matrix is also sparse. Therefore, in practice, it is advantageous to use a memory-efficient sparse storage scheme for these matrices, e.g., compressed sparse column or row matrices.
The gradient vector for the ith pose can be constructed directly with The final Hessian matrix and gradient vector for the linear system are obtained by summing over all the constraint-wise Hessians and descent vectors, This linearization is performed during each iteration of the LM algorithm. We take advantage of the sparse structure of the Hessian and solve the system using sparse Cholesky factorization. We terminate the algorithm when the Euclidean norm of the linear increment to the state vector ∆s is less than some small threshold, e.g., 0.001, and take the optimal configuration as s =s + ∆s. Finally, we rotate each pose in the optimal state vector according the global orientation parameter and translate back to the UTM coordinate system and update the rotation component of each pose, Hereinafter, we omit the global orientation parameter and denote the globally aligned state vector asŝ In order to resolve the global orientation parameter, it is required that we have a minimum of two GPS observations. We also add robustness to non-Gaussian distributed GPS errors using the same approach as in Section 2.2.2. We compute weights for the residuals corresponding to the global position coordinates using Tukey's biweight cost function, and the weights are used to scale the information matrix Q k . We conclude this section by providing a note on the values in the information matrices. In general, the information matrix for the odometry observations consists of large values relative to the values in the information matrix for the global position observations. This is consistent with the fact that global position measurements are typically degraded under the canopy of a forest. Furthermore, VO is expected to perform well in feature-rich environments, such as a forest. We also note that we provide a substantially larger value to the position in the odometry information matrix corresponding to the heading. This makes the translation component of the nodes in the pose graph more elastic in order to conform to the GPS track while keeping the heading stiff to help maintain an accurate reconstruction of tree stem observations, which we address in the next section.

Local Refinement
Given the globally aligned state vectorŝ, we will now optimize for a refined state vector by taking into account observations of tree stems. There are three main steps in local refinement: (1) We transform each tree stem observation to the world coordinate frame according to the globally aligned state vector we optimized in the previous section.
(2) We associate the observations corresponding to an individual tree stem position in the global map. (3) We optimize for a new configuration of the state that minimizes both the error functions from the previous section and an additional error function representing the discrepancy between observations of tree stems and their associated global position.

Graph Augmentation
Tree stems are detected in each frame using the convolutional neural network (CNN) object detector outlined in [59]. The input image to the network is resized to a resolution of 128 columns and 352 rows for real-time performance. We also detect the ground plane and breast height using the RANSAC-based algorithm presented in [60]. For each bounding box predicted by the CNN object detector, we extract the image coordinate, where the center-line of the bounding box along the u-axis of the image intersects the ground plane positioned at breast height. A disparity value is assigned to this image coordinate using Equation (5) in [59]. Using the inverse projection matrix and the disparity assignment, we project the image coordinate to R 3 . We represent a tree stem observation in the camera coordinate frame of the ith camera pose as z iq = (x, y) T , where x and y correspond to the back-projected image coordinate along the x and z axes of the camera coordinate frame. The index q ∈ Z i where Z i is a set of indices denoting the observations of stems from the ith camera. Thus, |Z i | is the number of tree stem observations in camera i and |Z | is the number of camera poses in the graph. We use the notation (i, q) ∈ Z to index the q th observation in camera i.
Recalling that a camera pose in the globally aligned state vector is represented aŝ v i = x,ŷ,θ T , we can transform the observations from the camera coordinate frame to the world frame withẑ Given all tree stem observations in the global coordinate frame, we perform data association by clustering spatially similar observations. We use the density-based spatial clustering of applications with the noise (DBSCAN) algorithm [61] to cluster the observations. DBSCAN takes two parameters and a distance function. For the distance function, we simply use the Euclidean distance. The two parameters correspond to the search radius and the minimum number of points required for a cluster. We use a search radius of 1 m and 10 as the minimum number of points in a cluster. The algorithm yields a label for each observation that specifies to which cluster observationẑ iq belongs. A label equal to zero denotes an outlier, i.e., an observation that does not belong to any cluster. We denote the set of unique labels as L and use the correspondence set M to specify that observationẑ iq is assigned to label ∈ L. The notation ((i, q), ) ∈ M is used to denote that observationẑ iq is assigned to cluster . For each cluster of tree stem observations, we compute the center of the clusters by where m = (x, y) T is the center of the cluster in the global coordinate frame, and the notation 1 takes 1 when the observationẑ iq is assigned to cluster and zero otherwise. Given the cluster centers we extend our state vector aŝ Note that we omitted the global orientation parameter. We can reconcile this in the Hessian matrix and gradient vector by removing the first row and column in the Hessian and the first row in the gradient vector, and redefining D ik as (I 2 , 0). See Figure 4 for a graphical illustration of the augmented graph.

Refinement
In order to find an optimal configuration of the new state vector, we must define an error function corresponding to the expected tree stem location that we initialized by computing the cluster centers. The error is simply the difference between the expected position of the stem m and the observationẑ iq . Since the observations in the global coordinate frame are subject to change during optimization, we will define the error functions on the observations from the coordinate frame of the corresponding camera, i.e., z iq , as we did with the error functions defined in the previous section. Thus, the error function is defined as which leads to an additional term in the objective function, The matrix Q iq is the information matrix for the q th observation in camera i. Since the information matrix in the inverse of the covariance matrix, we can define Q iq in terms of the expected measurement noise of the sensor. As we back projected the detected stem to the camera coordinate frame using parameters from the stereo camera, the information matrix will, in general, have relatively small values for stems detected far from the camera rig. We proceed in the same manner as we did in the previous section by linearizing the error function and constructing the Jacobian to compute the Hessian matrix and gradient vector. The partial derivatives of the new error function can be written as E i is a 2 × 3 matrix and F i is a 2 × 2 matrix. These definitions can be used directly to compute the blocks with the sparse Hessian matrix, . .
where the Hessian matrix is now a (1 + 3n + 2|L|) × (1 + 3n + 2|L|) matrix. Similarly, the entries to the gradient vector are The complete linear system is constructed by summing the Hessian matrices and gradient vectors for all the tree stem observations and adding them to the other Hessians and gradient vectors defined in Equation (45). Again, the first row and column of the Hessian, as well as the first row in the gradient vector, are removed to account for the omission of the global orientation parameter. We solve the linear system, (H + λI)∆s = −b, using sparse Cholesky factorization and update the state vector byŝ ← ∆s. As we did for global alignment, we terminate the algorithm when ∆s is less than some predetermined convergence threshold.

Analysis and Discussion
To test VO and the localization and mapping algorithms outlined above, we acquired a video sequence of a 1115 m path through a sparse ponderosa pine (Pinus ponderosa Douglas ex Lawson) forest in Western Montana. The video was captured by walking a hand-held 12 cm baseline ZED stereo camera [62] through the forest. The camera was operated at 10 frames per second and VGA resolution (480 × 640). Mounted on top of the camera field monitor was an antenna connected to a GlobalTop FGPMMOPA6H GPS module [63] that was queried for a GPS coordinate reading and a PDOP value after each video frame capture. The GPS coordinates and video frames were stored on an embedded backpack computer. Figure 5a shows the GPS coordinate readings projected on the UTM coordinate system after translating the position track by subtracting the first GPS coordinate from all coordinates. We estimated camera egomotion using the direct VO algorithm presented in Section 2. We used a camera resolution of 240 × 320 to estimate egomotion and a 25% gradient magnitude threshold during pixel selection. Frame-to-frame egomotion parameters were composed to construct an odometry track using Equation (28). Figure 5b shows the visual odometry position track after converting the translation components of the egomotion parameters to meters. Figure 6 shows the GPS track (black line) and the optimized VO track after global alignment and local refinement (red line). The blue dashed line in Figure 6 shows the VO track after only applying the optimized global orientation parameter. As the figure suggests, the VO position track, although locally consistent, is subject to drift after approximately 300 m.

Localization
In this section, we demonstrate the performance of GPS and VO integrated localization under various scenarios of degraded and intermittent GPS reception. We refer to the optimized position track shown in Figure 6 (red line) as the ground truth position track. We acknowledge that this track has not actually been ground truthed with survey grade equipment; however, this track is deemed optimal, given the available data.

Degraded GPS Reception
We simulated degraded GPS reception by adding zero mean Gaussian noise with a standard deviation of 5 m to each observed GPS coordinate. The incremental update to the state vector and the current estimate of the state converged after 15 LM iterations. Figure 7 shows six snapshots during optimization. The global orientation parameter converged after the 6th iteration. Figure 8 shows the converged path after global alignment and local refinement with a root mean squared error (RMSE) of less than 0.1 m compared to the ground truth position track. This suggests that, as long as GPS errors are Gaussian distributed, we can expect accurate position tracking.
We also tested robustness to non-Gaussian distributed noise in GPS readings. We randomly selected 5% of the coordinates from the GPS track and added uniformly distributed noise with a range of 0 to 200 m. Figure 9 shows the converged path after global alignment and local refinement with non-Gaussian distributed noise in GPS coordinates plotted over the ground truth path (RMSE < 0.1 m). According to this result, the algorithm is insensitive to GPS coordinate outliers.

Intermittent GPS Reception
When GPS is used under an extremely dense forest canopy, reception might only be intermittently reliable when the receiver is stationary for long periods of time or when the receiver crosses openings in the canopy. We simulated this scenario by extracting 12 coordinates from the GPS track; we extracted the starting position, ending position and randomly selected 10 coordinates along 100 m intervals from the track. Figure 10 shows six snapshots during optimization with 12 intermittent GPS coordinate readings. The algorithm converged after 21 iterations. Following local refinement, the optimized path had a RMSE of 2.7 m compared to the ground truth position track in Figure 11. This is an important application of the proposed algorithm in situations where GPS reception is only available intermittently.

Mapping
In order to generate a stem map, we performed global alignment using all the coordinates from the GPS track and the VO path shown in Figure 5a,b. Figure 12 shows the result after clustering the observed tree stem position. The black dots in the figure denote the cluster centers, and the ellipses around each center show the covariance matrix of the clusters at four standard deviations. There was a total of 6205 tree stem observations and 140 clusters, i.e., individual tree stems. Figure 13 shows the optimized cluster centers, i.e., tree stem positions, and the covariance structure of the observations at four standard deviations. Local refinement converged after nine iterations. Note that the ellipses representing the covariances of the clusters are smaller after local refinement. This is a result of simultaneously optimizing the position track and the cluster centers.
To test the accuracy of the stem map generated from the local refinement step, we collected a ground truth stem map of the 12 acre forest, from which we acquired the video and GPS track. We obtained global coordinates for each individual tree in the stand using a TruePulse 360B laser range finder and a 13-bit BEI industrial rotary encoder mounted on a tripod. We installed a reflector target near the center of the stand, at which a GPS coordinate was acquired using a mapping grade receiver. We mapped subsections of the stand by first aligning the laser and encoder to north, using the appropriate declination for the area, then recorded the distance and angle to the target to globally localize the plot center. We recorded the distance and angle to each individual tree within view from the plot center. We repeated this process moving clockwise around the reflector target mapping small subsections of the stand until the entire stand was mapped. Figure 14 shows the ground truth stem map (green triangles), the predicted tree stem position from the camera (red circles), and the optimized position track (yellow line) superimposed on an aerial photograph of the stand. We manually associated each observed tree stem position from the camera with its corresponding ground truth tree stem. Among the 140 predicted tree stem positions, we classified 2 observations as false positives, i.e., predicted tree stems that correspond to a tree in the aerial photograph that were not recorded during the collection of the ground truth stem map, and 7 tree stems as duplicate observations, i.e., observed tree stems that belong to the same ground truth stem. For all predicted and ground truth stem correspondences (n = 140), we calculated a RMSE of 2.16 m. This is a 46% improvement over the best-case expected accuracy of a consumer-grade GPS device under forest canopy (4 m).

Conclusions
In this paper, we presented a real-time algorithm for VO estimation and a novel method for integrating VO and a vision-based tree stem detection system with GNSS positioning to generate accurate stem maps. Our approach is based on existing and widely applied optimization techniques, i.e., GN and LM non-linear least squares, which provide efficient solution procedures to the optimization problems formulated in this paper.
Although GNSS positioning is used to maintain a globally consistent position track, the heading of the camera is subject to drift; global positioning cannot be used to infer the heading of the camera. This issue did not appear to be significant in our dataset; however, we expect that camera heading will eventually drift since it is unconstrained and only estimated from visual egomotion. This issue can be resolved by incorporating a global direction sensor, i.e., magnetometer, to maintain a globally consistent heading. The inclusion of directional data in the global alignment optimization step requires minimal modifications to the presented algorithm. Since electronic direction sensors are relatively inexpensive, we recommend using such a sensor for large-scale mapping applications.
The successful integration of real-time localization and mapping into forestry practices presents new opportunities for enhancing decision-making processes and implementing complex silvicultural prescriptions. This research provides a practical, low-cost solution that addresses the need for accurate and efficient positioning and mapping in forested environments.

Patents
US Patent No. US011481972B2: Method of performing dendrometry and forest mapping. R ∈ SO(2), and its Lie algebra by θ ∈ so (2). The closed-form exponential map that converts the Lie algebra, so (2), to the Lie group, SO(2), is given by where θ × denotes a skew symmetric matrix generated from θ. The inverse of the exponential map is given by the logarithm map that brings the transformation matrix back to the algebra, θ = ln(R) = tan −1 r 21 r 11 ∈ so(2) , where r 21 and r 11 are entries in R. In practice the tan −1 function is replaced by the twoargument inverse tangent function arctan2(y, x) for appropriate quadrant checking.
In the case of 3-dimensional rotations, we represent the transformation as a 3 × 3 matrix R ∈ SO(3), and its algebra as ω ∈ so(3), where ω = (α, β, γ) T is a vector encoding the rotation angles about the x, y and z-axes. The closed-form exponential is given by where ϑ = √ ω T ω and ω × is defined as Equation (A3) is Rodrigues' formula for rotating a vector in space, given an axis and a rotation angle. The logarithm map that brings the transformation matrix back to the algebra is given by The algebra is taken from the off-diagonal components of ω × , i.e., skew symmetric matrix to vector. The notation Tr(·) is the trace function of a square matrix, which returns the sum of the elements along the diagonal. In both the 2D and 3D cases, the column vectors of the matrices representing the transformations are orthogonal. Thus, the inverse is equivalent to the matrix transpose, Rigid transformations in 2 and 3 dimensions are represented by the special Euclidean Lie groups SE(2) and SE (3). By definition, these groups include rotations from the special orthogonal group described in the previous section, A rigid transformation in 2D takes the form This matrix is homogeneous for compatibility with inversion and composition operations. We represent the parameter space, i.e., the Lie algebra, as ξ = (θ, x, y) T = (θ, u) T ∈ se (2). The exponential map is given in closed form as where exp(θ × ) is defined in Equation (A1) and The logarithm map that brings the transformation matrix back to the algebra is given by where θ = ln(R) is defined in Equation (A2) and V −1 is defined as Rigid transformations in 3-dimensional space are written similarly to our definition for SE(2) but with 3D rotation matrices and 3-vector translations, We use the variable ξ to represent the algebra, ξ = (ω, u) T ∈ se (3), where ω = (α, β, γ) T and u = (x, y, z) T . The exponential map again can be computed in closed form, where R, defined in Equation (A3) is restated below for convenience, and V are given by Finally, the logarithm map is written as ξ = ln(T) = ln(R) The logarithm map of the rotation part is given by Equation (A5), and the inverse of V is defined as Composition of two transformation is equivalent for SE(2) and SE (3), and inverting a transformation is given by As a result, composition and inversion in a single operation is written as