Vision-Aided Localization and Mapping in Forested Environments Using Stereo Images

Wells, Lucas A.; Chung, Woodam

doi:10.3390/s23167043

Open AccessArticle

Vision-Aided Localization and Mapping in Forested Environments Using Stereo Images

by

Lucas A. Wells

^*,†

and

Woodam Chung

Department of Forest Engineering, Resources and Management, College of Forestry, Oregon State University, Corvallis, OR 97331, USA

^*

Author to whom correspondence should be addressed.

^†

Current address: Silvx Labs, Missoula, MT 59802, USA.

Sensors 2023, 23(16), 7043; https://doi.org/10.3390/s23167043

Submission received: 14 July 2023 / Revised: 4 August 2023 / Accepted: 7 August 2023 / Published: 9 August 2023

(This article belongs to the Section Sensing and Imaging)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Forests are traditionally characterized by stand-level descriptors, such as basal area, mean diameter, and stem density. In recent years, there has been a growing interest in enhancing the resolution of forest inventory to examine the spatial structure and patterns of trees across landscapes. The spatial arrangement of individual trees is closely linked to various non-monetary forest aspects, including water quality, wildlife habitat, and aesthetics. Additionally, associating individual tree positions with dendrometric variables like diameter, taper, and species can provide data for highly optimized, site-specific silvicultural prescriptions designed to achieve diverse management objectives. Aerial photogrammetry has proven effective for mapping individual trees; however, its utility is limited due to the inability to directly estimate many dendrometric variables. In contrast, terrestrial mapping methods can directly observe essential individual tree characteristics, such as diameter, but their mapping accuracy is governed by the accuracy of the global satellite navigation system (GNSS) receiver and the density of the canopy obstructions between the receiver and the satellite constellation. In this paper, we introduce an integrated approach that combines a camera-based motion and tree detection system with GNSS positioning, yielding a stem map with twice the accuracy of using a consumer-grade GNSS receiver alone. We demonstrate that large-scale stem maps can be generated in real time, achieving a root mean squared position error of 2.16 m. We offer an in-depth explanation of a visual egomotion estimation algorithm designed to enhance the local consistency of GNSS-based positioning. Additionally, we present a least squares minimization technique for concurrently optimizing the pose track and the positions of individual tree stem[s].

Keywords:

direct visual odometry; simultaneous localization and mapping; non-linear least squares; computer vision; GNSS/GPS

1. Introduction

The widespread adoption of sustainable forestry practices in much of the developed world has given rise to increasingly complex forest management objectives catering to a diverse array of interests. Consequently, silvicultural prescriptions often take into account numerous non-monetary factors, including forest resiliency and adaptability, wildlife conservation, aesthetic preservation, hydrological values, and other ecological functions that are dependent on stand structure [1]. In general, the incorporation of non-timber values into management objectives has served as the primary driver for transitioning silvicultural practices from homogeneous, even-aged systems to heterogeneous, uneven-aged systems. This paradigm shift directly affects forest operations; implementing complex silvicultural prescriptions becomes more costly in terms of layout, harvesting, and administration [2]. This challenge has spurred the demand for advanced precision forestry tools that offer accurate, real-time machine positioning, as well as forest measurement and mapping capabilities. Such tools have the potential to eliminate certain layout tasks, such as individual tree and boundary marking, thereby reducing operational costs and enhancing the economic feasibility of alternative silvicultural systems.

Real-time machine positioning and forest mapping are essential components of sustainable forest harvesting [3]. However, practical low-cost technologies for precise localization and large-scale mapping in forested environments during harvesting remain underdeveloped. Unlike the substantial advancements in automation introduced in agricultural systems [4], forest harvesting has not experienced groundbreaking technological progress in terms of automation and robotics. This lag can be attributed, in part, to the challenging environments in which forest machines operate, as well as potential cultural factors, resistance to change, or inadequate research and development efforts.

In this paper, we introduce visual-processing algorithms that facilitate precise real-time machine positioning and forest mapping. We employ visual egomotion estimation from a stereo camera to maintain the position of a forestry machine during instances of degraded or failed reception from a global navigation satellite system (GNSS). Furthermore, we demonstrate how precise position maintenance can be utilized to generate tree maps when combined with a tree-detection algorithm. Below, we itemize the main contributions of this work:

A novel camera-based system that integrates GNSS, visual odometry, and a tree-detection system and egomotion estimation.
A low-cost system using an off-the-shelf stereo camera and a consumer-grade GNSS receiver.
An efficient pixel-selection method with predictable inter-frame runtime for direct image alignment.
A global orientation parameter in the optimization framework for preliminary alignment between the GNSS and visual odometry pose track.
A system that provides the tree position and dendrometric information to the user in real time.

In the remainder of this introduction, we will discuss the crucial role of GNSS in forest operations, along with the associated challenges and limitations. Additionally, we will introduce the concepts of visual odometry (VO) and simultaneous localization and mapping (SLAM).

1.1. Global Positioning

Global navigation satellite systems (GNSSs) have played a significant role in forest resource management. These systems have been employed for inventory plot localization [5], forest traverse surveys [6], mapping forest disturbances [7], machine tracking [8,9], automated time studies [10,11], and general operational monitoring [12]. Recently, GNSS technology has been applied to enhance occupational safety in logging operations by providing virtual geofences around workers on the job site [13,14,15]. Most modern forest machines come equipped with a GNSS receiver from the factory, enabling real-time positioning of the machine during operation.

While highly accurate GNSS positioning may not be essential for many forestry applications, it becomes crucial when constraining machines with a virtual boundary, particularly when boundaries coincide with ownership demarcations or delineate high-value, sensitive, or hazardous areas. High-precision localization is also necessary for mapping individual trees. Challenges with GNSS accuracy have impeded the widespread adoption of GNSS technology in the virtual boundary and mapping domains.

GNSS accuracy depends on various factors, one of which is the number and geometry of visible satellites. This is characterized by an index called position dilution of precision (PDOP), which is a multiplicative term that scales the expected accuracy of the receiver. PDOP values less than 1 provide the highest possible accuracy, while PDOP values greater than 20 generally render coordinate readings futile, e.g., a PDOP value of 20 with a GNSS receiver capable of 3 m accuracy results in an actual accuracy of 60 m. The forest canopy can block signals from reaching the receiver, an effect known as the canopy effect [6]. Another important factor—primarily responsible for degraded GNSS accuracy under forest canopy—is signal diffraction and reflection, known as multipath errors [16]. Lastly, GNSS accuracy also depends on the class of receiver, such as survey-grade or consumer-grade receivers. Research on GNSS accuracy in forested environments has shown consumer-grade receivers to have an accuracy range between 4 and 12 m [17,18,19,20,21]. In this work, we utilize a specific GNSS called the global positioning system (GPS), operated by the United States Air Force. The GPS satellite constellation comprises 33 satellites, 31 of which are currently operational.

1.2. Visual Odometry

Estimating the egocentric motion of a camera is a fundamental task in computer vision. Egomotion estimation aims to determine a 3D geometric transformation that describes the incremental translational and rotational change of a camera in motion relative to the observed environment. Motion estimation occurs in discrete time steps, where a new estimate at time t reveals the camera’s motion relative to its pose at time

t - 1

. Each new estimate can be incrementally composed with the previous to produce a pose graph, i.e., a track of position and orientation over time. Visual odometry (VO), coined by Nistér et al. [22], is the term often used to describe time-integrated motion estimates and is analogous to other variants of odometry, e.g., wheeled odometry, where wheel encoders are used to estimate the traveled distance in ground vehicles. Odometric navigation is subject to errors that accumulate over time, eventually leading to positional drift, a well-known property of dead reckoning navigation systems.

Early research on the problem of recovering relative camera poses tackled the problem of structure-from-motion [23,24]. Structure-from-motion is the problem of recovering both 3D structure and camera poses from a set of unordered images. VO is a special case of structure-from-motion, where images are sequenced and pose recovery is the sole objective. To date, most algorithms for VO are based on a feature extraction and matching pipeline [25,26,27,28]. These algorithms are known in the vision community as feature-based methods. In general, feature-based methods involve three steps: (1) feature extraction and description using one of many available approaches (e.g., FAST detector [29,30], SIFT detector/descriptor [31] Harris detector [32], Shi and Tomasi detector [33], and SURF detector/descriptor [34]); (2) temporal feature matching, e.g., using mutual consistency or constrained matching [35]; and (3) pose recovery by minimizing the 3D-2D reprojection error using a perspective from n point (PnP) algorithm, e.g., EPnP [36], which is typically embedded in a random sample consensus (RANSAC) scheme [37]. See [38] for an overview of VO fundamentals in the context of feature-based methods.

Recent approaches to VO have migrated away from feature-based approaches due to complexity and the plethora of configurations in feature-based pipelines. Recent approaches use image intensities directly rather than extracting and matching features [39,40,41,42,43,44]. These methods are called direct methods and are founded on the work by Lucas and Kanade [45] introducing parametric image alignment. In contrast to feature-based methods, which recover relative camera motion by minimizing reprojection error, direct methods minimize the photometric error, the sum of squared differences in image intensities between two consecutive frames.

In this work, we employ a direct method to solve for the incremental 6-DoF motion parameters. We provide a detailed description of image warping and pixel selection techniques, as well as the optimization procedure. Our method is somewhat simplified in contrast to state-of-the-art methods, e.g., [40,43], which involve keyframe selection and windowed bundle adjustment optimization to mitigate trajectory drift. As we will show in the localization and mapping section, we align the odometry track with global coordinates from a GNSS, and therefore, maintaining global consistency within our VO framework is unnecessary. Instead, in this work, we focus on local consistency and computational efficiency.

1.3. Simultaneous Localization and Mapping

SLAM is a critical problem in robotics, which involves constructing a map of an unknown environment while concurrently determining the robot’s location within it. If the robot’s position and orientation are known at any given time, mapping its surroundings using sensor data becomes a straightforward task. Conversely, if the map is already known, the problem is reduced to localization, where the robot’s observations are utilized to determine its position and orientation within the map. When neither the pose nor the map is known, both the map and path must be estimated simultaneously. SLAM can be broadly classified into online and full SLAM problems. Online SLAM incrementally estimates the current pose along with the map, while full SLAM estimates the entire path and map using data from all poses and observations [46]. There has been research in forest harvesting related to the localization problem [47,48]; however, a stem map is required in order to localize the machine, which we consider to be a major limitation of such an approach since stem maps, in general, are not readily available.

Visual odometry (VO) and vision-based SLAM are closely related, as VO provides an estimate of the robot’s path. The key difference is that SLAM mandates maintaining a map—even if it is not ultimately needed—so that the robot can recognize previously explored areas. When the robot revisits an area on the map, it provides additional constraints to optimize the trajectory and map, ensuring global consistency. This process is known as loop closure. Generally, VO is focused on local consistency, while SLAM aims for global consistency. It is worth noting that if VO is free from drift, the SLAM problem is reduced to merely mapping observations. However, error-free VO has never been achieved in practice, making loop closures essential for maintaining global consistency. To recognize loop closures, robots need to operate in environments containing distinct features that can be reidentified upon returning to a previously mapped location.

Detecting loop closures in forested environments can be challenging, as the spatial configurations of features, such as trees, may not be unique enough to provide robust information for redetection. Integrating GNSS positioning with SLAM can assist in identifying loop closures by maintaining a globally consistent position path. In this paper, we demonstrate how intermittent and degraded GPS reception can be fused with VO to provide a globally consistent position estimate. Additionally, we present a graph-based SLAM algorithm for refining the estimated path and map simultaneously. For an introductory tutorial on graph-based SLAM, readers can refer to [49].

1.4. Notation

In our method description below, we denote vectors as bold lowercase letters, e.g.,

v

, matrices as bold capital letters, e.g.,

M

, and scalars as lowercase italic letters, e.g., s. We use the notation

∥ \cdot ∥

as a shorthand for

{∥ \cdot ∥}_{2}

, i.e., the Euclidean norm. We represent images as functions,

I : Ω \to R^{3}

for 3-channel color images, and

I : Ω \to R

for gray-scale images where

Ω \subset R^{2}

is the image domain. Sets are denoted by capital script letters, e.g.,

A

, and the number of elements in a set is given by

| A |

. See Appendix A for an overview of Lie groups and rigid transformations used in this paper.

2. Direct Visual Odometry

Given a reference image

I_{t - 1} : Ω \to R

acquired at time

t - 1

and an input image

I_{t} : Ω \to R

acquired at time t, we seek to estimate the 3D egomotion of the camera between the frames. Estimating the camera’s motion is performed by solving for the parameters of a warp,

ξ_{t - 1 : t}

, which relates the pixels in

I_{t - 1}

to the pixels in

I_{t}

. For brevity, we drop the time subscript on the warping parameters and denote it by

ξ

, and denote the reference image by

I

and the input image by

I^{'}

. The intensity of a pixel in the reference image is given by

I (p)

, where

p = {(u, v)}^{T} \in Ω

. Similarly, the intensity of a pixel in the input image is given by

I^{'} (p^{'})

, where the position vector

p^{'}

is the result after warping

p

according to the parameters

ξ

,

\begin{matrix} p^{'} = W (p; ξ) . \end{matrix}

(1)

Leaving the warping function undefined for the moment, the images are registered, or aligned, by minimizing the photometric error according to the following objective:

\begin{matrix} \min_{ξ} \sum_{p \in Φ} {∥I (p) - I^{'} (W (p; ξ))∥}^{2}, \end{matrix}

(2)

where

Φ \subset Ω

is a set of selected pixels from the image domain. Solving this expression is non-linear regardless of the warping function, as pixel intensities are unrelated to their coordinates. A common method for minimizing the objective is the Gauss–Newton (GN) algorithm. In general, the expression

\begin{matrix} \sum_{p \in Φ} {∥I (p) - I^{'} (W (p; ξ + Δ ξ))∥}^{2}, \end{matrix}

(3)

is linearized w.r.t. some small change in the parameters,

Δ ξ

, and then the parameters are updated by

\begin{matrix} ξ_{κ + 1} = ξ_{κ} + Δ ξ, \end{matrix}

(4)

until

ξ

and

Δ ξ

converge.

This formulation, as well as a solution procedure, was first given by Lucas and Kanade [45] in their seminal work, on which parametric image alignment, and subsequently direct VO, is founded. For a detailed presentation of optimization algorithms and alternative formulations, see Baker and Matthews [50]. We will discuss our optimization technique in a latter section after introducing the warping function.

2.1. Image Warping

The egomotion of a camera with no assumed holonomic constraints has 6 degrees of freedom; rotation about the

x, y

and z axes and translation along the

x, y

and z axes. The motion according to these degrees of freedom is represented by a transformation matrix in the special Euclidean Lie group:

\begin{matrix} T = (\begin{matrix} R & t \\ 0^{T} & 1 \end{matrix}) \in SE (3), \end{matrix}

(5)

where

R \in SO (3)

is a 3D rotation matrix and

t \in R^{3}

is a 3D displacement or translation vector. This is a rigid-body transformation that encodes the change in rotation and translation of a non-deformable object in motion between two discrete points in time. Although the matrix

T

has 6 degrees of freedom, there are 12 values that make up the non-homogeneous portion of the matrix: 9 values in the rotation matrix and 3 values for translation. This imposes unnecessary computational demands in an optimization setting. For this reason, we use the Lie algebra

se (3)

to parameterize the transformation. We denote the transformation matrix as a function of the parameters

ξ \in se (3)

,

\begin{matrix} T (ξ) = exp (\hat{ξ}) \in SE (3) . \end{matrix}

(6)

The exponential map of

se (3)

can be computed in closed form, as can the logarithm map that takes

SE (3)

back to the algebra

se (3)

,

\begin{matrix} ξ = ln (T (ξ)) \in se (3) . \end{matrix}

(7)

Using the Lie algebra parameterization of the transformation, we define a warping function

W : (R^{3} \times R^{6}) \to R^{2}

that takes a homogeneous pixel coordinate as an input, back projects it to

R^{3}

, transforms the back projection according to the warping parameters, and then projects it back to

R^{2}

,

\begin{matrix} W (\tilde{p}; ξ) = π (T (ξ) π^{- 1} (\tilde{p}, z)), \end{matrix}

(8)

where the notation

\tilde{p} = {(u, v, 1)}^{T} \in P^{2}

denotes a pixel position vector in homogeneous coordinates. The function

π^{- 1} (\cdot)

performs the back-projection and is defined as

\begin{matrix} π^{- 1} (\tilde{p}, z) = {({(z K^{- 1} \tilde{p})}^{T}, 1)}^{T} \in R^{4}, \end{matrix}

(9)

where z is a depth measurement for the pixel and

K^{- 1}

is the inverse of the projection matrix. Note that we homogenized, i.e., appended a fictitious coordinate, to the back-projected vector so that it is compatible with the transformation matrix. The transformation matrix is not homogeneous, i.e., the last row of the matrix shown in Equation (5) is omitted, so the resulting vector following the transformation has three dimensions. The function that performs the forward projection is defined as

\begin{matrix} π (x) = \tilde{n} (K x) \in R^{2}, \end{matrix}

(10)

where

x = {(x, y, z)}^{T} \in R^{3}

,

K

is the camera projection matrix, and

\tilde{n} : R^{3} \to R^{2}

normalizes the homogeneous coordinates, i.e.,

\tilde{n} ({(x, y, w)}^{T}) = {(x / w, y / w)}^{T}

. Finally, the projection matrix and its inverse are defined as

K = (\begin{matrix} f & 0 & c_{u} \\ 0 & f & c_{v} \\ 0 & 0 & 1 \end{matrix}), K^{- 1} = (\begin{matrix} \frac{1}{f} & 0 & - \frac{c_{u}}{f} \\ 0 & \frac{1}{f} & - \frac{c_{v}}{f} \\ 0 & 0 & 1 \end{matrix}),

(11)

where f is the focal length, assuming unit aspect ratio pixels, and

(c_{u}, c_{v})

is the principal point.

The only variable that still needs attention is the depth measurement z used in the back-projection function. We estimate the depth of each pixel by first computing the stereo correspondence via semi-global matching [51]. We use a real-time GPU implementation presented in [52]. Due to the smoothness constraints imposed by the semi-global matching scheme, occlusions are filled by some non-zero value. We handle occlusions by right–left consistency check: The disparity map

D^{l, r}

is computed first, then the second disparity map

D^{r, l}

by reflecting the left and right images along the v-axis and using the right image in place of the left, and the left in place of the right. The final disparity map

D

is equal to

D^{l, r}

when the absolute difference between

D^{l, r}

and

D^{r, l}

, evaluated at a position vector

p

, does not exceed some threshold

δ

. Otherwise,

D (p)

takes zero,

\begin{matrix} D (p) = \{\begin{matrix} D^{l, r} (p) & if | D^{l, r} (p) - D^{r, l} (p) | \leq δ \\ 0 & otherwise \end{matrix}, \forall p \in Ω . \end{matrix}

(12)

In this work, we use

δ = 1

. We do not perform sub-pixel refinement to interpolate disparity values. We simply use positive integers to represent the disparity map,

D : Ω \to N

. The depth estimate, z, for the pixel

p

is given via triangulation,

\begin{matrix} z (p) = \frac{f b}{D (p)}, \end{matrix}

(13)

where f is the focal length and b is the baseline distance of the stereo rig in centimeters. This assumes that the stereo camera is calibrated and row-aligned. We follow methods presented in [53] for camera calibration.

2.2. Optimization

We will now extend our discussion regarding the minimization of photometric error. As noted earlier, minimizing the expression defined in Equation (2) is a non-linear optimization task. This formulation requires a linearization step during each GN iteration. Namely, the Jacobian of the warp and the Hessian need to be computed during each iteration, which can lead to significant computational demands depending on the size of the Jacobian. Following from Baker and Matthews [50], we redefine the objective under the inverse compositional (IC) formulation by interchanging the roles of the reference and input image and solving for incremental warp parameters instead of additive updates as in the Lucas–Kanade formulation. Given some initial guess of the parameters,

ξ

, the objective is to minimize the following expression w.r.t the incremental warp parameters,

\begin{matrix} Δ ξ^{★} = \underset{Δ ξ}{argmin} \sum_{p \in Φ} {∥I (W (\tilde{p}; Δ ξ)) - I^{'} (W (\tilde{p}; ξ))∥}^{2} . \end{matrix}

(14)

The parameters are updated by inverting the incremental warp parameters and composing with the current estimate,

ξ \leftarrow ξ \circ Δ ξ^{- 1}

, where the notation ∘ denotes composition. The incremental warp parameters need to be inverted at each GN iteration as the linearization, which we will discuss next, is performed on the reference image. The update rule can be explicitly written as

\begin{matrix} ξ_{κ + 1} & = ξ_{κ} \circ Δ ξ^{- 1} \\ = ln (exp ({\hat{ξ}}_{κ}) exp (- Δ \hat{ξ})) \\ = ln (\begin{matrix} R_{κ} Δ R^{T} & R_{κ} (- Δ R^{T} Δ t) + t_{κ} \\ 0^{T} & 1 \end{matrix}) . \end{matrix}

(15)

According to the GN algorithm, the incremental warp parameters

Δ ξ

are given by the normal equations,

\begin{matrix} J^{T} J Δ ξ = - J^{T} r \Rightarrow Δ ξ = - {(J^{T} J)}^{- 1} J^{T} r, \end{matrix}

(16)

where

J

is a

m \times 6

Jacobian matrix,

J^{T} J

is the Gauss–Newton approximation of the Hessian matrix, and

r

is the vector of residuals given by

\begin{matrix} r = I^{'} (W (\tilde{p}; ξ)) - I (p) . \end{matrix}

(17)

The Jacobian encodes the partial derivatives of the reference image at each pixel

p_{{i}_{1}^{m}}

with respect to the six warping parameters,

J = (\begin{matrix} {\frac{\partial I (p_{1})}{\partial ξ}}^{T} \\ {\frac{\partial I (p_{2})}{\partial ξ}}^{T} \\ ⋮ \\ {\frac{I (p_{m})}{\partial ξ}}^{T} \end{matrix}) = (\begin{matrix} \frac{\partial I (p_{1})}{\partial ξ_{1}} & \frac{\partial I (p_{1})}{\partial ξ_{2}} & \dots & \frac{\partial I (p_{1})}{\partial ξ_{6}} \\ \frac{\partial I (p_{2})}{\partial ξ_{1}} & \frac{\partial I (p_{2})}{\partial ξ_{2}} & \dots & \frac{\partial I (p_{2})}{\partial ξ_{6}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial I (p_{m})}{\partial ξ_{1}} & \frac{\partial I (p_{m})}{\partial ξ_{2}} & \dots & \frac{\partial I (p_{m})}{\partial ξ_{6}} \end{matrix}) .

(18)

The linearization of Equation (14) can be achieved by performing a first-order Taylor expansion about the current estimate of the parameters. Denoting the ith row in the Jacobian as

\frac{\partial I (p)}{\partial ξ}

, corresponding to some pixel

p \in Φ

and applying the chain rule, we obtain

\begin{matrix} {\frac{\partial I (p)}{\partial ξ}}^{T} = \frac{\partial I (p)}{\partial p} \frac{\partial p}{\partial x} \frac{\partial x}{\partial ξ} . \end{matrix}

(19)

The partial derivative of the reference image w.r.t. some pixel position

p

is simply the gradient vector of the image along the u and v axes,

\begin{matrix} \frac{\partial I (p)}{\partial p} = \nabla I (p) = (I_{u} (p), I_{v} (p)) . \end{matrix}

(20)

We purposefully denoted the gradient vector as a row vector. Taking

p

to be equal to a reduced form of the forward projection function

π (\cdot)

such that

p = {(\frac{f x}{z} + c_{u}, \frac{f y}{z} + c_{v})}^{T}

, and

x = {(x, y, z)}^{T}

to be a back projection of the pixel, the partial derivative can be written as

\begin{matrix} \frac{\partial p}{\partial x} = (\begin{matrix} \frac{f}{z} & 0 & - f \frac{x}{z^{2}} \\ 0 & \frac{f}{z} & - f \frac{y}{z^{2}} \end{matrix}) . \end{matrix}

(21)

Finally, the partial derivative of the back-projected pixel

x

w.r.t. the warping parameters evaluated at the identity warp

ξ = 0

takes the form,

\begin{matrix} {\frac{\partial x}{\partial ξ}|}_{ξ = 0} = ({[x]}_{\times} I_{3}) = (\begin{matrix} 0 & - z & y & 1 & 0 & 0 \\ z & 0 & - x & 0 & 1 & 0 \\ - y & x & 0 & 0 & 0 & 1 \end{matrix}) . \end{matrix}

(22)

This result follows from the skew–symmetric matrix operator used when computing the

SE (3)

exponential map of the warp parameters. Multiplying out the last two partials gives

\begin{matrix} {\frac{\partial I (p)}{\partial ξ}}^{T} & = \nabla I (p) \frac{\partial p}{\partial x} \frac{\partial x}{\partial ξ} \\ = (I_{u} (p), I_{v} (p)) (\begin{matrix} f \frac{x y}{z^{2}} & f \frac{x^{2} - z^{2}}{z^{2}} & \frac{x y}{z} & \frac{f}{z} & 0 & - f \frac{x}{z^{2}} \\ f \frac{y^{2} + z^{2}}{z^{2}} & - f \frac{x y}{z^{2}} & - f \frac{x}{z} & 0 & \frac{f}{z} & - f \frac{y}{z^{2}} \end{matrix}) . \end{matrix}

(23)

For convenience, we also show a row in the Jacobian corresponding to some pixel

p

written out explicitly,

\begin{matrix} \frac{\partial I (p)}{\partial ξ} = {(\begin{matrix} \frac{I_{u} (p) f x y + I_{v} (p) (f y^{2} + f z^{2})}{z^{2}} \\ \frac{I_{u} (p) (- f x^{2} - f z^{2}) - I_{v} (p) f x y}{z^{2}} \\ \frac{I_{u} (p) f y - I_{v} (p) f x}{z} \\ \frac{I_{u} (p) f}{z} \\ \frac{I_{v} (p) f}{z} \\ \frac{- I_{u} (p) f x - I_{v} (p) f y}{z^{2}} \end{matrix})}^{T} \in R^{6} . \end{matrix}

(24)

The row vector stated above is computed for each pixel

p \in Φ

and stacked into the

m \times 6

Jacobian matrix, again where m is the number of pixels in

Φ

. The columns of the Jacobian can be visualized as the steepest descent images shown in Figure 1.

Since the linearization is performed on the coordinate frame of the reference image, the Jacobian

J

and the GN approximation of the Hessian

H = J^{T} J

, as well as its inverse

H^{- 1}

only need to be computed once. These are the computational savings of the IC formulation [50] over the original Lucas–Kanade algorithm [45].

2.2.1. Pixel Selection

Since minimizing photometric error is based on image gradients, only pixels with a non-zero gradient contribute to optimization. Therefore, including pixels with zero gradient in the Jacobian introduces unnecessary computations. In feature-rich environments, such as forests, selecting pixels with a non-zero gradient might not significantly reduce the number of pixels used in optimization. As shown in Figure 2a, selecting all non-zero gradient pixels results in using approximately 85% of the image. We can further reduce the number of pixels, thus decreasing computation, by thresholding pixels based on gradient magnitude. This approach, however, results in a Jacobian matrix that is subject to change in dimension between frames since the distribution of gradient magnitude is not guaranteed to be consistent; we must allocate enough memory to store the expected maximum number of non-zero gradient pixels across all frames, which cannot be determined in advance, or dynamically allocate memory prior to optimizing each frame. This issue can be resolved by selecting a percentage of the total number of pixels in the image either by performing binary search for a gradient magnitude threshold that results in the desired number of pixels or sorting the gradients in descending order and selecting the first

N \frac{p}{100}

pixels, where p is the desired percentage and N is the number of pixels in the image. In Figure 2b–d, we show the selected pixels resulting from desired percentages ranging from 25% to 75%.

Prior to selecting pixels based on the gradient magnitude, we discard all pixels with a disparity value of zero since these pixels are projected to infinity during image warping. Zero disparity pixels are apparent in Figure 2 as areas in the images that obviously have non-zero gradients but are too distant to have a non-zero disparity value.

2.2.2. Robustness

GN optimization assumes Gaussian distributed errors. It is often the case in real-world data, however, that non-Gaussian errors arise due to inaccurate pixel correspondences during disparity computation, motion-induced occlusions, illumination changes and auto-exposure adjustments. A cost function for minimizing photometric error that is insensitive to non-Gaussian distributed noise is said to be robust. In the GN framework, this can be achieved with iterative re-weighted least squares (IRLS) using a robust cost function. The decision of the cost function is somewhat arbitrary, typically selected through empirical evaluation or by means of some prior knowledge regarding the structure of outliers in the data. In this work, we choose to use Tukey’s biweight cost function [54,55] since it suppresses large residuals in contrast to Huber’s cost function [56] which simply down weights their influence.

Tukey’s biweight cost function takes the form

\begin{matrix} ρ (r_{i}) = \{\begin{matrix} {(1 - {(\frac{r_{i}}{c})}^{2})}^{2} & if | r_{i} | \leq c \\ 0 & otherwise \end{matrix}, \end{matrix}

(25)

where the constant c is usually chosen to be 4.6851 to achieve 95% asymptotic efficiency with the normal distribution and

r_{i}

is the ith residual from the error vector given by Equation (17). The cost function assumes that the residuals have unit variance. A common choice to estimate the scale parameter to standardize the residuals is to compute the median absolute deviation (MAD) and multiply by the expected MAD for a standard normal distribution,

\begin{matrix} \hat{s} = k \cdot {median}_{i} | r_{i} |, \end{matrix}

(26)

where

k = 1.4826

. To incorporate robustness in the GN minimization routine, we construct a weight vector

w

, where each weight

w_{i} = ρ (\frac{r_{i}}{\hat{s}})

and set the diagonal entries of a weight matrix equal to the weight vector, i.e.,

W = diag (w)

. It follows that the solution to the incremental update of the linearized system under the IRLS framework takes the form

\begin{matrix} Δ ξ = - {(J^{T} W J)}^{- 1} J^{T} W r, \end{matrix}

(27)

which is used in place of the normal equations shown in Equation (16).

3. Localization and Mapping

In this section, we present a graph-based algorithm for maintaining a globally and locally consistent pose track using the estimated frame-to-frame egomotion parameters from the previous section and a consumer-grade GPS receiver. We also show how the optimized pose graph can be refined to generate a map of detected tree stems. Our algorithm consists of two phases: global alignment, in which we minimize errors between the odometry-based pose graph and global positions provided by the GPS receiver, and local refinement, where we relate poses by multiple observations of tree stems and simultaneously optimize the configuration of the pose graph and tree stem positions.

In order to increase the efficiency of computations, we use the

se (2)

Lie algebra of rigid transformations to represent the graph, and ultimately the map, as opposed to the

se (3)

parameters we optimized for in the odometry section. We do this by simply extracting the translation parameters corresponding to the x and z axes in

se (3)

to represent the translation along the x and y saxes on a planar pose graph, and the rotation component about the y axis from the

se (3)

parameters to represent the heading. Using

se (2)

parameters produces a meaningful map that can easily be presented on a 2D display monitor.

3.1. Global Alignment

Given a set of frame-to-frame odometry observations,

{Δ ξ_{1}, Δ ξ_{2}, \dots, Δ ξ_{m}}

, where each

Δ ξ_{i} = {(x, y, θ)}^{T} \in se (2)

, we seek to align a pose graph constructed from the odometry observations with a set of global position readings from a GPS receiver. We assume odometry observations to be locally consistent but subject to drift and assume GPS coordinates to be locally bounded by a Gaussian distribution specified by an arbitrarily large covariance matrix that captures the expected errors due to multi-pass signals and geometric dillution of precision. We denote a GPS coordinate as

g = {(x, y)}^{T} \in R^{2}

, where x and y represent the global position estimate in meters within the Universal Transverse Mercator (UTM) coordinate system. We also convert the translation component of the odometry observations to meters for compatibility with the UTM coordinate frame.

3.1.1. Graph Construction

A pose graph is used to represent the camera poses and the motion constraints between the poses. A node, or vertex, in the graph denotes a pose, i.e., a position and orientation, and an edge denotes the relative motion constraint given by the odometry observation. We use

v_{i}

to denote the ith node in the graph and

Δ ξ_{i}

to denote the relative motion between

v_{i}

and

v_{i + 1}

. We construct the initial pose graph by sequentially transforming poses with the odometry observations. First, we fix the first node in the graph to the zero vector, then each subsequent pose is computed by right multiplying the previous pose with a homogeneous transformation matrix

T (Δ ξ) \in SE (2)

, representing the exponential map of the odometry observation,

\begin{matrix} v_{1} & = {(0, 0, 0)}^{T}, \end{matrix}

(28a)

\begin{matrix} v_{i + 1} & = T (ξ_{i}) v_{i}, \forall i = {1, 2, \dots, n} . \end{matrix}

(28b)

To simplify the notation, we use

ξ_{i j}

to denote the motion constraint between the poses

v_{i}

and

v_{j}

where

v_{j} \overset{def}{=} v_{i + 1}

. We also take n to be equal to the number of poses, which is the number of odometry constraints plus one, i.e.,

n \overset{def}{=} m + 1

. Associated with each motion constraint is a

3 \times 3

covariance matrix

Σ_{i j}

that represents the uncertainty of the motion. As we describe in the optimization section that follows, we use the information matrix

Q_{i j} = Σ_{i j}^{- 1}

to represent the strength of the edge, or constraint, in the graph.

Let

C

be an ordered set of 2-tuples representing the correspondences between poses and GPS readings. Thus, the tuple

(i, k) \in C

specifies that pose

v_{i}

corresponds to the GPS reading

g_{k}

. We insert GPS coordinates as nodes in the graph and add an edge to the corresponding camera pose. We also translate all GPS coordinates according to the first correspondence in

C

. We do this by storing the translation,

g_{0} \leftarrow g_{k \in C_{1}}

, where

C_{1}

is the first GPS-odometry correspondence, and subtracting

g_{0}

from all coordinates in the track,

\begin{matrix} g_{k} & \leftarrow g_{k} - g_{0} \forall k = {1, 2, \dots, | C |} . \end{matrix}

(29)

We store the translation so that we can invert the pose graph back to the original UTM coordinates after optimization. Associated with each GPS coordinate is a

2 \times 2

covariance matrix representing the expected accuracy of the receiver. We invert to covariance matrix, as we did with the odometry covariance, to obtain an information matrix

Q_{k}

. The values in this matrix depend on the expected accuracy of the GPS receiver and the environment in which the receiver is operating. For example, a consumer-grade GPS device operating under a dense forest canopy will have a relatively large uncertainty and thus small values in the information matrix.

Although we anchor the GPS track to a camera pose in the graph, their global orientation will likely differ. Thus, we introduce a global orientation parameter

ϕ

as another node in the graph that imposes a constraint on each node representing a camera pose that has a corresponding GPS reading. See Figure 3 for an illustration of the graph.

3.1.2. Optimization

Given the graph structure outline above, we seek to find the optimal configuration of the state vector,

\begin{matrix} s = {(ϕ, v_{1}^{T}, v_{2}^{T}, \dots, v_{n}^{T})}^{T}, \end{matrix}

(30)

that minimizes the sum of squared errors. The state is simply a vector consisting of the global orientation parameter followed by the camera poses, where each pose is parameterized as

v_{i} = {(x, y, θ)}^{T}

. Thus, the size of this vector is

1 + d n

, where d is the number of dimensions used to describe a camera pose, in this case 3, and n is the number of camera poses. The errors associated with the state configuration are given by two functions: one corresponding to the relative motion constraints given by the odometry observations and one corresponding to the constraints imposed by the observations from the GPS. The odometry error function, which is equivalent to the error function used in [49], is given by

\begin{matrix} r_{i j} (v_{i}, v_{j}) & = T {(ξ_{i j})}^{- 1} (T {(v_{i})}^{- 1} T (v_{j})), \\ = (\begin{matrix} R {(θ_{i j})}^{T} (R {(θ_{i})}^{T} (t_{j} - t_{i}) - t_{i j}) \\ θ_{j} - θ_{i} - θ_{i j} \end{matrix}) . \end{matrix}

(31)

where the notation

R (θ)

represents a 2D rotation matrix of the form

\begin{matrix} R (θ) = (\begin{matrix} cos θ & - sin θ \\ sin θ & cos θ \end{matrix}), \end{matrix}

(32)

and

θ_{i}, θ_{j}

and

θ_{i j}

are the rotation angles corresponding to

v_{i}, v_{j}

and

ξ_{i j}

, respectively. The notations

t_{i}, t_{j}

and

t_{i j}

represent the translation vectors from

v_{i}, v_{j}

and

ξ_{i j}

, respectively. The error function defined in Equation (31) gives the translational and rotational errors by first transforming pose

v_{j}

into the coordinate frame of

v_{i}

, then computing the differences according to the odometry observation. Thus, this function returns zero when the state vector is configured according to Equations (28a) and (28b).

The second error function calculates the position difference between a pose and its corresponding GPS coordinate according to the global orientation parameter,

\begin{matrix} r_{i k} (v_{i}, ϕ) = R (ϕ) t_{i} - g_{k} . \end{matrix}

(33)

This function rotates the translation vector of pose

v_{i}

about the origin of the coordinate frame according to a rotation matrix constructed from the global orientation parameter

ϕ

and returns the offset to the corresponding GPS coordinate.

Given the two error functions, we can write the sum of squared errors of the state configuration

s

as

\begin{matrix} ϵ (s) = \sum_{i j} r_{i j}^{T} Q_{i j} r_{i j} + \sum_{(i, k) \in C} r_{i k}^{T} Q_{k} r_{i k} . \end{matrix}

(34)

where the error vectors

r_{i j}

and

r_{i k}

are weighted by the information matrices to incorporate the degree of belief given to the observations. This leads to the following objective function:

\begin{matrix} s^{★} = \underset{s}{argmin} ϵ (s) . \end{matrix}

(35)

This is a non-linear least-squares optimization problem that can be solved with the GN algorithm. Specifically, the error function is linearized about a current estimate of the state configuration and we iteratively solve the linear system and incrementally update the state vector. Since our initial estimate of the state vector might be far from a minimum solution, we use a dampened version of the GN algorithm called the Levenberg–Marquardt (LM) algorithm [57,58]. The linear system that is solved during each LM iteration takes the form

\begin{matrix} (H + λ I) Δ s = - b, \end{matrix}

(36)

where

Δ s

is the solution to the linear system that provides incremental improvements to the state vector in the non-linear solution space by

\bar{s} \leftarrow \bar{s} + Δ s

, where

\bar{s}

is the current estimate of the state configuration. The variable

λ

is a non-negative damping factor that, when large, forces the update to behave as the steepest descent and, when small, brings the algorithm closer to GN. We initialize

λ = trace (H)

and divide by two if the objective value decreases and multiply by two if the objective value increases.

To construct the Hessian

H

and the gradient vector

b

, we first take the derivatives of the error functions with respect to the state parameters evaluated at the current estimated of the state. We initialize the poses in the state vector according to Equations (28a) and (28b), and set the global orientation parameter to

ϕ = 0

. The partial derivatives of the odometry error function can be written as

\begin{matrix} A_{i j} & = {\frac{\partial r_{i j}}{\partial v_{i}}|}_{s = \bar{s}} = (\begin{matrix} - R {(θ_{i j})}^{T} R {(θ_{i})}^{T} & R {(θ_{i j})}^{T} \frac{\partial R {(θ_{i})}^{T}}{\partial θ_{i}} (t_{j} - t_{i}) \\ 0^{T} & - 1 \end{matrix}), \end{matrix}

(37a)

\begin{matrix} B_{i j} & = {\frac{\partial r_{i j}}{\partial v_{j}}|}_{s = \bar{s}} = (\begin{matrix} R {(θ_{i j})}^{T} R {(θ_{i})}^{T} & 0 \\ 0^{T} & 1 \end{matrix}), \end{matrix}

(37b)

where

\frac{\partial R {(θ_{i})}^{T}}{\partial θ_{i}}

in

A_{i j}

is given by

\begin{matrix} \frac{\partial R {(θ)}^{T}}{\partial θ} = (\begin{matrix} - sin θ & - cos θ \\ cos θ & - sin θ \end{matrix}) . \end{matrix}

(38)

The derivatives for the second error function

r_{i k} (v_{i}, ϕ)

are defined as

\begin{matrix} C_{i k} & = {\frac{\partial r_{i k}}{\partial ϕ}|}_{s = \bar{s}} = \frac{\partial R {(ϕ)}^{T}}{\partial ϕ} t_{i}, \end{matrix}

(39a)

\begin{matrix} D_{i k} & = {\frac{\partial r_{i k}}{\partial v_{i}}|}_{s = \bar{s}} = (\begin{matrix} R (ϕ) & 0 \end{matrix}) . \end{matrix}

(39b)

The partial derivate of the rotation matrix w.r.t. the rotation angle in

C_{i k}

takes the same form as presented in Equation (38). Note that we have not taken any derivatives w.r.t. the GPS coordinates, as we do not wish to reconfigure them, and thus they do not appear in the state vector. For clarity, we specify the dimensions of these matrices: both

A

and

B

are

3 \times 3

matrices,

C

is a

2 \times 1

matrix, and

D

is a

2 \times 3

matrix. These derivatives lead to sparse Jacobian matrices for each of the error functions,

\begin{matrix} J_{i j} & = {(\begin{matrix} \dots & \dots & A_{i j} & B_{i j} & \dots \end{matrix})}_{3 \times 1 + 3 n}, \end{matrix}

(40a)

\begin{matrix} J_{i k} & = {(\begin{matrix} C_{i k} & \dots & D_{i k} & \dots & \dots \end{matrix})}_{2 \times 1 + 3 n} . \end{matrix}

(40b)

We use ellipses to indicate that unspecified values in the matrix are zeros. Since the odometry part of the pose graph only has constraints between consecutive nodes, the Jacobian

J_{i j}

will always have a contiguous

3 \times 6

block of non-zero values corresponding to the odometry constraints between nodes i and j. Furthermore, the Jacobian

J_{i k}

will always have non-zero values in the first

2 \times 1

block corresponding to the global orientation parameter and a

2 \times 3

non-zero block at the ith node representing the constraint between the GPS-odometry correspondences. Now that the Jacobians are specified, we obtain the Gauss–Netwon approximation to the Hessian matrices by

\begin{matrix} H_{i j} & = J_{i j}^{T} Q_{i j} J_{i j}, \end{matrix}

(41a)

\begin{matrix} H_{i k} & = J_{i k}^{T} Q_{k} J_{i k}, \end{matrix}

(41b)

and the gradient vectors by

\begin{matrix} b_{i j} & = J_{i j}^{T} Q_{i j} r_{i j}, \end{matrix}

(42a)

\begin{matrix} b_{i k} & = J_{i k}^{T} Q_{k} r_{i k} . \end{matrix}

(42b)

From an implementation standpoint, it is easier to construct the Hessian directly using our definitions for the non-zeros blocks in the Jacobians. The Hessian matrix for the ith pose in graph, assuming there is a corresponding GPS coordinate with the node, takes the form

\begin{matrix} H_{i j} + H_{i k} = (\begin{matrix} C_{i k}^{T} Q_{k} C_{i k} & \dots & C_{i k}^{T} Q_{k} D_{i k} & \dots \\ ⋮ & ⋱ & ⋮ \\ D_{i k}^{T} Q_{k} C_{i k} & \dots & A_{i j}^{T} Q_{i j} A_{i j} + D_{i k}^{T} Q_{k} D_{i k} & A_{i j}^{T} Q_{i j} B_{i j} & \dots \\ ⋮ & B^{T} Q_{i j} A_{i j} & B_{i j}^{T} Q_{i j} B_{i j} \\ ⋮ & ⋱ \end{matrix}) . \end{matrix}

(43)

Since the Jacobians are sparse, the resulting Hessian matrix is also sparse. Therefore, in practice, it is advantageous to use a memory-efficient sparse storage scheme for these matrices, e.g., compressed sparse column or row matrices.

The gradient vector for the ith pose can be constructed directly with

\begin{matrix} b_{i j} + b_{i k} = (\begin{matrix} C_{i k}^{T} Q_{k} r_{i k} \\ ⋮ \\ A_{i j}^{T} Q_{i j} r_{i j} + D_{i k}^{T} Q_{k} r_{i k} \\ B_{i j}^{T} Q_{i j} r_{i j} \\ ⋮ \end{matrix}) . \end{matrix}

(44)

The final Hessian matrix and gradient vector for the linear system are obtained by summing over all the constraint-wise Hessians and descent vectors,

\begin{matrix} H = \sum_{i j} H_{i j} + \sum_{(i, k) \in C} H_{i k}, \end{matrix}

(45a)

\begin{matrix} b = \sum_{i j} b_{i j} + \sum_{(i, k) \in C} b_{i k} . \end{matrix}

(45b)

This linearization is performed during each iteration of the LM algorithm. We take advantage of the sparse structure of the Hessian and solve the system using sparse Cholesky factorization. We terminate the algorithm when the Euclidean norm of the linear increment to the state vector

∥ Δ s ∥

is less than some small threshold, e.g., 0.001, and take the optimal configuration as

s^{★} = \bar{s} + Δ s

. Finally, we rotate each pose in the optimal state vector according the global orientation parameter and translate back to the UTM coordinate system and update the rotation component of each pose,

\begin{matrix} {\hat{v}}_{i} = (\begin{matrix} R (ϕ^{★}) t_{i}^{★} + g_{0} \\ θ_{i}^{★} + ϕ^{★} \end{matrix}), \forall {i}_{1}^{n} . \end{matrix}

(46)

Hereinafter, we omit the global orientation parameter and denote the globally aligned state vector as

\begin{matrix} \hat{s} = {({\hat{v}}_{1}^{T}, {\hat{v}}_{2}^{T}, \dots, {\hat{v}}_{n}^{T})}^{T} . \end{matrix}

(47)

In order to resolve the global orientation parameter, it is required that we have a minimum of two GPS observations. We also add robustness to non-Gaussian distributed GPS errors using the same approach as in Section 2.2.2. We compute weights for the residuals corresponding to the global position coordinates using Tukey’s biweight cost function, and the weights are used to scale the information matrix

Q_{k}

.

We conclude this section by providing a note on the values in the information matrices. In general, the information matrix for the odometry observations consists of large values relative to the values in the information matrix for the global position observations. This is consistent with the fact that global position measurements are typically degraded under the canopy of a forest. Furthermore, VO is expected to perform well in feature-rich environments, such as a forest. We also note that we provide a substantially larger value to the position in the odometry information matrix corresponding to the heading. This makes the translation component of the nodes in the pose graph more elastic in order to conform to the GPS track while keeping the heading stiff to help maintain an accurate reconstruction of tree stem observations, which we address in the next section.

3.2. Local Refinement

Given the globally aligned state vector

\hat{s}

, we will now optimize for a refined state vector by taking into account observations of tree stems. There are three main steps in local refinement: (1) We transform each tree stem observation to the world coordinate frame according to the globally aligned state vector we optimized in the previous section. (2) We associate the observations corresponding to an individual tree stem position in the global map. (3) We optimize for a new configuration of the state that minimizes both the error functions from the previous section and an additional error function representing the discrepancy between observations of tree stems and their associated global position.

3.2.1. Graph Augmentation

Tree stems are detected in each frame using the convolutional neural network (CNN) object detector outlined in [59]. The input image to the network is resized to a resolution of 128 columns and 352 rows for real-time performance. We also detect the ground plane and breast height using the RANSAC-based algorithm presented in [60]. For each bounding box predicted by the CNN object detector, we extract the image coordinate, where the center-line of the bounding box along the u-axis of the image intersects the ground plane positioned at breast height. A disparity value is assigned to this image coordinate using Equation (5) in [59]. Using the inverse projection matrix and the disparity assignment, we project the image coordinate to

R^{3}

. We represent a tree stem observation in the camera coordinate frame of the ith camera pose as

z_{i q} = {(x, y)}^{T}

, where x and y correspond to the back-projected image coordinate along the x and z axes of the camera coordinate frame. The index

q \in Z_{i}

where

Z_{i}

is a set of indices denoting the observations of stems from the ith camera. Thus,

| Z_{i} |

is the number of tree stem observations in camera i and

| Z |

is the number of camera poses in the graph. We use the notation

(i, q) \in Z

to index the

q^{th}

observation in camera i.

Recalling that a camera pose in the globally aligned state vector is represented as

{\hat{v}}_{i} = {(\hat{x}, \hat{y}, \hat{θ})}^{T}

, we can transform the observations from the camera coordinate frame to the world frame with

\begin{matrix} {\hat{z}}_{i q} = R ({\hat{θ}}_{i}) z_{i q} + {\hat{t}}_{i}, \forall (i, q) \in Z . \end{matrix}

(48)

Given all tree stem observations in the global coordinate frame, we perform data association by clustering spatially similar observations. We use the density-based spatial clustering of applications with the noise (DBSCAN) algorithm [61] to cluster the observations. DBSCAN takes two parameters and a distance function. For the distance function, we simply use the Euclidean distance. The two parameters correspond to the search radius and the minimum number of points required for a cluster. We use a search radius of 1 m and 10 as the minimum number of points in a cluster. The algorithm yields a label for each observation that specifies to which cluster observation

{\hat{z}}_{i q}

belongs. A label equal to zero denotes an outlier, i.e., an observation that does not belong to any cluster. We denote the set of unique labels as

L

and use the correspondence set

M

to specify that observation

{\hat{z}}_{i q}

is assigned to label

l \in L

. The notation

((i, q), l) \in M

is used to denote that observation

{\hat{z}}_{i q}

is assigned to cluster ℓ.

For each cluster of tree stem observations, we compute the center of the clusters by

\begin{matrix} m_{l} = \frac{1}{\sum 1_{l}} \sum_{(i, q) \in Z} 1_{l} {\hat{z}}_{i q}, \forall l \in L, \end{matrix}

(49)

where

m_{l} = {(x, y)}^{T}

is the center of the cluster in the global coordinate frame, and the notation

1_{l}

takes 1 when the observation

{\hat{z}}_{i q}

is assigned to cluster ℓ and zero otherwise. Given the cluster centers we extend our state vector as

\begin{matrix} \hat{s} = ({\hat{v}}_{i}^{T}, {\hat{v}}_{2}^{T}, \dots, {\hat{v}}_{n}^{T}, m_{1}^{T}, m_{2}^{T}, \dots, m_{l}^{T}) . \end{matrix}

(50)

Note that we omitted the global orientation parameter. We can reconcile this in the Hessian matrix and gradient vector by removing the first row and column in the Hessian and the first row in the gradient vector, and redefining

D_{i k}

as

(I_{2}, 0)

. See Figure 4 for a graphical illustration of the augmented graph.

3.2.2. Refinement

In order to find an optimal configuration of the new state vector, we must define an error function corresponding to the expected tree stem location that we initialized by computing the cluster centers. The error is simply the difference between the expected position of the stem

m_{l}

and the observation

{\hat{z}}_{i q}

. Since the observations in the global coordinate frame are subject to change during optimization, we will define the error functions on the observations from the coordinate frame of the corresponding camera, i.e.,

z_{i q}

, as we did with the error functions defined in the previous section. Thus, the error function is defined as

\begin{matrix} r_{i l} ({\hat{v}}_{i}, m_{l}) = R {({\hat{θ}}_{i})}^{T} (m_{l} - {\hat{t}}_{i}) - z_{i q}, \end{matrix}

(51)

which leads to an additional term in the objective function,

\begin{matrix} ϵ (\hat{s}) = \sum_{i j} r_{i j}^{T} Q_{i j} r_{i j} + \sum_{(i, k) \in C} r_{i k}^{T} Q_{k} r_{i k} + \sum_{((i, q), l) \in M} r_{i l}^{T} Q_{i q} r_{i l} . \end{matrix}

(52)

The matrix

Q_{i q}

is the information matrix for the

q^{th}

observation in camera i. Since the information matrix in the inverse of the covariance matrix, we can define

Q_{i q}

in terms of the expected measurement noise of the sensor. As we back projected the detected stem to the camera coordinate frame using parameters from the stereo camera, the information matrix will, in general, have relatively small values for stems detected far from the camera rig.

We proceed in the same manner as we did in the previous section by linearizing the error function and constructing the Jacobian to compute the Hessian matrix and gradient vector. The partial derivatives of the new error function can be written as

\begin{matrix} E_{i l} & = {\frac{\partial r_{i l}}{\partial {\hat{v}}_{i}}|}_{s = \hat{s}} = (\begin{matrix} - R ({\hat{θ}}_{i}) & \frac{\partial R {({\hat{θ}}_{i})}^{T}}{\partial {\hat{θ}}_{i}} (m_{l} - {\hat{t}}_{i}) \end{matrix}), \end{matrix}

(53a)

\begin{matrix} F_{i l} & = {\frac{\partial r_{i l}}{\partial m_{l}}|}_{s = \hat{s}} = R ({\hat{θ}}_{i}) . \end{matrix}

(53b)

E_{i l}

is a

2 \times 3

matrix and

F_{i l}

is a

2 \times 2

matrix. These definitions can be used directly to compute the blocks with the sparse Hessian matrix,

\begin{matrix} H_{i l} = (\begin{matrix} ⋱ \\ E_{i l}^{T} Q_{i q} E_{i l} & \dots & E_{i l}^{T} Q_{i q} F_{i l} \\ ⋮ & ⋱ & ⋮ \\ F_{i l}^{T} Q_{i q} F_{i l} \\ ⋱ \end{matrix}), \end{matrix}

(54)

where the Hessian matrix is now a

(1 + 3 n + 2 | L |) \times (1 + 3 n + 2 | L |)

matrix. Similarly, the entries to the gradient vector are

\begin{matrix} b_{i l} = (\begin{matrix} ⋮ \\ E_{i l}^{T} Q_{i q} r_{i l} \\ ⋮ \\ F_{i l}^{T} Q_{i q} r_{i l} \\ ⋮ \end{matrix}) \end{matrix}

(55)

The complete linear system is constructed by summing the Hessian matrices and gradient vectors for all the tree stem observations and adding them to the other Hessians and gradient vectors defined in Equation (45). Again, the first row and column of the Hessian, as well as the first row in the gradient vector, are removed to account for the omission of the global orientation parameter. We solve the linear system,

(H + λ I) Δ s = - b

, using sparse Cholesky factorization and update the state vector by

\hat{s} \leftarrow Δ s

. As we did for global alignment, we terminate the algorithm when

Δ s

is less than some predetermined convergence threshold.

4. Analysis and Discussion

To test VO and the localization and mapping algorithms outlined above, we acquired a video sequence of a 1115 m path through a sparse ponderosa pine (Pinus ponderosa Douglas ex Lawson) forest in Western Montana. The video was captured by walking a hand-held 12 cm baseline ZED stereo camera [62] through the forest. The camera was operated at 10 frames per second and VGA resolution (480 × 640). Mounted on top of the camera field monitor was an antenna connected to a GlobalTop FGPMMOPA6H GPS module [63] that was queried for a GPS coordinate reading and a PDOP value after each video frame capture. The GPS coordinates and video frames were stored on an embedded backpack computer. Figure 5a shows the GPS coordinate readings projected on the UTM coordinate system after translating the position track by subtracting the first GPS coordinate from all coordinates.

We estimated camera egomotion using the direct VO algorithm presented in Section 2. We used a camera resolution of 240 × 320 to estimate egomotion and a 25% gradient magnitude threshold during pixel selection. Frame-to-frame egomotion parameters were composed to construct an odometry track using Equation (28). Figure 5b shows the visual odometry position track after converting the translation components of the egomotion parameters to meters. Figure 6 shows the GPS track (black line) and the optimized VO track after global alignment and local refinement (red line). The blue dashed line in Figure 6 shows the VO track after only applying the optimized global orientation parameter. As the figure suggests, the VO position track, although locally consistent, is subject to drift after approximately 300 m.

4.1. Localization

In this section, we demonstrate the performance of GPS and VO integrated localization under various scenarios of degraded and intermittent GPS reception. We refer to the optimized position track shown in Figure 6 (red line) as the ground truth position track. We acknowledge that this track has not actually been ground truthed with survey grade equipment; however, this track is deemed optimal, given the available data.

4.1.1. Degraded GPS Reception

We simulated degraded GPS reception by adding zero mean Gaussian noise with a standard deviation of 5 m to each observed GPS coordinate. The incremental update to the state vector and the current estimate of the state converged after 15 LM iterations. Figure 7 shows six snapshots during optimization. The global orientation parameter converged after the 6th iteration. Figure 8 shows the converged path after global alignment and local refinement with a root mean squared error (RMSE) of less than 0.1 m compared to the ground truth position track. This suggests that, as long as GPS errors are Gaussian distributed, we can expect accurate position tracking.

We also tested robustness to non-Gaussian distributed noise in GPS readings. We randomly selected 5% of the coordinates from the GPS track and added uniformly distributed noise with a range of 0 to 200 m. Figure 9 shows the converged path after global alignment and local refinement with non-Gaussian distributed noise in GPS coordinates plotted over the ground truth path (RMSE < 0.1 m). According to this result, the algorithm is insensitive to GPS coordinate outliers.

4.1.2. Intermittent GPS Reception

When GPS is used under an extremely dense forest canopy, reception might only be intermittently reliable when the receiver is stationary for long periods of time or when the receiver crosses openings in the canopy. We simulated this scenario by extracting 12 coordinates from the GPS track; we extracted the starting position, ending position and randomly selected 10 coordinates along 100 m intervals from the track. Figure 10 shows six snapshots during optimization with 12 intermittent GPS coordinate readings. The algorithm converged after 21 iterations. Following local refinement, the optimized path had a RMSE of 2.7 m compared to the ground truth position track in Figure 11. This is an important application of the proposed algorithm in situations where GPS reception is only available intermittently.

4.2. Mapping

In order to generate a stem map, we performed global alignment using all the coordinates from the GPS track and the VO path shown in Figure 5a,b. Figure 12 shows the result after clustering the observed tree stem position. The black dots in the figure denote the cluster centers, and the ellipses around each center show the covariance matrix of the clusters at four standard deviations. There was a total of 6205 tree stem observations and 140 clusters, i.e., individual tree stems. Figure 13 shows the optimized cluster centers, i.e., tree stem positions, and the covariance structure of the observations at four standard deviations. Local refinement converged after nine iterations. Note that the ellipses representing the covariances of the clusters are smaller after local refinement. This is a result of simultaneously optimizing the position track and the cluster centers.

To test the accuracy of the stem map generated from the local refinement step, we collected a ground truth stem map of the 12 acre forest, from which we acquired the video and GPS track. We obtained global coordinates for each individual tree in the stand using a TruePulse 360B laser range finder and a 13-bit BEI industrial rotary encoder mounted on a tripod. We installed a reflector target near the center of the stand, at which a GPS coordinate was acquired using a mapping grade receiver. We mapped subsections of the stand by first aligning the laser and encoder to north, using the appropriate declination for the area, then recorded the distance and angle to the target to globally localize the plot center. We recorded the distance and angle to each individual tree within view from the plot center. We repeated this process moving clockwise around the reflector target mapping small subsections of the stand until the entire stand was mapped.

Figure 14 shows the ground truth stem map (green triangles), the predicted tree stem position from the camera (red circles), and the optimized position track (yellow line) superimposed on an aerial photograph of the stand. We manually associated each observed tree stem position from the camera with its corresponding ground truth tree stem. Among the 140 predicted tree stem positions, we classified 2 observations as false positives, i.e., predicted tree stems that correspond to a tree in the aerial photograph that were not recorded during the collection of the ground truth stem map, and 7 tree stems as duplicate observations, i.e., observed tree stems that belong to the same ground truth stem. For all predicted and ground truth stem correspondences (

n = 140

), we calculated a RMSE of 2.16 m. This is a 46% improvement over the best-case expected accuracy of a consumer-grade GPS device under forest canopy (4 m).

5. Conclusions

In this paper, we presented a real-time algorithm for VO estimation and a novel method for integrating VO and a vision-based tree stem detection system with GNSS positioning to generate accurate stem maps. Our approach is based on existing and widely applied optimization techniques, i.e., GN and LM non-linear least squares, which provide efficient solution procedures to the optimization problems formulated in this paper.

Although GNSS positioning is used to maintain a globally consistent position track, the heading of the camera is subject to drift; global positioning cannot be used to infer the heading of the camera. This issue did not appear to be significant in our dataset; however, we expect that camera heading will eventually drift since it is unconstrained and only estimated from visual egomotion. This issue can be resolved by incorporating a global direction sensor, i.e., magnetometer, to maintain a globally consistent heading. The inclusion of directional data in the global alignment optimization step requires minimal modifications to the presented algorithm. Since electronic direction sensors are relatively inexpensive, we recommend using such a sensor for large-scale mapping applications.

The successful integration of real-time localization and mapping into forestry practices presents new opportunities for enhancing decision-making processes and implementing complex silvicultural prescriptions. This research provides a practical, low-cost solution that addresses the need for accurate and efficient positioning and mapping in forested environments.

6. Patents

US Patent No. US011481972B2: Method of performing dendrometry and forest mapping.

Author Contributions

Conceptualization, L.A.W. and W.C.; methodology, L.A.W. and W.C.; software, L.A.W.; validation, L.A.W. and W.C.; formal analysis, L.A.W.; investigation, L.A.W. and W.C.; resources, L.A.W.; data curation, L.A.W.; writing—original draft preparation, L.A.W.; writing—review and editing, L.A.W. and W.C.; visualization, L.A.W.; supervision, W.C.; project administration, W.C.; funding acquisition, W.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the U.S. Forest Service National Technology and Development Program under contract number 16CS-1113-8100-017.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We gratefully acknowledge the invaluable assistance of Jenny Perth from the USDA Forest Service’s National Technology Development Program, whose diligent efforts in collecting validation data significantly contributed to the success of this research.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

In this appendix, we provide an overview of the mathematical foundations for special orthogonal and special Euclidean Lie groups. We discuss the closed-form exponential maps and logarithm maps for these groups, as well as their corresponding Lie algebras. This section aims to offer a convenient reference for the reader, summarizing essential concepts and formulas related to these topics, even though these derivations might exist in textbooks.

The special orthogonal Lie groups

SO (2)

and

SO (3)

represent rotations in 2 and 3 dimensions, respectively. In

SO (2)

, we represent a transformation by a

2 \times 2

matrix,

R \in SO (2)

, and its Lie algebra by

θ \in so (2)

. The closed-form exponential map that converts the Lie algebra,

so (2)

, to the Lie group,

SO (2)

, is given by

\begin{matrix} R = exp (θ_{\times}) = (\begin{matrix} cos θ & - sin θ \\ sin θ & cos θ \end{matrix}) \in SO (2), \end{matrix}

(A1)

where

θ_{\times}

denotes a skew symmetric matrix generated from

θ

. The inverse of the exponential map is given by the logarithm map that brings the transformation matrix back to the algebra,

\begin{matrix} θ = ln (R) = {tan}^{- 1} \frac{r_{21}}{r_{11}} \in so (2), \end{matrix}

(A2)

where

r_{21}

and

r_{11}

are entries in

R

. In practice the

{tan}^{- 1}

function is replaced by the two-argument inverse tangent function

\arctan 2 (y, x)

for appropriate quadrant checking.

In the case of 3-dimensional rotations, we represent the transformation as a

3 \times 3

matrix

R \in SO (3)

, and its algebra as

ω \in so (3)

, where

ω = {(α, β, γ)}^{T}

is a vector encoding the rotation angles about the x, y and z-axes. The closed-form exponential is given by

\begin{matrix} R = exp (ω_{\times}) = I + (\frac{sin ϑ}{ϑ}) ω_{\times} + (\frac{1 - cos ϑ}{ϑ^{2}}) ω_{\times}^{2} \in SO (3), \end{matrix}

(A3)

where

ϑ = \sqrt{ω^{T} ω}

and

ω_{\times}

is defined as

\begin{matrix} ω_{\times} = (\begin{matrix} 0 & - γ & β \\ γ & 0 & - α \\ - β & α & 0 \end{matrix}) . \end{matrix}

(A4)

Equation (A3) is Rodrigues’ formula for rotating a vector in space, given an axis and a rotation angle. The logarithm map that brings the transformation matrix back to the algebra is given by

\begin{matrix} ϑ & = {cos}^{- 1} \frac{Tr (R) - 1}{2}, \end{matrix}

(A5a)

\begin{matrix} ω_{\times} & = ln (R) = \frac{ϑ}{2 sin ϑ} (R - R^{T}) \in so (3) . \end{matrix}

(A5b)

The algebra is taken from the off-diagonal components of

ω_{\times}

, i.e., skew symmetric matrix to vector. The notation

Tr (\cdot)

is the trace function of a square matrix, which returns the sum of the elements along the diagonal.

In both the 2D and 3D cases, the column vectors of the matrices representing the transformations are orthogonal. Thus, the inverse is equivalent to the matrix transpose,

\begin{matrix} R^{- 1} = R^{T} \in SO (2), \end{matrix}

(A6a)

\begin{matrix} R^{- 1} = R^{T} \in SO (3) . \end{matrix}

(A6b)

Rigid transformations in 2 and 3 dimensions are represented by the special Euclidean Lie groups

SE (2)

and

SE (3)

. By definition, these groups include rotations from the special orthogonal group described in the previous section,

\begin{matrix} SE (2) & \overset{def}{=} SO (2) \times R^{2}, \end{matrix}

(A7a)

\begin{matrix} SE (3) & \overset{def}{=} SO (3) \times R^{3} . \end{matrix}

(A7b)

A rigid transformation in 2D takes the form

\begin{matrix} T = (\begin{matrix} R \in SO (2) & t \in R^{2} \\ 0^{T} & 1 \end{matrix}) \in SE (2) . \end{matrix}

(A8)

This matrix is homogeneous for compatibility with inversion and composition operations. We represent the parameter space, i.e., the Lie algebra, as

ξ = {(θ, x, y)}^{T} = {(θ, u)}^{T} \in se (2)

. The exponential map is given in closed form as

T = exp (\hat{ξ}) = (\begin{matrix} exp (θ_{\times}) & V u \\ 0^{T} & 1 \end{matrix}) = (\begin{matrix} R & t \\ 0^{T} & 1 \end{matrix}) \in SE (2)

(A9)

where

exp (θ_{\times})

is defined in Equation (A1) and

\begin{matrix} V = \frac{1}{θ} (\begin{matrix} sin θ & - (1 - cos θ) \\ 1 - cos θ & sin θ \end{matrix}) . \end{matrix}

(A10)

The logarithm map that brings the transformation matrix back to the algebra is given by

\begin{matrix} ξ = (\begin{matrix} θ \\ V^{- 1} t \end{matrix}) \in se (2), \end{matrix}

(A11)

where

θ = ln (R)

is defined in Equation (A2) and

V^{- 1}

is defined as

\begin{matrix} ϑ & = \frac{sin θ}{θ}, \end{matrix}

(A12a)

\begin{matrix} ϕ & = \frac{1 - cos θ}{θ}, \end{matrix}

(A12b)

\begin{matrix} V^{- 1} & = \frac{1}{ϑ^{2} + ϕ^{2}} (\begin{matrix} ϑ & ϕ \\ - ϕ & ϑ \end{matrix}) . \end{matrix}

(A12c)

Rigid transformations in 3-dimensional space are written similarly to our definition for

SE (2)

but with 3D rotation matrices and 3-vector translations,

\begin{matrix} T = (\begin{matrix} R \in SO (3) & t \in R^{3} \\ 0^{T} & 1 \end{matrix}) \in SE (3) . \end{matrix}

(A13)

We use the variable

ξ

to represent the algebra,

ξ = {(ω, u)}^{T} \in se (3)

, where

ω = {(α, β, γ)}^{T}

and

u = {(x, y, z)}^{T}

. The exponential map again can be computed in closed form,

exp (\hat{ξ}) = exp (\begin{matrix} ω_{\times} \\ u \end{matrix}) = (\begin{matrix} R & V u \\ 0^{T} & 1 \end{matrix}) \in SE (3),

(A14)

where

R

, defined in Equation (A3) is restated below for convenience, and

V

are given by

\begin{matrix} θ & = \sqrt{ω^{T} ω}, \end{matrix}

(A15a)

\begin{matrix} ϑ & = \frac{sin θ}{θ}, \end{matrix}

(A15b)

\begin{matrix} ϕ & = \frac{1 - cos θ}{θ^{2}}, \end{matrix}

(A15c)

\begin{matrix} ψ & = \frac{1 - ϑ}{θ^{2}}, \end{matrix}

(A15d)

\begin{matrix} R & = I + ϑ ω_{\times} + ϕ ω_{\times}^{2}, \end{matrix}

(A15e)

\begin{matrix} V & = I + ϕ ω_{\times} + ψ ω_{\times}^{2} . \end{matrix}

(A15f)

Finally, the logarithm map is written as

\begin{matrix} ξ = ln (T) = (\begin{matrix} ln (R) \\ V^{- 1} t \end{matrix}) \in se (3) . \end{matrix}

(A16)

The logarithm map of the rotation part is given by Equation (A5), and the inverse of

V

is defined as

\begin{matrix} V^{- 1} = I + \frac{1}{2} ω_{\times} + \frac{1}{θ^{2}} (1 - \frac{ϑ}{2 ϕ}) ω_{\times}^{2} . \end{matrix}

(A17)

Composition of two transformation is equivalent for

SE (2)

and

SE (3)

,

\begin{matrix} T_{1} \circ T_{2} = (\begin{matrix} R_{1} R_{2} & R_{1} t_{2} + t_{1} \\ 0^{T} & 1 \end{matrix}), \end{matrix}

(A18)

and inverting a transformation is given by

\begin{matrix} T^{- 1} = (\begin{matrix} R^{T} & - R^{T} t \\ 0^{T} & 1 \end{matrix}) . \end{matrix}

(A19)

As a result, composition and inversion in a single operation is written as

\begin{matrix} T_{1} \circ T_{2}^{- 1} = (\begin{matrix} R_{1} R_{2}^{T} & R_{1} (- R_{2}^{T} t_{2}) + t_{1} \\ 0^{T} & 1 \end{matrix}) . \end{matrix}

(A20)

References

Mizunaga, H.; Nagaike, T.; Yoshida, T.; Valkonen, S. Feasibility of silviculture for complex stand structures: Designing stand structures for sustainability and multiple objectives. J. For. Res. 2010, 15, 1–2. [Google Scholar] [CrossRef]
Hartley, D.; Han, H. Effects of Alternative Silvicultural Treatments on Cable Harvesting Productivity and Cost in Western Washington. West. J. Appl. For. 2007, 22, 204–212. [Google Scholar] [CrossRef] [Green Version]
Holopainen, M.; Vastaranta, M.; Hyyppä, J. Outlook for the Next Generation’s Precision Forestry in Finland. Forests 2014, 5, 1682–1694. [Google Scholar] [CrossRef] [Green Version]
Grift, T.; Zhang, Q.; Kondo, N.; Ting, K.C. Review of Automation and Robotics for the Bioindustry. J. Biomechatron. Eng. 2008, 1, 37–54. [Google Scholar]
Evans, D.; Carraway, R.; Simmons, G. Use of global positioning system (GPS) for forest plot location. South. J. Appl. For. 1992, 16, 67–70. [Google Scholar] [CrossRef]
Liu, C.J.; Brantigan, R. Using differential GPS for forest traverse surveys. Can. J. For. Res. 2011, 25, 1795–1805. [Google Scholar] [CrossRef]
McDonald, T.P.; Carter, E.A.; Taylor, S.E. Using the global positioning system to map disturbance patterns of forest harvesting machinery. Can. J. For. Res. 2002, 32, 310–319. [Google Scholar] [CrossRef]
Veal, M.W.; Taylor, S.E.; McDonald, T.P.; McLemore, D.K.; Dunn, M.R. Accuracy of Tracking Forest Machines with GPS. Trans. ASAE 2001, 44, 1903–1911. [Google Scholar]
Devlin, G.J.; McDonnell, K. Performance Accuracy of Real-Time GPS Asset Tracking Systems for Timber Haulage Trucks Travelling on Both Internal Forest Road and Public Road Networks. Int. J. For. Eng. 2009, 20, 45–49. [Google Scholar] [CrossRef]
Becker, R.M.; Keefe, R.F.; Anderson, N.M. Use of real-time GNSS-RF data to characterize the swing movements of forestry equipment. Forests 2017, 8, 44. [Google Scholar] [CrossRef] [Green Version]
McDonald, T.P.; Fulton, J.P. Automated time study of skidders using global positioning system data. Comput. Electron. Agric. 2005, 48, 19–37. [Google Scholar] [CrossRef] [Green Version]
Gallo, R.; Grigolato, S.; Cavalli, R.; Mazzetto, F. GNSS-based operational monitoring devices for forest logging operation chains. J. Agric. Eng. 2013, 44, 140–144. [Google Scholar] [CrossRef]
Grayson, L.M.; Keefe, R.F.; Tinkham, W.T.; Eitel, J.U.H.; Saralecos, J.D.; Smith, A.M.S.; Zimbelman, E.G. Accuracy of WAAS-Enabled GPS-RF Warning Signals When Crossing a Terrestrial Geofence. Sensors 2016, 16, 912. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zimbelman, E.G.; Keefe, R.F.; Strand, E.K.; Kolden, C.A.; Wempe, A.M. Hazards in Motion: Development of Mobile Geofences for Use in Logging Safety. Sensors 2017, 17, 822. [Google Scholar] [CrossRef] [Green Version]
Zimbelman, E.G.; Keefe, R.F. Real-time positioning in logging: Effects of forest stand characteristics, topography, and line-of-sight obstructions on GNSS-RF transponder accuracy and radio signal propagation. PLoS ONE 2018, 13, e0191017. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Danskin, S.; Bettinger, P.; Jordan, T. Multipath Mitigation under Forest Canopies: A Choke Ring Antenna Solution. For. Sci. 2009, 55, 109–116. [Google Scholar] [CrossRef]
Wing, M.G.; Eklund, A.; Kellogg, L.D. Consumer-Grade Global Positioning System (GPS) Accuracy and Reliability. J. For. 2005, 103, 169–173. [Google Scholar] [CrossRef]
Wing, M.G. Consumer-Grade Global Positioning Systems (GPS) Receiver Performance. J. For. 2008, 106, 185–190. [Google Scholar] [CrossRef]
Andersen, H.E.; Clarkin, T.; Winterberger, K.; Strunk, J. An accuracy assessment of positions obtained using survey- and recreational-grade Global Positioning System receivers across a range of forest conditions within the Tanana Valley of interior Alaska. West. J. Appl. For. 2009, 24, 128–136. [Google Scholar] [CrossRef] [Green Version]
Bettinger, P.; Fei, S. One year’s experience with a recreation-grade GPS receiver. Int. J. Math. Comput. For. Nat. Resour. Sci. 2010, 2, 153–160. [Google Scholar]
Wing, M.G. Consumer-grade GPS receiver measurement accuracy in varying forest conditions. Res. J. For. 2011, 5, 78–88. [Google Scholar] [CrossRef] [Green Version]
Nistér, D.; Naroditsky, O.; Bergen, J. Visual odometry. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 27 June–2 July 2004; CVPR 2004. Volume 1, pp. 652–659. [Google Scholar] [CrossRef]
Longuet-Higgins, H.C. A computer algorithm for reconstructing a scene from two projections. Nature 1981, 293, 133. [Google Scholar] [CrossRef]
Harris, C.; Pike, J. 3D positional integration from image sequences. Image Vis. Comput. 1988, 6, 87–90. [Google Scholar] [CrossRef] [Green Version]
Nistér, D.; Naroditsky, O.; Bergen, J. Visual odometry for ground vehicle applications. J. Field Robot. 2006, 23, 3–20. [Google Scholar] [CrossRef]
Howard, A. Real-time stereo visual odometry for autonomous ground vehicles. In Proceedings of the 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, Nice, France, 22–26 September 2008. [Google Scholar] [CrossRef]
Cumani, A. Feature localization refinement for improved visual odometry accuracy. Int. J. Circuts Syst. Single Process. 2011, 5, 151–158. [Google Scholar]
Jiang, Y.; Xu, Y.; Liu, Y. Performance evaluation of feature detection and matching in stereo visual odometry. Neurocomputing 2013, 120, 380–390. [Google Scholar] [CrossRef]
Rosten, E.; Drummond, T. Fusing points and lines for high performance tracking. In Proceedings of the IEEE International Conference on Computer Vision, Beijing, China, 17–21 October 2005; Volume 2, pp. 1508–1511. [Google Scholar] [CrossRef]
Rosten, E.; Drummond, T. Machine learning for high-speed corner detection. In Proceedings of the 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Volume 1, pp. 430–443. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Harris, C.; Stephens, M. A combined corner and edge detector. Alvey Vis. Conf. 1988, 15, 147–151. [Google Scholar]
Shi, J.; Tomasi, C. Good features to track. In Proceedings of the 9th IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 21–23 June 1994; pp. 593–600. [Google Scholar]
Bay, H.; Ess, A.; Tuytelaars, T.; Van Gool, L. Speeded-Up Robust Features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
Fraundorfer, F.; Scaramuzza, D. Visual Odometry: Part II—Matching, Robustness, and Applications. IEEE Robot. Autom. Mag. 2012, 19, 78–90. [Google Scholar] [CrossRef] [Green Version]
Moreno-Noguer, F.; Lepetit, V.; Fua, P. Accurate Non-Iterative O(n) Solution to the PnP Problem. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar] [CrossRef] [Green Version]
Fischler, M.A.; Bolles, R.C. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
Scaramuzza, D.; Fraundorfer, F. Visual Odometry: Part I—The fist 30 years and fundamentals. IEEE Robot. Autom. Mag. 2011, 18, 80–92. [Google Scholar] [CrossRef]
Comport, A.I.; Malis, E.; Rives, P. Accurate Quadrifocal Tracking for Robust 3D Visual Odometry. In Proceedings of the 2007 IEEE International Conference on Robotics and Automation, Rome, Italy, 10–14 April 2007; pp. 40–45. [Google Scholar] [CrossRef]
Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast semi-direct monocular visual odometry. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 15–22. [Google Scholar] [CrossRef] [Green Version]
Engel, J.; Sturm, J.; Cremers, D. Semi-Dense Visual Odometry for a Monocular Camera. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 1–8 December 2013. [Google Scholar]
Usenko, V.; Engel, J.; Stueckler, J.; Cremers, D. Direct Visual-Inertial Odometry with Stereo Cameras. In Proceedings of the International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016. [Google Scholar]
Engel, J.; Koltun, V.; Cremers, D. Direct Sparse Odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 611–625. [Google Scholar] [CrossRef]
Alismail, H.; Kaess, M.; Browning, B.; Lucey, S. Direct Visual Odometry in Low Light Using Binary Descriptors. IEEE Robot. Autom. Lett. 2017, 2, 444–451. [Google Scholar] [CrossRef]
Lucas, B.D.; Kanade, T. An Iterative Image Registration Technique with an Application to Stereo Vision. In Proceedings of the 7th International Joint Conference on Artificial Intelligence, Vancouver, BC, Canada, 24–28 August 1981; IJCAI’81. Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1981; Volume 2, pp. 674–679. [Google Scholar]
Thrun, S.; Burgard, W.; Fox, D. Probabilistic Robotics; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
Rossmann, J.; Schluse, M.; Schlette, C.; Buecken, A.; Emde, M. Realization of a highly accurate mobile robot system for multi purpose precision forestry applications. In Proceedings of the 2009 International Conference on Advanced Robotics, ICAR 2009, Munich, Germany, 22–26 June 2009; pp. 1–6. [Google Scholar]
Rossmann, J.; Krahwinkler, P.; Schlette, C. Navigation of Mobile Robots in Natural Environments: Using Sensor Fusion in Forestry. Syst. Cybern. Inform. 2010, 8, 67–71. [Google Scholar]
Grisetti, G.; Kummerle, R.; Stachniss, C.; Burgard, W. A Tutorial on Graph-Based SLAM. IEEE Intell. Transp. Syst. Mag. 2010, 2, 31–43. [Google Scholar] [CrossRef]
Baker, S.; Matthews, I. Lucas-Kanade 20 years on: A unifying framework. Int. J. Comput. Vis. 2004, 54, 221–255. [Google Scholar] [CrossRef]
Hirschmuller, H. Stereo Processing by Semiglobal Matching and Mutual Information. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 328–341. [Google Scholar] [CrossRef] [PubMed]
Hernandez-Juarez, D.; Chacón, A.; Espinosa, A.; Vázquez, D.; Moure, J.C.; López, A.M. Embedded Real-time Stereo Estimation via Semi-Global Matching on the GPU. In Proceedings of the International Conference on Computational Science 2016, ICCS 2016, San Diego, CA, USA, 6–8 June 2016; pp. 143–153. [Google Scholar] [CrossRef] [Green Version]
Zhang, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef] [Green Version]
Beaton, A.E.; Tukey, J.W. The Fitting of Power Series, Meaning Polynomials, Illustrated on Band-Spectroscopic Data. Technometrics 1974, 16, 147–185. [Google Scholar] [CrossRef]
Black, M.J.; Rangarajan, A. On the unification of line processes, outlier rejection, and robust statistics with applications in early vision. Int. J. Comput. Vis. 1996, 19, 57–91. [Google Scholar] [CrossRef]
Huber, P.J. Robust Estimation of a Location Parameter. Ann. Math. Stat. 1964, 35, 73–101. [Google Scholar] [CrossRef]
Levenberg, K. A Method for the Solution of Certain Non-Linear Problems in Least Squares. Q. Appl. Math. 1944, 2, 164–168. [Google Scholar] [CrossRef] [Green Version]
Marquardt, D. An Algorithm for Least-Squares Estimation of Nonlinear Parameters. SIAM J. Appl. Math. 1963, 11, 431–441. [Google Scholar] [CrossRef]
Wells, L.A.; Chung, W. Real-Time Computer Vision for Tree Stem Detection and Tracking. Forests 2023, 14, 267. [Google Scholar] [CrossRef]
Wells, L.A.; Chung, W. Evaluation of Ground Plane Detection for Estimating Breast Height in Stereo Images. For. Sci. 2020, 66, 612–622. [Google Scholar] [CrossRef]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
StereoLabs. ZED Stereo Camera. 2016. Available online: https://www.stereolabs.com/ (accessed on 12 September 2018).
GlobalTop Technology Inc. GlobalTop FGPMMOPA6H GPS Module. Acquired by Sierra Wireless. Available online: https://www.gtop-tech.com/ (accessed on 19 September 2018).

Figure 1. Steepest descent images; each image corresponds to a column in the Jacobian matrix.

Figure 2. Gradient magnitude-based pixel selection; selected pixels shown in red. Subfigure captions indicate the number of selected pixels for 240 × 320 resolution images. (a) Dense (≈65,000 px); (b) 75% (≈57,600 px); (c) 50% (≈38,400 px); (d) 25% (≈19,200 px).

Figure 3. Graphical model of global alignment. Circles denote nodes (vertices) and lines denotes edges (constraints).

Figure 4. Graphical model of local refinement. Circles denote nodes (vertices) and lines denote edges (constraints). Odometry nodes are represented as

v

, GPS nodes as

g

, and tree stem positions as

m

.

Figure 4. Graphical model of local refinement. Circles denote nodes (vertices) and lines denote edges (constraints). Odometry nodes are represented as

v

, GPS nodes as

g

, and tree stem positions as

m

.

Figure 5. GPS position track projected on a UTM coordinate system (a). VO position track (b). The blue square denotes the starting position and the red triangle denotes the end position.

Figure 6. GPS position track denoted by the black solid line. Optimized position track denoted by the red solid line. VO track rotated by the global orientation parameter denoted by the blue dashed line.

Figure 7. Global alignment with degraded GPS reception. Each subfigure shows a snapshot during optimization.

Figure 8. Converged globally aligned path with degraded GPS reception.

Figure 9. Converged globally aligned path with non-Gaussian GPS errors.

Figure 10. Global alignment with intermittent GPS reception. Each subfigure is a snapshot during optimization.

Figure 11. Converged globally aligned path with intermittent GPS reception.

Figure 12. Tree stem position averages (black dots), tree stem position covariances (black ellipses), and pose track (red line) before local refinement.

Figure 13. Tree stem position averages (black dots), tree stem position covariances (black ellipses), and pose track (red line) after local refinement.

Figure 14. Ground truth tree stem positions and estimated tree stems positions superimposed on an aerial photograph of the 12 ac forest stand. Green triangles denote ground truth stem positions, red circles show the estimated stem position from the camera, the yellow line denotes the optimized position track, and the blue lines show association between observed and estimated tree positions.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wells, L.A.; Chung, W. Vision-Aided Localization and Mapping in Forested Environments Using Stereo Images. Sensors 2023, 23, 7043. https://doi.org/10.3390/s23167043

AMA Style

Wells LA, Chung W. Vision-Aided Localization and Mapping in Forested Environments Using Stereo Images. Sensors. 2023; 23(16):7043. https://doi.org/10.3390/s23167043

Chicago/Turabian Style

Wells, Lucas A., and Woodam Chung. 2023. "Vision-Aided Localization and Mapping in Forested Environments Using Stereo Images" Sensors 23, no. 16: 7043. https://doi.org/10.3390/s23167043

APA Style

Wells, L. A., & Chung, W. (2023). Vision-Aided Localization and Mapping in Forested Environments Using Stereo Images. Sensors, 23(16), 7043. https://doi.org/10.3390/s23167043

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Vision-Aided Localization and Mapping in Forested Environments Using Stereo Images

Abstract

1. Introduction

1.1. Global Positioning

1.2. Visual Odometry

1.3. Simultaneous Localization and Mapping

1.4. Notation

2. Direct Visual Odometry

2.1. Image Warping

2.2. Optimization

2.2.1. Pixel Selection

2.2.2. Robustness

3. Localization and Mapping

3.1. Global Alignment

3.1.1. Graph Construction

3.1.2. Optimization

3.2. Local Refinement

3.2.1. Graph Augmentation

3.2.2. Refinement

4. Analysis and Discussion

4.1. Localization

4.1.1. Degraded GPS Reception

4.1.2. Intermittent GPS Reception

4.2. Mapping

5. Conclusions

6. Patents

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI