Robotics and autonomous systems are increasingly used as efficient data-gathering tools in environmental research and management and have the potential to significantly improve our capacity to monitor a wide range of environmental systems at large spatial and temporal scales [1
]. In the specific context of river monitoring, autonomous survey vessels measuring physical attributes, such as water depth, velocity and discharge, can contribute towards a more efficient implementation of environmental assessments for flood protection, water supply and ecological restoration (as required by some legislation [2
]). Conventionally, these physical river attributes are measured through hand-held mechanical meters or active acoustic sensors deployed from tethered, radio-controlled or manned platforms [4
A key challenge to the development of autonomous river monitoring vessels is the accurate localisation of the vessel itself, required for both navigation and spatial localisation (position error <1 m) of the collected data to allow for accurate data analysis [7
]. This is confounded by the natural occurrence of bank-side vegetation and urban settlement that preclude the conventional use of global positioning via line-of-sight global navigation satellite systems (GNSS; also commonly known as the Global Positioning System (GPS) when limited to the space and control segments operated by the United States Air Force) in many locales of interest. In such areas, the vessel location relative to its surrounding environment can be estimated based on range measuring on-board sensors, such as lasers, sonar or cameras [10
]. Localisation based on cameras is particularly attractive, because of their relatively low cost and light weight and the ability to retrieve the environmental appearance, colour and texture, enabling the integration of high-level tasks, such as ecological feature classification [11
The process of estimating the ego-motion of a robot or vehicle using the input of a single or multiple cameras attached to it is known as visual odometry [13
], and the first description of vehicle navigation based solely on visual information reaches back to the work of [14
] in 1980. The incremental pose estimate between camera frames is obtained based on the change in the recorded images induced by motion, and the technique relies on sufficient illumination in the environment, a static, textured scene and sufficient scene overlap between consecutive frames [15
]. Based on the sensor used, monocular and stereo visual odometry can be distinguished. In this study, we only consider the latter, because techniques based on a single camera rely on additional measurements, further on-board sensors or motion constraints in order to recover the absolute scale of the camera motion (scale ambiguity problem) [13
]. In contrast, stereo vision using calibrated cameras with a known baseline allows for the extraction of depth information with every recorded frame through triangulation. The existing visual odometry algorithms can be categorised into feature-based (sparse) and appearance-based (dense) techniques [15
]. The former estimate camera poses based on the displacement of a sparse set of salient features that are detected and matched across subsequent images. These techniques involve the projection of feature points from the (2D) image domain to the (3D) real-world domain. The pose increment between frames is then commonly computed by minimising the differences between corresponding 3D feature locations from subsequent frames (absolute orientation methods) or by minimising the error in the re-projection of the transformed 3D features into the image domain (perspective in n
-point methods). For robustness against outliers, these optimisation procedures are frequently wrapped into a random sample consensus (RANSAC) scheme [18
]. Using stereo cameras, sparse visual odometry has been demonstrated on aerial and ground vehicles in a range of settings, including outdoor urban environments [19
], rough terrain [17
] and even extraterrestrial terrain [26
]. For example, on a 9 km-long trajectory with a motorcar in rough terrain, [17
] achieved a root mean square error (
) in the 3D position of 45.74 m (0.49% of the trajectory), which was reduced to 4.09 m (0.04% of the trajectory) by integrating the visual odometry with angular motion estimated from an inertial measurement unit (IMU).
Dense visual odometry, on the other hand, avoids the potentially error-prone feature extraction and matching, but instead estimates the camera motion based on a direct model that involves the dense set of pixels for which depth information is available. The underpinning idea of this technique is that after the camera motion from a reference to a target frame, the re-projection of the dense cloud of previously-extracted and transformed 3D points to the image plane will yield a deformed or warped intensity image of the target frame. The solution is to find the camera pose increment that minimises a cost function based on the differences between the pixel intensities of the warped target frame and the reference frame. This optimisation has also been described as photo-consistency maximisation [27
]. In [28
], it was argued that minimising a cost function that is directly based on the image measurement (pixel intensities) avoids the systematic propagation of feature extraction and matching errors, reducing the resulting drift in the camera pose estimate. Dense visual odometry has found increased application with consumer depth cameras, offering co-registered colour and depth imagery (RGB-D) in indoor environments [27
], but has also been applied with monocular [30
] and stereo cameras in urban settings [31
]. For a 220 m-long loop trajectory with a motorcar in a city, [32
] reported an
of 1.37 m (0.6% of the trajectory).
Previous studies assessing visual odometry in the inland waterway environment are rare and have focused exclusively on feature-based techniques with the dominance of aerial vehicles navigating a few metres above the water surface [33
]. Stereo visual odometry in a river environment has been implemented in [34
], who propose the fusion of a classic feature-based technique with inertial measurements from gyroscope and accelerometers and intermittent readings from a GPS device through a graph-based optimisation to correct for unbounded position drift. Although the system was designed with a focus on aerial vehicles, tests were conducted on a manned floating platform. They report a consistent under-estimation of the platform translation (by 10% on average) due to a lack of features at close range; this being a problem that is specific to certain river environments where structure is limited to the river banks. After correcting for this bias, the system (visual odometry, IMU, sparse GPS) is shown to achieve a mean position error of 5 m over a 2-km traverse [36
], it was argued that the limited reliability of existing visual odometry algorithms prevents these methods from being used for on-board guidance of a fully-autonomous vehicle in challenging environments. They emphasise the need for an increased understanding of the effect of covariates related to vehicle kinematics and scenery on the performance of existing visual odometry algorithms, in order to guide the development of more robust techniques. The challenges to robust visual odometry in the inland waterway environment arise from (i) a landscape structure that is different from that in indoor settings and urban environments, so that the specific structure of the scenery, such as orthogonality constraints and the presence of distinct corner features, cannot be assumed a priori
] and (ii) from the platform kinematics, which are specific to the respective environmental monitoring application.
Guided survey vessels equipped with active acoustic sensors to measure river discharge or to characterise the hydrodynamics of river cross-sections follow a distinct sampling strategy involving the repeated crossing of a lateral river section [4
] (commonly four times or more). The vehicle operation differs from that covered in visual odometry assessment datasets in the automotive context (e.g., [39
]) by the very low speed (ideally less than or equal to the average total water velocity [4
]) and large changes in yaw (often with no translational motion) at the beginning and ending of crossings. The latter has been shown to be potentially detrimental to the accuracy of feature-based visual odometry due to motion blur and degeneration of the linear system to calculate the fundamental matrix [37
]. Furthermore, dense visual odometry has been shown to be susceptible to errors from large camera orientation changes [32
]. In addition to cross-sectional measurements, radio control survey platforms and acoustic sensors are increasingly used for surveying the river bed topography (bathymetry) and the spatial distribution of water velocities in continuous, spatially-dense sampling trajectories over small areas of interest, such as near river engineering structures [5
] or over river reaches of several kilometres in length [40
]. The sceneries encountered in such applications can be dominated by feature-rich, but repetitive, vegetated river banks, distant features (e.g., with the cameras pointing directly up- or down-stream on a wide river), reflections from the water surface or feature-poor engineered river structures, such as piers (see Figure 1
). To be suitable for river monitoring platforms, a visual odometry system should be sufficiently robust and fast (real time) to allow for autonomous navigation given the mentioned variety of sceneries and distinct vehicle kinematics and enable spatial data referencing at accuracies similar to good quality differentially-corrected GPS in order to meet common surveying standards [7
Full (orange) and discharge measurement (blue) trajectory with exemplary left intensity image samples.
Full (orange) and discharge measurement (blue) trajectory with exemplary left intensity image samples.
In this paper, we examine the use of unaided (i.e.
, without integrating other sensors) visual odometry for navigating an autonomous watercraft following typical trajectories for sampling water depth, velocities and river discharge. Our study is unique because it (i) focuses on real-world river monitoring applications using established survey vessels and statistically robust data sampling strategies, (ii) assesses both feature-based and appearance-based visual odometry approaches, (iii) introduces a technique for ground truthing position estimates at accuracies of a few centimetres in truly GNSS-denied outdoor environments based on an electronic theodolite integrated with an electronic distance meter (EDM) and tracking capability (Total Station) and (iv) quantifies the error contribution of covariates related to river environment scenery and platform kinematics through multiple linear regression analysis. Thereby, we contribute to the cross-disciplinary application of techniques from the domain of mobile robotics and address the need for an increased understanding and subsequent improvement of the reliability of existing visual odometry algorithms when applied to real-world applications in challenging environments [37