Regressed Terrain Traversability Cost for Autonomous Navigation Based on Image Textures

: The exploration of remote, unknown, rough environments by autonomous robots strongly depends on the ability of the on-board system to build an accurate predictor of terrain traversability. Terrain traversability prediction can be made more cost efﬁcient by using texture information of 2D images obtained by a monocular camera. In cases where the robot is required to operate on a variety of terrains, it is important to consider that terrains sometimes contain spiky objects that appear as non-uniform in the texture of terrain images. This paper presents an approach to estimate the terrain traversability cost based on terrain non-uniformity detection (TNUD). Terrain images undergo a multiscale analysis to determine whether a terrain is uniform or non-uniform. Terrains are represented using a texture and a motion feature computed from terrain images and acceleration signal, respectively. Both features are then combined to learn independent Gaussian Process (GP) predictors, and consequently, predict vibrations using only image texture features. The proposed approach outperforms conventional methods relying only on image features without utilizing TNUD.


Background
The autonomous exploration of unstructured environments by mobile robots is witnessing increased interest, as it helps to accomplish diverse tasks such as search and rescue, surveying and data collection (e.g., [1] with unmanned ground vehicles and [2] with unmanned aerial vehicle), and surveillance. Performing these missions by autonomous mobile robots presents several advantages. It allows the avoidance of human intervention in hazardous area [3], as well as overcomes difficulties that may arise in the case of remotely controlled mobile robots such as a loss of connectivity or degraded latency. With robots being utilized in a wide spectrum of unstructured environments, the need for systems to build reliable representations of the environment is critical. Therefore, the estimation of the traversability of terrains is of great importance. In the so-called model-based approach, traversability is estimated based on mechanical analysis or simulation (e.g., [4,5]). However, the approach requires detailed knowledge of the terrain properties and high computational complexity [6]; hence, it is not easily applicable to certain environments. The other approach, which is becoming more popular, is to rely on machine learning techniques based on many data samples.
While supervised learning methods have proven to be efficient (e.g., [7][8][9]), there are several drawbacks to such methods. Manually labeling a large amount of data to train the classifier is tedious and subject to inconsistency. As a result, this approach is not flexible as it is unable to adapt to dynamic environments without more data annotation and training. As an alternative, self-supervised learning, using proprioceptive sensor information, such as inertial data, as label information to determine whether terrains is traversable, has been intensively investigated [10]. The self-supervised approach, also called the near-to-far strategy [11,12], is promising for making mobile robots adaptive to a variety of environments.

Related Works on Unsupervised Learning Approaches to Traversability Prediction
As an input for the traversability prediction, LiDAR (either 2D [13] or 3D [9,14]) and stereo camera images [15][16][17] have been widely used. Shape attributes of the ground have been extensively investigated for traversability estimation. The idea is to use 3D environment analysis methods and chart the mobility smoothness in a hierarchical manner. Based on this, 3D Ladar information has been clustered according to scatterness (grass), surfaceness (ground), and linearness (tree trunks) for terrain modeling [18].
Although laser sensors are more precise than stereo cameras, they are still costly and do not necessarily provide sufficient precision due to the sparseness of the 3D points. In particular, they are not perfectly able to detect small unevenness in the terrain in front of a robot despite their high applicability for detecting obstacles with sufficient size and height. Compared with laser sensors, cameras are more affordable and provide different attributes of terrain from 3D geometry obtained through point clouds. Thus, it is important to pursue the availability of image information for the prediction of traversability. In this paper, we investigate the applicability of 2D image features obtained by a camera for traversability prediction.
From the viewpoint of output as a measure of traversability for the self-supervised learning, the discrimination of traversable/non-traversable terrains [10,16] and the classification of terrains, such as gravel/sand/grass [19][20][21][22], are frequently used.
One problem of discriminating traversable/non-traversable terrains as the output of the classifier is that the boundary between the two classes is inconsistent and depends on the environment or requirements on the robot. When all regions around a robot are classified as traversable, the robot might still prefer to select a path with minimum (small) damage to its body considering the accumulation of fatigue during a long operational period. In contrast, even if every region around it is classified as non-traversable, the robot might have to search for the least hazardous path to escape from a situation. Thus, the predictability of a continuous cost is more important for avoiding damage to the robot. From the same viewpoint, classifying terrains (gravel/sand/etc.) is not the ultimate goal of traversability prediction; however, cost/damage to the robot should be considered.
As continuous measures of traversability assessment, the slippage of wheels of planetary rovers [6] and the deformability of rocky terrains [11,23] have been investigated. Considering the general use of mobile robots in outdoor environments, however, it is useful to consider not only large-scale unevenness, such as rocky and slippy terrains, but also small-scale unevenness, which causes vibrations or (vertical) acceleration. By considering such small-scale unevenness, it becomes possible to reduce accumulated damage to the robot and instability of its cargo and improve the comfort of passengers.

Objective and Approach
In this paper, we propose a framework to enable mobile robots to autonomously learn terrain traversability using only their sensors in a self-supervised manner. Terrain traversability is predicted as a cost measured by an acceleration sensor equipped by the mobile robot. Focusing on the usage of image features for terrains with small-scale unevenness, a texture-based prediction of traversability was proposed in [24]. However, the applicability of the texture information in 2D images has not been sufficiently investigated because, in natural environments, terrains are not always uniform and often are non-uniform, therein containing spiky materials such as relatively large stones and roots of trees. This non-uniformity causes traversability cost prediction to be challenging. In this paper, a detection of such non-uniformity in terrains is proposed based on multiscale local image features. It is shown that we can improve the prediction performance prediction performance of the texture-based approach. An advantage of the texture-based approach is that the sensor is affordable and can still detect motion features of the traversing robot without high-cost 3D sensing of the terrain geometry.
To reduce the difficulty of motion feature prediction due to spiky objects in terrains, the classification of images into uniform/non-uniform is introduced. Predictors generated by a Gaussian process [25] are independently applied to uniform/non-uniform terrains so that each predictor can be specialized to learn each type of terrain. The proposed framework for improving the prediction performance is evaluated through comparison with existing texture-based motion feature prediction.
The remainder of the paper is structured as follows. In Section 2, the problem definition is given. Section 3 gives an overview of the system architecture, introduces how terrain uniformity detection is applied, and describes terrain traversability estimation from exteroceptive and proprioceptive information. Section 4 presents the experimental results of the proposed method. Finally, Section 5 concludes this paper.

Problem Definition
Terrains considered in this study are not always uniform due to irregular obstacles, such as large stones and roots of trees, and are assumed to be traversable even at a higher cost than homogeneous terrains. Typical obstacle size varies between 40 to 80mm in height, which is a significant mobility challenge for the mobile robot as the wheels radius is 105mm. Terrains are assumed to be rigid. Deformable terrains [11] that can lead the mobile robot to lose balance, or cause slippage/sinkage [26], are not covered in this paper. In addition, untraversable terrains are also beyond the scope of this paper. Samples of the terrains investigated in this paper are introduced in Figures 1 and 2. This study was conducted with a widely used mobile platform (Pioneer 3AT) for navigation and traversability projects in outdoor environments [27][28][29][30]. The platform is driven by four wheels and is capable of traversing bumps up to 100mm, gaps up to 150mm, and slopes up to 19• . The mobile platform is equipped with a vision sensor and an acceleration sensor, as shown in Figure 3. The camera gathers terrain images, and the IMU registers the acceleration signal generated during a sequence of a run. The goal is to estimate vibrations as a measure of traversability only from terrain images. The predicted traversability cost will be used to anticipate the motion nature of subsequent terrains and is expected to enable a safe run. The focus is not on which robot offers the best handling of terrain unevenness but on enabling any type of robot, regardless of its construction/configuration, to traverse non-uniform terrains. In the bird's-eye view given by Figure 4, the gray circle represents the mobile robot and the black rectangles are the wheels. The wheel base and the wheels radius are denoted by L and r respectively. The system accepts as input v l and v r as input, respectively, which are the angular velocities of the left and right wheels. The mobile robot motion is described as follows: where x and z are the robot position and φ is the orientation of the platform. The angular velocities for both the left and right wheels are similar; therefore, the robot will run straight according to the world Z w -axis. Moreover, the effects of steering and acceleration/deceleration are nonexistent.
The mobile robot has no suspension system, that is, it is a rigid body. Hence, effects of interaction of the mobile platform on the right and left wheels will appear on the sensor values. Operating in outdoor environments comes with a very important lightning conditions challenge. To simplify the problem setting, experiments were conducted under fair light conditions with neither shadows nor strong backlight.

Algorithm Architecture-Overview
Using data received from sensors, the mobile robot will build a model of the environment to predict terrain vibrations necessary to enable safe autonomous navigation. The overall system architecture is given in Figure 5. The architecture consists of the sensors available on the robot, which are a mono-vision system, an acceleration sensor, and odometers. The camera is responsible for gathering images of subsequent terrains. The inertial unit measures acceleration signals generated while traversing. The wheel odometer tracks the position of the mobile platform over time. The proposed method follows offline and online processes to achieve traversability cost prediction.  In the offline process, sensor information required for generating the traversability cost predictor is collected. Multiscale analysis measures a low-level image feature that is contrast distance feature to localize irregularities in terrain images. In terrain non-uniformity detection (TNUD), the resulting feature map from the multiscale analysis, is tested to determine whether the terrain nature is homogeneous or non-homogeneous. Region of Interest (ROI) localization extracts image regions traversed by the mobile robot based on its physical features and the camera model. The texture analysis calculates the texture features using the fractal dimension. Using the acceleration signal, the vibration analysis calculates the motion feature. Based on the output of TNUD, image and motion features are used to approximate either the function for uniform or non-uniform terrains.
In the online process, the same steps accomplished in the above mentioned offline process are executed to obtain the terrain image properties, that is, the terrain image properties, texture feature and terrain category. Based on the terrain category class, the vibrations of the subsequent terrain are inferred using texture information and the learned function.

Terrain Non Uniformity Detection-Region Extraction
Terrain non-uniformity detection aims at identifying whether ahead of the unmanned ground vehicle is a uniformly distributed terrain or contains spiky material. TNUD can be treated as a low-level image processing problem. A bump in a terrain image, as an example of spiky terrain, can be different from the background in terms of contrast. Thus, to detect bumps, we evaluate the lightness contrast distance for each block in the terrain image. Let I denote the input terrain image. I is divided into overlapping blocks of size w × w pixels. Bumps tend to have a higher contrast value than their surrounding regions. Therefore, we measure the distance contrast between the local and global contrast values of the blocks and input image I, respectively. For this purpose, two additional scale images are produced by resizing the original image to half and a quarter by bicubic interpolation. Then, R, G, B values are converted to L * , a * , b * measured in the Commission Internationale de l' Eclairage (CIE 1976) (L * , a * , b * ) color space (CIELAB) as described in [31]. Using the lightness channel, the global lightness contrast of the image is measured by: where STD(I) and MEAN(I) denote the standard deviation and mean values of the input image I, respectively. Let x denote a block in the input image I, and let C(x) denote the lightness contrast measured for x. C(x) is given as : where STD(x) and MEAN(x) denote the standard deviation and mean values of the block x.
The contrast distance is calculated using the results of Equations (2) and (3) as follows: where CD(x) denotes the lightness contrast distance value of the block x. Sample results are given in To detect non-uniformity in terrain image, all scale maps are combined together to compute a refined contrast distance map using a maximum operator. As shown in Figure 8, non-uniform image regions tend to have a higher contrast distance value than the surrounding uniform areas. From the terrain images, regions traversed by the mobile robot are considered for texture extraction. For this purpose, the camera model in [32] is employed to define ROIs. As shown in Figure 9, the camera is defined by the camera frame {O c ,x c ,y c ,z c } where O c is the center. The following equation projects a point from the world reference frame given by P ∈ R 4 to p ∈ R 3 (both containing one as the last element) onto the image plane: where [R|t] ∈ R 4×4 is the extrinsic parameter matrix, and K ∈ R 4×3 is the camera parameter matrix obtained through a calibration process.
Terrain plane Image plane Figure 9. Camera frame for mapping 3D point mapping onto the image plane.
The above-mentioned model will be combined with the physical characteristics of the UGV , that is, the UGV's wheels base L and width W to determine the image regions traversed by the mobile robot. As shown in Figure 10, both camera and world reference frames are centered on the platform. The world reference frame is given by {O w ,x w ,y w ,z w } with O w as the origin. The objective is to map points to identify the inner and outer bounds of the terrain region crossed by the vehicle. All points belong to the same line; hence, only the depth components with respect to the w z -axis will change. For both the left and right sides, the coordinates of these points are expressed as follows: where P ro and P ri denote the points laying on the right outer and inner bounds, respectively, and P lo and P li denote the left outer and inner bounds. The results of this operation are shown in Figure 11. To increase the number of terrain samples, two terrain patches will be extracted from a single image. The warp perspective transformation is performed to remove the trapezoidal shape introduced when taking a picture.

Texture and Vibration Features Extraction/Association
In this paper, texture features are obtained using of segmentation-based fractal texture analysis (SFTA) [33]. The SFTA method operates in two steps. In the first step, the input grayscale image is split into binary images using the two-threshold binary decomposition (TTBD) [33]. In the second step, binary images are used to compute the fractal dimension from its region boundaries. TTBD returns a set of thresholds T calculated by the multilevel algorithm [34]. After obtaining the targeted threshold number n t , pairs of contiguous thresholds T ∪ {n l } with n l being the highest value in the input gray scale image, in combination with pairs of thresholds {t, n t } with t ∈ T, are employed to obtain the binary images as follows where I b x,y denotes the binary value of image pixel (x, y), t l and t u denote the lower and the upper threshold values, respectively. Therefore, 2n t binary images will be obtained. In this paper, the SFTA feature vector includes only the fractal dimension of the boundaries. Let ∆(x, y) denote the border image of the binary image I b x,y , and which is obtained by the following equation: where N x,y is the 8-connexity of a pixel (x, y). ∆(x, y) takes on a value of if the pixel at location (x, y) in the related binary image I b x,y has a value of one and has a minimum one neighbor pixel with a value of zero. The border image serves to compute the fractal dimension D ∈ R by the box counting method as follows where N(ε) is the number of hyper-cubes of length ε that fill the object. In this paper, the obtained SFTA feature vector is referred to as x ∈ R 2n t . Let a k , k = 1, · · · , K be the vertical acceleration signal representing vibrations generated from the interaction of the wheels with the terrain when traversing a short range distance, where k is the time step and K is the total time steps. The amplitude distance is retained to describe the behavior of the mobile platform, and is measured by According to [35], all motion features (amplitude distance, root mean square, kurtosis, skewness, and crest factor) have a similar level of correlation with the SFTA feature. The amplitude distance feature was chosen as it is easy to implement and to understand. The acceleration signal does not undergo any noise filtering operation, and it is clear whether any filtering is performed at the sensor level. To measure the noise contribution to the measurements, signal-to-noise ratio (SNR) of the acceleration signal at rest was computed as follows: where µ and σ are the mean and standard deviation of the acceleration signal, respectively. Results of SNR calculation for four signals are given in Table 1. SNR values are high which means that the noise does not impact severely our intended purpose. Two subsequent terrain segments serve for image feature extraction. Appropriately, the relevant acceleration segment for motion feature calculation is paired to the image features by means of the coordinate transformation and odometry. As shown in Figure 12, at time t = 0, the robot is located at the origin of the world reference frame. The position used to take a new terrain image, denoted here by Z I i (x,y) , is expressed as where d s denotes the sampling distance, and N image is the total number of terrain images acquired during a run. Due to the camera tilt angle α, terrains will be covered from a certain position, expressed as where Z BZ is the blind spot. The motive behind focusing on a short distance range l covered by the images for texture feature extraction is that pixels further from the camera focus point are subject to more noise. Thus, visual information may fail to faithfully represent the environment. The acceleration signal sequence generated when traversing a distance l is used for motion feature extraction, and is limited according to the ROI as follows where A i denotes the ROI.

Traversability Cost Regression using Gaussian Process (GP)
The Gaussian process [25] is used for regression analysis. Let D = (x j , y j ), j = 1, · · · , n denote a set of training samples with pairs of SFTA feature vector inputs x j and motion feature outputs y j ∈ R, and n denotes the total number of training samples. The goal is to evaluate the predictive distribution for the function f * for test input samples x * . The noise is additive and independent and following a normal distribution. The relationship between the function f (x) and the observed noisy targets y is given by where ε j is noise with distribution N (0, σ 2 noise ), in which σ 2 noise denotes the variance of the noise. The notation N (a, A) is retained for the normal distribution with mean a and covariance A. The Gaussian process regression is a Bayesian algorithm that assumes that a priori the function values respond as p(f|x 1 , x 2 , · · · , x n ) = N (0, K), where f = [ f 1 , f 2 , · · · , f n ] contains the latent function values, f j = f (x j ) and K is a covariance matrix of which inputs are provided by the covariance function defined as K ij = k(x i , x j ). A widely employed covariance function is the squared exponential given by where σ 2 controls the variance, and λ is the isotropic length scale parameter describing the smoothness of a function. We compute the covariance function among all possible combinations of data points using the following equations: The joint prior distribution of the training outputs, y, and the test outputs y * according to the prior is y y * ∼ N 0, The predictive distribution of the latent function for Gaussian Process Regression, y * , is given by y * ∼ N (y * , σ 2 y * ) where:

Experimental Settings
The goal of the experiment is to verify the ability of the proposed approach to model terrain information. We validate the effectiveness of introducing TNUD to the traversability cost prediction by comparing it with the framework introduced in [24], where the Gaussian process was directly applied to cost prediction without non-uniformity detection. For this purpose, we compute the root-mean-squared prediction error (RMSE), which is defined by where N is the number of test samples, µ i denotes the predicted mean vibration of the input image texture x i and f (x i ) is the corresponding ground truth vibration. The experiment was conducted at Sanaru Lake in Hamamatsu, Japan, in a wide portfolio of terrains, where a full database, including terrain images and acceleration signals, was recorded and later randomly divided into training and test sets. The total number of samples is 2582, from which 90% of samples are allocated for training, and the remaining 10% of samples are used for testing. The resulting refined maps from multiscale analysis were weighted against a lightness contrast distance threshold contrastDistance threshold = 0.5. The remaining experimental settings are given in [24]. All experiments were performed on an Intel Core i7-3770 3.4 GHz CPU with 8GB of RAM.

Results and Discussion
As a result of TNUD discrimination, training process for homogeneous and non-homogeneous terrains was achieved using 1956 and 368 observations, respectively. The predictions of the motion information for homogeneous and non-homogeneous terrains were generated using 220 and 38 image texture samples, respectively. To facilitate the interpretation of the regression results, we summarize the results in Table 2 and plot the results of the predicted vibrations for both homogeneous and non-homogeneous terrains in Figures 13 and 14. The error of the prediction for both homogeneous and non-homogeneous terrains is far lower than the prediction error of the regressor proposed in [24]. Moreover, the failed prediction error is also lower than the method proposed in [24]. The results are shown in Table 3. The current framework sometimes fails to predict the vibrations in the terrain samples judged by the multiscale analysis to be either uniform or non-uniform. In the case of non-uniform terrains, as shown in Figure 14, the Gaussian process outputs negative predictions for vibration since we are not imposing any restrictions in this regard. Such negative vibration prediction is not consistent with the nature of the feature used in this study, which is always positive, as described by Equation (10). The only computations that are performed on the image are the global contrast measurement and contrast distance maps for scales 1, 2, and 3. The purpose of doing so was to understand and evaluate the behavior of the contrast distance feature toward non-uniformity. Multiscale analysis for scales 1, 2, and 3 suffers a slightly high computation time, as shown in Table 4. A scale 1 contrast distance map was generated with an image of size 1080 × 1920, a scale 2 contrast distance map was generated with an image of size 540 × 960, and a scale 3 contrast distance map was generated with an image of size 270 × 480. Since we could confirm that non-uniform regions have a higher contrast distance feature value than uniform regions, we propose in future work to accelerate this process by focusing only on the ROI and not the whole image. The ROI is approximately 10% of the whole image; thus, the computational time will decrease drastically. Compared with [36], where motion prediction with varying speed based on the 3D reconstruction problem setting was investigated, our platform run at a constant speed of 0.2 m/s. We propose to study the effect of speed variations on the prediction of vibrations in future works.

Conclusions
In this paper, a traversability cost prediction was presented based on terrain non-uniformity detection for uneven outdoor terrain environments. The traversability cost was represented by the max-min difference of vertical acceleration of the mobile robot as a motion feature. Based on multiscale analysis of contrast distance feature maps, the non-uniformity of textures in an image is detected. Based on the discrimination of uniform/non-uniform terrains, the GP-based predictor is applied independently to learn the traversability cost. In the experiment, it was verified that the proposed non-uniformity detection helps to improve the prediction performance.
It was also observed in the experiment that it is naturally difficult to predict large accelerations on non-uniform terrains in certain cases. As future work, instead of applying the same framework of texture-based prediction to both uniform and non-uniform terrains, we can investigate a more suitable approach for the cost prediction of non-uniform terrains. In the case of complex mobile robot architectures, we expect that the same methodology can be applied by collecting motion data on various terrains and extending the framework to consider the mechanical properties as well as the correlation between speed of motion and traversability cost.