Laplacian Support Vector Machine for Vibration-Based Robotic Terrain Classiﬁcation

: The achievement of robot autonomy has environmental perception as a prerequisite. The hazards rendered from uneven, soft and slippery terrains, which are generally named non-geometric hazards, are another potential threat reducing the traversing efﬁcient, and therefore receiving more and more attention from the robotics community. In the paper, the vibration-based terrain classiﬁcation (VTC) is investigated by taking a very practical issue, i.e., lack of labels, into consideration. According to the intrinsic temporal correlation existing in the sampled terrain sequence, a modiﬁed Laplacian SVM is proposed to utilise the unlabelled data to improve the classiﬁcation performance. To the best of our knowledge, this is the ﬁrst paper studying semi-supervised learning problem in robotic terrain classiﬁcation. The experiment demonstrates that: (1) supervised learning (SVM) achieves a relatively low classiﬁcation accuracy if given insufﬁcient labels; (2) feature-space homogeneity based semi-supervised learning (traditional Laplacian SVM) cannot improve supervised learning’s accuracy, and even makes it worse; (3) feature- and temporal-space based semi-supervised learning (modiﬁed Laplacian SVM), which is proposed in the paper, could increase the classiﬁcation accuracy very signiﬁcantly.


Introduction
Achieving autonomous motion of a mobile robot is one of the most challenging problems in robotics, and the key to its success consists of the following four parts: environmental perception, pose estimation, motion control and route planning [1]. The implementation of pose estimation, motion control and route planning often requires us to introduce environmental information to some extent, so accurate environmental perception is of great importance [2]. The environmental humps (e.g., walls) and sinks (e.g., rivers) that robots cannot traverse are referred to as geometric hazards, which have been investigated extensively [3]. On the other hand, the hazards rendered from the uneven, soft and slippery terrains, which are often called non-geometric hazards, are receiving more and more attention from the robotics community [4].Different from geometric hazards, non-geometric hazards do not obstruct the traversing robot completely, but have a great impact on the traversing efficiency [5]. Inappropriately planned routes and an improper control strategy may lead the robot to waste too much energy, or even cause a loss in mobility. Therefore, if the robot can predict its current and forward terrain type accurately and in real time, then it can replan its route in time to avoid non-geometric hazards. Apart from its great effect on route planing, robotic terrain classification also contributes to accuracy of over 90%. In [32], a training phase using the waveform representation of the vibration data gathered from an accelerometer is first executed, and then linear discriminant analysis is used for online classification. A comparative study is presented in [33], and the results show the best performance of SVM classifier compared with other classifiers-probabilistic neural networks, kNN, etc. In [34], a terrain classifier is proposed that uses vibration data and motor control data for legged robots, for which one-versus-one SVM is employed. In [35], the measurements from an accelerometer are used to classify the road terrain for land vehicles. In the work, the principal component analysis is employed to determine the best features, and three classifiers including Naive Bayes, neural network and SVM are evaluated. In order to take the temporal coherence into consideration, an adaptive Bayesian filter is employed to correct the classification results. In their work, the terrain predictions do not only rely on the current vibration measurements, but also on the nearest several classifications [36]. Similarly in [37], a Markovian random field based clustering approach is presented to group vibration data, in which the inherent temporal dependencies between consecutive measurements are also considered when predicting the terrain types. In addition, terrain vision could be an auxiliary measure for improving the vibration based terrain classification [38,39].
Aforementioned works were developed based on SVM, kNN, Naive Bayes, etc. Apart from the traditional machine learning methods, the artificial neural networks, which have a stronger ability to fit any non-linear relationship, were introduced to tackle robotic terrain classification problems. In [40], a terrain classification method based on a 3-axis accelerometer is proposed. After processing the gathered 3-dimensional vibration data by fast Fourier transformation and manual labelling, the labelled training set is constructed and then used to derive a modified back propagation artificial neural network (ANN) to classify the traversed terrains. The real-world experiment results show their work could realise the classification of five different terrains with about 90% accuracy. Furthermore in [41], an ANN with deeper multi-layer perception is introduced, the accuracy of which is increased significantly compared with that of [40]. Dispensing with feature extraction, the recurrent neural network is able to operate the vibration data in time domain and competent for classification of 14 different terrain types [42]. Similar work based on ANN could be found in [43,44].
Although much work has been done, most studies treated the terrain classification task as a supervised learning problem. As terrain type is defined in terms of its appearance by a human, a robot could gather the terrain images and vibration data synchronously, and save them in pairs. Afterwards, all vibrations are labelled by a human according to the terrain images. However, it is a repetitive and burdensome task for a human to label all terrain images. Additionally, the site at which a field robot works may be far from urban areas, which cannot guarantee reliable and fast communication, so it is impracticable for the robot to request all the manual labels. As a result, only a small portion of the terrain images could be labelled manually. To the best of our knowledge, such a semi-supervised learning problem has not been studied in robotic terrain classification. In this paper, the vibration-based terrain classification (VTC) is investigated by introducing the semi-supervised learning framework. According to the intrinsic temporal correlation existing in the sampled terrain sequence, a modified Laplacian SVM is proposed to utilise the unlabelled data to improve the classification performance.
The rest of the paper is organised as follows: Section 2 illustrates the framework and flow chart of the terrain classification system, and expatiates on the feature extraction method and semi-supervised learning algorithm. Section 3 presents the real-world experiment of our method compared with some existing ones, as well as the performances by adjusting the parameters of our method. The paper is concluded in Section 4.

Methodology
The terrain classification system is illustrated in Figure 1. A camera mounted on the support top captures the terrain image in the forward direction. Once the robot moves a very short distance, the sub-images of the terrain patches that the robot wheels traversed could be picked out, according to the relative distance measured by localisation sensors (e.g., GPS, odometry). It is known that odometry could realise the relative localisation with high accuracy within a small distance, and the terrain is spatially continuous (which means that the terrain patches might be of the same class within a wide area), so the effect of localisation uncertainty on the learning problem could be ignored.
Because a robot may be equipped with shock absorbers, the vibration sensors are preferred to be mounted on the axle. Hence, vibration data and the corresponding image of terrain patch could be matched. As terrain type is defined in terms of its appearance by a human, the robot could send the terrain images to a human and request labels. Field robots are designed to execute tasks in fields which are far from urban areas, and a field robot cannot guarantee reliable and fast communication, so it is impracticable for the robot to request all the manual labels. As a result, only a small portion of the terrain images are labelled manually, and a semi-supervised learning method will be employed to train the classifier after the labels arriving.
The problems is formulated as follows. A sample is denoted by . The work of terrain classification aims to classify the sample set X = {x 1 , x 2 , · · · , x n } into L subsets, where L is the number of terrain types. Under the semi-supervised learning framework, the robotic system requests samples to be labelled, and predicts the remaining u = n − unlabelled samples without any instructions from humans.

Semi-Supervised
Learning Labels

Feature Extraction
An accelerometer is employed to collect the z-axis acceleration sequences at 100 Hz. Actually, the detected acceleration do not only contain a pure motion-induced vibration, but also the gravitational acceleration. Because the robot works on horizontal ground, the gravity could be treated as a constant number, and therefore, subtracting the gravitational acceleration from the acceleration sequence could yield the vibration sequence. In addition, all vibration sequences are segmented into sub-sequences which are called vibration frames. Each vibration frame is composed of m successive vibration points, and overlaps its forward/backward vibration frames by 50% to guarantee the classification timeliness. Define a vibration frame by a = (a 1 , a 2 , · · · , a m ). Now we are in a position to extract features from a in the time domain and the frequency domain.

Frequency-Domain Features
Transforming the vibration sequence from the time domain to the frequency domain is usually very helpful, as it could extract discriminative signal components and simplify the mathematical analysis. As a frequently-used digital signal decomposition tool, the discrete Fourier transform (DFT) could output the amplitude spectrum of sequence in a time-discrete way. The κ-point DFT on the vibration frame a is defined by where j 2 = −1, and k is the frequency. In the paper, we use the fast Fourier transform (FFT) to implement DFT, thereby accelerating the signal decomposition. The parameter κ should be an integer which can be factored into a product of small prime numbers, or a power of 2, simply. If κ > n, the vibration frame a should be padded with zeros. In other words, the terms [a n+1 : a κ ] are set to zeros. Our experiment employs an accelerometer with up to 100 Hz frequency. Because the terrain prediction is desired to be provided every second, i.e., terrain classification works at 1 Hz, we set κ = 2 7 = 128. By using 128-point FFT, a 128-dimensional feature vector is obtained. In order to increase the classification speed and reduce redundancy features, we can sample some entries uniformly from the spectral vector to constitute the frequency-domain feature vector.

Time-Domain Features
Apart from extracting features in frequency domain, we can also extract features in the time domain directly. The existing literature has proposed many time-domain features and achieved an acceptable classification accuracy, but only a portion of them contribute primarily to the classification performance [24]. In this paper, the time-domain feature vector x = (x (1) , x (2) , · · · , x (10) ) is shown as follows: • x (1) : The number of sign variations where I(·) is an indicator function, which outputs 1 if the expression in (·) holds, or 0 otherwise. This feature is an approximation of the frequency of a. • x (2) : The mean of a which measures the coarse degree of terrains. This feature may considerably diverge from zero for some coarse terrains. • x (3) : The number of sign changes inā wherē This is a complement to x (1) , which avoids x (1) ≈ 0 for even a high-frequency vibration sequence when the robot is traversing coarse terrain. • x (4) : The variance of a Intuitively, the variance is higher when the terrain becomes coarser. • x (5) : The autocorrelation function of a where τ < m is an integer indicating time difference. As a measure of non-randomness, x (5) gets larger with a stronger dependency between a i and a i+τ . Obviously, this feature can be extended by setting τ = 1, 2, · · · , m − 1. However, according to Khintchine's law, it should be guaranteed that τ m bounds the estimation error of x (5) . In the paper, we choose τ = 1. • x (6) : The maximum value in a which indicates the biggest bump of the terrain. • x (7) : The minimum value in a which indicates the deepest puddle of the terrain. • x (8) : The 2 -norm of a which reflects the energy of a. If x (2) → 0, x (8) has the similar function as x (4) . Instead, we can also use the 1 -norm; i.e., • x (9) : The impulse factor of a which measures the impact degree in a. • x (10) : The kurtosis of a which measures the deviation degree of the a with Gaussian distribution.

Laplacian Support Vector Machine
The Laplacian SVM (LapSVM) uitised in this paper is an extension of the traditional support vector machine (SVM). It is worth noting that the LapSVM belongs to semi-supervised learning, which is different from SVM. The LapSVM trains the model from the labelled and unlabelled data according to the manifold assumption, while the SVM only uses the labelled data. Consequently, we will first introduce the SVM model, and then expatiate on the formulation of LapSVM, including the construction of similarity matrices in the remainder of this chapter.

SVM Model
The SVMs are a series of classification algorithms that divide data into two groups with a separating hyperplane. Considering the incompleteness of training data and the existence of the noise interferes, the separating hyperplane of maximum margin is applied in the SVMs to improve the robustness. Hence, the separating hyperplane can be represented as the following linear equation where ω = (ω 1 ; ω 2 ; . . . ; ω d ) is a normal vector with respect to the hyperplane, b is a scalar bias deciding the distance from the origin to hyperplane and denotes the transpose. In general, the classification tasks are hardly completed in the data space when the samples cannot be classified linearly (e.g., Xor classification problems). Hence, the kernel function is introduced to SVM to map the samples from original data space to a high-dimensional space, where an adequate separating hyperplane could be found for the nonlinear classification problems. First, given the mapping function φ : x → φ(x), the hyperplane in Equation (13) can be rewritten as: According to the representer theorem proposed in [45], the kernel function is denoted by , and thereby we have where α i denotes the Lagrangian multiplier. The samples x i with α i > 0 determine the decision function; hence naming them support vectors.

Semi-Supervised Learning
As an extension of SVM, the LapSVM introduces a manifold regularisation term to improve the smoothness of model. By utilising the similarity among the labelled and unlabelled samples, the Laplacian matrix of the graph is created-Laplacian SVM [46]. The LapSVM is achieved by solving the following optimization problem, where f denotes the decision function, y i ∈ {−1, +1} denotes the labels, V denotes a given loss . The coefficients γ A and γ M control the complexity of f in the reproducing kernel Hilbert space (RKHS) and in the intrinsic geometry of the marginal distribution, respectively.
In Equation (16), the regularisation term f 2 K can be expanded in terms of the expansion coefficients α and kernel matrix K = [k(x i , x j )] n×n as follows, Similarly, the regularisation term f 2 M in Equation (16) can be rewritten, which is based on the manifold assumption, where w ij is the similarity between the iand j-th samples, thereby denoting the similarity matrix The construction of the similarity matrix W is introduced in the next section.

Similarity Matrix
As shown in Figure 2, we observe that samples of terrain types comply with the homogeneities both in the feature space and temporal dimension, which could be utilised to improve the classification performance under the lack of labels. The first similarity matrix W 1 is established based on the homogeneity in feature space. The (i, j)-th element w where N 1 (x j ) denotes the set of k 1 -nearest neighbouring samples of x i under the metric of euclidean distance in feature space, and t 1 > 0 denotes the width of Guassian kernel.
1,4 under k = 3. As the right subfigure shows, N 1 ( Analogously, the second similarity matrix W 2 is established based on the homogeneity in temporal dimension. The (i, j)-th element w (2) i,j in W 2 is denoted by where k 2 ≥ 2 is an even, and t 2 > 0 denotes the width of Guassian kernel.
The two similarity matrixes W 1 and W 2 can be merged into one similarity matrix W by where µ ∈ (0, 1) denotes the weight and σ(·) denotes a given nonlinear function. The weight coefficient µ selects which homogeneity is more convinced. For example, if terrain switches from one type to another more frequently over time, µ should be greater because of the weaker temporal correlation. The nonlinear function σ(·), e.g., (·) 2 , could increase the similarity between two samples which are similar both in feature space and temporal dimension.

Solution of LapSVM
According to Equations (17), (18) and (21), Equation (16) is rewritten as the solution of which is the targeted SVM classification model. According to the representer theorem proposed in [45], the solution of Equation (22) can be found in RKHS and is expressed by where K(·, ·) denotes the kernel, and α * i and b * is worked out by the preconditioned conjugate gradient (PCG) algorithm [46].

Experiment Setup
The experiment is conducted by a four-wheel mobile robot, the photo and electronic system of which are shown in Figure 3. The robot is 340 mm in length, 270 mm in width, 230 mm in height and 2.6 kg in mass. The diameter and width of the wheels are 130 mm and 60 mm, respectively. With a power supply of 12 V, the robot could traverse coarse ground at a speed of up to 1.5 m/s. The sensing system is composed of an accelerometer, gyroscope, magnetometer and odometer. The data collector reads the accelerometer and odometrer at 100 Hz and 1 Hz, respectively. All data is recorded in the onboard flash memory, and transferred to a computer (3.20 GHz, 8 GB RAM). Controlled by a smart phone via Bluetooth, the robot traverses six typical terrains. Photos of patches of terrains and the related vibration sub-sequences are shown in Figure 4. Some of them are artificial terrains (e.g., asphalt road), while some are natural ones (e.g., natural grass). These terrains are different in rigidity, roughness and flatness. Compared with other terrains, it is observed that the interaction between the robot and the cobble path generates a highly distinguishable vibration. The vibration has higher frequency, larger magnitude and weaker autocorrelation, because the cobble path is relatively rigid and irregular. The vibrations of the other five terrains may not be easy to discriminate intuitively; their slight differences, however, still can be found in terms of their variational tendencies. The dataset is composed of 1584 vibration frames belonging to six different terrains (natural grass (NG), asphalt road (AR), cobble path (CP), artificial grass (AG), sand beach (SB), plastic track (PT)), and each terrain has 264 frames.

Data Visualisation
We visualise our data through t-distributed stochastic neighbour embedding (t-SNE). As shown in [47], t-SNE is a non-linear technique for dimensionality reduction that is particularly well suited for the visualisation of high-dimensional datasets. The t-SNE minimises the divergence between two distributions: a distribution that measures pairwise similarities of the input objects and a distribution that measures pairwise similarities of the corresponding low-dimensional points in the embedding. The Kullback-Leibler (KL) divergence of the joint probability of the original space and the embedded space is used to evaluate the quality of visualisation effect. That is to say, the function of KL divergence is used as a loss function, and then the loss function is minimised by gradient descent, and finally the convergence result is obtained. Figure 5 shows the t-SNE visualisation of the time-domain features and frequency-domain features of our data. It is easy to derive the following conclusions: (1) The data of CB do not intersect with and are far away from the other data; hence CB could be recognised easily and accurately. This is because CB is relatively rigid and irregular, and CB-induced vibration is distinguishable compared with other terrain types, which is also demonstrated in Figure 4. (2) The data of AR are relatively clustered and barely intersect with those of NG, AG, SB, PT, both in the time domain and frequency domain, so AR could be recognised with the second accuracy. (3) The data of terrains other than CB and AR may intersect with others more or less, so there may exist confusion in the classification of the four terrains. In particular, for NG and SB, the data are embedded into each other, so it is a challenge to distinguish them. (4) Compared with the time-domain features, the frequency-domain features have more clustered behaviour in the 2-dimensional feature space. It can be predicted that the frequency-domain features could yield a better classification accuracy.  As shown in Figure 6, we split the terrain sequence into segments and concatenate them into a new rearranged terrain sequence. We use dwelling time T d to describe the terrain switching frequency. It is observed that the terrain sequence implicates temporal correlation; i.e., the next terrain has the same type as the current type very possibly. From top to bottom, each terrain dwells for 264, 66 or 22 sampling points, respectively. In the following experiment, we will show the influence of classification accuracy by different dwelling times.

Experiment Coding
We will evaluate the classification performance by SVM, the traditional LapSVM (t-LapSVM), and the proposed LapSVM (p-LapSVM), all conducted by MATLAB. If using SVM, the labelled data are used to train the classifier. We use the famous tool "LIBSVM" to train a 6-class classifier and test it directly [48]. If using LapSVM, both the labelled and unlabelled data are used to train the classifier. These trained classifiers are tested on the unlabelled data. The tool of binary t-LapSVM [47] written in MATLAB has been released at http://www.dii.unisi.it/~melacci/lapsvmp/. Multi-class t-LapSVM is realised by one-versus-one strategy. The t-LapSVM only considers the homogeneity in the feature space, while the p-LapSVM considers the homogeneities in the feature space and the temporal dimension. Hence, we modify the t-LapSVM code through adding an extra similarity matrix coupled with its weight, thereby deriving the p-LapSVM tool. As for the other tools, those concerning machine learning could be easily realised by using the built-in functions of MATLAB.

Experiments on the Labelling Strategy 1 (LS1)
In this part, we randomly select the samples with equal number e from each class, and label them. This is an ideal yet unrealisable labelling strategy, which is used to evaluate the effectiveness of the semi-supervised learning algorithm.

SVM
The experiment results using SVM under LS1 are shown in Figure 7. Using time-domain features and e = 30, the total classification accuracy is 75.64%. As Figure 7a illustrated, CP could be classified with 100% accuracy, while the other five terrains could not be classified with acceptable accuracies. It was found that there were obvious confusions between NG and SB; AG and PT; and SB and PT, which can be interpreted from Figure 5a. Adjusting e from 10 to 100, the number of labelled data increased, but it is not easy to observe the increasing of classification accuracy, as shown in Figure 7c. Observe the performances in the frequency domain. The total classification accuracy is 82.19% under e = 30. The confusions between NG and SB; AG and PT; and SB and PT still exist, but the other confusions are reduced through observing the differences between Figure 7a and Figure 7b. With e increasing, the classification accuracy is increased slightly, as shown in Figure 7d.

t-LapSVM
The experiment results of t-LapSVM under LS1 are shown in Figure 8. Using time-domain features and e = 30, the total classification accuracy is 76.21%, so the classification is almost not improved from SVM to t-LapSVM. Comparing Figure 8a with Figure 7a, it is observed that some confusions are weakened and some are strengthened. This is because the unlabelled data could be labelled correctly if the feature-space homogeneity holds; otherwise they may be labelled incorrectly. As Figure 5a shows, the data of different classes intersect partially; i.e., the feature-space homogeneity does not hold. Therefore, the data which could be correctly predicted by SVM may not be predicted correctly by t-LapSVM. Although increasing the number of labelled samples, the classification cannot be improved, which is observed by comparing Figure 8c with Figure 7c. The total classification accuracy using frequency-domain features is 78.85% under e = 30. Observing Figure 8b,d, it is found that the classification performs worse by introducing semi-supervised learning. Hence, inappropriate utilisation of unlabelled data may cause a deterioration in classification. In Table 1, we present the classification performances with different parameters of t-LapSVM. It was observed that changing the values of k 1 , γ K , γ M and t 1 did not cause a significant variation in classification performance, which may reveal that the assumption of feature-space homogeneity is invalid.

p-LapSVM
The experiment results of p-LapSVM under LS1 are shown in Figure 9. Using time-domain features and e = 30, the total classification accuracy is 90.95%, so the classification is increased by about 15% compared with t-LapSVM. Comparing Figure 9a with Figure 8a, it is clear that most confusions are weakened significantly. Additionally, as shown in Figure 9c, the total classification accuracy increases from 85% to 98% with increasing from 10 to 100. Table 2 lists the total classification accuracies under different parameters of p-LapSVM. It can be found that the total classification accuracy could even reach 99.64% if the parameters could be set appropriately. The parameter setting obeys the following rule: the homogeneity in temporal dimension should be weighted more under a large dwelling time (i.e., terrain switches infrequently); otherwise, weighted less under a small dwelling time (i.e., terrain switches frequently). As column 5 shows, t 2 = 1 and µ = 1, meaning the homogeneity in temporal dimension receives the highest weight, so the total classification accuracy is extremely high under T d = 264 but extremely low under T d = 22. Hence, the homogeneity in temporal dimension should be weighted moderately rather than exceedingly, especially in the situation in which T d cannot be determined. The experiment demonstrates that the p-LapSVM could properly utilise the homogeneity in temporal dimension, and thereby realise the terrain classification with high accuracy under the lack of labels. In the frequency domain, the total classification accuracy is 93.30% under e = 30, and the classification could be further improved with a larger e .

Experiments on Labelling Strategy 2 (LS2)
The LS1 is not realizable in practice, so we try to employ another practicable labelling strategy. If the number of total labelling samples is set to a , then we use a -clustering algorithm to yield a clusters and randomly select one sample from each cluster to request manual labelling. It is noted that a = 6 × e . In this part, we aim to show the influence of class imbalance, which is caused by the labelling strategy, on the semi-supervised learning accuracy. As shown in Figure 10, the classes of the selected samples are not balanced, but not so seriously. The classification accuracies of p-LapSVM over T d under LS2 are shown in Figures 11 and 12. It is easy to find that the total classification accuracy decreases under the same total number of labelled samples, for the presence of class imbalance. However, a significant improvement in accuracy can be still observed by using the modified LapSVM.

Conclusions
In the paper, the semi-supervised learning problem for terrain classification is investigated. To the best of our knowledge, such a semi-supervised learning problem has not been studied in robotic terrain classification. Based on the homogeneities in feature space and the temporal dimension, a modified Laplacian SVM is proposed, and thereby the intrinsic information of unlabelled data could be sufficiently used to increase the classification accuracy. As the experiment demonstrated, the modified Laplacian SVM overwhelms the traditional Laplacian SVM in accuracy.

Conflicts of Interest:
The authors declare no conflict of interest.