Setup and devices
The FoB had the X/Y plane parallel to the workspace horizontal plane. The device returns 6 double-precision numbers describing the position (x, y and z in inches) and rotation (azimuth, elevation and roll in degrees) of the sensor with respect to a magnetic basis mounted about one meter away from the subject. The spatial resolution of the acquired data is 0.1 inches and 0.5 degrees.
The E504, after the standard calibration phase, returns one true/false value, denoting validity of the gaze coordinates (that is, the pupil being in sight of the camera and correctly recognised), and two doubleprecision numbers indicating the coordinates of the subject’s gaze with respect to the monitor. Its precision is about 1 degree which, we verified, corresponded to less than one pixel precision on the monitor, which was considered acceptable. The device can nominally stream at up to 50Hz. We configured it in order not to filter in any way the gaze signal, so that we could read the “raw” pupil movement; notice that, however, the signal was actually filtered off-line, later on, during the data analysis, as described below.
The CyberGlove was used as an on-off switch, to detect when the subject’s hand would close, by monitoring one of its sensors via a threshold.
The monitor showed the slave’s workspace; the slave is the humanoid platform Babybot, composed of two colour cameras, a commercial 6-degrees-of-freedom robotic arm, a pan/tilt head and a humanoid hand (see, e.g., (Natale, Orabona, Berton, Metta, & Sandini, 2005)). During the experiment, we only employed one of its colour cameras.
Figure 1, panel (
d) (reproduced from (
Natale et al., 2005)), shows a black-and-white representation of the workspace as seen on the monitor by the subject, that is, the “stimulus” presented to those who joined the experiment.
All data were collected, synchronised, and saved in real time at a frequency of slightly less than 50Hz, this being the best frequency obtained from the gaze tracker.
Building the data set
The first question was what pieces of data to consider to train our machine, that is, how to filter and/or manipulate the data obtained from the setup. It is wellknown that, when a human subject wants to grasp an object, he/she fixates the desired object and then performs the reaching action, without looking at his own hand while reaching (see, e.g., (Johansson, Westling, Bäckström, & Flanagan, 2001)). Therefore, we considered (a) the average of the subjects’ hand velocity, (b) the variance of the subjects’ gaze coordinates and (c) the information whether the subjects’ right hand was open or closed. We then expect, while fixating and reaching,
1. the gaze coordinates to hover around the point onthe screen where the desired object is seen; that is, their standard deviation over some time to be small; and
2. the hand to move towards the object on the screen, that is, the hand velocity components to be on average large.
The instants in which the hand is closed signal the intention to grasp, whereas those when the hand is open represent negative examples. Data (a) were easily obtained by differentiating in time the hand position x,y,z coordinates obtained from the FoB and then averaging these numbers over a certain time window (see below); data (b) were obtained by evaluating the standard deviation over the same time window of the gaze coordinates obtained from the E504; and lastly data (c) was obtained directly from the CyberGlove. (The samples corresponding to negative values of the E504 validity flag were ignored, manually verifying that this would not hamper the overall statistics.)
Thus, from each subject we obtained a sequence of 6-tuples (the three hand velocity coordinates, the two gaze coordinates and the open/closed hand flag). The above considerations should be valid over a certain time window, characteristic of the fixation/reaching operations — call it τ; and in general each subject will have a different τ(i), i = 1,...,7. Driven by this, we then decided to feed the learning system the following data: for each user i (and therefore for each sequence) and for a range of different values Tc attributed to τ(i), the hand velocity average values over Tc (three real numbers) and the gaze position standard deviations over Tc (two real numbers). Training was enforced by requiring that the system could guess, instant by instant, whether the hand was closed or not. This was represented as an integer value, in turn 1 or −1. The problem of guessing when the subject wants to grasp was thus turned into a typical supervised learning problem.
Grasping speed
In choosing the range for Tc, we were driven by the main consideration that a moving time window should not be longer then the interval of time between one grasping attempt and the following one. In fact, a longer time window could trick the system into considering data obtained during two or more independent grasping attempts.
By examining all sequences we found out that the interval between one grasping attempt and the following one lasted on average 7.1 ± 1.8 seconds. We then decided to let Tc range in the interval 0.1,...,5 seconds.
In general, we expected to find a best minimum value for Tc, which would then be the required τ(i) for each user, figuring out that shorter values would convey too little information about the ongoing movement, and that for longer ones, the moving averages would reach a plateau effect, tending to the overall average values of the hand velocity and gaze standard deviations. In fact, a moving average is roughly equivalent to a low-pass filter, and the longer the Tc, the lower the cutoff frequency; evaluating the best minimum value for Tc is tantamount to finding the right cutoff frequency, that is, to filtering out noise without damaging the signal.
Support Vector Machines
Our machine learning system is based upon Support Vector Machines (SVMs). Introduced in the early 90s by Boser, Guyon and Vapnik (
Boser et al., 1992), SVMs are a class of kernel-based learning algorithms deeply rooted in Statistical Learning Theory (
Vapnik, 1998), now extensively used in, e.g., speech recognition, object classification and function approximation with good results (
Cristianini & Shawe-Taylor, 2000). For an extensive introduction to the subject, see, e.g., (
Burges, 1998).
We are interested here in the problem of SVM classification, that is: given a set of
l training samples
, with
and
yi ∈{−1, 1}, find a function
f, drawn from a suitable functional space
, which best approximates the probability distribution of the source of the elements of
S. This function will be called a
model of the unknown probability distribution. In order to decide whether a sample belongs to either category, the sign of
f is considered, with the convention that
sgn(
f(
x)) ≥0 indicates
y = 1 and vice-versa. In practice,
f(
x) is a sum of
l elementary functions
K(
x,
y), each one centered on a point in
S, and weighted by real coefficients α
i:
where
. The choice of
K, the so-called
kernel, is done
a priori and defines
once and for all; it is therefore crucial. According to a standard practice (see, e.g., (
Cristianini & Shawe-Taylor, 2000)) we have chosen a
Gaussian kernel, which has one positive parameter
which is the standard deviation of the Gaussian functions used to build (1). Notice that this is not related to the fact that the target probability distribution might or might not be Gaussian.
Now, let
be a positive parameter; then the α
is and
b are found by minimising
LP (
training phase) with respect to the coefficients α
i, where
Here
R is a
regularisationterm and
L is a
loss functional. In practice, after the training phase, some of the α
is will be zero; the
xis associated with non-zero α
is are called
support vectors. Both the training time (i.e., the time required by the training phase) and the testing time (i.e., the time required to find the value of a point not in
S) crucially depend on the total number of support vectors; therefore, this number is an indicator of how hard the problem is. Since the number of support vectors is proportional to the sample set (
Steinwart, 2003), an even better indicator of the hardness of the problem is the percentage of support vectors with respect to the sample set size. We will denote this percentage by the symbol
pSV and call it
size of the related model. Willing to implement the system on-line, one has to choose models with the smallest possible size.
In (2), minimising the sum of R and L together ensures that the solution will approximate well the values in the training set, at the same time avoiding overfitting, i.e., exhibiting poor accuracy on points outside S. Smaller values of the parameter C give more importance to the regularisation term and vice-versa.
There are, therefore, two parameters to be tuned in our setting: C and σ. In all our tests we found the optimal values of C and σ by grid search with 3-fold cross-validation. This ensures that the obtained models will have a high generalisation power, i.e., their guess will be accurate also on samples not in S.
Notice, lastly, that the quantity to be minimised in Equation (2) is convex; due to this, as well as to the use of a kernel, SVMs have the advantages that their training is guaranteed to end up in a global solution and that they can easily work in highly dimensional, non-linear feature spaces, as opposed to analogous algorithms such as, e.g., artificial neural networks. As a matter of fact, SVMs are best employed when the chosen kernel maps the samples to a space in which the problem is linearly separable, that is, a hyperplane (linear function) can be found which separates the samples labelled 1 from those labelled −1.
We have employed LIBSVM v2.82 (
Chang & Lin, 2001), a standard, efficient implementation of SVMs.
According to the procedure described in the previous parts of this Section, we decided to set up a SVM for each user
i and value of the time window
Tc. Willing to compare the performance with and without the use of the gaze signal, we defined
as the input space in the case of not using the gaze (the 3 numbers representing the hand average velocity over
Tc), and
in the case of using the gaze (the 5 numbers representing the hand velocity average and gaze position standard deviation over
Tc). According to this and to the experience gathered in previous work (
Castellini & Sandini, 2007), the ranges of the parameters
C and σ were chosen as follows:
C was 10
k with
k = −1,...,3 in steps of 0.2, whereas σ was
with
k = 0,...,2, in steps of 0.2
in the case of not using the gaze).