A scanpath is the spatiotemporal sequence of fixations and saccades performed during one trial of eye movement measurements. The term scanpath was first introduced by Noton and Stark in 1971 (1971a, 1971b). In their study, eye movement sequences of participants were compared across encoding and recognition of simple line drawings. Scanpath analysis revealed high similarity across these conditions, leading to the conclusion that the idiosyncratic scanpath of a participant directly reveals his or her current cognitive processes. Although, such a strong connection between cognition and scanpaths is no longer believed, it is still argued that the scanpath reveals something about the underlying cognitive processes involved in eye movement control. Foulsham, Dewhurst, Nyström, Jarodzka, Johansson, Underwood, and Holmqvist (2012), for instance, argued recently that scanpaths are indeed similar between encoding and retrieval of spatial information, because eye movements are used as spatial retrieval cues for memory content. Scanpath similarities have also been found between visual perception and imagery (
Laeng & Teodorescu, 2002) as well as between different levels of expertise in a sensorimotor task (
Foerster et al., 2011). Scanpaths during a highly trained sequential sensorimotor task were even similar with and without visual input (Foerster, Carbone, Koesling, &
Schneider, 2012). Hence, in all these experimental conditions, scanpath similarity measures are essential for understanding how various factors shape eye movement control over space and time.
However, different methods can be used to determine scanpath similarity quantitatively. Some methods calculate with exact x- and y-coordinates, others with regions of interest. Some methods align fixations according to their numerical index, others according to their temporal position in the path. Each procedure has benefits and limits, so that it depends on the accomplished task and the research question which method to choose. Especially for research on sequential tasks, such as sensorimotor real-world tasks, none of the existing methods is suitable. The most prominent methods are the string edit methods (Brandt & Stark, 1997; Hacisalihzade, Stark, & Allen, 1992; Levenshtein, 1966, Noton & Stark, 1971a, 1971b). String edit methods compare fixations within scanpaths according to their spatial similarity along their numerical and more recently temporal position within the paths (Cristino, Mathôt, Theeuwes, & Gilchrist, 2010). The first step of these methods is to superimpose a grid onto the spatial dimension where the scanpaths have been executed, for instance onto the computer screen in a laboratory experiment. Each grid region is then labeled individually, usually by letters, numbers, or a combination of both. In the next step, each of the to-be-compared scanpaths is expressed as a string according to the sequence of region labels it passes through. A scanpath with a first fixation in B, a second in D, and a last fixation in G would become the string BDG. In order to evaluate the similarity between two paths, the string corresponding to one of the paths is transformed into the string corresponding to the other path by insertion, deletion, and substitution of individual labels. Transforming the string GNIRF into AIJRF would consist of one substitution of G by A, one deletion of N, and one insertion of J after I, adding up to three editing steps. The number of editing steps needed to transform the one string into the other is the dissimilarity value. This value is often normalized in a way that perfect similarity is expressed by 0, while “
maximum” dissimilarity is expressed by 1 (for details see, e.g., Cristino et al., 2010). There are some variants of string editing. The editing steps can be differently weighted. The duration of fixations can be taken into account, for instance, by repeating the labels of longer fixations respectively to shorter fixations. The substitution by a label that is far away within the grid can be weighted higher than the substitution by a label that is near within the grid (e.g.,
Cristino et al., 2010). However, all string edit methods compare fixation locations according to their similarity along the scanpaths, meaning that fixations will be more likely aligned that have either similar numerical indices within the path or were executed after a similar interval from trial onset. This procedure is reasonable if the to-be-compared scanpaths are executed in response to a relatively stable visual input, such as in picture viewing.
In real-world sensorimotor tasks such as walking (
Jovancevic-Misic & Hayhoe, 2009), sandwich making (Hayhoe, Shrivastava, Mruczek, & Pelz, 2003), or sports (
Land & McLeod, 2000), however, trial completion times vary across trial repetitions and participants. In addition, stimuli viewing times are uncontrollable because participants actively change their environment while performing the task. Fixations performed during one and the same functional unit of the task - e.g., a specific sub-action - can have a different index in the two paths and might be executed at a different interval after trial onset. When performing a task for the first time, participants typically need longer and perform more fixations than when having acquired a reasonable degree of expertise (Epelboim, Steinman, Kowler, Edwards, Pizlo, Erkelens, & Collewijn, 1995;
Foerster et al., 2011). Apart from these intraindividual differences during skill acquisition, there are inter-individual differences, in that some participants may execute more fixations and take longer to perform a specific task than other participants. Imagine a child and its parents tying a shoe. The child will take much longer than the adults even though working off exactly the same sub-actions to fulfill the task. When investigating such an everyday action, you might want to know whether they all fixate similar locations on the shoe or the lace while performing exactly the same sub-action. Do both look at the lace while setting the loop or does the child look at its own finger while the adult looks at the lace? The overall question is how similar scanpaths are during such sequential sensorimotor tasks, when eye movements are compared within the sub-actions along the task sequence.
In a study on learning the sensorimotor skill of speed stacking (
Foerster et al., 2011), we wanted to know how eye movement patterns differ across low and high levels of expertise. Therefore, we used mobile eye tracking to investigate eye movements in speed stacking before and after a 14-days training session. In speed stacking, pyramids of plastic cups have to be stacked up and down as fast as possible in a fixed sequence. Participants became faster and performed fewer fixations while learning the task. However, we wanted to know whether they nevertheless looked at similar locations to guide their hand movements, i.e. whether scanpaths are similar across levels of expertise. As the scanpaths were much shorter after training, traditional scanpath similarity methods could not help answer the question whether participants looked at similar locations while performing the same sub-action of the task. Therefore, we developed a scanpath similarity method with a functional matching procedure. First, we divided the task into 44 units. The objectrelated actions (ORAs) of the task were used as units. According to
Land and Hayhoe (
2001) an ORA is an act that is performed on an object without interruption. In the case of speed stacking, an ORA was defined as stacking up or down a single cup or stack of cups to other cups or stacks. Second, the cup’s starting configuration of each of the 44 ORAs was drawn schematically with a common coordinate system (see real-world example at the end of the present paper and Figure 7). The locations of all fixations were plotted in the scheme with respect to the cup arrangement of the corresponding ORA. This was done manually based on the frame-by-frame video information of the mobile eye tracker’s recordings. Afterwards, standardized x- and y-coordinates as well as an ORA index could be assigned to each fixation. This means that first the location of each fixation was standardized according to the visual input during the task, and second that each fixation was labeled according to the task’s functional unit – here ORA – in which it occurred. After having labeled all fixations, the mean fixation location within each functional unit was calculated for all scanpaths, respectively. Then, Euclidean distances were calculated between to-be-compared scanpaths. Finally, distance values were evaluated by testing them against random baseline values of the same data set. The random baseline values were calculated by first scrambling fixation locations and then calculating the Euclidean distances between the scrambled path and the actually observed path. Therefore, the method compares whether fixation locations are more similar within the functional units of the to-be-compared paths than across the functional units of one and the same path. In other words, it is tested whether the variance across scanpaths is smaller than the variance within a scanpath.
The question the method answers so far is whether the location sequence is similar in two scanpaths. However, not only the sequence of fixation locations can be of interest when comparing scanpaths, but also other characteristics of eye movements within the paths, such as fixation durations, saccade lengths, or saccade directions (cf. Dewhurst, Nyström, Jarodzka, Foulsham, Johansson, & Holmqvist, 2012; Jarodzka, Holmqvist, & Nyström, 2010). In addition, averaging fixation locations within task components is a sub-optimal calculation step, if many different fixations belong to each functional unit of the task. Therefore, we extended our method to strengthen its applicability for scanpath comparisons in sequential tasks. Here, we report the extended version of our functionally sequenced scanpath similarity method (FuncSim). First, we will define the types of tasks whose scanpaths can be investigated with FuncSim. Then, the prerequisites and algorithms of FuncSim are described in detail. Afterwards, we will outline the advantages of the method compared to traditional methods. Finally, we compare FuncSim with recent scanpath similarity methods (ScanMatch by Cristino et al, 2010, and MultiMatch by Dewhurst et al., 2012;
Jarodzka et al., 2010), first on the basis of artificial examples, and second with the help of a real-world example.
Sequential tasks as the domains of application
The functional sequenced scanpath similarity method has been proven to be useful in comparing scanpaths in the sensorimotor task of speed stacking across levels of expertise, across participants, and across lighting conditions (
Foerster et al., 2011,
2012). The method is useful in all tasks that can be divided into distinct functional units, as is often the case with sensorimotor tasks. A characteristic of these sensorimotor tasks is that they consist of distinct sub-actions or sub-tasks that have to be accomplished. The sequence of these sub-units can be completely fixed, partly fixed, or completely variable. When playing a specific piece on the piano, for instance, the sequence of notes to be played is completely fixed. However, when making a peanut butter jelly sandwich (Hayhoe, Shrivastava, Mruczek, & Pelz, 2003), you can decide whether you want to spread the butter or the jelly first. Nevertheless, you cannot spread jelly or peanut butter on the bread before opening the respective jar. Sandwich making is an example of a task with a partly fixed sequence of sub-units. Note that sub-units of piano playing consist of the same type of actions (key boarding) to different objects of the same type (piano keys), while sandwich making consists of many different types of actions on different types of objects. Tasks that can be investigated with this method are not restricted to tasks on graspable objects in the world. Reading is an example. Here, sub-units are not actions on graspable objects, but actions (speaking) to sentences, words, or syllables on a paper or a computer screen. Do participants look at the same place on the text while uttering the same sub-unit of it? Moreover, there don’t have to be necessarily objects in the world. Reciting a speech or a poem by heart is an example. In this case, sub-units (spoken words) are performed to internally activated object representations. Sub-units don’t have to be motor responses - they can be internal cognitive “responses”. External signals can be used to determine the cognitive sub-task of the participant and the sub-units at the same time. Tone signals could, for instance, instruct participants which cognitive sub-task they shall perform, e.g., a first tone indicates to count numbers on a computer screen, a second tone indicates to sum the numbers, a third tone indicates to multiply them, etc. Finally, sub-units can be of varying complexity and execution time. The important prerequisite for using FuncSim is that fixations can be assigned to distinct units of the task.
FuncSim: The functionally sequenced scanpath similarity method
In previous publications (
Foerster et al., 2011,
2012), we introduced the sub-action sequenced linear distance method that allows to determine scanpath similarity within functional units of a task. Here, we report the following two major extensions of our method. First, we introduced a second alignment variant making the method more robust across tasks. Second, we added the possibility to compare scanpaths according to multiple characteristics (cf., Dewhurst et al., 2012;
Jarodzka et al., 2010) to expand the number of research questions that can be investigated with this method.
Before using FuncSim, it has to be ensured that fixation location data belong to a common coordinate system. This is already the case in static eye tracking with fixed body and head position. However, in mobile eye tracking without such a movement restriction, the location of each fixation has to be standardized, e.g., in world coordinates according to the visual input during the task (
Figure 1). In stacking a pyramid of cups (
Foerster et al., 2011,
2012), for example, fixation locations could be standardized according to the bottom cup that is not moved. In tying a shoe, fixation locations could be standardized according to the location of the shoe. In mobile eye-tracking, this is usually done by manual frame-by-frame coding. In this laborious procedure, fixations are extracted from the eye-tracking video in a frame-by-frame manner. It is manually coded whether the change in the eye-marker position on the video is small enough to be part of the same fixation or large enough to be part of a saccade to another position. Although this procedure is open to errors, it is still the state-of-the art in mobile eye-tracking research (Holmqvist, Nyström, Andersson, Dewhurst, Jarodzka, & Van de Weijer, 2011, pp. 175-176). However, with the fast progress in eye-tracking technologies, semiautomatized computer vision-based analysis procedures might be available in the future.
Moreover, each fixation has to be labeled according to the task’s functional unit in which it had been executed (
Table 1 and
Figure 2). If a fixation starts in one functional unit and ends in the next functional unit, we recommend doubling it and labeling it once with the former and once with the latter functional unit. This means that coordinates and fixation duration of this fixation appear twice as in the shaded lines of
Table 1. If a fixation lasts for even more functional units, it can be added and labeled as often as functional units are traversed. After this labeling procedure, each functional unit contains all fixations that have been made from its onset to its offset. While this is our recommendation, the user can of course decide to assign fixations spanning several functional units solely to that unit, in which it starts or solely to that unit, in which it ends. Knowing that all fixations in a specific task are just-in-time, it might be plausible to assign fixations to those units in which they start. However, knowing that fixations in the investigated task are quite anticipatorily, it might be more useful to assign fixations to those units in which they end.
In sensorimotor real-world tasks, the functional units are usually actions. In reading, functional units could be words, so that fixations would be labeled according to the word that was spoken at the same time. In externally sequenced tasks, the external signal could be used as functional unit, e.g., response cues. In a laboratory task, a response cue can be a stimulus on the computer screen, or a tone from the loudspeaker that announces the next sub-task, e.g., solve first mathematical problem in response to high tone, solve second problem in response to high tone. In other tasks, other functional units could be chosen of course.
After having labeled all fixations, data can be analyzed with FuncSim. The first step of the FuncSim algorithm is the calculation of length and direction values. As lengths and directions can only be calculated based on two successive fixations, the next fixation is always taken into account. This means that the last length and direction within each functional unit is calculated on the basis of the saccade that starts in the regarded unit and ends in the next functional unit. In addition, no length and no direction values are assigned to a path’s very last fixation. Length values are calculated in the same measure as the x- and y-coordinates (e.g., cm, pixel, degrees of visual angle). Direction values are calculated from -180 to +180 to the horizontal. Copied fixations contain the same characteristics in each functional unit. Length and direction calculations for the artificial scanpaths of
Table 1 can be seen in
Table 2.
In the second step of the FuncSim algorithm, a random version of scanpath 1 is created. Therefore, functional units and their fixations are scrambled with the premise that no functional unit remains at the old position within the sequence. Fixation sequences within functional units are kept constant (
Figure 2). Thus, the calculated random scanpath is a permutation of scanpath 1’s functional unit sequence. The random scanpath will be used for the computation of random baseline difference scores in the last step of the FuncSim algorithm. Note that this random baseline is a conservative baseline as the permuted scanpath contains really fixated locations instead of also possible randomly chosen locations on the coordinate range.
In the third step, FuncSim aligns the two-becompared scanpaths in a meaningful way to each other. Two variants of FuncSim can be used to align the scanpaths. The former procedure “
average” averages fixation locations, fixation durations, saccade length, and saccade directions within each functional unit, so that each functional unit contains only one value of each dimension (one line per functional unit,
Table 3 and
Figure 3). Especially with many different fixations within functional units, an alternative to averaging is useful. The new variant “
reldur” (relative duration) takes the fixation durations into account, i.e., fixations within functional units are aligned according to their durations relative to the sum of all fixation durations within the functional unit (
Table 3 and
Figure 3 and
Figure 4). In this way, multiple fixations within a specific functional unit across the to-becompared paths are aligned according to the contribution they have to the functional unit’s overall dwell time. As an example, in scanpath 2, the first fixation (350 ms) lasts for 7/11 of functional unit 1 (350 ms+200 ms), while the second fixation (200 ms) lasts for only 4/11. When scaling the smaller overall dwell time of scanpath 2 in functional unit 1 to the longer overall dwell time of scanpath 1, this ratio is maintained (
Figure 4). The rationale behind this alignment procedure within functional units is that unit segments have constant relative durations on the whole unit. This means that when a unit is performed at a different speed, all unit segments are scaled in time. It has been shown that this is the case for the timing of action sequences (Lacquaniti, Terzuolo, & Viviani, 1984; Viviani & Terzuolo, 1980;
Wing, 1978). Therefore, by aligning fixations within functional units of a task according to their relative duration, it is aimed to compare the same segments within units to each other. However, if there is reason to believe that the investigated functional units do not follow this relative timing, then the “
average” alignment procedure might be more adequate even with more fixations per functional unit.
In a forth step FuncSim calculates the difference values between assigned fixations within the functional units of the to-be-compared scanpaths. Location dissimilarity is calculated as Euclidean distances between assigned fixation locations. Location difference is calculated in the imported measure. This can for instance be pixel, cm, or degree of visual angle. Apart from this location difference score, we extended our method by the calculation of fixation duration differences, saccade length differences, as well as saccade direction differences. Fixation duration difference is the absolute difference between the durations of the assigned fixations. Evidently, the actually measured fixation durations are compared and not the relative fixation durations from the alignment procedure. Fixation duration difference is calculated in the imported measure, e.g., in milliseconds or seconds. Length difference is the absolute difference between the assigned fixation-to-fixation lengths and is calculated in the same measure as location difference. Direction difference is the absolute difference between the assigned fixation-to-fixation directions and is calculated in degrees. Lineseparated between-path differences (
BPDs) in all dimensions for the artificial scanpaths of
Table 1 and
Figure 2 can be seen in
Table 4. The average of all difference scores (location, duration, length, and direction) serve as final dissimilarity values of FuncSim (Means in
Table 4). These values are not normalized further. This could be done of course. For instance, the mean Euclidean distance could be normalized based on the diagonal of the eye tracking video, or the task’s work space as “
maximum” dissimilarity value. However, one and the same quantitative difference value does not always resemble the same qualitative dissimilarity. If participants’ task was to insert a thread into a needle, a smaller distance value should be regarded similar than if participants’ task was to throw a ball into a basket. Thus, the size of task objects is an important criterion for judging the magnitude of the calculated mean Euclidean distance. The same holds for the other difference scores. Instead of normalizing the difference scores according to any arbitrary value of the measurement, we conducted a similarity evaluation procedure based on the variance across the functional units of the task.
Therefore, in a final step, FuncSim calculates random baseline differences in all dimensions that can be used for statistical testing. Computation of a random baseline was inspired by the method reported in the study of ‘t Hart, Vockeroth, Schumann, Bartl, Schneider, König, and Einhäuser (2009). Random baseline differences are calculated in each dimension by comparing scanpath 1 with its randomly chosen unit sequence permutation (see step 2). Alignment of scanpath 1 and its permutation is performed according to the chosen procedure (
”average“ or
”reldur“, see step 3). Difference scores are calculated as described in the last paragraph. Line-separated random baseline differences (
RBDs) in all dimensions for the artificial scanpaths of
Table 1,
Table 2 and
Table 3 can be seen in
Table 4. Note that a different number of lines arises with
”reldur“ alignment for
RBDs than for
BPDs as different scanpaths are compared (scanpath 1 to its permutation instead of to scanpath 2). The averages of all difference scores within a dimension serve as final dissimilarity values of FuncSim’s random baseline (Means in
Table 4). Again, these values are not normalized further.
A t-test can be performed to evaluate whether the calculated between-paths difference (BPD) scores across to-be-compared scanpaths are significantly smaller than the random baseline difference (RBD) scores calculated within the same scanpath. Therefore, with the help of FuncSim it can be compared whether gaze characteristics (fixation location, fixation duration, fixation-to-fixation length, and fixation-to-fixation direction) are more similar within functional units across to-be-compared scanpaths than across functional units within one and the same scanpath. In this way, FuncSim is an adequate method to judge whether scanpaths are similar if participants are engaged in the same functional unit of a task. Here is a summary of the preparation steps needed to use FuncSim as well as FuncSim’s algorithm steps:
Preparation:
- A.
Standardizing x- and y-coordinates.
- B.
Labeling fixations according to functional units.
FuncSim algorithm:
Calculating length and direction values.
Creating baseline path by random unit permutation.
Aligning scanpaths according to functional units &
Calculating between-path differences (BPDs).
Calculating random baseline differences (RBDs).
Post-testing:
In the next section, the advantages of FuncSim compared to other scanpath similarity methods will be presented. In the sections thereafter, these advantages will be supported by mathematical examples.