Introduction
Both eye tracking and motion capture technologies are nowadays widely used in human sciences (e.g., music research or sign language linguistics), although both technologies are usually used separately. However, the combination of measuring both eye and body movements simultaneously would offer great potential for investigating action-perception links and cross-modal interaction in human behavior in general, and in musical behavior and sign language in particular. Especially in communicative and joint actions, such as making music or dancing together, combining different data acquisition tools like motion capture and eye tracking would provide new and innovative possibilities for conducting research.
Possible research questions of interest could include whether performers in a musical ensemble coordinate eye and body movements to create successful joint performances or whether gaze directions reflect participants’ movements and interactive behaviors when dancing with another person. In the field of sign language research—in which eye behavior, together with the activity of the hands and other parts of the body, has been argued to be an important means to organize linguistic structure—possible research questions could include how exactly signers coordinate eye gaze and eye movements with manually produced linguistic units, and how the temporal alignment of eye and hand behaviors differ, for example, between native signers and sign language learners.
However, the biggest challenge in combining separate data acquisition technologies, such as motion capture and eye tracking, is reliably synchronizing the devices so that the data can either be recorded at the same time or be precisely aligned afterwards. Accurate synchronization of the different data streams is crucial for time-critical analysis of the data and for relating the different data streams to each other in order to answer the research questions at hand.
Research Using Motion Capture and Eye Tracking
Both technologies have been used separately in various research areas such as psychology, biomechanics, education, sports, linguistics, and music. Since the authors are mainly familiar with research in music and sign language, the following literature review will focus on these research fields.
Music and Motion Capture
In the music field, motion capture has been used to study gestures during music performance or spontaneous movement responses to music for example. In terms of performers’ gestures, work by, for instance,
Thompson and Luck (
2012) investigated expressivity during piano performances, finding increased movement in structurally important parts when playing expressively compared to playing without expression.
Van Zijl and Luck (
2013) addressed the role of experienced emotions on movement characteristics during music performance, finding increased movement when playing with a sad expression compared to playing while being in a sad feeling.
Glowinski et al. (
2013) studied the movements of a string quartet during performance, obtaining different head movement patterns in joint versus solo performances.
In music-induced movement, Burger, Thompson, Luck, Saarikallio, and Toiviainen (2013a) explored relationships between spontaneous full body movement and musical characteristics such as pulse clarity and spectral content, finding that clearer pulses and stronger spectral content in low and high frequencies encouraged participants to move more.
Van Dyck et al. (
2013) showed that participants’ spontaneous movements increased with the presence of the bass drum. Carlson, Burger, London, Thompson, and Toiviainen (2016) focused on personality characteristics in relation to music-induced movement, finding that participants with higher conscientiousness and lower extraversion show greater responsiveness to tempo changes.
Haugen (
2016) studied music-dance relationships both in Brazilian Samba and Norwegian Telespringar, while
Naveda and Leman (
2010) investigated spatiotemporal representations in dance gestures of Samba and the Charleston.
Movement has also been studied from the perspective of perception. Vuoskoski, Thompson, Clarke, and Spence (2014) showed stick-figure animations to participants and studied the perception of expressivity in musical performances, showing that the influence of the visual component seems stronger in the communication of expressivity compared to the auditory. Burger, Thompson, Saarikallio, Luck, and Toiviainen (2013b) investigated the attribution of emotions to music-induced movement by showing participants stick-figure animations of spontaneous dance movement, showing that dance was perceived rather as positive than negative emotions.
Su and Keller (
2018) studied synchronization when perceiving stick-figure videos of dance movements of oneself and others, finding that participants, especially musicians, synchronized more accurately with others than with their own movements.
Sign Language Linguistics and Motion Capture
In sign language linguistics, motion capture has been used in a few works to investigate various linguistically relevant phenomena from an articulatory perspective. Concerning early work,
Wilbur (
1990) showed that there is a link between stressed sign production and certain kinematic variables such as displacement, velocity, and acceleration.
Wilcox (
1992), in turn, looked at the production of consecutive hand alphabets (i.e., fingerspelling) and showed, for instance, that the velocity peaks of the finger movements to target alphabets are a significant feature in the organization of fingerspelling.
More recently,
Tyrone and Mauk (
2010) examined sign lowering (i.e., producing the sign lower than its citation form) in American Sign Language and found that it is affected in predictable ways by production rate, phonetic context, and position within an utterance (see also
Mauk & Tyrone, 2012).
Jantunen (
2012), in turn, investigated whether the signed syllable—a sequential movement of the articulator(s)—could be empirically defined with the help of a single acceleration peak. He found that this was not the case, as the number of acceleration peaks in syllables could vary from zero to three and acceleration peaks could also be found outside the syllable domain. In another study,
Jantunen (
2013) compared sign strokes (“signs”) with non-strokes (“transitions”) and established that there is a kinematic difference between them.
In a more recent work, Puupponen, Wainio, Burger, and Jantunen (2015) analyzed the kinematic characteristics and functional properties of different head movements in Finnish Sign Language and showed that there is no perfect correspondence between their forms and functions, unlike results reported in some earlier studies.
Music and Eye Tracking
Eye tracking has been frequently used to study music (sight-) reading. When looking at amateur musicians Penttinen, Huovinen, and Ylitalo (2013) found that more experienced musicians used shorter fixation times and more linear scanning of the notated music. Focusing on adult music students, Penttinen, Huovinen, and Ylitalo (2015) found that performance majors showed shorter fixation durations and larger eye-hand spans. Professional performers had more efficient fixations that helped them anticipate difficulties and potential problems compared to non-musicians (Drai-Zerbib, Baccino, & Bigand, 2012).
Hadley, Sturt, Eerola, and Pickering (2017) found that harmonically incongruent melodies caused rapid disruption in eye movements and pupil dilation.
Gruhn et al. (
2006) investigated differences between saccadic eye movements in musicians and non-musicians, finding that musicians had more express saccades, stronger voluntary eye control, and more stability in their fixations than nonmusicians.
Laeng, Eidet, Sulutvedt, and Panksepp (2016) found relationships between pupil dilation and musical chills, in that the pupil size increased around the moment of experiencing the chill. Gingras, Marin, Puig-Waldmüller, and Fitch (2015) could predict pupillary responses from musicinduced arousal and individual differences—pupils dilated more for arousing or tense excerpts, in particular when the excerpts were liked less.
Fink, Geng, Hurley, and Janata (2017) investigated the role of attention during music listening on pupil dilation, finding pupil dilations in deviants of complex musical rhythms.
Woolhouse and Lai (
2014) studied participants’ eye movements while observing dance movements, finding more fixations on the upper rather than the lower body, as well as greater dwell times for the head than for torso, legs, or feet.
Sign Language Linguistics and Eye Tracking
In sign language linguistics, the use of eye tracking has been very rare. Concerning perception studies,
Muir and Richardson (
2005) found that native signers tend to fixate on the upper face of the addressee, especially if the addressee is close by. Emmorey, Thompson, and Colvin (2009) showed that this tends not to be the case for signing beginners who prefer to look at the mouth area.
Wehrmeyer (
2014) showed that the viewing habits of deaf and hearing adults are also different in other contexts, for example, in watching sign language interpreted news broadcasts.
Concerning production studies, Thompson, Emmorey, and Kluender (2006) found that signers’ gaze behavior is different depending of the type of the verb sign and how it is modified in the signing space. In a follow up study (2009), they also showed that this gaze behavior is affected by signing skill. A recent study by
Hosemann (
2011), however, suggested that the pattern found by
Thompson et al. (
2006) may not be so systematic.
Combining Motion Capture and Eye Tracking
Within the music field, there have only been very few studies so far that tried to combine motion capture and eye tracking, while in sign language research, motion capture and eye tracking have not been used together before. In music-related research,
Bishop and Goebl (
2017) study visual attention during duet performances, expecting that visual attention declines with repetition of the piece due to getting to know each other’s intentions.
Marandola (
2017) investigated hand-eye synchronization in xylophone performance, suggesting that western musicians prepare for the hits to be performed with their gaze, while Cameroonian musicians tend to look away from the instrument.
What Is Motion Capture?
Different systems for recording motion capture are available (
Burger, 2013). Inertial systems track the acceleration and orientation of sensors attached to participants/objects in three dimensions, while magnetic systems measure the three-dimensional position and orientation of objects in a magnetic field. Of more importance for this paper are camera-based systems, in particular infraredbased optical motion capture systems. In such systems, cameras send out infrared light that is reflected by (passive, wireless) markers attached to participants and/or objects, so that these reflections can be recorded by the cameras. These systems are composed of an array of several cameras chained in a row to represent the data in a three-dimensional space. Using a method called direct linear transformation, the system acquires the exact position and orientation of each camera, with respect to the others and the floor, to be able to create the three-dimensional representation of the capture space and triangulate the marker positions (Robertson, Caldwell, Hamill, Kamen, & Whittlesey, 2004).
Since optical motion capture systems work with reflections (i.e., passive reflective markers) only, these markers need to be labeled to identify which body part or object each marker represents. Two main approaches for data labeling exist. Some systems, such as the ones manufactured by Vicon or OptiTrak, let the user define the locations of the markers and create a body model prior to the recording that is applied during the recording or post-processing. If the model works correctly, the data is labelled automatically. However, if the model fails (due to, for instance, marker loss), manual labeling is required. In the Qualisys system, the user first records the raw markers without any body model. Afterwards, one recording is labelled manually, from which a body model is created that is then applied to the other recordings to label them automatically. Also here, manual labeling is required, if the model fails. However, the model can be improved by updating it after each recording. The main challenge of optical systems is that occlusions of markers during the recording causes marker loss and gaps in the data. Thus, such occlusions should be prevented by careful marker placement and camera positioning before and during the recording.
Optical motion capture systems have high temporal and spatial resolutions, as recent systems track up to 10,000 frames per second and have a resolution of less than one millimeter. Normally in music- and sign language-related applications, standard capture speeds range from 60 to 240 Hz (most often 100–120 Hz), which is sufficient for capturing most relevant activities, such as playing instruments or dancing (
Burger, 2013).
What Is Eye Tracking?
In the case of eye tracking, camera-based trackers are most widely used nowadays, with an infrared light source detecting the pupil by using the so called corneal reflections, resulting in a variety of different measures including the position or dilation of the pupil (
Holmqvist et al., 2011). Screen-based or stationary eye trackers are attached to the object to be tracked, usually a screen, with the participant placed in a stationary position in front of the screen and the tracking system. Mobile eye trackers, on the other hand, are head-mounted eye trackers worn like glasses so the participant can move in space while the tracker captures the eye movement and the scene being observed. Therefore, mobile eye trackers have two kinds of cameras, one (infrared-based) to record the eye/pupil and the other (regular pixel-based, fish-eye lensed) for the field or the scene, representing what the participant sees.
Eye trackers also require calibration, usually by providing four fixed points in space that the participant is asked to focus on one after another while keeping the head still (i.e., by only moving the pupils). With these four points, the system is able to combine the eye positions with the field video and display the focus of the gaze as a cross hair in the field video. Most mobile eye trackers track at rates of 50 or 60 Hz. Both mocap and eye tracking systems result in numerical data representations of the body and eye movement respectively that can be processed computationally.
Synchronization of Motion Capture and Eye Tracking
Reliable and accurate synchronization between the motion capture system and eye tracker is crucial for relating both data streams to each other and time-critically analyzing the data. Different attempts have been developed. The two studies mentioned above have employed different methods. One possibility is to use (i.e., purchase) solutions offered by the manufacturers (e.g., using sync boxes or plug-ins like
Bishop & Goebl, 2017) or alternatively use (analog) claps like
Marandola (
2017) equipped with mocap markers recorded by the eye tracking glasses’ field camera. However, manual claps would require the researcher to manually synchronize the data, which is a rather time-consuming effort. Moreover, since the video (of the eye tracker field camera) recording the clap is based on (changing) pixels, the possibility of finding the exact frame to which the mocap data should be synchronized might be more challenging compared to working with digital representations of time series motion capture and eye movement data. Another potential challenge for synchronization might arise from differences in the starting times of the recordings of both eye tracking cameras. This would mean that the delay between the start of the eye camera and the field camera has to be additionally quantified for each recording, resulting in possible inconsistencies.
Ready-made solutions offered by the manufacturers are available for several motion capture system and eye tracker combinations, although not for all available eye systems. Furthermore, such a plug-in is relatively cost-intensive and usually requires a complicated technical setup using two computers (one for running the motion capture recording software, the other for running the eye tracker software—at least in case of the Qualisys motion capture system) that are linked via a wireless network connection, which might cause computer/system security issues or delays/lags in the processing. Other solutions (e.g., from Natural Point OptiTrack) work via a sync box connecting the different devices, for instance via a TTL signal and/or STPTE timecode (see below), which is also cost-intensive, possibly requiring engineering knowledge as cables might need customized connectors to fit into the available in- and outputs of the devices and computers.
Synchronizing different devices is a technically challenging problem. It is not only a challenge to ensure that recordings start at the same time, but also that they would not drift apart in time from each other during the recording (so one recording would be longer or have less frames recorded than the other). Another possibility could be that the sampling points of the different systems are locally misaligned (due to an unstable sampling rate) which is referred to as jitter. While high quality motion capture systems, such as the Qualisys system used in this case, exhibit close to zero drift and jitter (being one part per million according to the Qualisys costumer support), eye trackers are said to exhibit some drift and jitter (
Holmqvist et al., 2011).
Different ways to synchronize different devices have been developed and are used in industrial and research applications. One way is to send TTL (Transistor–transistor logic) triggers indicating the start and stop of a recording. Other developments include timecode and genlock/sync (
McDougal, 2015). Timecode, such as the SMPTE timecode, developed by the Society of Motion Picture and Television Engineers, is a standard in the film industry to link cameras or video and audio material. The SMPTE timecode indexes each recorded frame (or every second, third, etc. depending on the frame rate of the devices) with a time stamp, to offer synch points for post processing. However, such time codes can still cause jitter as well as drift if they are not strictly kept together by, for instance, using a central clock or a reference signal genlocking the devices. However, such devices are relatively expensive and require some engineering knowledge to set up correctly. Often, they also require a cable connection between the device and the recording computer. With this being less of a problem for the motion capture system (since the pulses would only be sent to the cameras), the (wireless) eye tracker would lose its mobility. Some systems offer wireless synchronization via WLAN, however, this is likely to introduce delays, inconsistencies, and data loss due to unreliability and loss of the signal.
Another option that has been developed to synchronize different devices is the lab streaming layer (LSL). The LSL is a system for the synchronized collection of various time series data over network. However, it requires programing and computer knowledge, especially if the motion capture and eye tracker systems at hand are not among the already supported devices. Thus, it might not be suitable and easy to use for everyone.
Aim of This Paper
In order to overcome such device-specific, hard- and/or software-based solutions, we aimed for a device-free, behavior-based approach to reliably synchronize the two systems that can be used with any combination of motion capture and eye tracking systems. This approach should be easy to perform for the participant and automatically processable by a computational algorithm to avoid manual synchronization of each separate recording. Such a solution has low demands on technical knowledge and could be used with any combination of eye tracker and motion capture system at no extra cost. Furthermore, the synchronization would be purely based on the numerical representations of both mocap and eye tracker data, so possible differences in recording beginnings of the different (eye tracker) cameras would not affect the synchronization accuracy.
This computational synchronization solution was developed in a pilot phase, and a refined version of it was subsequently tested in a second, larger data collection. This paper describes this development as well as the evaluation of the accuracy in comparison to manual synchronization of the recordings.
Results and Evaluation
We will first present the results regarding the alignment of computationally extracted sync points and the manually acquired “ground truth data” to evaluate the accuracy of the sync point extraction.
Table 2 displays the differences between the temporal locations from the computational synchronization approach and the manual “ground truth data” of both mocap and eye tracker for each of the 40 recordings. The differences are given in frames.
For the mocap data, the “ground truth data” equaled the computationally derived sync points in all but five trials for the nod in the beginning and three trials for the nod in the end. In all cases, the difference was one frame (5 ms). For the pupil data, the “ground truth data” conformed with the computationally derived sync points in all but four trials for the nod at the beginning and five trials for the nod at the end. Each difference was also one frame (20 ms) in these instances. In all cases but one (P2, end nod mocap), the sync point was one frame after the “ground truth data”.
In order to further evaluate the accuracy of the synchronization solution regarding synchronization over time of both systems (i.e., drift), the durations in between the nod at the beginning and the end for the mocap system and the eye tracker were compared per trial. The results of the short recordings are shown in
Table 3, while the five longer recordings are presented in
Table 4.
For the short recordings, the duration differences are on average below the sampling frequency of the eye tracker (50 Hz, 20 ms); in eight out of ten cases, the average difference is below half the eye tracker sampling frequency. In 35 of the 40 recordings, the difference is smaller than or half of the eye tracker sampling frequency. In 14 cases of the short recordings, the eye tracker and the mocap recordings were of exactly the same length, whereas in 13, the eye tracker recording was shorter than the mocap, and in the remaining 13, the eye tracker recording was longer than the mocap.
In the five longer recordings, the differences ranged from 0 to 0.035 ms, with four of the five recordings being below the eye tracker frequency. For all long recordings, the eye tracker recordings were shorter compared to the mocap system.
Subjective Experiences
Participants were asked to rate four questions regarding their experiences about the nod after the data collection on a 7-point scale. The detailed overview of the ratings is found in
Table 5.
Participants were overall positive about the task. They found it easy to keep the eyes open and were overall rather comfortable with the task. It was very clear when to produce the nod. Furthermore, the nod was not perceived as disturbing.
Discussion
In this paper, we described the development of a computational approach to automatically synchronize recordings of a motion capture system and an eye tracker. The aim of the paper was to present a solution that is reliable and does not depend on a ready-made plug-in by the manufacturer, but is instead device-free and intrinsic to the recording of the data.
The measured accuracy of the motion capture data is very high; 90% of the (nine out of ten) pilot recordings at a frame rate of 120 Hz, and 90% (72 out of 80) of the second data collection at 200 Hz could be optimally aligned between the “ground truth data” and the computational solution, while the remaining ones showed one frame difference. The difference of one frame (8.3 ms at 120 Hz and 5 ms at 200 Hz) could be due to the smoothing of the data after the time derivation or due to rounding during the calculation. Small inconsistencies could also have emerged from a slower speed or a smoother movement during the nod. However, the time difference is so small that it can be considered negligible.
The accuracy of the eye tracker data was less than the mocap data in the pilot recordings, though it increased during the second data collection. In the pilot, five out of the ten recordings (50%) could be optimally aligned, whereas the remaining five differed in one frame each. In the second data collection, 71 out of 80 sync points equaled the “ground truth data” (88.75%). These values suggest that the procedure can be regarded reliable and accurate for the required purpose of time-critically synchronizing both systems. Our analysis showed a maximum difference of one frame in each system, suggesting a maximum difference (“worst case scenario”) of 25 ms between mocap and eye tracker, while the actual differences were mostly much smaller. These values should be sufficient for most research questions related to eye movement, unless very fast saccades and microsaccades are of interest (
Holmqvist et al., 2011; Wierts, Janssen, & Kingma, 2008). However, if higher temporal accuracy of the eye movements is needed, the sampling frequency of the eye tracker should be (much) higher than 50 Hz.
We increased the sampling frequency of the mocap system from 120 Hz to 200 Hz to match the eye tracker sampling frequency in an integer relationship. Despite recording more data points per time, this did not increase the accuracy of locating the sync points (peaks of the nods), as we received the same percentage of correctly located synch points. However, it might have still reduced rounding errors when combining the data with the eye tracker and thus increased data accuracy when trimming the data, due to less noisy rounding and interpolation between the two systems.
The less accurate synchronization result for the pupil data, especially in the pilot recordings could be related to the “ground truth data” being based on a video signal and not a time series data representation, like the mocap data. Local minima of a curve might be more clearly detectable than the change in frames of the eye tracker video data, thus that kind of “ground truth data” could be slightly less reliable. Issues in pupil detection (i.e., when the pupil was adjusted manually) could also have influenced the accuracy. Manual adjustment might have caused less accurate precision or larger differences between continuous frames than automatic tracking, thus the resulting velocity curve could have contained more noise. Furthermore, slight inconsistencies could have emerged due to different starting points of the field and the eye camera. D-Lab does not use a global time clock for its recordings, but records the devices “as they are detected”, so there has been a variable delay (ranging from 14 ms to 85 ms) between the start of both cameras.
When trimming the data and comparing the resulting lengths of both recordings, the recordings were very similar in length. In most cases, the differences were below the sampling frequency of the eye tracker (often even half the sampling frequency), so the accuracy should be sufficient for most applications as mentioned above. The small differences in the lengths of the recordings could be related to rounding errors when deriving the sync points, or suggest that there is a bit of drift in the alignment of the two data streams. Since the differences in the shorter recordings are both positive and negative (i.e., for some recordings, the eye tracker recording is shorter, whereas in other cases the mocap recording is shorter), these might be rather due to rounding errors, whereas in the long recordings, the eye tracker recordings were all shorter than the mocap, suggesting a trend that the eye tracker was “faster”. For a more extensive investigation of the existing drift as well as possible jitter, appropriate hardware is required that would synchronize the recordings using genlocking on a frameto-frame basis.
Moreover, our longer recordings of about one minute were still relatively short. In order to further investigate drift and jitter between the motion capture system and eye tracker, longer recordings (e.g., about 10 minutes) should be made. However, since recordings in our studies are usually not longer than one to two minutes, we refrained from making longer recordings at this stage.
In the pilot data collection, only two out of six participants could be reliably synchronized using this approach. The other four were found difficult due to different reasons. In two cases, the eye tracker could not reliably track the participants’ pupils due to technical difficulties. In the other two cases, the participants were blinking at the moment of the nod. The closure time in these cases was in the middle of the nod, so it was impossible to manually adjust (or add/estimate) the pupil, after which the computational synchronization could still have been possible. In the second data collection, we provided the participants with more thorough and clear instructions, as well as asked them to perform practice nods prior to the recording to make them familiar with the procedure. This seemed to have clearly helped, as none of the participants blinked during the nod in the second data collection. This finding strongly indicates the importance of clear instruction for the participants, explaining the procedure to them, and ensuring they understand the underlying rationale.
The assessment of participants’ subjective experiences related to the nod showed that it was not perceived as disturbing or difficult to perform. It seemed to have been well integrated into the task and was clear when and how to produce it. It might have also helped participants to have a defined start and end of each recording and concentrate on the task. In order to even further prompt the participants to perform the nod, a metronome beat could be presented (in case of hearing participants), so that the participant could synchronize the nod for instance to the fifth beat.
In order to check that the nod was performed successfully, a real-time or close to real-time check could be included. If it was possible to, for instance, display the vertical displacement of the eye movement as a time-series directly during the recording, the success of the nod (especially whether or not a blink happened) could be checked immediately after it was performed.
The question whether more accurate results would have resulted from the synchronization plug-in provided by Qualisys or a sync box solution remains. The technical setup of the plug-in involving two wirelessly connected computers might point towards such connections potentially introducing lags. However, in order to answer this sufficiently, the set-up would have to be tested with both the plugin and the nod and the results compared afterwards.
We also considered other motion sequences for the synchronization in order to potentially improve our approach and make the synchronization easier. We piloted two different approaches, (1) several consecutive nods and (2) a passive application of force by someone else exerting a sudden strike to the participant’s head. However, our volunteers found both approaches more uncomfortable than the single nod. The consecutive nods felt very unnatural, and either the first or the last were less pronounced, making it difficult for the automatic detection to choose one. The sudden knock felt uncomfortable, since, despite knowing about it, it still felt somewhat unexpected. Additionally, volunteers involuntarily blinked during the nod, probably due to the sudden and unexpected exertion of force on them. It seems, therefore, that a single nod is the best approach for this method.
The approach described here will be integrated into the Mocap Toolbox to make it assessable for everyone. As of now it is available for free on the toolbox website (MoCap Toolbox website:
https://www.jyu.fi/hytk/fi/laitokset/mutku/en/research/materials/mocaptoolbox) and will be integrated into the toolbox with the next release. The function set includes the function to read the Dikablis eye tracker data into Matlab, convert it into a Mocap Toolbox compatible data structure, and the function to automatically sync the eye tracker recording with the corresponding mocap recording. No further Matlab expertise than basic understanding of how to use the Mocap Toolbox nor any other external devices would be needed to apply this syncing method. Furthermore, since the Mocap Toolbox stores the eye tracker data in the same way as mocap data, the same functions and procedures can be used to analyze the eye tracker data.
The mocap toolbox function also offers the possibility to adapt the thresholds for detecting the nod for both mocap and pupil data. In our data sets, a threshold of −2 could reliably detect the nod in both (the z-scored) mocap and pupil data, though this might not be the case for other recordings. Thus, adjustable thresholds that can account for participants performing the nod at different speeds and spans make the function more flexible.
Furthermore, the nod was useful for manually synchronizing the different data streams used in the experiment. The motion capture, eye tracker, and regular video data could be accurately synchronized by using the nod as a reference when importing the data into the freely-available audio and video annotation and transcription software ELAN (see screenshot in
Figure 8), developed at the Max Planck Institute for Psycholinguistics, The Language Archive, Nijmegen, The Netherlands. Given the different sampling frequencies of all the systems, it would have been much more difficult to synchronize the recordings without the clear displacement that the nod provided.