Introduction
In daily life, humans receive a large quantity of information about the environment through sight and hearing. The fast processing of this information helps us to react rapidly and properly. Hence, there exists a mechanism in the brain to direct attention towards particular regions or events, called salient regions or events. This attentional bias is not only influenced by visual and auditory information separately, but is also influenced by audio-visual interaction.
From psychophysical studies, we know that humans react faster to overlapping bimodal audio-visual stimuli than to unimodal (audio or visual) stimuli (Corneil, van Wanrooij, Munoz, & Van Opstal, 2002 ; Sinnett, Faraco, & Spence, 2008). The studies on audio-visual interaction concentrate on two areas: the influence of visual input on auditory perception and the influence of acoustic input on visual perception.
Early evidence of the influence of visual input on auditory perception is the “McGurk Effect”. The “McGurk Effect” is a phenomenon that demonstrates a perceptual fusion between auditory and visual (lip-reading) information in speech perception. In this experiment a film of a young woman repeating utterances of the syllable [ba] was dubbed on to lip movements for [ga]: normal adults reported hearing [da] (McGurk& MacDonald, 1976). This “McGurk Effect” works with perceivers of all language backgrounds (Cohen & Massaro, 1994), and it also works on young infants (Rosenblum, Schmuckler, & Johnson, 1997). Another well-known audio-visual interaction is that visual “lip-reading” helps speech to be understood, when speech is in poor acoustical conditions or in a foreign language (Jeffers & Barley, 1971 ; Summerfield, 1987).
Speech is a special audio stimulus: numerous current studies are focused on audio-visual interaction of speech (Alho et al., 2012). A study from (Tuomainen, Andersen, Tiippana, & Sams, 2005) provided evidence of the existence of a specific mode of multi-sensory speech perception. More recently, some observations of the mechanisms of speech stimuli and visual interaction have demonstrated that lip-read information was more strongly paired with speech information than non-speech information (Vroomen & Stekelenburg, 2011). Other types of sound have been investigated less.
Auditory cues also influence visual perception. Previous studies showed that when auditory and visual signals come from the same location, the sound can guide attention toward a visual target (Perrott, Saberi, Brown, & Strybel, 1990; Spence & J.Driver, 1997). Moreover, other studies demonstrated that synchronous auditory and visual events can improve visual perception (Vroomen & De Gelder, 2000; Dalton & Spence, 2007). Another study considered the situation in which audio and visual information do not come from the same spatial place. The result showed that the synchronous sound “pip” makes the visual object pop out from its complex environment phenomenally (Van der Burg, Olivers, & Bronkhorst, 2008).
Inspired by these studies of the influence of audiovisual interaction on human behavior, computer scientists have tried to simulate this attentional mechanism to create a computational attention model, which helps to select important objects from a mass of information. This computational attention model provides another way to better understand the attentional mechanism. Furthermore, these computational attention models are useful for applications such as video coding (Lee, Simone, & Ebrahimi, 2011) and video summarizing (Wang & Ngo, 2012).
Studies in cognitive neurosciences show that eye movements are tightly linked to visual attention (Awh, Armstrong, & Moore, 2006). The study of eye movements enables a better understanding of the visual system and the mechanisms in our brain to select salient regions. Furthermore, eye movements also represent the influence of audio-visual interaction on human behavior. Quigley and her colleagues (Quigley, Onat, Harding, Cooke, & König, 2008) investigated how different locations of sound source (played by loudspeakers in different locations: left, right, up and down) influence eye movement in static images (Onat, Libertus, & König, 2007). The results showed that eye movements were spatially biased towards the regions of the scene corresponding to the location of the loudspeakers. Auditory influences on visual location also depend on the size of the visual target (Heron, Whitaker, & McGraw, 2004). In videos, during dynamic face viewing, sound influences gaze to different face regions (Vö, Smith, Mital, & Henderson, 2012).
Although the interaction of features within audio and visual modalities has been actively studied, the sound effect on human gaze when looking at videos with their original soundtrack has been explored less. Our previous research (Song, Pellerin, & Granjon, 2011) showed that sound affects human gaze differently depending on the type of sound, and the effect is greater for the on-screen speech class (the speakers appear on screen) rather than the non-speech class (any kind of audio signal other than speech) and the non-sound class (intensity below 40 dB). Recently, (Coutrot, Guyader, Ionescu, & Caplier, 2012) showed that original soundtrack of videos impacts on eye position, fixation duration and saccade amplitude, and (Vilar et al., 2012) using non-original soundtrack also concluded that sound affects human gaze.
In our previous research, we only considered three sound classes and no strict control of sound event over time. In this paper, we provide a deeper investigation of the question of which type of sound influences human gaze with a controllable sound. A preliminary analysis was published in (Song, Pellerin, & Granjon, 2012). We first describe an audio-visual experiment with two groups of participants: with original soundtrack called audiovisual (AV) condition; and without sound called visual (V) condition. Then, we observe the difference of eye position between two groups of participants for thirteen more refined sound classes. The fixation duration between groups with AV and V conditions is also studied.
Results
In order to investigate the effect of sound on visual gaze, we analyzed the difference of eye position between participants with AV condition and with V condition.
Figure 4 (a) shows an example of the eye positions of two groups of participants.
Figure 4 (b) shows an example of the density map of groups of participants with AV condition (
) and with V condition (
).
Comparison among three clusters of sound classes
We analyzed the Kullback-Leibler divergence (
KL) between the eye position of the participants in two groups with AV and V conditions, among three clusters of classes (see
Figure 2): “on-screen with one sound source”, “on-screen with more than one sound source” and “off-screen sound source”.
In this section, for each clip snippet, we investigated one second after the beginning of the second sound (from frame 6 to 30, to eliminate reaction time of about 5 frames). We used the ANOVA test to compare
KL among different clusters of classes. This test requires the samples in each cluster to be independent samples. Because we consider continuous measurement over time, the eye position for most participants does not change much between two adjacent frames, they could not be considered as independent samples. To solve this problem, we took the mean of
KL values over one second (from frame 6 to 30 after the beginning of the second sound) as one independent sample.
Figure 4.
A sample frame in singer class of eye position of participants in groups with AV (red points) and with V condition (blue points), and corresponding density map of groups of participants with AV condition Mhav (red region) and with V condition Mhv (blue region).
Figure 4.
A sample frame in singer class of eye position of participants in groups with AV (red points) and with V condition (blue points), and corresponding density map of groups of participants with AV condition Mhav (red region) and with V condition Mhv (blue region).
In
Figure 5, with the ANOVA test, “off-screen sound source” presents the lowest
KL among the three clusters of classes. The difference is significant between “on-screen with one sound source” and “off-screen sound source” (
F(1,63)=4.72, p=0.034), and also significant between “on-screen with more than one sound source” and “off-screen sound source” (
F(1,25)=4.67,
p=0.041). The difference between “on-screen with one sound source” and “on-screen with more than one sound source” is not significantly different (
F(1,69)=0.03, p=0.859). These results indicate that one and more than one localizable sound sources lead to a greater distance between the groups with AV and V conditions compared to non-localizable sound source.
The results above were confirmed by two other metrics: cc and md.
To verify that the effect measured is really due to the second sound, we performed the same calculation for a period of one second (25 frames, from frame -24 to 0) before the transition from first sound to second sound for all the classes. This “pre-transition” cluster (in
Figure 5) can be considered as a baseline, compared to the three other clusters. The difference is significant between “on-screen with one sound source” and “pre-transition” (
F(1,133)=9.09, p=0.0031), and also significant between “on-screen with more than one sound source” and “pre-transition” (
F(1,95)=4.65, p=0.034). The difference is not significant between “off-screen sound source” and “pre-transition” (
F(1,89)=0.01, p=0.915). These results show that one and more than one localizable sound sources for the second sound lead to a greater distance between the groups with AV and V conditions compared to pre-transition (first sound).
Figure 5.
Kullback-Leibler divergence (KL) between participants with AV and V conditions in three clusters of classes: “on-screen with one sound source”, “on-screen with more than one sound source” and “off-screen sound source”, and compared to the “pre-transition” cluster. Larger KL values represent greater difference between groups with AV and V conditions.
Figure 5.
Kullback-Leibler divergence (KL) between participants with AV and V conditions in three clusters of classes: “on-screen with one sound source”, “on-screen with more than one sound source” and “off-screen sound source”, and compared to the “pre-transition” cluster. Larger KL values represent greater difference between groups with AV and V conditions.
To complete the previous study, we analyzed entropy variation between before and after sound transition. More precisely, in AV condition (respectively V condition) for each clip snippet, we calculated the mean of entropy for a period of one second after the transition (from frame 6 to 30) and subtracted the mean of entropy for one second before the transition (from frame -24 to 0). Then, we compared the results of entropy variation between the two conditions (AV and V) by using paired t-test. For “on-screen with one sound source” and for “on-screen with more than one sound source” clusters, the mean of entropy variation is significantly larger in AV condition compared to V condition (respectively
t(53)=2.95, p=0.004 and
t(15)=2.52, p=0.023) (
Figure 6). Participants with AV condition are not only attracted by salient regions from visual aspect, such as face, motion regions, but also attracted by sound sources from audio aspect. For these two clusters, the entropy variation is negligible in V condition. For “off-screen sound source”, the entropy variation is not significantly different between AV and V conditions (
t(9)=0.84, p=0.42) (
Figure 6). In this case, participants with AV condition modify their behavior slightly compared with V condition.
Analysis of thirteen sound classes
We analyzed the thirteen sound classes separately. We did not analyze sound effect directly through audio information, but through the eye position of participants which are also based on visual information. In order to reduce the influence of visual information, we created a baseline for statistical comparison by performing a randomization (Edgington & Onghena, 2007): We fused two groups of participants with AV and V conditions into one set of 36 participants. We extracted 18 participants from this set randomly to create a new group called G1. The rest of the participants formed another new group, called G2. Afterwards, we calculated the KL between G1 and G2 for each frame. We repeated this procedure 5000 times, obtaining for each frame a distribution of 5000 random KL values (KLi, i=1,2,…5000). Then, we took the mean of the 5000 KL values as the baseline (). This , which is influenced by image only, is an estimate of the KL that can be expected between two random groups of participants. Finally, we calculated the difference () where represents the difference between participants with AV and V conditions. Because is caused by the effect of both image and sound, and is caused by the effect of image only, the difference () is mainly caused by the effect of sound.
Figure 7 shows the difference over time between
and
for two classes: “speech” (human) and “impact and explosion” (non-human). If (
) is above 0, the difference between AV and V groups is greater than that between two random groups. The behavior over time is different for two presented sound classes.
Table 2 shows the results for frames 6 to 30 after the beginning of the second sound. The high
values (therefore low
p values) for the marked classes (with ■): speech, singer, human noise, and singers, show that human voice affects visual gaze significantly (
p<0.05).
To verify that the effect measured above is really due to the second sound, we perform the same calculation for a period of one second (25 frames) before the beginning of the second sound. Results of probability estimations of values higher than of all the sound classes from frame -24 to 0 are higher than 0.1, suggesting that before the second sound, eye position of participants between groups with AV and V conditions are not significantly different for all the sound classes.
The results above were confirmed by other two metrics: cc and md.
Analysis of distance between sound source and eye position
In the previous section, we showed that the Kullback-Leibler divergence between eye position of participants with AV and V conditions is greater for speech, singer, human noise and singers classes than others. In this section, we want to verify the assumption that participants with AV condition moved their eye to the sound source after the beginning of the second sound. We only analyzed the “on-screen with one sound source” cluster of sound classes. We first located the approximate coordinates of the center of the sound source manually. Then, we calculated the Euclidean distance between the eye position of each participant with AV condition and the sound source. The mean of these Euclidean distances gives the value, which is affected by both image and sound information. Similarly, in order to reduce the influence of visual information, we created a baseline for statistical comparison by performing a randomization (Edgington & Onghena, 2007). We considered the mean Euclidean distance between eye position of participants of G1 (consists of 18 participants, which are randomly selected from the set of all the participants in groups with AV and V conditions) and sound source (Di, i=1,2…5000). We took the mean of 5000 distance values as the baseline (), which was affected only by image information. Afterwards, for each frame, we calculated for all the classes with one sound source. This difference reflects the influence of the sound information.
Figure 8 shows the difference over time between
and
for “speech” and “impact and explosion” classes. When the values are negative, the group with AV condition is closer to the sound source than the random group. Again, different sound classes behave differently.
To find out which classes give the higher difference between
DAVS and
DR and quantify the sound effect, we investigate the same duration of one second (25 frames) as in previous analysis, from frame 6 to 30 after the beginning of the second sound. We compared
(the mean of
DAVS over the 25 frames) to the distribution of
(
i=1,2,...5000), where
is the mean of
Di between G1 and sound source over the 25 frames for the random trial
i. To estimate the probability of
being smaller than
, we calculate
p=n/5000, where
n is the number of
which are smaller than
.
Figure 8.
Average difference (DAVS - DR) over time for “speech” and “impact and explosion” classes. Dark regions represent (DAVS - DR) below 0, suggesting that the group with AV condition is closer to the sound source than the random group.
Figure 8.
Average difference (DAVS - DR) over time for “speech” and “impact and explosion” classes. Dark regions represent (DAVS - DR) below 0, suggesting that the group with AV condition is closer to the sound source than the random group.
In
Table 3,
is smaller than
(
p<0.05), from frame 6 to 30 after the beginning of the second sound, for speech, singer, human noise classes (marked with ■) suggesting that participants tend to move their eyes to the sound source only when they hear human voice.
Analysis of musical instrument subclass
Compared to human voice classes, which have been well discussed in recent decades, music class has been explored less. To better understand the influence of audio-visual interaction, we propose a deeper investigation of eye movement behavior of music class. In our music class database, four snippets are humans playing musical instruments. They represent the musical instrument subclass. In this subclass, there is more than one face in the scene. However, only one person is playing an instrument (piano or guitar), when the corresponding music begins.
In musical instrument subclass, what is more attractive to the participants? There is evidence that faces in the scene are preferred by the visual system compared to other object categories (Rossion et al., 2000 ; Langton, Law, Burton, & Schweinberger, 2008), and can be processed at the earliest stage after stimulus presentation (Ro, Russell, & Lavie, 2001). From our observation, we assume that a particular face -- Face of the player attracts more attention than other faces. In previous calculations, we know that the sound source in the scene was attractive for participants with AV condition in human voice sound classes. In the musical instrument subclass, do participants have a preference for sound source, that is, the Musical instrument?
To measure which region (musical instrument or the face of the player) is more attractive to the participants, we calculate the Euclidean distance between the eye position of participants with AV condition and Musical instrument (). Respectively, we calculate the Euclidean distance between the eye position of participants with AV condition and the Face of the player (). Again, we introduce a baseline , which is the mean Euclidean distance between random group G1 and Musical instrument for 5000 randomization times. Respectively, the baseline is the mean Euclidean distance between random group G1 and the Face of the player.
Figure 9 illustrates the distances from group with AV condition to Musical instrument (a) and to Face of the player (b) over time. Here, the dark regions below zero represent smaller distances from the Face of the player or the Musical instrument. The Face of the player is reached more frequently after the beginning of the music sound until around frame 14. After that, both the Face of the player and the Musical instrument are reached somewhat equally.
To quantify the measurement, we further investigated a period of one second (25 frames), from frame 6 to 30 after the beginning of the second sound. The probability of
(
is the mean Euclidean distance between G1 and Musical instrument) being smaller than
is
p=0.164. The probability of
(
is the mean Euclidean distance between G1 and Face of the player) being smaller than
is
p=0.042. The results indicate that during this period of one second, participants move their eyes to the Face of the player rather than the Musical instrument.
Figure 9.
Average distances (a) from Musical instrument (DAVM - DRM), (b) from Face of the player (DAVF - DRF) for 4 clip snippets of musical instrument subclass over time.
Figure 9.
Average distances (a) from Musical instrument (DAVM - DRM), (b) from Face of the player (DAVF - DRF) for 4 clip snippets of musical instrument subclass over time.
Fixation duration analysis
We also investigated the effect of sound on the distribution of fixation duration for the whole database. It is typical to study such parameters (Tatler, Hayhoe, Land, & Ballard, 2011). For each participant, we calculated the mean of fixation duration for each clip. A traditional method -- paired t-test was adopted. Per clip, AV condition has a shorter average duration of fixation (6.17 frames, 247 ms) than V condition (6.82 frames, 273 ms), and the difference is significant (t(9)=2.479, p=0.035). Per participant, AV condition still has a shorter average duration of fixation (6.19 frames, 248 ms) than V condition (6.75 frames, 270 ms), and the difference is also significant (t(35)=2.697, p=0.011). This means that the participants with AV condition tend to move their eyes more frequently compared to the participants with V condition. Additionally, this result is confirmed by a more recent method -- mixed effect model (Baayen, Davidson, & Bates, 2008).
Discussion
This study demonstrates that not only does human speech have a higher effect on human gaze when looking freely at videos, but also singer(s) and human noise.
The Kullback-Leibler divergence () between the groups with AV and V conditions is lower for “off-screen sound source” cluster than for two “on-screen sound source” clusters. The result indicates that a change in auditory information affects human gaze, when the information is linked to a visual event in the video (Hidaka, Teramoto, Gyoba, & Suzuki, 2010) (Gordon & Hibberts, 2011). The reason is perhaps that synchronized audio-visual events capture attention rather than unpaired audio-visual stimuli (Van der Burg, Brederoo, Nieuwenstein, Theeuwes, & Olivers, 2010). The entropy variation between before and after sound transition in AV condition (compared to V condition) shows that eye positions of participants tend to be more dispersed after transition when the sound source(s) is on-screen.
By calculating the difference between (the temporal mean of between two groups of participants) and randomization distribution , we conclude that the difference between participants with AV and V conditions is greater for four human classes (speech, singer, human noise, and singers). To explain this difference, we assume that the participants with AV condition move their eyes to the sound source after the beginning of the second sound. The result of (mean of distance between participants with AV condition and sound source) is smaller than (randomization distribution), and implies that after the auditory stimuli, participants searched for the sound source, associated with auditory information in the scene. This kind of behavior is obvious when the auditory stimulus is a human voice. This kind of behavior has also been observed by other researchers, but only for speech class. Kim and colleagues (Kim, Davis, & Krins, 2004 ; Tuomainen et al., 2005) provided evidence that acoustic and visual speech is strongly integrated only when the perceiver interprets the acoustic stimuli as speech. More recently observations of the mechanisms of speech stimuli and visual interaction demonstrated that lip-read information was more strongly paired with speech information than non-speech information (Vroomen & Stekelenburg, 2011).
Temporally, reaction time of participants is also observed. In
Figure 7 (a), the
value between participants with AV and V conditions of “speech” class increases around frame 7. However, in
Figure 8 (a), the eye position of participants with AV condition seems to reach the sound source after frame 14. It takes 7 frames on average (280 ms) for a participant to move their eyes to the sound source after hearing the second sound.
Face in the scene not only influences human voice sound classes, but also influences the musical instrument subclass. In this subclass, the distance between the eye position of participants with AV condition and the Face of the player is smaller than the distance between the eye position of participants with AV condition and the Musical instrument. The visual event linked to the acoustic stimuli is the instrument, not the face. The result shows that after the participants hear music, first they tend to move their eyes to the Face of the player. After a while, both the human face and musical instrument are reached. One possible explanation for this behavior is that participants responded faster to social stimuli (like faces) compared to non-social stimuli (like houses) (Escoffier, Sheng, & Schirmer, 2010). This special attractability of the Face of the player among other faces only appears when the music (from a musical instrument) can be heard simultaneously.
The comparison of fixation duration between the groups of participants with AV and V conditions was carried out for the whole database. We observed that the group with AV condition had a shorter fixation duration than the group with V condition. It may be caused by the fact that the responses of the participants to bimodal audio-visual stimuli were significantly faster than unimodal visual stimuli (Sinnett et al., 2008). Recent research from (Zou, Mller, & Shi, 2012) also confirms that synchronous audio-visual stimuli facilitate visual search performance, and have shorter reaction time than visual stimuli only.
In conclusion, our results provide evidence of sound influence on gaze when looking at videos. This sound effect is different depending on the type of sounds. Sound effect can be measured only when the sound is human voice. More precisely, human voice drives participants to move their eyes towards the sound source. In future work, by simulating this eye movement behavior influenced by sound, it would be interesting to add auditory influence to the traditional computational visual saliency model (such as (Itti, Koch, & Niebur, 1998)) to create an audio-visual saliency model. It could help to increase the prediction accuracy when the model is applied to videos with an original soundtrack.