Introduction
Operating under normal everyday conditions, the human brain continuously deals with often highly complex streams of input from different modalities. Attentional mechanisms allow the selection of a subset of this information for further processing. Overt attentional mechanisms, which involve the directed movement of sensory organs, play a particularly important role in human vision. Saccadic eye movements normally occur several times a second, and allow a sequential sampling of visual space by bringing different portions of the scene onto the fovea for maximum processing by the high density of photoreceptors found there. Eye movements are tightly linked to attention (
Hoffman & Subramaniam, 1995; Awh, Armstrong & Moore, 2006) and measuring where people look when presented with an image is a useful means of characterising what makes a visual stimulus pre-attentively interesting, or salient (Parkhurst, Law & Niebur, 2002). One of the most influential models of overt visual attention is currently the saliency map model (
Koch & Ullman, 1985;
Itti & Koch, 2001). Saliency map models decompose a stimulus into some constituent features (luminance contrast or red-green colour contrast, for example), calculate the centre-surround differences at different scales within each feature space, and then linearly combine these topographically aligned feature values to yield the salience measure for each region of the stimulus. The most salient points selected by the model can be verified by comparison with human fixation behaviour in response to the same stimuli. Unimodal visual (
Itti and Koch, 2001) and auditory (
Kayser et al., 2005) models have been implemented, and their performance has been shown to correlate with human behaviour (
Parkhurst et al., 2002;
Kayser et al., 2005). As such, studying what makes a stimulus spatially and temporally salient under natural conditions can broaden our understanding of attentional mechanisms.
Although the integration of features within single modalities and the role of this integration in capturing attention has been actively studied, research into similar crossmodal integration processes is scarce. We know from psychophysical studies that spatially and temporally overlapping multimodal stimuli elicit faster and more accurate responses in target detection tasks compared to unimodal stimuli (e.g.,
Frens et al., 1995;
Corneil et al., 2002). These benefits are attributed to an integration of the individual modalities, and not just statistical facilitation due to parallel processing of the unimodal information, as the measured crossmodal responses are on average found to be faster or more accurate than the best measured unimodal response. The corresponding neural mechanisms have also been investigated, in particular the multisensory integrative neurons of the superior colliculus (
Meredith and Stein, 1986;
Bell et al., 2005), an area which plays an important role in orienting behaviours such as eye and head movements (
Sparks, 1986). The response of collicular cells depends on the intensity of the stimulus: while many cells’ multisensory responses can be modelled as a summation of the responses to the individual unisensory cues, a superadditive response to multisensory stimulation with near-threshold stimuli is also a well-known phenomenon (
Stanford et al., 2005). Furthermore, reconsideration of the way in which early sensory processing areas are organised (
Wallace et al., 2004) and recent work clearly showing multisensory interactions in early sensory areas (e.g.,
Kayser et al., 2007) have introduced new questions about when and where crossmodal integration might take place (
Foxe & Schroeder, 2005), and how we should go about measuring and characterising it (
Kayser & Logothetis, 2007). Finally, a great deal of the research to date uses very simple and controllable stimuli, and although it is undoubtedly of benefit to use parametric stimuli and a rigorously defined task, the reliance on simple artificial stimuli and reaction time or detection tasks has been criticised (
DeGelder & Bertleson, 2003). In summary, we still lack an understanding of how information from different modalities is integrated for the control of attention, particularly under natural conditions.
We are interested in using subjects’ visual behaviour to infer something about the underlying integrative process: How is the information from different modalities combined in order to guide overt attention? We propose three hypotheses for this process, illustrated in
Figure 1. The end of this process will always involve the selection of a visual location to be attended, via a winner-takes-all operation (WTA). Each hypothesis can be characterised by the shape of the interaction surface created when the unimodal and multimodal saliencies, namely the probability of a given region of space to be fixated under the corresponding stimulation condition, are plotted against each other (right column of
Figure 1). The first hypothesis we consider (Maximum) involves a direct competition between modalities, and the maximally salient region among both modalities is chosen for fixation. This is reflected in a sublinear integration surface. There is no integration at work here; it is a pure competition between unimodal saliencies. Such a competition or race model is an important benchmark in cross-modal research, and in reaction time experiments a violation of the model is taken as an indication of an integrative process (
Miller, 1982). We will use the Maximum hypothesis analogously here. The second hypothesis (Linear Integration) is of a linear integration of the unimodal saliencies and can be modelled as a weighted summation, with a corresponding planar surface used to model the multimodal saliency as a function of unimodal saliencies. As mentioned above, multimodal responses measured from single neurons in superior colliculus have been found to approximate the sum of the constituent unimodal responses, and a statistically optimal linear integration has also been found in human estimates of object properties provided by multimodal cues (
Ernst & Banks, 2002). The final proposed scenario (Supra-linear Integration) involves an expansive non-linearity. Such a multiplicative interaction has been reported at the neural level in superior colliculus in response to threshold artificial stimuli, as mentioned earlier. Alternatively, this pattern of integration could be due to an earlier facilitatory or modulatory effect of the signals themselves, possibly in early sensory cortices before estimates of unimodal saliencies are complete. At the behavioural level, the latency and accuracy of saccades to auditory white noise and LEDSs presented at various signal-to-noise ratios has been found to be best explained by a multiplicative audio-visual integration (
Corneil et al., 2002).
Figure 1.
Three hypotheses for audiovisual interaction. The processing leading up to the winner-takes-all selection of the next location to be fixated is shown for each hypothetical integration. The rightmost column (grey background) depicts the corresponding interaction plot for each scheme – visual saliency is plotted on the x-axis, auditory saliency on the y-axis, with colour value (increasing from light to dark) indicating the magnitude of audiovisual saliency for a stimulus region with given visual and auditory saliencies. See text for more detail.
Figure 1.
Three hypotheses for audiovisual interaction. The processing leading up to the winner-takes-all selection of the next location to be fixated is shown for each hypothetical integration. The rightmost column (grey background) depicts the corresponding interaction plot for each scheme – visual saliency is plotted on the x-axis, auditory saliency on the y-axis, with colour value (increasing from light to dark) indicating the magnitude of audiovisual saliency for a stimulus region with given visual and auditory saliencies. See text for more detail.
Here we report the results of an eye-tracking study carried out using natural visual and auditory stimuli in order to determine the nature of the audiovisual interaction involved in the bottom-up control of human eye movements. By measuring the eye movements of 32 subjects presented with unimodal visual (V), auditory (A) and combined audiovisual (AV) stimuli, we can empirically determine saliency maps for the stimuli presented in each condition, and use these to directly assess the crossmodal interaction. Previous work by our group (
Onat et al., 2007) used this approach with natural stimuli: images of forest scenes and bird-songs presented from the left or right of the computer screen. The present experiment expands the categories of visual stimuli used, seeks to avoid any semantic relation between images and sounds, and increases the number of sound sources and their spatial dimensionality by using a simulated auditory space.
Methods
Subjects and Recording
Participants were recruited by advertisement at the University of Osnabrück, and were screened for hearing deficits before taking part in the experiment. Eye movement data of 32 subjects (13 male; aged 19-36, median age 23) was recorded. All had normal or corrected-to-normal vision, normal hearing, and were naive concerning the exact purpose of the experiment. Informed written consent was given by each participant, and the experiment was conducted in accordance with Declaration of Helsinki guidelines. Subjects received payment for participation (€5) or credit towards a university course requirement.
All experiments took place in a dimly-lit room dedicated to eye tracking. The Eye Link II head-mounted camera-based eye tracker (SR Research, Ontario, Canada) was used to record binocular eye movements in pupil-only mode, at sampling rate of 500 Hz and nominal spatial resolution of 0.01°. Default velocity, acceleration, motion and duration thresholds (30°s-1, 8000°s-2, 0.1°, 4ms) were used to define saccades.
A chin rest was offered to subjects to assist them in remaining as still as possible during the experiment, and was used by all participants. Before the experiment began, the system was calibrated using an array of 13 fixation spots presented serially in a random order, until an average absolute global error of less than 0.5° was reached. Subjects were also required to fixate a central fixation point before each stimulus presentation, which allowed for online correction of any small drift errors during the experiment. The background colour of the calibration and drift correction screens was set to the mean luminance of the images used in the experiment.
The experiment was run using a Python program running on a high-speed computer (Apple Mac Pro: Apple, CA, USA). Visual stimuli were shown on a 30” display (Apple Cinema HD display: Apple, CA, USA) at a resolution of 2560 × 1600 pixels. The images did not fully extend to the horizontal edges of the display and these areas were covered with an aperture made from thick black card. Subjects were seated 60 cm from the display with the centre of the screen approximately at eye level, meaning that the images subtended approximately 44° × 36° degrees of visual field. Auditory stimuli were presented through in-ear binaural earphones (ER-4B: Etymotic Research, IL, USA) via an external audio interface (Transit High-Resolution Mobile Audio Interface: M-Audio, CA, USA) at a comfortable listening volume chosen by the subject before the experiment using a white-noise test stimulus.
Stimuli
Images were of natural indoor or outdoor scenes captured using a high-quality digital camera (DSC-V1 Cyber-shot: Sony, Tokyo, Japan). 24 greyscale bitmap images (1944 × 1600 pixels, 12 indoor and 12 outdoor scenes) were used in the experiment (see
Figure 4 for sample stimuli) and were chosen with the aim of avoiding central presentation of any distinctive objects, while creating scenes that were interesting along their full spatial extent.
Each four-second auditory stimulus contained a single sound sustained over the complete duration, and were deliberately chosen so that the underlying auditory scene could not be well-identified. 18 basis sounds, selected for their steady-state spectrogram and verified by a human listener, were taken from a collection recorded for the experiment at the University of Sheffield, which were created by manipulating various household and office objects in a repetitive and continuous way, for example dragging a hairbrush over a cork surface, or slowly unfolding a piece of aluminium foil. The sounds are described in more detail in the Appendix. A later experiment with three naive listeners, who did not take part in the eye-tracking experiment, asked participants to first describe each sound in their own words, and then to judge whether pairs of sounds were the same or different (18 pairs, of which 12 were randomly paired mismatched sounds, and 6 were matched pairs, with matching balanced over subjects). The first part of this test confirmed that the sounds could not be easily identified, with the three subjects adequately describing the manipulation and the objects for only 8, 8, and 4 of the 18 sounds, respectively. The range of responses given was very broad, including “crackling fire” and “footsteps in undergrowth”, and thus suggests that the sounds were suitably ambiguous and could be used with all visual stimuli without introducing a specific semantic bias. However, listeners could still differentiate between the sounds, with all three participants correctly classifying all pairs as same or different. As opposed to white noise, the sounds were perceived distinctly and can be considered natural.
The recordings of auditory stimuli were made in an acoustically isolated booth (IAC 400-A: IAC, Staines, UK) using a single microphone (B&K type 4190: Bruel & Kjaer, Denmark). The recorded signal was pre-amplified (Nexus 2690 microphone conditioning amplifier: Bruel & Kjaer, Denmark) and then digitised at 25 kHz (RP2.1 real-time processor: Tucker-Davis Technologies, FL, USA). The signals were later re-sampled to 44.1 kHz using polyphase filtering, as implemented in Matlab’s inbuilt resample function. All further pre-processing and analysis was carried out in Matlab (The MathWorks, MA, USA).
The chosen sounds were normalised to approximately equal loudness using the a-weighting curve (
IEC, 1979). The next step was to process the sounds so that they could be listened to through headphones and perceived as arising from a set of locations in 3-D space (for a more detailed account of the following procedure and alternatives, see
Blauert, 1997). This “spatialisation” of the sounds was achieved using RoomSim (
Campbell et al., 2005), an open source acoustic room simulation program. Given the dimensions of a rectangular room, the sound absorption properties of the room’s surfaces, and the location of sensor and sound source, RoomSim computes an impulse response (IR) which models the source-to-sensor acoustic transformation for that particular environment. The program can also be used to create binaural room IRs by additionally incorporating a head-related transfer function (HRTF), which models the direction-specific spectral response characteristics of the two ears of a human listener. In order to use the computed IRs to spatialise a sound signal, the mono sound signal is simply convolved with each of the binaural IRs (corresponding to the two ears). These two resulting sounds are then used as the channels of a single sound that can be played through earphones.
Four different artificial sound sources were used, each with 20° horizontal (azimuth) and 22.5° vertical (elevation) offset relative to the listener, roughly corresponding to the four corners of the computer screen. We used the dimensions of the room in which the experiment was conducted, and chose surface absorption properties that produced realistic-sounding reverberation. A generic HRTF from the CIPIC collection (
Algazi et al., 2001) was used, and a binaural room IR was created for each of the four sound sources. Each of the 18 sounds was convolved with each IR, yielding 72 spatialised auditory stimuli. As it was not known whether participants would be able to discern the difference between these artifical sources, each subject’s spatialisation ability was determined in a post-experimental test as described below.
Figure 2.
Procedure for the eye-tracking experiment. Each of the 96 trials lasted for four seconds and consisted of either an image (V), a sound with white noise image displayed (A1-4), or simultaneously presented image and sound (AV1-4). Sounds were spatialised to one of four locations, corresponding approximately to the corners of the screen (1: top-left, 2: bottom-left, 3: top-right, 4: bottom-right). Drift correction was performed before presentation of each stimulus. See text for further details.
Figure 2.
Procedure for the eye-tracking experiment. Each of the 96 trials lasted for four seconds and consisted of either an image (V), a sound with white noise image displayed (A1-4), or simultaneously presented image and sound (AV1-4). Sounds were spatialised to one of four locations, corresponding approximately to the corners of the screen (1: top-left, 2: bottom-left, 3: top-right, 4: bottom-right). Drift correction was performed before presentation of each stimulus. See text for further details.
Experimental Procedure
An illustration of the eye-tracking experiment is shown in
Figure 2. The experiment consisted of unimodal (auditory: A, and visual: V) and multimodal (AV) conditions. Each auditory stimulus had one of four locations, which amounted to four different auditory and audiovisual conditions (A1-4, AV1-4). Participants’ eye movements were measured in 96 four-second trials, of which 24 were visual, 24 (6 × 4) were auditory, and 48 (12 × 4) were audiovisual. Each image was seen three times by each subject (1 × V, 2 × AV), and each of the 18 sounds was heard once at each of the four possible sound locations. Static white noise images were displayed during auditory trials in order to limit subjects’ fixations to the same spatial range as V and AV trials. The order of stimulus presentation and the assignment of auditory stimuli to A and AV trials were randomly generated for each subject, with the single constraint (to aid balancing within subjects) that repeated appearances of images in AV trials were accompanied by sounds from sources with common azimuth or elevation. This was a free-viewing experiment, with participants told only to “study the images and listen to the sounds carefully”.
After the eye-tracking experiment, participants were briefly interviewed to determine whether they had realised that the sounds differed in spatial location, and were informed of the details of the spatialisation. In order to determine whether they had accurately perceived the spatial location of the auditory stimuli, participants then completed a four-alternative forced choice (4AFC) task, in which they listened again to each sound heard in the experiment and indicated their estimation of its location (upper-left, lower-left, upper-right, lower-right) by button press on marked keys on the numerical pad of a standard keyboard. The sounds were presented in a new random order for each subject, with the constraint that two different spatialisations of the same sound did not directly follow each other. Finally, subjects were debriefed about the purpose of the experiment and any questions they had were answered. The entire experiment lasted less than an hour, and was conducted in either English or German, depending on the participant’s language preference.
Figure 3.
Data from two representative images. Two of the images used in the experiment are shown (original resolution was 1944 x 1600 pixels), along with the image pdfs constructed from eye movements of all subjects presented with this image in V and AV conditions. Probability of fixation is denoted by increasing colour; see colour bars provided with each image for corresponding scale (for visualisation purposes, V and AV pdfs are equalised to the same range). Centre of gravity of each pdf is indicated by a cross.
Figure 3.
Data from two representative images. Two of the images used in the experiment are shown (original resolution was 1944 x 1600 pixels), along with the image pdfs constructed from eye movements of all subjects presented with this image in V and AV conditions. Probability of fixation is denoted by increasing colour; see colour bars provided with each image for corresponding scale (for visualisation purposes, V and AV pdfs are equalised to the same range). Centre of gravity of each pdf is indicated by a cross.
Data Analysis
The recorded fixation data was used to create probability density functions (pdfs), which quantify the probability of a region of space to be fixated by a subject, as follows. For each condition, subject, and stimulus, the pixel location of each recorded fixation was marked on a raw fixation map. To smooth the data and to compensate for the finite precision of the eyetracker, these raw maps were convolved with a Gaussian of unit integral (full width at half maximum: 1.5° visual angle). First fixations were not included due to the bias introduced by the central fixation spot which preceded each trial. Image pdfs for V and each of the 4 AV conditions (corresponding to the 4 sound locations) were produced by averaging over subjects (see
Figure 3 for examples); subject pdfs were made by averaging over images (
Figure 4). In the case of the auditory conditions, pdfs were made by averaging over subjects and sounds, resulting in a total of 4 pdfs.
Pdfs were characterised by their centre of gravity, and centre of gravity shifts between V and AV conditions, used to characterise the shift in fixation density due to auditory stimulation, were calculated by simple subtraction. The horizontal influence of sounds was evaluated by comparing the distributions of horizontal shifts grouped by common azimuth, i.e. shifts measured using AV pdfs with sound sources located to the top-left and bottom-left (AV1 and AV2) vs. those arising from AV pdfs with right-side sound stimulation (AV3 and AV4). The vertical auditory effect was evaluated by comparing distributions of vertical shifts grouped by common AV elevation. Image pdfs reveal consistent biases in subjects’ fixation behaviour when presented with the corresponding stimulus, and can thus be considered as empirical saliency maps. This plays an important role in subsequent analysis.
Two approaches were taken to model the audiovisual interaction and to test the hypotheses described in the Introduction. First, a multiple regression analysis was used to fit the unimodal saliencies (the empirical saliency maps constructed from the measured gaze data: p
A and p
V in the equation below) with the measured multimodal saliency (p
AV) according to the following equation:
Least squares regression was used to solve this equation separately for each AV stimulus, yielding 24×4 (images×sound source locations) estimates for each coefficient (α0 to α3). The last term of the equation, pA×V, is the normalised (to unit integral) element-wise product of the unimodal saliencies pA and pV, and is assumed to approximate the outcome of a multiplicative cross-modal interaction.
The second approach involved creating a model of the interaction from the measured data, which was then evaluated in terms of the candidate interaction scenarios detailed above. As mentioned earlier, V and AV image pdfs constitute empirical saliency maps, and each point of the saliency map for a given image provides us with the saliency of that portion of stimulus space under the respective input condition. We also have a value for the saliency of visual space measured in the auditory condition, yielding a triplet of A, V, and AV saliency values for each pixel of each image. Data-driven integration plots were made by ordering these triplets in a three-dimensional V×A×AV space, thereby disregarding their spatial location and the stimulus from which they originated. Because of the high resolution of the images involved, the individual saliency maps were downsized by a factor of 4 (to 486×400 pixels) before creating the interaction plots. Saliency maps generally tend to have few highly salient areas, meaning the higher saliency areas of the V×A×AV space are sparsely populated. For this reason, we restricted our analysis to a subset of the space, which was chosen in order to maximise the amount of data included in the model, while still requiring adequate density of data for estimation of audiovisual saliency values (see Results for more details). Data within this range were binned, and the mean of each bin was used as an estimate for p
AV, and the bin centres as estimates for p
A and p
V. A least squares analysis – weighted by the inverse variance of each bin – was then carried out to estimate the coefficients of the following equations:
with G(•) indicating the geometric mean. The first equation models the multimodal saliency as a linear integration of unimodal saliencies, while the second includes the multiplicative term p
A×V. In the multiple regression analysis of (1) above, this term, which was created by multiplying the unimodal pdfs, could be normalised to unit integral in order to bring it into a range comparable to the unimodal pdfs. Here, the saliencies no longer belong to a structured probability distribution, so the saliency data belonging to each condition were normalised by their respective geometric mean (G(p
C)) in order to bring the saliency values into the same exponent range.
Figure 4.
Data from two representative subjects. Pdfs constructed from eye movements of separate subjects are shown for each experimental condition: unimodal visual (V), unimodal auditory (A1-4), and multimodal audiovisual (AV1-4). Each pdf includes fixations made by the subject when presented with all stimuli of the corresponding condition. Probability of fixation is denoted by increasing colour; see colour bars provided with each condition and subject for corresponding scale. Centre of gravity is indicated on each pdf by a cross.
Figure 4.
Data from two representative subjects. Pdfs constructed from eye movements of separate subjects are shown for each experimental condition: unimodal visual (V), unimodal auditory (A1-4), and multimodal audiovisual (AV1-4). Each pdf includes fixations made by the subject when presented with all stimuli of the corresponding condition. Probability of fixation is denoted by increasing colour; see colour bars provided with each condition and subject for corresponding scale. Centre of gravity is indicated on each pdf by a cross.
Discussion
We investigated the integration of auditory and visual information during overt attention by measuring the eye movements of 32 subjects, who were asked to freely explore natural stimuli in visual, auditory and audiovisual conditions. Analysis of the recorded eye movements revealed a systematic shift of gaze towards the sound location of multimodal stimuli. Auditory information thus has an effect on the selection of candidate fixation points. Furthermore, the pattern of fixation was found to be similar for images shown in V and AV conditions, implying that visually salient regions remain salient under multimodal stimulation. Taken together, this suggests that simultaneously presented visual and auditory information integrate before selection of the next fixation point. Reconsidering the three alternative processes detailed in the Introduction, we can now reject the first (Maximum) hypothesis of competing modalities, in which the maximally salient information wins out in the selection of the next target of overt attention. Next, multiple regression analysis of the saliency maps grouped by stimulus suggested that a linear combination of unimodal saliencies could adequately explain the multimodal saliency (supporting the Linear Integration hypothesis). Finally, the interaction plot constructed from the data was found to fit extremely well with a linear model, and little was gained by including the multiplicative term corresponding to the Supra-linear Integration hypothesis. These results therefore support our second hypothesis of a linear integration of visual and auditory information in the bottom-up control of human eye movements.
The auditory stimuli had been processed to differ in spatial location using a generic non-individualised technique, and based on reports in the literature (e.g.,
Wenzel et al., 1993), it was expected that subjects would have difficulty in vertically discriminating the locations of these simulated sounds. However, the 4 alternative forced-choice task results revealed that overall, all subjects performed beyond chance level. This first attempt at simulating auditory sources differing in spatial location was a success, and the spatialisation approach used could be further developed to cover more of the stimulus space, allowing a more systematic examination of horizontal and vertical effects of auditory stimulation. There is certainly room for improvement, for example regarding the more subtle, subjective aspects of auditory perception, such as feeling that the sound is truly situated outside the head and not just directionally presented through headphones. One factor that can help to make sounds more realistic is reverberation, which was included in the simulation process we used to synthesise our stimuli. Fixation selection is an unconscious process, and it is not obvious whether and how any improvement in subjective auditory aspects, such as externalisation, would affect visual behavior. In addition, there is a known trade-off between the directional accuracy of sound localisation and the authenticity of the artificial auditory experience (
Shinn-Cunningham et al., 2005), which was not explored in any great depth here. Nonetheless, the results of the 4AFC show that subjects could discriminate between the spatial locations of different sound sources. Additionally, fixation was shifted towards the sound source, showing a behavioural effect of the spatialised sounds. For the present purposes, this suffices to show that the stimuli were indeed perceived to differ in spatial location.
Some further aspects of the auditory stimuli also warrant discussion. First, the sounds were created to be reasonably stationary over time, and were additionally fixed to a single spatial location for their duration. It can be argued that natural sounds are often transient in time and space, and that they serve only to redirect attention. However, the analysis used here is dependent on the empirical saliency maps created from the recorded visual behaviour of many subjects. The construction of our integration model thus depended on having as much data as possible, which led us to choose continuous, spatially static sounds in order to prolong any spatial bias in participants’ overt behaviour and to thus sample more data. Examining the progression of this spatial bias over time, we found that the orienting effect of sound on eye movements was strongest between 250 and 2500 ms. Within this temporal interval of interest, there was no indication that processing early after stimulus onset is any different to later processing, but the exact time course of the effect was not investigated any further. With enough data, a future study could address this in detail.
Second, we deliberately aimed to avoid semantic congruency between visual and auditory stimuli. This could of course be done differently. However, introducing meaningful content to the auditory modality also introduces associated problems. On the one hand, a readily identifiable sound might provide a more complex spatial cue by virtue of the listener’s world knowledge – birds usually sing from trees, footsteps usually emanate from the ground, etc. In order to create a realistic integration model, the auditory saliency map would have to reflect this, which would require more data. Alternatively, the semantic information contained in the sound could be used to create congruent and incongruent audiovisual stimuli. There are different cases that must be dealt with here. In congruent stimuli, the sound-object, for example an animal, might already be visible, i.e. already visually salient. If the sound is incongruent to the presented visual scene, the sound could become salient for reasons of incongruence making incongruence difficult to control. Furthermore, in cases where the sound-generating agent is not visible, this might elicit visual search, which cannot be compared to free-viewing conditions. It is indeed interesting to explore the role of semantic congruence, but there are pitfalls involved, and here we chose to avoid them.
As mentioned earlier, a previous study by our group used the same approach but with visual images of forest scenes and birdsong presented from loudspeakers at the left or right of the computer screen (
Onat et al., 2007). In that study there was high semantic congruence between images and sounds, and the same result was found – eye movements during audiovisual stimulation were well explained by a linear integration of visual and auditory saliency maps. There were several further differences between the two experiments. First, concerning the stimuli used in both experiments – the study described here increased the dimensionality of the auditory space, and deliberately avoided semantic correspondance between images and sounds. The means of presenting the stimuli also differed. Here, sounds were created using simulation software and presented through earphones, while the images spanned more of subjects’ visual field. Despite these differences, our findings confirm the results of the previous experiment and support the major finding here: that a linear integration of unimodal saliencies provides a good explanation of the process underlying crossmodal integration for the control of overt attention.
In terms of the shift in fixation density towards the sound location during multimodal stimulation, we additionally investigated the effect in the vertical direction. The horizontal effect of sounds played in audiovisual conditions was clear, while the vertical effect was much smaller and did not reach significance when considered over subjects. One reason for this may be in the possible ambiguity in perception of sound locations – the 4AFC task revealed that subjects made most errors in vertically distinguishing sound locations. Alternatively, the stronger spatial biasing by the horizontal auditory component may simply reflect an intrinsic property of human viewers to attend more to horizontal aspects of the visual space, reflecting the spatial layout of the human world. This is certainly an aspect of this study that warrants further investigation, possibly by means of a further increase in the auditory coverage of stimulus space.
Our ultimate goal is to understand attentional processes under fully natural conditions, and this study can be considered as a step in this direction. In terms of visual stimuli, this requires a move beyond static images, and the use of video stimuli introduces a new set of image features in the form of motion features. These have been investigated in a pilot eye-tracking experiment within our group (Açık et al., in preparation), and further research is underway. In addition, naturally behaving observers act within their environment; as well as eye movements, head and body movements are an important part of overt attention. Recent work by our group and others has also begun to address this issue (e.g. Schumann et al., in press). The paradigm used here is certainly amenable to more advanced stimulation techniques, as the required uniand multi-modal saliencies are measured empirically from the participants’ visual behaviour, which removes the need for a robust model of saliency computation from the visual, auditory, or audiovisual stimuli themselves.
Finally, the hypothesis best supported by the data is that a linear integration of unimodal saliencies is involved in the control of eye movements during free viewing of natural stimuli. Previous work in our group (
Schumann et al., 2007) has also shown that within the visual modality, the pair-wise integration of the information provided by visual features is also well-modelled by a linear integration. Taken together, these results hint that there might be a general principle of linear integration underlying the combination of information in the brain. As pointed out earlier, neurons in the Superior Colliculus have been found to integrate unimodal stimuli in a linear fashion, but this depends on stimulus efficacy, with stimuli close to threshold levels inducing a supra-linear integration. Working with artificial stimuli, some researchers have found the same pattern of inverse effectiveness (
Corneil et al., 2002) when auditory and visual stimuli are embedded in noisy backgrounds, but others using different artificial stimulation methods have failed to do so (
e.g. Frens et al., 1995). Here we used natural auditory and visual stimuli and found that a linear combination of unimodal saliencies well explains the integration process underlying overt visual attention. It remains to be seen whether further factors, such as a manipulation of semantic congruency, can have an effect on the integrative processing scheme found here.