Introduction
Though diverse forms of music exist across the globe, all music shares the property of evolving through time. While certain scales, modes, meters, or timbres may be more or less prevalent depending on the culture in question, the use of time to organize sound is universal. Therefore, rhythm, one of the most basic elements of music, provides an excellent scientific starting point to begin to question and characterize the neural mechanisms underlying music-induced changes in motor behavior and attentional state. To remain consistent with previous literature, here rhythm is defined as patterns of duration, timing, and stress in the amplitude envelope of an auditory signal (a physical property), whereas meter is a perceptual phenomenon that tends to include the pulse (beat or tactus) frequency perceived in a rhythmic sequence, as well as slower and faster integer-related frequencies (
London, 2012).
Previous studies have shown that the presence of meter affects attention and motor behavior. For instance, perceptual sensitivity is enhanced and reaction times are decreased when targets occur in phase with an on-going metric periodicity (
Bergeson & Trehub, 2006; Jones, Johnston, & Puente, 2006; Yee, Holleran, & Jones, 1994). Interestingly, this facilitation via auditory regularity is observed not only for auditory targets, including speech (
Cason & Schön, 2012), but also for visual targets (Bolger, Coull, & Schön, 2014; Bolger, Trost, & Schön, 2013; Escoffier, Sheng, & Schirmer, 2010;
Grahn, 2012; Grahn, Henry, & McAuley, 2011; Hove, Fairhurst, Kotz, & Keller, 2013; Miller, Carlson, & McAuley, 2013). One promising theory that accounts for these results is Dynamic Attending Theory (
Jones & Boltz, 1989;
Large & Snyder, 2009).
Dynamic Attending Theory
Dynamic Attending Theory (DAT) posits that the neural mechanisms of attention are susceptible to entrainment by an external stimulus, allowing for temporal predictions and therefore attention and motor coordination to specific time points (
Jones & Boltz, 1989;
Large & Palmer, 2002;
Large & Snyder, 2009). For any given stimulus, the periodicities with the most energy will capture attention most strongly. Neurobiologically, the proposed mechanism is entrainment of neuronal membrane potential of, for example, neurons in primary auditory cortex (in the case of auditory entrainment), to the external stimulus. These phase-locked fluctuations in membrane potential alter the probability of firing action potentials at any given point in time (see Schroeder and Lakatos (2009) for a review).
Similarly, the recent Active Sensing Hypothesis (Lakatos, Karmos, Mehta, Ulbert, & Schroeder, 2008;
Schroeder & Lakatos, 2009; Schroeder, Wilson, Radman, Scharfman, & Lakatos, 2010) proposes that perception occurs actively via motor sampling routines, that neural oscillations serve to selectively enhance or suppress input, cross-modally, and that cortical entrainment is, in and of itself, a mechanism of attentional selection. Higher frequency oscillations can become nested within lower frequency ones via phase-phase coupling, phase-amplitude coupling, or amplitude-amplitude coupling, allowing for processing of different stimulus attributes in parallel (
Buzsaki, 2004; Siegel, Donner, & Engel, 2012). Henry and Herrmann (2014) have connected the ideas of Active Sensing to those of DAT by highlighting the critical role of low frequency neural oscillations. In summary, DAT and Active Sensing are not incompatible, as outlined in Morillon, Hackett, Kajikawa, and Schroeder (2015).
Interestingly, studies of neural entrainment are typically separate from those investigating sensorimotor synchronization, defined as spontaneous synchronization of one’s motor effectors with an external rhythm (
Repp, 2005). However, a recent study confirms that the amplitude of neural entrainment at the beat frequency explains variability in sensorimotor synchronization accuracy, as well as temporal prediction capabilities (Nozaradan, Peretz, & Keller, 2016). Although motor entrainment, typically referred to as sensorimotor synchronization, is not explicitly mentioned as a mechanism of DAT, the role of the motor system in shaping perception is discussed in many DAT papers (
Grahn & Rowe, 2013; Iversen, Repp, & Patel, 2009; Large, Herrera, & Velasco, 2015; Morillon, Schroeder, & Wyart, 2014; Teki, Grube, Kumar, & Griffiths, 2011) and is a core tenet of the Active Sensing Hypothesis (
Morillon et al., 2015).
In this paper, we test whether a computational model of Dynamic Attending Theory can predict attentional fluctuations to rhythmic patterns. We also attempt to bridge the gap between motor and cortical entrainment by investigating coupling of pupil dynamics to musical stimuli. We consider the pupil both a motor behavior and an overt index of attention, which we discuss in more detail below.
Sensori(oculo)motor coupling
Though most sensorimotor synchronization research has focused on large-scale motor effectors, the auditory system also seems to have a tight relationship with the ocular motor system. For instance, Schaefer, Süss, and Fiebig, (1981) show that eye movements can synchronize with a moving acoustic target whether it is real or imagined, in light or in darkness. With regard to rhythm, a recent paper by Maroti, Knakker, Vidnyanszky, and Weiss, (2017) suggests that the tempo of rhythmic auditory stimuli modulates both fixation durations and inter-saccade-intervals: rhythms with faster tempi result in shorter fixations and inter-saccade-intervals and vice versa. These results seem to fit with those observed in audiovisual illusions, which illustrate the ability of auditory stimuli to influence visual perception and even enhance visual discrimination (Recanzone, 2002; Sekuler, Sekuler, & Lau, 1997). Such cross-modal influencing of perception also occurs when participants are asked to engage in purely imaginary situations (Berger & Ehrsson, 2013).
Though most studies have focused on eyeball movements, some (outlined below) have begun to analyze the effect of auditory stimuli on pupil dilation. Such an approach holds particular promise, as changes in pupil size reflect sub-second changes in attentional state related to locus coeruleus-mediated noradrenergic (LC-NE) functioning (Aston-Jones, Rajkowski, Kubiak, & Alexinsky, 1994; Berridge & Waterhouse, 2003; Rajkowski, Kubiak, & Aston-Jones, 1993). The LC-NE system plays a critical role in sensory processing, attentional regulation, and memory consolidation. Its activity is time-locked to theta oscillations in hippocampal CA1, and is theorized to be capable of phase-resetting forebrain gamma band fluctuations, which are similarly implicated in a broad range of cognitive processes (
Sara, 2015).
In the visual domain, the pupil can dynamically follow the frequency of an attended luminance flicker and index the allocation of visual attention (Naber, Alvarez, & Nakayama, 2013), as well as the spread of attention, whether cued endogenously or exogenously (Daniels, Nichols, Seifert, & Hock, 2012). However, such a pupillary entrainment effect has never been studied in the auditory domain. Theoretically though, pupil dilation should be susceptible to auditory entrainment, like other autonomic responses, such as respiration, heart rate, and blood pressure, which can become entrained to slow periodicities present in music (see Trost, Labbe, and Grandjean (2017) for a recent review on autonomic entrainment).
In the context of audition, the pupil seems to be a reliable index of neuronal auditory cortex activity and behavioral sensory sensitivity. McGinley, David, and McCormick (2015) simultaneously recorded neurons in auditory cortex, medial geniculate (MG), and hippocampal CA1 in conjunction with pupil size, while mice detected auditory targets embedded in noise. They found that pupil diameter was tightly related to both ripple activity in CA1 (in a 180 degree antiphase relationship) and neuronal membrane fluctuations in auditory cortex. Slow rhythmic activity and high membrane potential variability were observed in conjunction with constricted pupils, while high frequency activity and high membrane potential variability were observed with largely dilated pupils. At intermediate levels of pupil dilation, the membrane was hyperpolarized and the variance in its potential was decreased. The same inverted U relationship was observed for MG neurons as well. Crucially, in the behavioral task,
McGinley et al. (
2015) found that the decrease in membrane potential variance at intermediate pre-stimulus pupil sizes predicted the best performance on the task. Variability of membrane potential was smallest on detected trials (intermediate pupil size), largest on false alarm trials (large pupil size), and intermediate on miss trials (small pupil size). Though this study was performed on mice, it provides compelling neurophysiological evidence for using pupil size as an index of auditory processing.
The same inverted U relationship between pupil size and task performance has been observed in humans during a standard auditory oddball task. For instance, Murphy, Robertson, Balsters, and O’Connell (2011) showed that baseline pupil diameter predicts both reaction time and P300 amplitude in an inverted U fashion on an individual trial basis. Additionally, Murphy et al. (2011) found that baseline pupil diameter is negatively correlated with the phasic pupillary response elicited by deviants. Because of the well-established relationship between tonic neuronal activity in locus coeruleus and pupil diameter, it is theorized that both the P300 amplitude and pupil diameter index locus coeruleus-norepinephrine activity (
Aston-Jones et al., 1994; Joshi, Li, Kalwani, & Gold, 2016; Murphy, O’Connell, O’Sullivan, Robertson, & Balsters, 2014;
Murphy et al., 2011;
Rajkowski et al., 1993).
Moving towards musical stimuli, pupil size has been found to be larger for: more arousing stimuli (Gingras, Marin, Puig-Waldmuller, & Fitch, 2015; Weiss, Trehub, Schellenberg, & Habashi, 2016), well-liked stimuli (Lange, Zweck, & Sinn, 2017), more familiar stimuli (
Weiss et al., 2016), psychologically and physically salient auditory targets (
Beatty, 1982; Hong, Walz, & Sajda, 2014; Liao, Yoneya, Kidani, Kashino, & Furukawa, 2016), more perceptually stable auditory stimuli (Einhauser, Stout, Koch, & Carter, 2008), and chill-evoking musical passages (Laeng, Eidet, Sulutvedt, & Panksepp, 2016). A particularly relevant paper by Damsma and van Rijn (2017) showed that the pupil responds to unattended omissions in on-going rhythmic patterns when the omissions coincide with strong metrical beats but not weak ones, suggesting that the pupil is sensitive to internally generated hierarchical models of musical meter. While Damsma and van Rijn’s analysis of difference waves was informative, the continuous time series of pupil size may provide additional dynamic insights into complex auditory processing.
Of particular note, Kang and Wheatley (2015) demonstrated a relationship between attention and the time course of the pupillary signal while listening to music. To do this, they had participants listen to 30-sec clips of classical music while being eye-tracked. In the first phase of the experiment, participants listened to each clip individually (diotic presentation); in the second phase participants were presented with two different clips at once (dichotic presentation) and instructed to attend to one or the other. Kang & Wheatley compared the pupil signal during dichotic presentation to the pupil signal during diotic presentation of the attended vs. ignored clip. Using dynamic time warping to determine the similarity between the pupillary signals of interest, they showed that in the dichotic condition, the pupil signal was more similar to the pupil signal recorded during diotic presentation of the attended clip than to that recorded during diotic presentation of the unattended clip (Kang & Wheatley, 2015). Such a finding implies that the pupil time series is a time-locked, continuous dependent measure that can reveal fine-grained information about an attended auditory stimulus. However, it remains to be determined whether it is possible for the pupil to become entrained to rhythmic auditory stimuli and whether such oscillations would reflect attentional processes or merely passive entrainment.
Predicting dynamic auditory attention
Because the metric structure perceived by listeners is not readily derivable from the acoustic signal, a variety of algorithms have been developed to predict at what period listeners will perceive the beat. For example, most music software applications use algorithms to display tempo to users and a variety of contests exist in the music information retrieval community for developing the most accurate estimation of perceived tempo, as well as individual beats, e.g. the Music Information Retrieval Evaluation eXchange (MIREX) Audio Beat Tracking task (Davies, Degara, & Plumbley, 2009). The beat period, however, it just one aspect of the musical meter. More sophisticated algorithms and models have been developed to predict all prominent metric periodicities in a stimulus, as well as the way in which attention might fluctuate as a function of the temporal structure of an audio stimulus, as predicted by Dynamic Attending Theory.
For instance, the Beyond-the-Beat (BTB) model (
Tomic & Janata, 2008) parses audio in a way analogous to the auditory nerve and uses a bank of 99 damped linear oscillators (reson filters) tuned to frequencies between 0.25 and 10 Hz to model the periodicities present in a stimulus. Several studies have shown that temporal regularities present in behavioral movement data (tapping and motion capture) collected from participants listening to musical stimuli correspond to the modeled BTB periodicity predictions for those same stimuli (Hurley, Martens, & Janata, 2014; Janata, Tomic, & Haberman, 2012;
Tomic & Janata, 2008). Recent work (Hurley, Fink, & Janata, 2018) suggests that an additional model calculation of time-varying temporal salience can predict participants’ perceptual thresholds for detecting intensity changes at a variety of probed time points throughout the modeled stimuli, i.e. participants’ time-varying fluctuations in attention when listening to rhythmic patterns.
In the current study, we further tested the BTB model’s temporal salience predictions by asking whether output from the model could predict the pupillary response to rhythmic musical patterns. We hypothesized that the model could predict neurophysiological signals, such as the pupillary response, which we use as a proxy for attention. Specifically, we expected that the pupil would become entrained to the rhythmic musical patterns in a stimulus specific way.
We also expected to see phasic pupil dilation responses to intensity deviants. As in
Hurley et al. (
2018), we used an adaptive thresholding procedure to probe participants’ perceptual thresholds for detecting intensity increases (dB SPL) inserted at multiple time points throughout realistic, multi-part rhythmic stimuli. Each probed position within the stimulus had a corresponding value in terms of the model’s temporal salience predictions. We hypothesized that detection thresholds should be lower at moments of high model-predicted salience and vice versa. If perceptual thresholds differ for different moments in time, we assume this reflects fluctuations in attention, as predicted by DAT.
Materials
The five rhythmic patterns used in this study (
Figure 1, left column, top panels) were initially created by Dr. Peter Keller via a custom audio sequencer in Max/MSP 4.5.7 (Cycling ‘74), for a previous experiment in our lab. Multitimbre percussive patterns, each consisting of the snap, shaker, and conga samples from a Proteus 2000 sound module (E-mu Systems, Scotts Valley, CA) were designed to be played back in a continuous loop at 107 beats per minute, with a 4/4 meter in mind. However, we remain agnostic as the actual beat periodicity and metric periodicities listeners perceived in the stimuli, as we leave such predictions to the linear oscillator model. Each stimulus pattern lasted 2.2 s. We use the same stimulus names as in
Hurley et al. (
2018) for consistency. All stimuli can be accessed in the supplemental material of
Hurley et al. (
2018).
Please note that the intensity level changed dynamically throughout the experiment based on participants’ responses. The real-time, adaptive presentation of the stimuli is discussed further in the Adaptive Thresholding Procedure section below.
Linear oscillator model predictions
All stimuli were processed through the Beyond-the-Beat model (
Tomic & Janata, 2008) to obtain mean periodicity profiles and temporal salience predictions. For full details about the architecture of the model and the periodicity surface calculations, see Tomic and Janata (2008). For details about the temporal salience calculations, please see
Hurley et al. (
2018).
In short, the model uses the Institute for Psychoacoustics and Electronic Music toolbox (Leman, Lesaffre, & Tanghe, 2001) to transform the incoming audio in a manner analogous to the auditory nerve, separating the signal into 40 different frequency bands, with center frequencies ranging from 141 to 8877 Hz. Then, onset detection is performed in each band by taking the half-wave rectified first order difference of the root mean square (RMS) amplitude. Adjacent bands are averaged together to reduce redundancy and enhance computational efficiency. The signal from each of the remaining five bands is fed through a bank of 99 reson filters (linear oscillators) tuned to a range of frequencies up to 10 Hz. The oscillators driven most strongly by the incoming signal oscillate with the largest amplitude (
Figure 2A). A windowed RMS on the reson filter outputs results in five periodicity surfaces (one for each of the five bands), which show the energy output at each reson-filter periodicity (
Figure 2B). The periodicity surfaces are averaged together to produce an Average Periodicity Surface (
Figure 2C). The profile plotted to the right of each stimulus (
Figure 1, right column;
Figure 2D) is termed the Mean Periodicity Profile (MPP) and represents the energy at each periodicity frequency, averaged over time. Periodicities in the MPP that exceed 5% of the MPP’s amplitude range are considered peak periodicities and are plotted as dark black lines against the gray profile (
Figure 1, right column).
After determining the peak periodicities for each stimulus, we return to the output in each of the five bands from the reson filters. We mask this output to only contain activity from the peak frequencies. Taking the point-wise mean resonator amplitude across the peak-frequency reson filters in all five bands yields the time series shown directly beneath each stimulus pattern in
Figure 1 (also see
Figure 2E). We consider this output an estimate of salience over time.
In deciding the possible time points at which to probe perceptual thresholds, we tried to sample across the range of model-predicted salience values for each stimulus by choosing four temporal locations (dotted lines in lower panels of
Figure 1). We treat the model predictions as a continuous variable.
To predict the temporal and spectral properties of the pupillary signal, we 1) extend the temporal salience prediction for multiple loop iterations 2) convolve the extended temporal salience prediction with a canonical pupillary response function (McCloy, Larson, Lau, & Lee, 2016) and 3) calculate the spectrum of this extended, convolved prediction. The pupillary response function (PRF) is plotted in
Figure 2F. Its parameters have been empirically derived, first by Hoeks and Levelt (1993) then refined by McCloy et al. (2016). The PRF is an Erlang gamma function, with the equation:
where
h is the impulse response of the pupil, with latency
tmax. Hoeks and Levelt (1993) derived
n as 10.1 which represents the number of neural signaling steps between attentional pulse and pupillary response. They derived tmax as 930ms when participants responded to suprathreshold auditory tones with a button press. More recently,
McCloy et al. (
2016) estimated tmax to suprathreshold auditory tones in the absence of a button press to be 512ms. They show that this non-motor PRF is more accurate in correctly deconvolving precipitating attentional events and that it can be used even when there are occasional motor responses involved, e.g. responses to deviants, as long as they are balanced across conditions. Hence, in our case, we model the continuous pupillary response to our stimuli using the non-motor PRF and simply treat any motor responses to deviants as noise that is balanced across all of our conditions (stimuli).
Though previous studies have taken a deconvolution approach (deconvolving the recorded pupil data to get an estimate of the attentional pulses that elicited it), note that we here take a forward, convolutional approach. This allows us to generate predicted pupil data (
Figure 2G) which we compare to our recorded pupil data. With this approach, we avoid the issue of not being certain when exactly the attentional pulse occurred, i.e. with deconvolution it is unclear what the relationships are between the stimulus, attentional pulse, and the system’s delay (see discussion in Hoeks and Levelt (1993) p. 24); also note that deconvolution approaches often require an additional temporal alignment technique such as an optimization algorithm, e.g. Wierda, van Rijn, Taatgend, and Martens (2012), and/or dynamic time-warping, e.g. Kang & Wheatley (2015). Here, we take the empirically derived delay,
t, of the pupil to return to baseline from
McCloy et al. (
2016),
Figure 1a, as 1300ms.
Alternative models
An important consideration is whether the complexity of the linear oscillator model is necessary to accurately predict behavioral and pupillary data for rhythmic stimuli. To address this question, two alternative models are considered in our analyses, each representing different, relevant aspects of the acoustic input sequence.
Full resonator output: Rather than masking the resonator output at the peak periodicities determined by the Mean Periodicity Profile to get an estimate of salience over time that is driven by the likely relevant metric frequencies, it is possible to just average the output from all reson filters over time. Such a prediction acts as a nice alternative to our filtered output and allows for a comparison of whether the prominent metric periodicities play a role in predicting attention over time.
Amplitude Envelope: The spectrum of the amplitude envelope of a sound signal has been shown to predict neural entrainment frequencies, e.g. Nozaradan, Peretz, and Mouraux (2012) show cortical steady-state evoked potentials at peak frequencies in the envelope spectrum. Hence, as a comparison to the linear oscillator model predictions, we also used the amplitude envelope of our stimuli as a predictor. To extract the amplitude envelope of our stimuli, we repeated each stimulus for multiple loops and calculated the root mean square envelope using MATLAB’s envelope function with the ‘rms’ flag and a sliding window of 50ms. Proceeding with just the upper half of the envelope, we low-pass filtered the signal at 50 Hz using a 3rd order Butterworth filter then down-sampled to 100 Hz to match the resolution of the oscillator model. To predict pupil data, we convolved the envelope with the PRF, as previously detailed for the linear oscillator model.
Apparatus
Participants were tested individually in a dimly lit, sound-attenuating room, at a desk with a computer monitor, infrared eye-tracker, Logitech Z-4 speaker system, and a Dell keyboard connected to the computer via USB serial port. Participants were seated approximately 60 cm away from the monitor. Throughout the experiment, the screen was gray with a luminance of 17.7 cd/m
2, a black fixation cross in the center, and a refresh rate of 60 Hz. The center to edge of the fixation cross subtended 2.8° of visual angle. Pupil diameter of the right eye was recorded with an Eyelink 1000 (SR Research) sampling at 500 Hz in remote mode, using Pupil-CR tracking and the ellipse pupil tracking model. Stimuli were presented at a comfortable listening level, individually selected by each participant, through speakers that were situated on the right and left sides of the computer monitor. During the experiment, auditory stimuli were adaptively presented through Max/MSP (Cycling ’74; code available at:
https://github.com/janatalab/attmap.git), which also recorded behavioral responses and sent event codes to the eye-tracking computer via a custom Python socket.
Procedure
Participants were instructed to listen to the music, to maintain their gaze as comfortably as possible on the central fixation cross, and to press the “spacebar” key any time they heard an increase in volume (a deviant). They were informed that some increases might be larger or smaller than others and that they should respond to any such change. A 1 min practice run was delivered under the control of our experimental web interface, Ensemble (
Tomic & Janata, 2007), after which participants were asked if they had any questions.
During the experiment proper, a run was approximately 7 min long and consisted of approximately 190 repetitions of a stimulus pattern. There were no pauses between repetitions, thus a continuously looping musical scene was created. Please note that the exact number of loop repetitions any participant heard varied according to the adaptive procedure outlined below. Each participant heard each stimulus once, resulting in five total runs of approximately 7 min each, i.e. a roughly 35 min experiment.
Stimulus order was randomized throughout the experiment. Messages to take a break and continue when ready were presented after each run of each stimulus. Following the deviance detection task, participants completed questionnaires assessing musical experience, imagery abilities, genre preferences, etc. These questionnaire data were collected as part of larger ongoing projects in our lab and will not be reported in this study. In total, the experimental session lasted approximately 50 min; this includes the auditory task, self-determined breaks between runs (which were typically between 5-30 s), and the completion of surveys.
Adaptive Thresholding Procedure: Though participants experienced a continuous musical scene, we can think of each repetition of the stimulus loop as the fundamental organizing unit of the experiment that determined the occurrence of deviants. Specifically, after every standard (no-deviant) loop iteration, there was an 80% chance of a deviant, in one of the four probed temporal locations, without replacement, on the following loop iteration. After every deviant loop, there was a 100% chance of a no-deviant loop. The Zippy Estimation by Sequential Testing (ZEST) (King-Smith, Grigsby, Vingrys, Benes, & Supowit, 1994; Marvit, Florentine, & Buus, 2003) algorithm was used to dynamically change the decibel level of each deviant, depending on the participant’s prior responses and an estimated probability density function (p.d.f.) of their threshold.
The ZEST algorithm tracked thresholds for each of the four probed temporal locations separately during each stimulus run. A starting amplitude increase of 10 dB SPL was used as an initial difference limen. On subsequent trials, the p.d.f. for each probed location was calculated based on whether the participant detected the probe or not, within a 1000 ms window following probe onset. ZEST uses Bayes’ theorem to constantly reduce the variance of a posterior probability density function by reducing uncertainty in the participant’s threshold probability distribution, given the participant’s preceding performance. The mean of the resultant p.d.f. determines the magnitude of the following deviant at that location. The mean of the estimated probability density function on the last deviant trial is the participant’s estimated perceptual threshold.
Compared to a traditional staircase procedure, ZEST allows for relatively quick convergence on perceptual thresholds. Because the ZEST procedure aims to minimize variance, using reversals as a stopping rule, like in the case of a staircase procedure, does not make sense. Here, we used 20 observations as a stopping rule because
Marvit et al. (
2003) showed that 18 observations were sufficient in a similar auditory task and
Hurley et al. (
2018) demonstrated that, on average, 11 trials allowed for reliable estimation of perceptual threshold when using a dynamic stopping rule. For the current study, 20 was a conservative choice for estimating thresholds, which simultaneously enabled multiple observations over which to average pupillary data.
In summary, each participant was presented with a deviant at each of the four probed locations, in each stimulus, 20 times. The intensity change was always applied to the audio file for 200 ms in duration, i.e. participants heard an increase in volume of the on-going rhythmic pattern for 200 ms before the pattern returned to the initial listening volume. The dB SPL of each deviant was adjusted dynamically based on participants’ prior responses. The mean of the estimated probability density function on the last deviant trial (observation 20) was the participant’s estimated threshold. Examples of this adaptive stimulus presentation are accessible online (
Hurley et al. (
2018): Supplemental Material:
http://dx.doi.org/10.1037/xhp0000563.supp).
Analysis
Perceptual Thresholds: Participants’ perceptual thresholds for detecting deviants at each of the probed temporal locations were computed via ZEST (
King-Smith et al., 1994;
Marvit et al., 2003); see the
Adaptive Thresholding Procedure section above for further details.
Reaction Time: Reaction times were calculated for each trial for each participant, from deviant onset until button press. Trials containing reaction times that did not fall within three scaled median absolute deviations from the median were removed from subsequent analysis. This process resulted in the removal of 0.12% of the data.
Pupil Preprocessing: Blinks were identified in the pupil data using the Eyelink parser blink detection algorithm (
S. R. Research, 2009), which identifies blinks as periods of loss in pupil data surrounded by saccade detection, presumed to occur based on the sweep of the eyelid during the closing and opening of the eye. Saccades were also identified using Eyelink’s default algorithm.
Subsequently, all ocular data were preprocessed using custom scripts and third party toolboxes in MATLAB version 9.2 (
MATLAB, 2017a). Samples consisting of blinks or saccades were set to NaN, as was any sample that was 20 arbitrary units greater than the preceding sample. A sliding window of 25 samples (50 ms) was used around all NaN events to remove edge artifacts. Missing pupil data were imputed by linear interpolation. Runs requiring 30% or more interpolation were discarded from future analysis, which equated to 9% of the data. The pupil time series for each participant, each run (~7 min), was high-pass filtered at .05 Hz, using a 3
rd order Butterworth filter, to remove any large-scale drift in the data. For each participant, each stimulus run, pupil data were normalized as follows: zscoredPupilData = (rawData – mean(rawData)) / std(rawData). See Figure S1 in the Supplementary Materials accompanying this article for a visualization of these pupil pre-processing steps.
Collectively, the preprocessing procedures and some of the statistical analyses reported below relied on the Signal Processing Toolbox (v. 7.4), the Statistics and Machine Learning Toolbox (v. 11.1), the Bioinformatics Toolbox (v. 4.4), Ensemble (
Tomic & Janata, 2007), and the Janata Lab Music Toolbox (
Janata, 2009;
Tomic & Janata, 2008). All custom analysis code is available upon request.
Pupil Dilation Response: The pupil dilation response (PDR) was calculated for each probed deviant location in each stimulus by time-locking the pupil data to deviant onset. A baseline period was defined as 200 ms preceding deviant onset. The mean pupil size from the baseline period was subtracted from the trial pupil data (deviant onset through 3000 ms). The mean and max pupil size were calculated within this 3000 ms window. We chose 3000 ms because the pupil dilation response typically takes around 2500 ms to return to baseline following a motor response (
McCloy et al., 2016); therefore, 3000 ms seemed a safe window length. Additionally, the velocity of the change in pupil size from deviant onset to max dilation was calculated as the slope of a line fit from pupil size at deviant onset to max pupil size in the window, similar to
Figure 1C in Wang, Boehnke, Itti, and Munoz (2014). The latency until pupil size maximum was defined as the duration (in ms) it took from deviant onset until the pupil reached its maximum size. Trials containing a baseline mean pupil size that was greater than three scaled median absolute deviations from the median were removed from subsequent analyses (0.2% of all trials).
Time-Frequency Analyses: To examine the spectrotemporal overlap between our varied model predictions and the observed pupillary signals, we calculated the spectrum of the average pupillary signal to 8-loop epochs of each stimulus, for each participant. We then averaged the power at each frequency across all participants, for each stimulus. We compared the continuous time series and the power spectral density for the recorded pupil signal for each stimulus to those predicted by the model predictions convolved with the pupillary response function. These two average analyses are included for illustrative purposes; note that the main analysis of interest is on the level of the single participant, single stimulus, as outlined below.
To compare the fine-grained similarity between the pupil size time series for any given stimulus to the linear oscillator model prediction for that stimulus, we computed the Cross Power Spectral Density (CPSD) between the pupil time series and itself, the model time series and itself, and the pupil time series and the model time series. The CPSD was calculated using Welch’s method (
Welch, 1967), with a 4.4 s window and 75% overlap. For each participant, each stimulus, we computed 1) the CPSD between the pupil trace and the model prediction for that stimulus and 2) the CPSD between the pupil trace and the model prediction for all other stimuli, which served as the null distribution of coherence estimates.
The phase coherence between the pupil and the model for any given stimulus was defined as the squared absolute value of the pupil-model CPSD, divided by the power spectral density functions of the CPSD of the individual signals with themselves. We then calculated a single true and null coherence estimate for each participant, each stimulus, by finding the true vs. null coherence at each model-predicted peak frequency (see Table S2) under 3 Hz and averaging.
Discussion
The current experiment used a linear oscillator model to predict both perceptual detection thresholds and the pupil signal during a continuous auditory psychophysical task. During the task, participants listened to repeating percussion loops and detected momentary intensity increments. We hypothesized that the linear oscillator model would predict perceptual thresholds for detecting intensity deviants that were adaptively embedded into our stimuli, as well as the continuous pupillary response to the stimuli.
The linear oscillator model reflects the predictions of Dynamic Attending Theory (DAT), which posits that attention can become entrained by an external (quasi)-periodic stimulus. The model is driven largely by onsets detected in the acoustic envelope of the input signal, which get fed through a bank of linear oscillators (reson filters). From there it is possible to calculate which oscillators are most active, mask the output at those peak frequencies, and average over time. Throughout the paper, we considered this peak-filtered signal the ideal prediction of temporal salience, as in
Hurley et al. (
2018); however, for comparison, we also tested how well output from all resonators would predict our data, as well as how well the amplitude envelope alone (without any processing through the oscillator model) would do in predicting both perceptual thresholds and the pupillary signal.
The peak-filtered model was best at predicting perceptual thresholds, providing an important replication and extension of our previous study (
Hurley et al., 2018). In the present study we used only complex stimuli and intensity increments but our previous study showed the same predictive effects of the peak-filtered model for both intensity increments and decrements, as well as simple and complex stimuli (
Hurley et al., 2018). We assume that such results imply that the peaks extracted by our model are attentionally relevant periodicities that guide listeners’ attention throughout complex auditory scenes. The fact that perceptual thresholds were higher at moments of low predicted salience, i.e. a deviant needed to be louder at that moment in time for participants to hear it and vice versa, indicates that attention is not evenly distributed throughout time, in line with the predictions of DAT. However, we note that the linear oscillator model’s temporal salience prediction was strongly correlated with the magnitude of the amplitude envelope of the signal.
Indeed, when it comes to the pupil signal, both the peak-filtered model and the amplitude envelope performed almost identically. This similarity is likely a result of a variety of factors: 1) the rhythms used in the current study are stationary (unchanging over time) and, though there are some moments lacking acoustic energy, overall, the prominent periodicities are present in the acoustic signal. Hence, the Fourier Transform of the amplitude envelope yields roughly identical peak periodicities to that of the model. 2) Convolving both signals with the pupillary response function smears out most subtle differences between the two signals, making them even more similar.
Regardless of the ambiguity regarding which model may be a better predictor, the pupillary results reported in this paper are exciting nonetheless. First and foremost, we show that the pupil can entrain to a rhythmic auditory stimulus. To our knowledge, we are the first to report such a finding, though others have reported pupillary entrainment in the visual domain (
Daniels et al., 2012;
Naber et al., 2013). The continuous pupillary signal and the pupil spectrum to each stimulus were both well predicted by the linear oscillator model and the amplitude envelope of the audio signal. That pupil dilation/constriction dynamics, controlled by the smooth dilator and sphincter muscles of the iris, respectively, entrain to auditory stimuli is in line with a large literature on music-evoked entrainment. Though the pupil has never been mentioned in this literature, other areas of the autonomic nervous system have been shown to entrain to music (
Trost et al., 2017). It remains to be tested how pupillary oscillations might relate to cortical neural oscillations, as highlighted in the introduction of this paper. Are pupillary delta oscillations phase-locked to cortical delta? Do cortical steady-state evoked potentials overlap with those of the pupil? Pupillometry is a more mobile and cost-effective method than EEG, as such, characterizing the relationship between pupillary and cortical responses to music will hopefully allow future studies to use pupillometry in situations that otherwise might have required EEG.
Furthermore, we have shown not only that the pupil entrains to the prominent periodicities present in our stimuli, but also that the oscillatory pupillary response to each stimulus is unique. These results extend those of Kang and Wheatley (2015) and speak to the effectiveness of using pupillometry in the context of music cognition studies. Unlike Kang and Wheatley (2015), we did not use deconvolution or dynamic time-warping to assess the fit of our pupil data with our stimuli, rather, we took a forward approach to modeling our stimuli, convolving stimulus-specific predictions with a pupillary response function, effectively removing the need for algorithms like dynamic timewarping or fitting optimizations. We hope that this approach will prove beneficial for others, especially given the simplicity in calculating the amplitude envelope of a signal and convolving it with the pupillary response function. With regards to our linear oscillator model, future work will use a wider and more temporally dynamic variety of stimuli to assess the power of our linear oscillator model vs. the amplitude envelope in predicting the timevarying pupil signal across a diverse range of musical cases. Hopefully such a study will shed more light on the issue of whether pupillary oscillations are an evoked response or a reflection of attention and in what contexts one might need to use a more complex model, if any.
Even in the case of oscillations being driven more so by the stimuli than by endogenous attention, we still feel that such oscillations nevertheless shape subsequent input and reflect, in some way, the likely attended features of the auditory, and perhaps visual, input. Because pupil dilation blurs the retinal image and widens the visual receptive field, while pupil constriction sharpens the retinal image, narrowing the visual receptive field (
Daniels et al., 2012), oscillations in pupil size, which are driven by auditory stimuli may also have ramifications for visual attention (
Mathot & Van der Stigchel, 2015) and audiovisual integration. For example, visual attention should be more spatially spread (greater sensitivity) at moments of greater auditory temporal salience (larger pupil dilation). It is possible, however, that such small changes in pupil size elicited by music are negligible with respect to visual sensitivity and/or acuity (see
Mathôt (
2018) for a discussion). In short, such interactions remain to be empirically tested.
Another important finding of the current study is the pupil dilation response (PDR) to deviants. Of particular interest is the result that the pupil responds to deviants even when participants do not report hearing them, providing further evidence that pupillary responses are indicators of preconscious processing (Laeng, Sirois, & Gredeback, 2012). However, the current results raise an additional important question of whether the PDR might be more akin to the mismatch negativity (MMN) than the P3, which requires conscious attention to be elicited (
Polich, 2007). Others have shown a PDR to deviants in the absence of attention (Damsma & van Rijn, 2017; Liao et al., 2016) and here we show a PDR to deviants that did not reach participants’ decision thresholds, or possibly conscious awareness, despite their focused attention. Hence, though there is evidence connecting the P3a to the PDR and the LC-NE system (
Murphy et al., 2011; Nieuwenhuis, De Geus, & Aston-Jones, 2011), an important avenue of future research will be to disentangle the relationship between the PDR, P3a, and MMN, which can occur without the involvement of conscious attention (Naatanen, Paavilainen, Rinne, & Alho, 2007) or perception (Allen, Kraus, & Bradlow, 2000). While all three of these measures can be used as indices of deviance detection, the P3a has been proposed to reflect comparison of sensory input with a previously formed mental expectation that is distributed across sensory and motor regions (Petr Janata, 2012;
Navarro Cebrian & Janata, 2010), whereas the MMN and PDR have been interpreted as more sensor-ydriven, pre-attentive comparison processes.
Though we are enthusiastic about the PDR results, a few considerations remain. Since the present experiment only utilized intensity increments as deviants, it could be argued that the pupil dilation response observed on trials during which a deviant was presented below perceptual threshold does not reflect subthreshold processes but rather a linear response to stimulus amplitude. To this argument, we point to the fact that there was no correlation between the mean or max evoked pupil size on any given trial and the contrast in dB SPL on that trial. However, to further address this possible alternative interpretation, we conducted an experiment involving both intensity increments and decrements. Those data show the same PDR patterns for both increments and decrements, hit vs. missed trials, as reported in the current study (Fink, Hurley, Geng, & Janata, 2017). In addition, previous studies of rhythmic violations showed a standard PDR to the omission of an event (e.g. Damsma and van Rijn (2017)), suggesting that the PDR is not specific to intensity increment deviance.
An additional critique might be that the difference between the pupil size on hit vs. missed trials is because hit trials require a button press while miss trials do not. Though it may be the case that the additional motor response required to report deviance detection results in a larger pupil size, this is unlikely to fully account for the difference in results (
Einhauser et al., 2008;
Laeng et al., 2016). Critically, even if a button press results in a greater pupil dilation, this does not change the fact that a PDR is observed on trials in which a deviant was presented but not reported as heard (uncontaminated by a button press).
In summary, our study contributes to a growing literature emphasizing the benefits of using eye-tracking in musical contexts (for further examples, please see the other articles in this Special Issue on Music & Eye-Tracking). We have shown that the pupil of the eye can reliably index deviance detection, as well as rhythmic entrainment. Considered in conjunction with previous studies from our lab, the linear oscillator model utilized in this paper is a valuable predictor on multiple scales – tapping and large body movements (
Hurley et al., 2014;
Janata et al., 2012;
Tomic & Janata, 2008), perceptual thresholds (Hurley et al., 2018), and pupil dynamics (current study). In general, the model can explain aspects of motor entrainment within a range of .25–10 Hz – the typical range of human motor movements. Future work should further compare the strengths and limitations of models of rhythmic attending, (e.g. Forth, Agres, Purver, and Wiggins (2016); Large et al. (2015)), the added benefits of such models over simpler predictors such as the amplitude envelope, and the musical contexts in which one model is more effective than another.