Arousal States as a Key Source of Variability in Speech Perception and Learning

: The human brain exhibits the remarkable ability to categorize speech sounds into distinct, meaningful percepts, even in challenging tasks like learning non-native speech categories in adulthood and hearing speech in noisy listening conditions. In these scenarios, there is substantial variability in perception and behavior, both across individual listeners and individual trials. While there has been extensive work characterizing stimulus-related and contextual factors that contribute to variability, recent advances in neuroscience are beginning to shed light on another potential source of variability that has not been explored in speech processing. Speciﬁcally, there are task-independent, moment-to-moment variations in neural activity in broadly-distributed cortical and subcortical networks that affect how a stimulus is perceived on a trial-by-trial basis. In this review, we discuss factors that affect speech sound learning and moment-to-moment variability in perception, particularly arousal states—neurotransmitter-dependent modulations of cortical activity. We propose that a more complete model of speech perception and learning should incorporate subcortically-mediated arousal states that alter behavior in ways that are distinct from, yet complementary to, top-down cognitive modulations. Finally, we discuss a novel neuromodulation technique, transcutaneous auricular vagus nerve stimulation (taVNS), which is particularly well-suited to investigating causal relationships between arousal mechanisms and performance in a variety of perceptual tasks. Together, these approaches provide novel testable hypotheses for explaining variability in classically challenging tasks, including non-native speech sound learning.


Introduction
The ability to perceive speech, especially under challenging conditions, reflects a remarkable set of computational processes that the human brain is well-adapted to perform. A major challenge that both expert and novice listeners face when learning a new language is substantial variability in input across different speakers, contexts, and listening environments. To comprehend spoken language, listeners must transform a highly variable and often noisy acoustic signal into meaningful linguistic units. Listeners use numerous sources of knowledge to overcome this variability, including (but not limited to) visual information (Campbell 2008;McGurk and MacDonald 1976), coarticulation (Kang et al. 2016;Mann and Repp 1980), lexical status (Ganong 1980;Luthra et al. 2021), semantic information (Kutas and Federmeier 2000;Miller and Isard 1963), and discourse structure (Brouwer et al. 2013; Van Berkum et al. 2005). In addition, the active processes of perception and comprehension are modulated by task-and goal-driven factors like attention (Heald and Nusbaum 2014;Huyck and Johnsrude 2012). Together, these factors provide listeners with the flexibility necessary to comprehend speech, including under conditions where the signal-to-noise Languages 2022, 7, 19 2 of 20 ratio of the input is decreased (Guediche et al. 2014), or in the context of challenging tasks like acquisition of an unfamiliar language (Birdsong 2018).
However, the same speech sounds can be perceived quite differently due to ambiguity in the input (e.g., masking in a noisy environment) or ambiguity in the listener's perceptual or cognitive representations of speech (e.g., due to variation in familiarity with the language). While there has been extensive work characterizing the contributions of these stimulus-related factors to trial-by-trial variability in perception (Guediche et al. 2014;Heald and Nusbaum 2014), recent advances in neuroscience are beginning to shed light on another, complementary set of factors that may be just as important to understanding behavioral variability. Specifically, moment-to-moment variations in neural activity that are not directly related to a given task may play a key role in perceptual and behavioral outcomes.
In this review, we examine the evidence for the ability of arousal states-neurotransmitterdependent modulations of cortical activity-to affect behaviors that are central to the ability to learn new languages. Specifically, we focus on how arousal states may influence non-native sound learning in adulthood and perception of acoustically ambiguous speech, since these constitute concrete examples of core processes that reflect the tight coordination between cortical perceptual systems and subcortical arousal systems. While arousal states are also likely to be involved in learning to produce the sounds of a non-native language, here we refer to "sound learning" solely with regard to the process of learning to identify and discriminate non-native speech sound categories.
First, we provide an overview of the physiological basis of arousal states and describe how brain activity may be modulated by fluctuations in these systems. Second, we will discuss the challenging task of learning to discriminate and identify non-native speech sound categories in adulthood, focusing in particular on our current understanding of the neural processes that the arousal system may act upon. Third, we will discuss examples of moment-to-moment perceptual variability that provide crucial insight into factors that affect how speech sounds are processed. Fourth, we propose that changes in arousal states may be able to explain this moment-to-moment variability. Finally, we introduce an emerging neuromodulation technique, non-invasive transcutaneous auricular vagus nerve stimulation (taVNS), that can be used to provide causal tests of arousal mechanisms in a variety of tasks, including sound learning. We propose that taVNS holds promise as both a scientific and translational tool for understanding and manipulating arousal states that may play a major role in learning and perceptual outcomes.

The Physiology of Arousal States
The arousal system is one of the most fundamental mechanisms in the vertebrate central nervous system (Coull 1998;Whyte 1992). It is composed of multiple overlapping components that modulate core bodily functions and states including wakefulness/alertness, body-wide motoric activity, and affective reactivity (Satpute et al. 2019), as well as other functions such as neural plasticity (Coull et al. 1999;Martins and Froemke 2015;Unsworth and Robison 2017).
Arousal is generally divided into two subtypes with related but dissociable mechanisms. Tonic arousal is characterized by slow fluctuations, such as those governed by circadian rhythm, and corresponds behaviorally to states of wakefulness/alertness. In contrast, phasic arousal refers to rapid (i.e., on the level of seconds and milliseconds) fluctuations in neural responsivity operating within stages of tonic arousal (Satpute et al. 2019;Whyte 1992). While tonic arousal may have important implications for perceptual processes, moment-to-moment variability in perception is more strongly influenced by phasic arousal (McGinley et al. 2015b).
Anatomically, the arousal system consists of numerous ascending and descending pathways originating in the lower brainstem. These pathways activate subcortical structures that release neurotransmitters such as norepinephrine (NE) and acetylcholine (ACh; Quinkert et al. 2011). The release of these neurotransmitters modulates activity across the cortex. Among these pathways, the locus coeruleus-norepenephrine (LC-NE) pathway is Languages 2022, 7, 19 3 of 20 one of the most important systems for controlling phasic arousal (Aston-Jones and Cohen 2005;Sara 2009;Unsworth and Robison 2017). The LC is a brainstem nucleus that is the main source of NE and projects to widespread cortical and subcortical sites (Berridge and Waterhouse 2003;McCormick and Pape 1990;Ranjbar-Slamloo and Fazlali 2020). In addition to controlling arousal, activation of the LC has been found to affect numerous cognitive processes such as attention, memory, and sensory processing (Poe et al. 2020;Sara and Bouret 2012). Crucially, the LC-NE system has rapid effects on cortical activity, reflected in dynamic changes in neurophysiology and psychophysiological measures like pupil dilation (Gilzenrat et al. 2010;McGinley et al. 2015b;Reimer et al. 2016).
Levels of arousal are often divided into distinct states, corresponding to low, moderate, and high arousal (Aston-Jones and Cohen 2005). While arousal states have long been appreciated in many other domains (e.g., Symmes and Anderson 1967), to date, there is little work on their role in speech perception and learning. In general, arousal states can be viewed as varying degrees of receptivity (how likely it is that a population will be activated by some input) and reactivity (how strongly a population responds to a given input) in neural populations. Brief changes in phasic arousal induced via air puff strongly suppress auditory responses in avian sensorimotor regions (Cardin and Schmidt 2003), and direct application of NE to auditory processing areas enhances or suppresses auditory responses dependent on dosage (Cardin and Schmidt 2004). In mouse models, increased arousal broadens the response bandwidth of cells tuned to specific frequencies  which may affect sensory processing by generating a larger neural response to a particular stimulus. Similarly, the performance of mice attempting to detect a pure tone inside a complex auditory mask has been shown to be modulated by moment-to-moment arousal state (McGinley et al. 2015a).
Emerging research in many domains, including working memory, attentional control (Unsworth and Robison 2017), and plasticity (Martins and Froemke 2015), collectively points to the importance of neuromodulator-dependent arousal states as a critical source of individual neural variability. With regards to speech, prior studies have tended to focus on how arousal states affect production (e.g., Kleinow and Smith 2006) or how perception and comprehension may affect arousal (e.g., Zekveld et al. 2018). Research that focuses specifically on the effect of arousal states (independent of anxiety, c.f. Mattys et al. 2013) on perception and learning is only recently beginning to emerge. For example, a recent EEG study found that during sleep, neural responses phase locked to speech (isolated vowel sounds) are modulated by changes in arousal state (Mai et al. 2019). Similarly, Llanos et al. (2020) recently demonstrated that performance on a non-native speech sound learning could be modulated using stimulation techniques that target arousal mechanisms. We propose that these rapid fluctuations in arousal states may explain moment-to-moment variability in neural activity and behavior during speech perception, with strong implications for explaining variability in sound learning in adulthood.

Emergence of Non-Native Speech Category Representations in Adulthood
A clear example of variability in speech perception comes from the domain of nonnative speech sound learning. Difficulties with perceptual discrimination and identification can disproportionately affect progress when learning a new language. After developmental sensitive periods (usually 4-12 months; Kral et al. 2019;Kuhl 2010Kuhl , 2004, there is greater neural commitment (and preference) to the phonological inventories of languages that the infant has been exposed to and reduced neural commitment (and preference) to non-native sound categories (Werker and Hensch 2015;Yu and Zhang 2018). This early experience can impact later learning (Finn et al. 2013;Kuhl et al. 2005; though c.f. Birdsong and Vanhove 2016), making it notoriously difficult for adults to acquire certain non-native speech sound categories (e.g., Japanese learners acquiring /r/ vs. /l/ distinctions ;Bradlow 2008;Zhang et al. 2005). However, laboratory-based training approaches have demonstrated that robust and generalizable learning is achievable in adulthood and that this learning is retained well after the training period (Myers 2014;Reetzke et al. 2018). Training approaches widely Languages 2022, 7, 19 4 of 20 differ, as do the learning outcomes across individuals (Chandrasekaran et al. 2010;Golestani and Zatorre 2009). Some approaches use synthesized stimuli with acoustic information constrained to generate specific contrasts (Scharinger et al. 2013) or reflect native-like distributions (Reetzke et al. 2018;Zhang et al. 2009), while others use naturalistic stimuli without distributional constraints (Chandrasekaran et al. 2010;Sadakata and McQueen 2013). Some actively leverage talker variability (Brosseau-Lapré et al. 2013;Perrachione et al. 2011), others do not. Some approaches involve no feedback, some involve incidental or implicit feedback (Lim and Holt 2011), while some provide varying levels of explicit feedback .
Despite the large number of training approaches, two consistent themes emerge from the speech sound learning literature: First, on average, adults can learn even difficult non-native phonetic categories, and second, large-scale individual differences persist across various training approaches. However, it remains unclear what constitutes the underlying neural mechanisms and sources of individual variability.
A majority of laboratory-based phonetic training approaches have used three training characteristics: (1) naturalistic stimuli produced by native speakers, (2) high-variabilityusing multiple talkers and segmental contexts, and (3) trial-by-trial feedback. While #1 and #2 help listeners focus on dimensions that are category-relevant (and ignore dimensions that are highly variable across talkers/segments), #3 allows listeners the opportunity to monitor and learn from errors. These three training characteristics have engendered significant and generalizable learning for various speech categories (e.g., the /l/~/r/ contrast, lexical tone, voice-onset-time distinctions, etc.). Other approaches (e.g., incidental learning and implicit learning) also result in significant and generalizable learning. While it is important to note that other approaches (e.g., incidental and implicit learning) also have been shown to improve performance in language learning tasks, in this review we restrict our discussions to training approaches with the three training characteristics (natural speech, talker variability, and trial-by-trial feedback) described above.
Over the last several years, a series of studies have adopted a systems neuroscience perspective using multiple neuroimaging methods to provide a better understanding of how sensory and perceptual representations of categories emerge as a function of soundto-category training. Non-invasive methods like functional magnetic resonance imaging (fMRI) offer high spatial precision and an opportunity to discern network-level activity, while electroencephalography (EEG) offers millisecond precision and an opportunity to discern sensory changes as a function of training (Menon and Crottaz-Herbette 2005). In addition, novel computational techniques such as multivariate decoding and functional connectivity analysis have allowed for a more mechanistic understanding of the neural correlates of speech sound learning.
For example, Feng et al. (2019) examined blood oxygen level-dependent (BOLD) activity using fMRI as participants acquired non-native Mandarin tone categories over a session of sound-to-category training. This study examined how tone category representations emerge within the auditory association cortex across the timescale of a single session (Figure 1). Within a few hundred trials, activation patterns that differentiate tonal categories (syllables produced by different talkers that vary on the basis of pitch patterns) emerged within the left superior temporal gyrus (STG). These emergent representations were robust to talker and segmental variability, suggesting that they reflect abstract category representations. Furthermore, contrasting correct versus incorrect trials showed activation of several striatal regions including the bilateral putamen, caudate nucleus, and nucleus accumbens. Participants who showed increased putamen activation showed more robust learning and employed more procedural-based (sound-to-reward mapping) learning strategies Maddox and Chandrasekaran 2014). These results demonstrate that abstract category representations emerge primarily in secondary auditory cortical regions, and that another subcortical network is sensitive to the type of feedback participants receive across trials during learning. Feng et al. (2019) then tested how these two networks interact and found that emergent representations in the left STG were more functionally coupled with the putamen in the latter half of the training paradigm when participants encountered incorrect feedback. They posit that this functional coupling serves to tune emergent category representations in a feedback-dependent manner to ensure continued reward (in the form of correct feedback).
Recently, non-invasive methods like fMRI and EEG have been complemented by invasive direct neural recordings using electrodes implanted on the brain surface. For example, some individuals with conditions such as drug-resistant epilepsy undergo invasive electrophysiological monitoring as part of their clinical care, during which they may elect to participate in research, creating the opportunity to record electrical activity from strips or grids of electrodes placed directly on the surface of the cortex (Chang 2015). These electrocorticographic (ECoG) recordings have high spatial and temporal resolution (with a downside of limited spatial sampling of the brain), making it possible to record the activity of populations of neurons with millisecond precision and with sufficient signal-to-noise to examine activity patterns on a single trial basis and link activity to behavior. Combined with methodological advances that make it possible to track finegrained neural computations at the level of single individuals and single trials, these approaches have vastly increased our ability to investigate variability in speech perception and sound learning.
A recent ECoG study examined variability in the early stages of non-native speech sound learning (Yi et al. 2021). Native English speakers were trained to identify sounds in Mandarin Chinese, which are distinguished by four distinct pitch patterns (lexical tones) that are difficult for native English speakers to learn (Chandrasekaran et al. 2010). As the participants listened to Mandarin lexical tones and received feedback on their ability to label the tone, a distributed set of neural populations in superior temporal and lateral frontal cortex showed a diverse set of changes, including both increases and decreases in tone-specific encoding patterns depending on whether each trial was behaviorally correct. Both behavior and neural activity were variable across trials (despite participants being presented with the same stimuli), providing a clear neural correlate of trial-wise variability during learning. However, it remains unclear why learning behavior was so variable across trials, in particular leading to non-monotonic learning curves. These results demonstrate that abstract category representations emerge primarily in secondary auditory cortical regions, and that another subcortical network is sensitive to the type of feedback participants receive across trials during learning. Feng et al. (2019) then tested how these two networks interact and found that emergent representations in the left STG were more functionally coupled with the putamen in the latter half of the training paradigm when participants encountered incorrect feedback. They posit that this functional coupling serves to tune emergent category representations in a feedback-dependent manner to ensure continued reward (in the form of correct feedback).
Recently, non-invasive methods like fMRI and EEG have been complemented by invasive direct neural recordings using electrodes implanted on the brain surface. For example, some individuals with conditions such as drug-resistant epilepsy undergo invasive electrophysiological monitoring as part of their clinical care, during which they may elect to participate in research, creating the opportunity to record electrical activity from strips or grids of electrodes placed directly on the surface of the cortex (Chang 2015). These electrocorticographic (ECoG) recordings have high spatial and temporal resolution (with a downside of limited spatial sampling of the brain), making it possible to record the activity of populations of neurons with millisecond precision and with sufficient signal-to-noise to examine activity patterns on a single trial basis and link activity to behavior. Combined with methodological advances that make it possible to track fine-grained neural computations at the level of single individuals and single trials, these approaches have vastly increased our ability to investigate variability in speech perception and sound learning.
A recent ECoG study examined variability in the early stages of non-native speech sound learning (Yi et al. 2021). Native English speakers were trained to identify sounds in Mandarin Chinese, which are distinguished by four distinct pitch patterns (lexical tones) that are difficult for native English speakers to learn (Chandrasekaran et al. 2010). As the participants listened to Mandarin lexical tones and received feedback on their ability to label the tone, a distributed set of neural populations in superior temporal and lateral frontal cortex showed a diverse set of changes, including both increases and decreases in tone-specific encoding patterns depending on whether each trial was behaviorally correct. Both behavior and neural activity were variable across trials (despite participants being presented with the same stimuli), providing a clear neural correlate of trial-wise variability during learning. However, it remains unclear why learning behavior was so variable across trials, in particular leading to non-monotonic learning curves.
While most individuals can acquire novel speech sound categories in adulthood, there are large scale differences in the extent of learning success (e.g., Llanos et al. 2020). Such variability may be driven by individual differences in the extent of engagement of the cortico-striatal network ) and consequently, the robustness of the emergent representations, as well as factors related to the stimuli and training paradigm. Thus, it is crucial to understand the role of moment-to-moment perceptual and behavioral variability in mediating sound learning and more broadly, perception.

Moment-to-Moment Variability in Speech Perception and Acquisition
The processes involved in mapping continuous sounds to abstract representations support the goal of successful comprehension. However, these processes also introduce additional sources of variability. For example, the exact same input may be perceived differently when presented multiple times depending on the context (e.g., "bank" as a place to store money versus the edge of a river) and the listener's prior knowledge (e.g., whether they are familiar with the role of suprasegmental pitch patterns in a tonal language like Mandarin Chinese).
The additional variability introduced by factors associated with the stimulus (e.g., lexical status) or task (e.g., attention, feedback) can explain some but not all trial-to-trial differences in perception and comprehension. Even when context, task, and prior knowledge are held constant, perception can still vary, indicating that there are stimulus-and taskindependent processes that strongly influence behavior. Here, we propose another source of variability that explains the apparent non-deterministic nature of perception and learning: phasic fluctuations in arousal states, which modulate brain activity on a moment-bymoment basis, independent of stimulus and task constraints. Despite numerous advances in our understanding of the myriad extrinsic factors that generate variability in speech perception, current models have yet to address these stimulus-and task-independent sources of variability and their role in perception and learning.
In particular, we focus on studies that examine ambiguous sounds that can be perceived in a multistable fashion. Similar to the case of non-native speech sound category learning, these paradigms allow us to study how physically identical sounds are represented in the brain when they are perceived differently.

Behavioral Evidence for Stimulus-and Task-Independent Variability
One clear demonstration of the behavioral consequences of stimulus-and taskindependent variability is the classic psycholinguistic phenomenon of phoneme restoration (Samuel 1981;Warren 1970). When a portion of the acoustic input is completely masked by noise (e.g., the /s/ sound in the word "legislature"), listeners often fail to report that any sound was missing and are unable to localize the noise even when told explicitly that it was there. This effect has been taken as strong evidence for the role of top-down modulation in speech perception; phonological, lexical, and semantic information act to "restore" the missing sound.
However, the restored phoneme that is perceived can change on a trial-by-trial basis. When the noise replaces a phoneme in a word that generates two possible English words (e.g., /fae#t Languages 2022, 7, x FOR PEER REVIEW 6 of 21 While most individuals can acquire novel speech sound categories in adulthood, there are large scale differences in the extent of learning success (e.g., Llanos et al. 2020). Such variability may be driven by individual differences in the extent of engagement of the cortico-striatal network ) and consequently, the robustness of the emergent representations, as well as factors related to the stimuli and training paradigm. Thus, it is crucial to understand the role of moment-to-moment perceptual and behavioral variability in mediating sound learning and more broadly, perception.

Moment-to-Moment Variability in Speech Perception and Acquisition
The processes involved in mapping continuous sounds to abstract representations support the goal of successful comprehension. However, these processes also introduce additional sources of variability. For example, the exact same input may be perceived differently when presented multiple times depending on the context (e.g., "bank" as a place to store money versus the edge of a river) and the listener's prior knowledge (e.g., whether they are familiar with the role of suprasegmental pitch patterns in a tonal language like Mandarin Chinese).
The additional variability introduced by factors associated with the stimulus (e.g., lexical status) or task (e.g., attention, feedback) can explain some but not all trial-to-trial differences in perception and comprehension. Even when context, task, and prior knowledge are held constant, perception can still vary, indicating that there are stimulusand task-independent processes that strongly influence behavior. Here, we propose another source of variability that explains the apparent non-deterministic nature of perception and learning: phasic fluctuations in arousal states, which modulate brain activity on a moment-by-moment basis, independent of stimulus and task constraints. Despite numerous advances in our understanding of the myriad extrinsic factors that generate variability in speech perception, current models have yet to address these stimulus-and taskindependent sources of variability and their role in perception and learning.
In particular, we focus on studies that examine ambiguous sounds that can be perceived in a multistable fashion. Similar to the case of non-native speech sound category learning, these paradigms allow us to study how physically identical sounds are represented in the brain when they are perceived differently.

Behavioral Evidence for Stimulus-and Task-Independent Variability
One clear demonstration of the behavioral consequences of stimulus-and taskindependent variability is the classic psycholinguistic phenomenon of phoneme restoration (Samuel 1981;Warren 1970). When a portion of the acoustic input is completely masked by noise (e.g., the /s/ sound in the word "legislature"), listeners often fail to report that any sound was missing and are unable to localize the noise even when told explicitly that it was there. This effect has been taken as strong evidence for the role of top-down modulation in speech perception; phonological, lexical, and semantic information act to "restore" the missing sound.
However, the restored phoneme that is perceived can change on a trial-by-trial basis. When the noise replaces a phoneme in a word that generates two possible English words (e.g., /fae#t ɚ / could be "faster" or "factor"), listeners report bistable perception of the same ambiguous acoustic input. Strikingly, even when provided with strong extrinsic cues (e.g., "She drove the car /fae#tɚ/"), perception still varies (Leonard et al. 2016), suggesting that there is an additional source of variability that overrides stimulus characteristics and task goals. ɚ Another common situation in which the same stimulus is repeatedly presented, and when variability in the perceptual response of the listener is a crucial, desired feature, is during second language acquisition. Listeners swiftly adapt to accented speech (Bradlow and Bent 2008;Norris et al. 2003) and with repeated exposure or training can learn to distinguish sounds that were initially indiscriminable (Bradlow 2008). The input may be / could be "faster" or "factor"), listeners report bistable perception of the same ambiguous acoustic input. Strikingly, even when provided with strong extrinsic cues (e.g., "She drove the car /fae#t Languages 2022, 7, x FOR PEER REVIEW While most individuals can acquire novel speech sound categories in ad there are large scale differences in the extent of learning success (e.g., Llanos et Such variability may be driven by individual differences in the extent of engag the cortico-striatal network ) and consequently, the robustness of t gent representations, as well as factors related to the stimuli and training paradig it is crucial to understand the role of moment-to-moment perceptual and behavi iability in mediating sound learning and more broadly, perception.

Moment-to-Moment Variability in Speech Perception and Acquisition
The processes involved in mapping continuous sounds to abstract represe support the goal of successful comprehension. However, these processes also i additional sources of variability. For example, the exact same input may be perce ferently when presented multiple times depending on the context (e.g., "bank" a to store money versus the edge of a river) and the listener's prior knowledge (e.g., they are familiar with the role of suprasegmental pitch patterns in a tonal langu Mandarin Chinese).
The additional variability introduced by factors associated with the stimu lexical status) or task (e.g., attention, feedback) can explain some but not all tria differences in perception and comprehension. Even when context, task, an knowledge are held constant, perception can still vary, indicating that there are s and task-independent processes that strongly influence behavior. Here, we pro other source of variability that explains the apparent non-deterministic nature o tion and learning: phasic fluctuations in arousal states, which modulate brain ac a moment-by-moment basis, independent of stimulus and task constraints. De merous advances in our understanding of the myriad extrinsic factors that gener ability in speech perception, current models have yet to address these stimulus-a independent sources of variability and their role in perception and learning.
In particular, we focus on studies that examine ambiguous sounds that can ceived in a multistable fashion. Similar to the case of non-native speech sound learning, these paradigms allow us to study how physically identical sounds a sented in the brain when they are perceived differently.

Behavioral Evidence for Stimulus-and Task-Independent Variability
One clear demonstration of the behavioral consequences of stimulus-a independent variability is the classic psycholinguistic phenomenon of p restoration (Samuel 1981;Warren 1970). When a portion of the acoustic completely masked by noise (e.g., the /s/ sound in the word "legislature"), listen fail to report that any sound was missing and are unable to localize the noise ev told explicitly that it was there. This effect has been taken as strong evidence for of top-down modulation in speech perception; phonological, lexical, and information act to "restore" the missing sound.
However, the restored phoneme that is perceived can change on a trial-by-tr When the noise replaces a phoneme in a word that generates two possible Englis (e.g., /fae#t ɚ / could be "faster" or "factor"), listeners report bistable percepti same ambiguous acoustic input. Strikingly, even when provided with strong cues (e.g., "She drove the car /fae#tɚ/"), perception still varies (Leonard et a suggesting that there is an additional source of variability that overrides characteristics and task goals. ɚ Another common situation in which the same stimulus is repeatedly presen when variability in the perceptual response of the listener is a crucial, desired fe during second language acquisition. Listeners swiftly adapt to accented speech ( and Bent 2008; Norris et al. 2003) and with repeated exposure or training can distinguish sounds that were initially indiscriminable (Bradlow 2008). The inpu /"), perception still varies (Leonard et al. 2016), suggesting that there is an additional source of variability that overrides stimulus characteristics and task goals.
Another common situation in which the same stimulus is repeatedly presented, and when variability in the perceptual response of the listener is a crucial, desired feature, is during second language acquisition. Listeners swiftly adapt to accented speech (Bradlow and Bent 2008;Norris et al. 2003) and with repeated exposure or training can learn to distinguish sounds that were initially indiscriminable (Bradlow 2008). The input may be physically unambiguous, yet perceptual or linguistic representations are less robust, Languages 2022, 7, 19 7 of 20 generating increased variability across trials and among individuals (Chandrasekaran et al. 2010;Paulon et al. 2020;Reetzke et al. 2018). Indeed, decreasing this variability is a primary goal of the learning process, meaning that it is crucial to understand the neural mechanisms that contribute to it.

Neural Evidence for Arousal-Related Variability
As with speech sound learning (Yi et al. 2021), invasive electrophysiological methods are enabling us to examine the moment-to-moment neural correlates of perceptual variability. A recent ECoG study examined the neural encoding of ambiguous speech sounds in a phoneme restoration task, where trial-by-trial perceptual variability was a key characteristic of the behavior (Leonard et al. 2016). When ECoG participants were presented with both unambiguous (e.g., 'factor' and 'faster') and ambiguous ([fae#t ability in speech perception, current models ha independent sources of variability and their ro In particular, we focus on studies that exa ceived in a multistable fashion. Similar to the learning, these paradigms allow us to study h sented in the brain when they are perceived di

Behavioral Evidence for Stimulus-and Task-In
One clear demonstration of the behavio independent variability is the classic psy restoration (Samuel 1981;Warren 1970). W completely masked by noise (e.g., the /s/ sound fail to report that any sound was missing and told explicitly that it was there. This effect has of top-down modulation in speech percept information act to "restore" the missing sound However, the restored phoneme that is pe When the noise replaces a phoneme in a word (e.g., /fae#t ɚ / could be "faster" or "factor") same ambiguous acoustic input. Strikingly, e cues (e.g., "She drove the car /fae#tɚ/"), per suggesting that there is an additional sourc characteristics and task goals. ɚ Another common situation in which the s when variability in the perceptual response of during second language acquisition. Listeners and Bent 2008; Norris et al. 2003) and with re distinguish sounds that were initially indiscrim ]) stimuli, activity in the superior temporal gyrus (STG) reflected the acoustic-phonetic properties (Mesgarani et al. 2014;Yi et al. 2019) of the perceived sound on a trial-by-trial basis. That is, on a trial when a participant reported hearing the noise burst as /s/ ([faest tion and learning: phasic fluctuations in arousal sta a moment-by-moment basis, independent of stimu merous advances in our understanding of the myri ability in speech perception, current models have ye independent sources of variability and their role in In particular, we focus on studies that examin ceived in a multistable fashion. Similar to the case learning, these paradigms allow us to study how p sented in the brain when they are perceived differen

Behavioral Evidence for Stimulus-and Task-Indepen
One clear demonstration of the behavioral c independent variability is the classic psycholi restoration (Samuel 1981;Warren 1970). When completely masked by noise (e.g., the /s/ sound in t fail to report that any sound was missing and are u told explicitly that it was there. This effect has been of top-down modulation in speech perception; information act to "restore" the missing sound.
However, the restored phoneme that is perceiv When the noise replaces a phoneme in a word that (e.g., /fae#t ɚ / could be "faster" or "factor"), liste same ambiguous acoustic input. Strikingly, even w cues (e.g., "She drove the car /fae#tɚ/"), percepti suggesting that there is an additional source of characteristics and task goals. ɚ Another common situation in which the same when variability in the perceptual response of the l during second language acquisition. Listeners swift and Bent 2008; Norris et al. 2003) and with repeat distinguish sounds that were initially indiscriminab ]), neural populations in STG exhibited activity that closely resembled the evoked activity to a real /s/, and the converse was true when the noise was perceived as /k/ ([faekt and task-independent processes that strongly influence other source of variability that explains the apparent non tion and learning: phasic fluctuations in arousal states, w a moment-by-moment basis, independent of stimulus an merous advances in our understanding of the myriad ext ability in speech perception, current models have yet to a independent sources of variability and their role in perce In particular, we focus on studies that examine amb ceived in a multistable fashion. Similar to the case of no learning, these paradigms allow us to study how physic sented in the brain when they are perceived differently.

Behavioral Evidence for Stimulus-and Task-Independent V
One clear demonstration of the behavioral conseq independent variability is the classic psycholinguist restoration (Samuel 1981;Warren 1970). When a po completely masked by noise (e.g., the /s/ sound in the wo fail to report that any sound was missing and are unable told explicitly that it was there. This effect has been taken of top-down modulation in speech perception; phon information act to "restore" the missing sound. However, the restored phoneme that is perceived can When the noise replaces a phoneme in a word that gener (e.g., /fae#t ɚ / could be "faster" or "factor"), listeners same ambiguous acoustic input. Strikingly, even when cues (e.g., "She drove the car /fae#tɚ/"), perception sti suggesting that there is an additional source of varia characteristics and task goals. ɚ Another common situation in which the same stimu when variability in the perceptual response of the listene during second language acquisition. Listeners swiftly ada and Bent 2008; Norris et al. 2003) and with repeated ex distinguish sounds that were initially indiscriminable (Br ]). This activity reflected the online perceptual experience on a trial-by-trial basis, rapidly changing when participants had distinct percepts on repeated presentations of the same stimulus. Strikingly, participants showed semi-random perceptual changes across trials, which was partially explained by activity in a left frontal network. However, it remains unclear what this activity reflects, and importantly, why certain trials were perceived as one word rather than another.
The phenomenon of phoneme restoration represents a clear case in which the stimulus and context are held constant, yet perceptual behavior and its neural correlates are still subject to variation. This leads to a crucial question: What are the sources of this trial-totrial variability and how do they affect speech perception and speech sound learning on a mechanistic level?

Cortical State-Dependent Perception and Behavior
Perhaps the most common explanation for trial-by-trial variability in tasks like ambiguous stimulus perception and speech category learning is the interaction between bottom-up perceptual and top-down cognitive processes (Heald and Nusbaum 2014). One source of task-dependent, top-down modulation is a set of fronto-parietal regions known as the Multiple Demand (MD; Duncan 2010; Hasson et al. 2018) network, which is characterized by its involvement in numerous aspects of cognition. The MD network is generally defined as activity that scales with cognitive flexibility, task demands, and engagement of fluid intelligence. The MD network is involved in selecting or generating a response to particular cues (e.g., sensory stimuli) presented in particular contexts. This process is reflected in dynamic reconfigurations of activity patterns across the cortex, resulting in a greater proportion of neurons responding to relevant stimuli and increased similarity in response patterns to targets versus non-targets (Duncan 2010). While there is some debate regarding the role of the MD network in language processing (Diachek et al. 2020), there is clear evidence for its interactive and modulatory effects in trial-to-trial decisions for speech sound categorization, a core behavior for perception and learning. For example, when native listeners must recognize highly variable auditory input as corresponding to a single class (e.g., multiple instances of the same phoneme category spoken by different speakers), categorization is reflected in activity patterns within the MD network (Feng et al. 2021). Interactions between these MD nodes and core speech regions (bilateral STG) were associated with accumulation of evidence for a response, indicating that perceptual behavior is reflected not only in local populations tuned to specific features but also in the coordinated activity of distributed cortical networks.
Outside the MD network, there are other task-dependent processes that can change perceptual and neural representations of sound on a trial-by-trial basis. For example, focusing attention on a particular speaker in a noisy acoustic environment containing competing Languages 2022, 7, 19 8 of 20 signals (e.g., one talker's voice in a room full of other talkers) has been found to enhance or inhibit responses of specific neural populations depending on which stream is being attended to (Brodbeck et al. 2020;Ding and Simon 2012;Mesgarani and Chang 2012). The deployment of attentional resources has been of particular interest to explaining variability with regards to speech perception under adverse listening conditions (for detailed reviews, see Guediche et al. 2014;Scott and McGettigan 2013). For example, momentary fluctuations in attention (indexed by prestimulus alpha phase) have been found to distinguish correct and incorrect lexical decisions (Strauss et al. 2015). Similarly, knowledge about the nature and structure of an unfamiliar stimulus can dramatically change both perception and neural encoding. Noise-vocoded speech (Davis et al. 2005; Davis and Johnsrude 2007) and sine-wave speech (Remez et al. 1981) constitute extreme examples of degraded auditory stimuli that most often are at first completely unintelligible to naïve listeners. However, after an extremely brief (sometimes a single exposure) presentation of information that cues listeners to the nature of these unfamiliar sounds, comprehension is greatly enhanced (Sohoglu et al. 2012), while activity in cortical speech networks suddenly shifts to resemble perception of normal, unfiltered speech (Holdgraf et al. 2016;Khoshkhoo et al. 2018). Adaptation to degraded speech has also been found to recruit executive networks involved in attention and perceptual learning (Erb et al. 2013).
In the case of sound learning, behavior and neural responses have been shown to be affected by task-dependent factors such as instructions or whether a stimulus is presented within a homogenous or diverse set of stimuli. Different types of corrective feedback have been shown to have different effects on trial-to-trial performance when English speakers are learning to identify Mandarin tone speech sounds . When listeners learn to categorize non-native speech sounds presented by multiple talkers, talker-independent representations of those speech categories emerge in STG activity (Feng et al. 2019).
While the preceding examples illustrate cases where task-dependent processes modulate perception and behavior on a trial-by-trial basis, there is still substantial unexplained variability both within and across individuals. For example, variability in the firing patterns of neurons tuned to specific features has been shown to be an important component of the auditory processing system (Faisal et al. 2008;Stein et al. 2005). Even in simple actions such as a button press (Fox et al. 2007), task-independent fluctuations in the activity of cortical regions and networks may account for a considerable proportion of the variance (Sadaghiani and Kleinschmidt 2013;Taghia et al. 2018;Vidaurre et al. 2017). In the case of speech perception, moment-to-moment changes in activity patterns have been found to predict perceptual behavior when input is ambiguous (Leonard et al. 2016). A major unanswered question is what modulates task-independent variability from trial to trial.
We propose that these cortical activity patterns that reflect moment-to-moment variability in perception are an example of brain states. Brain states are network-level activity patterns that do not directly encode information like stimulus content, but which dynamically alter stimulus or behavioral representations via changes in functional connectivity (Taghia et al. 2018). Brain states have been well characterized in the domains of sleep (Steriade et al. 2001), memory, and attention (Harris and Thiele 2011), but they have yet to be studied in speech and language. We currently lack an understanding of how brain states modulate representations for speech sounds and importantly what mechanisms act to organize the brain states themselves. Specifically, we lack a mechanistic explanation as to how task-independent shifts in neural activity arise (McGinley et al. 2015b).
We propose that a major source of trial-to-trial variability resides in a fundamental biological system, the arousal system, which is known to drive widespread and rapid changes in the dynamics of cortical networks (Coull 1998;Raut et al. 2021;Whyte 1992). We hypothesize that rapid changes in arousal states modulate broad cortical activity patterns (brain states) independent of task demands and stimulus characteristics, which can lead to substantial variability in perception and behavior. Crucially, the impact of arousal states occurs in concert with other sources of variability, including top-down processes like categorical perception and task-dependent processes like attention. The rest of this review focuses on the putative role of arousal states in behavioral and perceptual variability.

Arousal States Modulate Brain States That Influence Perception
While it is clear that arousal can have important effects on neural activity, its effects on speech perception are not clearly understood. In particular, we lack a fundamental understanding of the links between arousal states (brainstem/LC-NE), cortical brain states (distributed network activity), and trial-by-trial variability in perceptual behavior (speech cortex/STG). We propose that arousal states modulate perceptual behavior and performance during auditory processing tasks by modulating brain states that influence perception (Figure 2). biological system, the arousal system, which is known to drive widespread and rapid changes in the dynamics of cortical networks (Coull 1998;Raut et al. 2021;Whyte 1992). We hypothesize that rapid changes in arousal states modulate broad cortical activity patterns (brain states) independent of task demands and stimulus characteristics, which can lead to substantial variability in perception and behavior. Crucially, the impact of arousal states occurs in concert with other sources of variability, including top-down processes like categorical perception and task-dependent processes like attention. The rest of this review focuses on the putative role of arousal states in behavioral and perceptual variability.

Arousal States Modulate Brain States That Influence Perception
While it is clear that arousal can have important effects on neural activity, its effects on speech perception are not clearly understood. In particular, we lack a fundamental understanding of the links between arousal states (brainstem/LC-NE), cortical brain states (distributed network activity), and trial-by-trial variability in perceptual behavior (speech cortex/STG). We propose that arousal states modulate perceptual behavior and performance during auditory processing tasks by modulating brain states that influence perception (Figure 2).

Figure 2. Trial-by-trial variability in perception is modulated by cortical brain states driven by arousal.
Arousal state is mediated by the release of brainstem neuromodulators (such as norepinephrine; yellow box). Changes in arousal alter activity in numerous cortical sites including core speech cortex/STG (blue and red circles), non-core speech processing regions (green circles), and cortical structures comprising the Multiple Demand (MD) network (orange circles). An ambiguous stimulus (e.g., /fae#tɚ/) may be perceived in multiple ways depending on moment-to-moment variability in brain states, the strength and configuration of functional networks (denoted by line thickness), and activity within cortical nodes tuned to specific features (shaded blue and red circles).
In human experiments, fluctuations in indices commonly associated with phasic arousal have been found to correlate with variability in perceptual outcomes, including the phase of scalp-recorded alpha oscillations. With regards to visual perception, prestimulus alpha phase has been shown to predict whether or not phosphenes (visual sensations perceived in the absence of an evoking stimulus) will be induced by transcranial magnetic stimulation (Laura Dugue et al. 2011). Likewise, Strauss et al. (2015) Figure 2. Trial-by-trial variability in perception is modulated by cortical brain states driven by arousal. Arousal state is mediated by the release of brainstem neuromodulators (such as norepinephrine; yellow box). Changes in arousal alter activity in numerous cortical sites including core speech cortex/STG (blue and red circles), non-core speech processing regions (green circles), and cortical structures comprising the Multiple Demand (MD) network (orange circles). An ambiguous stimulus (e.g., /fae#t While most individuals can acquire novel speech sound categories in adulthood, there are large scale differences in the extent of learning success (e.g., Llanos et al. 2020). Such variability may be driven by individual differences in the extent of engagement of the cortico-striatal network ) and consequently, the robustness of the emergent representations, as well as factors related to the stimuli and training paradigm. Thus, it is crucial to understand the role of moment-to-moment perceptual and behavioral variability in mediating sound learning and more broadly, perception.

Moment-to-Moment Variability in Speech Perception and Acquisition
The processes involved in mapping continuous sounds to abstract representations support the goal of successful comprehension. However, these processes also introduce additional sources of variability. For example, the exact same input may be perceived differently when presented multiple times depending on the context (e.g., "bank" as a place to store money versus the edge of a river) and the listener's prior knowledge (e.g., whether they are familiar with the role of suprasegmental pitch patterns in a tonal language like Mandarin Chinese).
The additional variability introduced by factors associated with the stimulus (e.g., lexical status) or task (e.g., attention, feedback) can explain some but not all trial-to-trial differences in perception and comprehension. Even when context, task, and prior knowledge are held constant, perception can still vary, indicating that there are stimulusand task-independent processes that strongly influence behavior. Here, we propose another source of variability that explains the apparent non-deterministic nature of perception and learning: phasic fluctuations in arousal states, which modulate brain activity on a moment-by-moment basis, independent of stimulus and task constraints. Despite numerous advances in our understanding of the myriad extrinsic factors that generate variability in speech perception, current models have yet to address these stimulus-and taskindependent sources of variability and their role in perception and learning.
In particular, we focus on studies that examine ambiguous sounds that can be perceived in a multistable fashion. Similar to the case of non-native speech sound category learning, these paradigms allow us to study how physically identical sounds are represented in the brain when they are perceived differently.

Behavioral Evidence for Stimulus-and Task-Independent Variability
One clear demonstration of the behavioral consequences of stimulus-and taskindependent variability is the classic psycholinguistic phenomenon of phoneme restoration (Samuel 1981;Warren 1970). When a portion of the acoustic input is completely masked by noise (e.g., the /s/ sound in the word "legislature"), listeners often fail to report that any sound was missing and are unable to localize the noise even when told explicitly that it was there. This effect has been taken as strong evidence for the role of top-down modulation in speech perception; phonological, lexical, and semantic information act to "restore" the missing sound.
However, the restored phoneme that is perceived can change on a trial-by-trial basis. When the noise replaces a phoneme in a word that generates two possible English words (e.g., /fae#t ɚ / could be "faster" or "factor"), listeners report bistable perception of the same ambiguous acoustic input. Strikingly, even when provided with strong extrinsic cues (e.g., "She drove the car /fae#tɚ/"), perception still varies (Leonard et al. 2016), suggesting that there is an additional source of variability that overrides stimulus characteristics and task goals. ɚ Another common situation in which the same stimulus is repeatedly presented, and when variability in the perceptual response of the listener is a crucial, desired feature, is during second language acquisition. Listeners swiftly adapt to accented speech (Bradlow and Bent 2008;Norris et al. 2003) and with repeated exposure or training can learn to distinguish sounds that were initially indiscriminable (Bradlow 2008). The input may be /) may be perceived in multiple ways depending on moment-to-moment variability in brain states, the strength and configuration of functional networks (denoted by line thickness), and activity within cortical nodes tuned to specific features (shaded blue and red circles).
In human experiments, fluctuations in indices commonly associated with phasic arousal have been found to correlate with variability in perceptual outcomes, including the phase of scalp-recorded alpha oscillations. With regards to visual perception, pre-stimulus alpha phase has been shown to predict whether or not phosphenes (visual sensations perceived in the absence of an evoking stimulus) will be induced by transcranial magnetic stimulation (Dugue et al. 2011). Likewise, Strauss et al. (2015) found that correct and incorrect responses for a lexical decision task in noise could be predicted from prestimulus alpha phase. The authors suggest that in correct responses, the onset of the initial phoneme of the stimulus coincided with an optimal excitatory phase in which a perceptual object can be 'selected'. Whether this effect is driven by top-down factors like attention or by subcortical arousal mechanisms remains unclear.
As stated in Section 2, levels of arousal are often separated into 'low', 'moderate', and 'high' states. Optimal performance in auditory tasks appears to coincide with moderate arousal states (McGinley et al. 2015a). For example, in an auditory detection task, listeners attempted to detect the presence of a faint pure tone embedded in noise (de Gee et al. 2020). When tones were presented on 50% of trials, participants exhibited increased sensitivity and decreased reaction time when in a higher state of arousal. Similarly, optimal sensitivity to pitch differences in an auditory judgment task has been found to correspond to moderate levels of arousal (Waschke et al. 2019).
These findings indicate two potential mechanisms by which arousal may modulate brain states to affect perception and behavior. First, moderate, 'optimal' levels of arousal may decrease the spontaneous firing rate of specific populations to a greater extent than evoked activity (i.e., increase the activity within specific network nodes), thereby increasing the signal-to-noise ratio (SNR) of a signal (McBurney-Lin et al. 2019). Second, arousal may modulate the functional connectivity of broad cortical networks. Traveling waves of activity have been linked to fluctuations in arousal (Raut et al. 2021) and different levels of arousal have been found to correlate with changes in within-and between-network connectivity (Young et al. 2017). Given the distributed interactional nature of the processes involved in speech processing and language learning (Evans and Davis 2015;Feng et al. 2021;Friederici 2012) it is likely that these processes are affected by arousal-driven fluctuations in global brain states.
LC-NE mediated arousal has also been shown to be a key driver of neural plasticity, a key concept in learning (Marzo et al. 2009). The LC becomes active in response to a novel/salient stimulus, though this novelty response quickly habituates. However, if reinforcement is provided (e.g., in the form of a task reward) then the LC response reemerges (Sara et al. 1994).
Given these robust effects on auditory processing and plasticity, we propose that changes in arousal state can explain variability in neural activity corresponding to performance in speech perception and learning. While not previously considered through the lens of arousal states, the mechanisms discussed here potentially explain the trialby-trial variability observed in complex second language acquisition tasks like speech sound category learning (Yi et al. 2021). For example, it may be the case that differences in cortical responses to specific tones on correct versus incorrect trials reflect differences in SNR driven by rapid fluctuations in arousal (McBurney-Lin et al. 2019), with more accurate responses coinciding with moderate levels of arousal (Aston-Jones and Cohen 2005;de Gee et al. 2020). Alternatively, specific neural populations, primarily in bilateral STG and ventrolateral frontal cortex but also in broadly distributed regions within the MD network, exhibit shifts in neural response profiles in correspondence with learning. These shifts may reflect the influence of arousal state on cortical networks supporting tone categorization (Feng et al. 2021;Sara and Bouret 2012). Likewise, phasic arousal mechanisms may explain some of the stochastic behavior observed in multistable perception tasks (Leonard et al. 2016) and may also provide the modulatory effects necessary to allow top-down cognitive factors to influence perception (Holdgraf et al. 2016).

Using Non-Invasive Vagus Nerve Stimulation to Study the Effects of Arousal in Speech Perception and Learning
Thus far, evidence indicates that apparently random fluctuations in behavior and perception may be partially explainable by changes in arousal states. Perhaps the strongest evidence for the role of subcortically-mediated arousal states in perception and learning comes from studies that use causal manipulations of these putative mechanisms. In this section, we discuss a key method that will enable us to develop a mechanistic understanding of how arousal states influence speech sound perception and sound learning: transcutaneous auricular vagus nerve stimulation (taVNS).
Recent work has identified a relatively simple method for modulating arousal states via electrical stimulation. The peripheral nerves, and in particular the cranial nerves that project directly to the brainstem, can be targeted for modulating central nervous system activity (Adair et al. 2020;Bari and Pouratian 2012;Ginn et al. 2019). The vagus nerve in particular has been shown to be a key peripheral nerve associated with arousal and cognition, in part due to its widespread connectivity to a variety of systems throughout the brain and body (Vonck et al. 2014;Vonck and Larsen 2018). Traditionally, vagus nerve stimulation (VNS) involves the implantation of a cuff electrode around the cervical vagus nerve in the neck and a signal generator/battery in the chest (iVNS). More recently, a non-surgical alternative has been developed targeting the auricular branch of the vagus nerve using transcutaneous surface electrodes (taVNS; Frangos et al. 2015;Ventureyra 2000; Figure 3a,b).
In this section, we discuss a key method that will enable us to develop a mechanistic understanding of how arousal states influence speech sound perception and sound learning: transcutaneous auricular vagus nerve stimulation (taVNS).
Recent work has identified a relatively simple method for modulating arousal states via electrical stimulation. The peripheral nerves, and in particular the cranial nerves that project directly to the brainstem, can be targeted for modulating central nervous system activity (Adair et al. 2020;Bari and Pouratian 2012;Ginn et al. 2019). The vagus nerve in particular has been shown to be a key peripheral nerve associated with arousal and cognition, in part due to its widespread connectivity to a variety of systems throughout the brain and body (Vonck et al. 2014;Vonck and Larsen 2018). Traditionally, vagus nerve stimulation (VNS) involves the implantation of a cuff electrode around the cervical vagus nerve in the neck and a signal generator/battery in the chest (iVNS). More recently, a nonsurgical alternative has been developed targeting the auricular branch of the vagus nerve using transcutaneous surface electrodes (taVNS; Frangos et al. 2015;Ventureyra 2000); Figure 3a  , taVNS delivered during the presentation of easy-to-learn stimuli enhanced performance compared to control (top panel; orange and light blue lines). However, no effect was found when stimulation was delivered during feedback (bottom panel; purple line). ** and *** indicate increasing levels of statistical significance in a linear mixed effects model.
Application of taVNS involves delivering electrical activity to branches of the vagus nerves innervating the skin of the outer ear. Typically, electrodes are affixed to the target site using either an earbud (e.g., Frangos et al. 2015), clip (e.g., Fang et al. 2016), or moldable putty (e.g., Llanos et al. 2020;Schuerman et al. 2021). Stimulation patterns (i.e., the shape of the electrical waveform) can vary widely with regard to pulse shape (e.g., mono/biphasic square wave pulse), pulse amplitude, pulse width, and frequency (i.e., pulse delivery rate; for detailed reviews, see Kaniusas et al. 2019a;Yap et al. 2020). As evidenced in both animal (Hulsey et al. 2017;Loerwald et al. 2018;Morrison et al. 2020;  , taVNS delivered during the presentation of easy-to-learn stimuli enhanced performance compared to control (top panel; orange and light blue lines). However, no effect was found when stimulation was delivered during feedback (bottom panel; purple line). ** and *** indicate increasing levels of statistical significance in a linear mixed effects model. Application of taVNS involves delivering electrical activity to branches of the vagus nerves innervating the skin of the outer ear. Typically, electrodes are affixed to the target site using either an earbud (e.g., Frangos et al. 2015), clip (e.g., Fang et al. 2016), or moldable putty (e.g., Llanos et al. 2020;Schuerman et al. 2021). Stimulation patterns (i.e., the shape of the electrical waveform) can vary widely with regard to pulse shape (e.g., mono/biphasic square wave pulse), pulse amplitude, pulse width, and frequency (i.e., pulse delivery rate; for detailed reviews, see Kaniusas et al. 2019a;Yap et al. 2020). As evidenced in both animal (Hulsey et al. 2017;Loerwald et al. 2018;Morrison et al. 2020;Van Lysebettens et al. 2020) and human physiology (Badran et al. 2018;Schuerman et al. 2021;Yakunina et al. 2017), varying stimulation parameters can produce different effects, which will likely necessitate matching specific stimulation parameters to target outcomes.
While there are likely differences in innervation between the auricular and cervical pathways (Butt et al. 2019;Cakmak 2019), exploratory studies have revealed that taVNS may be able to achieve comparable effects as iVNS without the need for surgery (Kaniusas et al. 2019b;Schuerman et al. 2021). Crucially, both iVNS and taVNS have recently been shown to modulate complex perceptual, motor, and cognitive processes, such as tonotopy (Engineer et al. 2011;Shetake et al. 2012), somatotopy (Darrow et al. 2020;Pruitt et al. 2016), auditory stimulus-reward association (Lai and David 2021), and responses to speech sounds (Engineer et al. 2015).
These effects have recently been extended into the domain of non-native sound category learning. In a recent study, participants were trained to recognize Mandarin tones while receiving taVNS . Participants were randomly assigned to three stimulation groups: the first received taVNS on two 'easy-to-learn' tones that could be differentiated on the basis of pitch height (tones 1 and 3; 'taVNS-easy'); the second received taVNS on two 'hard-to-learn' tones that differed by the direction of pitch change (tones 2 and 4; 'taVNS-hard'); the third received no stimulation ('Control'). During training, participants in the stimulation groups received peri-stimulus taVNS aligned to the onset of the auditory stimulus. Accuracy was found to be greater in the taVNS-easy group compared to both the taVNS-hard group and the control group, as well as compared to a normative aggregate sample of 678 comparable listeners (Figure 3c, top). Furthermore, increases in accuracy were specific to the stimulated tones. This study demonstrates that it is possible to rapidly modulate a key component of L2 learning using taVNS.
This study also revealed that, along with choice of stimulation parameters, the timing of stimulation relative to task events is likely to be a key consideration for the implementation and optimization of taVNS. Llanos et al. (2020) employed a peri-stimulus paradigm in which taVNS was aligned to the onset of the auditory stimulus. However, in a followup experiment, no learning enhancement was found when stimulation was paired with feedback on each trial (Figure 3c, bottom). Two recent studies on Mandarin word learning that directly compared peri-stimulus to continuously delivered taVNS found that both forms of taVNS improved performance. However, the patterns of effects differed between paradigms. For example, with regards to behavior, peri-stimulus taVNS was associated with increased accuracy on mismatch trials, whereas continuous stimulation was associated with a greater reduction in reaction time (Pandža et al. 2020;Phillips et al. 2021). Changes in tone-evoked pupillary responses on subsequent training days were greater for participants in the peri-stim group compared to sham or continuous stimulation (Pandža et al. 2020). Interestingly, the relationship between accuracy and the amplitude of the N400 EEG event-related potential was found to differ between peri-stim and continuous stimulation, with sham and continuous stimulation patterning together while peri-stimulus exhibited an inverse relationship from the two (Phillips et al. 2021). Further research in this area is required to determine whether these differences in behavior and physiology stem from dissociable mechanisms.
It remains to be established whether these types of learning enhancement effects are directly related to regulation of arousal states. The neuromodulatory pathway of VNS is believed to overlap extensively with subcortical structures regulating arousal. The vagus nerve innervates the LC-NE system, as well as several others, via projections to the nucleus of the solitary tract (Van Bockstaele et al. 1999;Ruggiero et al. 2000). Accordingly, activity in the LC has been found to increase in response to VNS stimulation (Hulsey et al. 2017), even at stimulation amplitudes as low as 0.1 milliamps. Activation of the LC-NE system by VNS is also supported by research in non-human animals that has found rapid, dose-dependent increases in pupil dilation and cortical activation in response to VNS (Collins et al. 2021;Mridha et al. 2021) and similar findings have been reported in humans using i/taVNS (Desbeaumes Jodoin et al. 2015;Pandža et al. 2020;Sharon et al. 2021;Urbin et al. 2021; though c.f. Burger et al. 2020;D'Agostini et al. 2021;Schevernels et al. 2016;Warren et al. 2019). taVNS has also been found to modulate other biomarkers of LC-NE activity and arousal, such as alpha oscillations (Sharon et al. 2021) and salivary alpha-amylase and cortisol levels (Warren et al. 2019). Finally, VNS has been found to elicit activation across distributed cortical regions, consistent with widespread neurotransmitter release (Cao et al. 2017;Schuerman et al. 2021). These findings suggest that rapid effects of VNS on performance are likely mediated by the arousal system. Given the established links between VNS and arousal, we propose that the modulation of performance observed in this and other studies reflects the targeted modulation of arousal states during the learning process. When taVNS is paired with a specific task or behavior that engages a particular set of brain regions (e.g., STG during a speech perception task), changes in domain-general arousal can lead to reinforced activity patterns in core task-relevant areas (Engineer et al. 2019;Hulsey et al. 2016). The widespread release of arousal-modulating neurotransmitters may allow representations that are stored in cortical circuits to be either enhanced or disrupted, depending on factors like timing and behavioral task parameters (Berridge and Waterhouse 2003). For example, the identity of the phoneme perceived in a perceptual restoration task can be predicted by activity in broader cortical networks, including inferior frontal cortex, up to~300 milliseconds before the onset of the sound (Leonard et al. 2016), suggesting that activity in broadlydistributed areas influences perception in a top-down fashion. Similarly, improvement in Mandarin tone learning has been found to correspond to dynamic changes in activity across the MD network (Feng et al. 2021). Rapid changes in arousal driven by VNS may act to alter the configuration of such networks during speech perception, suppressing or enhancing the activity of neural populations with response properties relevant to tone learning (Yi et al. 2021). More specifically, taVNS-induced arousal may modulate the SNR of task-relevant neural populations (i.e., those tuned to specific features, such as changes in pitch; McBurney-Lin et al. 2019). While there is more work necessary to establish these links directly, we propose that this framework provides a set of testable hypotheses that may allow for these neuromodulatory systems to be incorporated into a new model of speech perception and learning that accounts for within-as well as across-individual variability ( Figure 2).
To establish this link more explicitly requires reliable biomarkers that reflect changes in arousal state as well as methods for modulating the activity of the arousal system in a causal manner. It has been known for more than half a century that changes in pupil diameter reflect cognitive load and affective arousal (Kahneman and Beatty 1966;Stanners et al. 1979). More recent research has revealed rapid fluctuations in pupillary responses depending on stimulus properties, behavior, and task demands (de Gee et al. 2017(de Gee et al. , 2020Gilzenrat et al. 2010;McGinley et al. 2015a;Reimer et al. 2016). While pupil responses appear to be influenced by multiple subcortical pathways (Berridge and Waterhouse 2003;Larsen and Waters 2018), changes in pupil diameter show robust correlations with activity in the LC (Joshi et al. 2016). Furthermore, both electrical ) and optogenetic (Breton-Provencher and Sur 2019) stimulation of the LC increases pupil diameter. These findings suggest that activity in the LC-NE system influences, if not drives, rapid fluctuations in pupil diameter (Joshi et al. 2016). Overall, the evidence indicates that pupil dilation is a strong biomarker for LC-NE mediated changes in arousal state.

Conclusions
In this review, we have discussed the numerous factors that influence trial-by-trial variability in speech perception and learning tasks. Studies that employ invasive electrophysiological techniques or combine non-invasive imaging with advanced computational methods are rapidly generating new insights into the neural foundations of perception and non-native speech sound learning. Together, these recent advances have begun to explain the sources of neural variability underlying behavioral differences between individuals. Furthermore, these techniques make it possible to investigate the timecourse of perception and learning within a single individual, generating novel insights regarding how the dynamics of neural activity reflect stimulus-(e.g., context) or task-(e.g., attention) related variability.
However, it is clear that there is a substantial portion of this variability that has not been explained by traditionally studied factors like top-down cognitive modulation. Classic examples of multistable perception such as phoneme restoration clearly demonstrate that variability exists that is both stimulus-and task-independent. Given the challenges listeners and learners face in real-world scenarios, understanding as much of this variability as possible may be crucial to developing an accurate model of perception, and for creating translational and pedagogical tools for improving outcomes in fields such as second language acquisition.
We propose that a more complete model of speech perception and learning should consider the role of subcortically-mediated arousal states. Emerging research is demonstrating how rapid fluctuations in arousal state can affect perceptual outcomes as well as related behavior. To determine how arousal states fit into a more complete model of perception, it is crucial that we not only are able to track correlations between arousal and perception, but also able to manipulate arousal states in order to identify causal links between the two. In this regard, taVNS constitutes a novel, promising tool for studying the brain systems that underlie these mechanisms. With such tools, it is increasingly feasible to integrate long-overlooked systems into our thinking about complex behaviors like speech perception and non-native sound learning. We are excited for the next several years of research on this topic and are optimistic that this work will contribute to major advances in our understanding.

Conflicts of Interest:
The authors declare no conflict of interest.