When words appear in sentences—as opposed to in isolation—their occurrence is evidently affected by syntactic, semantic and other factors. Research within psycholinguistics over the past half-century has exposed the role of some of these sentence-level factors in accounting for eye movements.
Clifton et al. (
2007) provides a review of this work, and calls for the development of explicit theories that combine word-level and sentence-level factors. Of course, such combined models would be unnecessary if it turned out that sentence-level factors actually have very little effect on eye movements. These sorts of factors do not figure in current models of eye-movement control such as E-Z Reader (Pollatsek, Reichle, & Rayner, 2006) and SWIFT (Engbert, Nuthmann, Richter, & Kliegl, 2005), whose difficulty predictions derive primarily from statistical properties of individual words and their immediate neighbors.
In this paper, we cast doubt on this simpler view by exhibiting a quantitative model that takes into account both word and sentence-level factors in explaining eye fixation durations and regression probabilities. We show that the surprise value of a word, on a grammar-based parsing model, is an important predictor of processing difficulty independent of factors such as word length, frequency, and empirical predictability. This result harmonizes with the rise of probabilistic theories in psycholinguistics defined over grammatical representations such as constituents and dependency relations (
Jurafsky, 1996;
Crocker & Brants, 2000;
Keller, 2003). In addition to demonstrating the effect of surprisal on eye-movement measures, we also show that surprisal has a small but statistically non-significant effect on empirical predictability.
The paper is organized into three sections. The first section explains the concept of surprisal, summarizing the
Hale (
2001) formulation. The second section marshals several predictors—surprisal, word length, unigram frequency, bigram frequency (
transitional probability in the sense of McDonald & Shillcock, 2003) and empirical predictability values—in a quantitative model of fixation durations and regression probabilities. We fit this model to the measurements recorded in the Potsdam Sentence Corpus (Kliegl, Nuthmann, & Engbert, 2006), making it possible to determine which predictors account for readers’ fixation durations and regressive eye movements. The last section discusses implications of this fitted model for various linking hypotheses between eye movement measures and parsing theories. This final section also discusses the implications of the results for E-Z Reader (
Pollatsek et al., 2006) and SWIFT (
Engbert et al., 2005).
Surprisal
Surprisal is a human sentence processing complexity metric; it offers a theoretical reason why a particular word should be easier or more difficult to comprehend at a given point in a sentence. Although various complexity metrics have been proposed over the years (
Miller & Chomsky, 1963;
Kaplan, 1972;
Gibson, 1991;
Stabler, 1994;
Morrill, 2000;
Rohde, 2002;
Hale, 2006), surprisal has lately come to prominence within the field of human sentence processing (
Park & Brew, 2006; Levy, in press;
Demberg & Keller, 2008). This renewal of interest coincides with a growing consensus in that field that both absolute as well as graded grammatical factors should figure in an adequate theory. Surprisal combines both sorts of considerations.
This combination is made possible by the assumption of a probabilistic grammar. Surprisal presupposes that sentence-comprehenders know a grammar describing the structure of the word-sequences they hear. This grammar not only says which words can combine with which other words but also assigns a probability to all well-formed combinations. Such a probabilistic grammar assigns exactly one structure to unambiguous sentences. But even before the final word, one can use the grammar to answer the question: what structures are compatible with the words that have been heard so far? This set of structures may contract more or less radically as a comprehender makes their way through a sentence.
The idea of surprisal is to model processing difficulty as a logarithmic function of the probability mass eliminated by the most recently added word. This number is a measure of the information value of the word just seen as rated by the grammar’s probability model; it is nonnegative and unbounded. More formally, define the
prefix probability of an initial substring to be the total probability of all grammatical (In this definition,
G is a probabilistic grammar; the only restriction on
G is that it provide a set of derivations,
![Jemr 02 00002 i013]()
that assign a probability to particular strings. When
![Jemr 02 00002 i013]()
(
G, u) =
![Jemr 02 00002 i014]()
we say that
G does not derive the string
u. The expression
![Jemr 02 00002 i013]()
(
G, wv) denotes the set of derivations on
G that derive
w as the initial part of larger string, the rest of which is
v. See
Jurafsky and Martin (
2000), Manning and Schütze (2000) or
Charniak (
1993) for more details on probabilistic grammars.) analyses that derive
w =
w1 · · ·
wn as a left-prefix (definition 1). Where the grammar
G and prefix string
w (but not
w’s length,
n) are understood, this quantity is abbreviated (Computational linguists typically define a statedependent forward probability α
n(
q) that depends on the particular destination state
q at position
n. These values are indicated in red inside the circles in
Figure 3(a). It is natural to extend this definition to state sets by summing the statedependent α values for all members. To define the surprisal of a left-contextualized word on a grammar the summation ranges over all grammatically-licensed parser states at that word’s position. The notation α
n (without any parenthesized
q argument) denotes this aggregate quantity.) by the forward probability symbol, α
n.
Then the surprisal of the
nth word is the log-ratio of the prefix probability before seeing the word, compared to the prefix probability after seeing it (definition 2).
As the logarithm of a probability, this quantity is measured in bits.
Consider some consequences of this definition. Using a law of logarithms, one could rewrite definition 2 as
But on a well-defined probabilistic grammar, the prefix probabilities α are always less than one and strictly nonincreasing from left to right. This implies that the two logarithms are to be subtracted in the opposite order. For instance, if a given word brings the prefix probability down from 0.6 to 0.01, the surprise value is 4.09 bits.
Intuitively, surprisal increases when a parser is required to build some low-probability structure. The key insight is that the relevant structure’s size need not be fixed in advance as with Markov models. Rather, appropriate probabilistic grammars can provide a larger domain of locality. This paper considers two probabilistic grammars, one based on hierarchical phrasestructure (The probabilistic context-free phrase-structure grammars were unlexicalized. See
Stolcke (
1995) for more information in the methods used in this work. For this purpose, we adapted Levy’s implementation of the Stolcke parser, available from
http://idiom.ucsd.edu/∼rlevy/prefixprobabilityparser.html.) and another based on word-to-word dependencies. These two grammar-types were chosen to illustrate surprisal’s compatibility with different grammar formalisms. Since the phrase-structure approach has already been presented in
Hale (
2001), the next two sub-sections elaborate the dependency grammar approach.
Estimating the Parser’s Probability Model
Consider the German sentence in example 3.
- (3)
Der alte Kapitaen goss stets ein wenig
the old captain poured always a little
Rum in seinen Tee
rum in his tea
“The old captain always poured a little rum in his tea”
A probabilistic dependency parser can proceed through this sentence from left to right, connecting words that stand in probable head-dependent relationships (
Nivre, 2006). In this paper, parser-action probabilities are estimated from the union of two German newspaper corpora, NEGRA (Skut, Krenn, Brants, & Uszkoreit, 1997) and TIGER (
König & Lezius, 2003), as in
Figure 1.
Figure 1 defines the method of estimating the parser probabilities from the corpus data. A simulation of the parser is run on the training data, yielding a series of parser states and transitions for all sentences in the corpora. This information informs several features (
Hall, 2007), which are then used to condition the probabilities of each transition. A Maximum Entropy training model (
Charniak & Johnson, 2005) was used to weight each feature instance for better accuracy.
Estimating Surprisal
The prefix probability (definition 1) may be approximated to any degree of accuracy
k by summing up the total probability of the top
k most probable analyses defined by the dependency parser. Then surprisals can be computed by applying definition 2 following
Boston and Hale (
2007).
Figure 2 shows the surprisals associated with just two of the words in Example 3.
Figure 2 also depicts the dependency relations for this sentence, as annotated in the Potsdam Sentence corpus. (The labels in the second line (e.g., VVFIN) symbolize the grammatical category for each word as described in the Negra annotation manual (
Skut et al., 1997). We are presuming a tagger that accomplishes this task (see Chapter 10 of Manning and Schütze (2000)).) Following
Tesnière (
1959) and
Hayes (
1964), the word at the arrow head is identified as the ‘dependent’, the other is the ‘head’ or ‘governor’. The associated part-of-speech tag is written below each actual word; this figures into the surprisal calculation via the parser’s probability model. The thermometers indicate surprisal magnitudes; at
alte, 0.74 bits amounts to very little surprise. In TIGER and NEGRA newspaper text, it is quite typical to see an adjective (ADJA) following an article (ART) unconnected by any dependency relation. By contrast, the preposition
in is most unexpected. Its surprisal value is 23.83 bits.
The surprisal values are the result of a calculation that makes crucial reference to instantaneous descriptions of the incremental parser.
Figure 3(a) schematically depicts this calculation. At the beginning of Example 3, the parser has seen
der but the prefix probability is still 1.0 reflecting the overwhelming likelihood that a sentence begins with an article. Hearing the second word
alte, the top
k = 3 destination states are, for example,
q8, q17 and
q26 (the state labels are arbitrary).
Figure 3(b) reads off the grammatical significance of these alternative destinations: either
alte becomes a dependent of
der, or
der becomes a dependent of
alte or no dependency predicated. Each transition from state
q1 to states
q8, q17 and
q26 has a corpus-estimated probability denoted by the values above the arc (e.g., the transition probability to
q8 = 0.3). Approximating definition 1, we find that the total probability of all state trajectories (This work takes the
Nivre (
2006) transition system to be sound and complete with respect to a probabilistic dependency grammar that could, in principle, be written down.) arriving in one of those top 3 is 0.6, and thus the surprisal at
alte is 0.740 bits.
When the parser arrives at in, the prefix probability for the word has made its way down to 6.9 × 10−63. Such miniscule probabilities are not uncommon in broad-coverage modeling. What matters for the surprisal calculation is not the absolute value of the prefix probability, but rather the ratio between the old prefix-probability and the new prefix-probability. A high αn−1/αn ratio means that structural alternatives have been reduced in probability or even completely ruled out since the last word.
For instance, the action that attaches the preposition
in to its governing verb
goss is assigned a probability of just over one-third. That action in this left-context leads to the successor state
q88 with the highest forward probability (indicated inside the circles in red). Metaphorically, the preposition tempers the parser’s belief that
goss has only a single dependent. Of course,
k-best parsing considers other alternatives, such as state
q96 in which no attachment is made, in anticipation that some future word will attach
in as a left-dependent. However these alternative actions are all dominated by the one that sets up the correct
![Jemr 02 00002 i012]()
dependency. This relationship would be ignored in a 3-gram model because it spans four words. By contrast, this attachment is available to the
Nivre (
2006) transition system because of its stackstructured memory. In fact, attachments to
stets, ‘always’,
ein, ‘a’, and
wenig, ‘little’, are all excluded from consideration because the parser is projective, i.e., does not have crossing dependencies (Kahane, Nasr, & Rambow, 1998;
Buch-Kromann, 2007).
The essence of the explanation is that difficult words force transitions through state-sets whose forward probability is much smaller than at the last word. This explanation is interpretable in light of the linguistic claims made by the parser. However, the explanation is also a numerical one that can be viewed as just another kind of predictor. The next section applies this perspective to modeling observed fixation durations and regression frequencies.
Predicting eye movements: The role of surprisal
Having sketched a particular formalization of sentence-level syntactic factors in the previous section, this section takes up several other factors (
Table 1) that figure in models of eye-movement control. Two subsections report answers to two distinct but related questions. The first question is, can surprisal stand in, perhaps only partly, for empirical predictability? If empirical predictability could be approximated by surprisal, this would save eye-movement researchers a great deal of effort; there would no longer be a need to engage in the time-consuming process of gathering predictability scores. Unfortunately, the answer to this first question is negative – including surprisal in a model that already contains word-level factors such as length and bigram frequency does not allow it to do significantly better at predicting empirical predictability scores in the Cloze-type data we considered.
The second question pertains to eye-movement data. The second subsection proceeds by defining a variety of dependent measures commonly used in eye movement research. Then it takes up the question, does adding surprisal as an explanatory factor result in a better statistical model of eye-movement data? The answer here is affirmative for a variety of fixation duration measures as well as regression likelihoods.
Does surprisal approximate empirical predictability?
The Potsdam Sentence Corpus (PSC) consists of 144 German sentences overlayed with a variety of related information (Kliegl, Nuthmann, & Engbert, 2006). One kind of information comes from a predictability study in which native speakers were asked to guess a word given its left-context in the PSC (
Kliegl et al., 2004). The probability of correctly guessing the word was estimated from the responses of 272 participants. This diverse pool included high school students, university students, and adults as old as 80 years. As a result of this study, every PSC word—except the first word of each sentence, which has no left context—has associated with it an empirical word-predictability value that ranges from 0 to 1 with a mean (standard deviation) of 0.20 (0.28). These predictability values were submitted to a logit transformation in order to correct for the dependency between mean probabilities and the associated standard deviations; see (
Kliegl et al., 2004) for details.
The
Deviance Information Criterion or DIC (Spiegelhalter, Best, Carlin, & Linde, 2002;
Spiegelhalter, 2006), (
Gelman & Hill, 2007, 524-527) was used to compare the relative quality of fit between models. The DIC depends on the summary measure of fit
deviance d = −2 × log-likelihood. Adding a new predictor that represents noise is expected to reduce deviance by 1; more generally, adding
k noise predictors will reduce deviance by an amount corresponding to the χ
2 distribution with
k degrees of freedom. DIC is the sum of
mean deviance and 2×
the effective number of parameters; mean deviance is the average of the deviance over all simulated parameter vectors, and the effective number of parameters depends on the amount of pooling in the mixed-effects model. Thus, in mixed-effects models DIC plays the role of the Akaike Information Criterion (
Akaike, 1973;
Wagenmakers & Farrell, 2004), in which the number of estimated parameters can be determined exactly.
In the linear mixed-effects models, neither version of surprisal showed a statistically significant effect. (An absolute t-value of 2 or greater indicates statistical significance at α = 0.05. The t-values in a mixed-effects models are only approximations because determining the exact degrees of freedom is non-trivial (
Gelman & Hill, 2007).) However, the sign of the coefficient was negative for both variants of surprisal and DIC values were lower when surprisal was added as a predictor. This is as expected: more surprising words are harder to predict. The DIC was 2229 for the simpler model, versus 2220 for each of the two more complex models.
Table 2 summarizes the models including surprisal as a predictor.
In sum, the analyses show that surprisal scores exhibit rather weak relations with empirical predictability scores; indeed, they are much weaker than unigram frequency and word length as well as corpus-based bigram frequency. Given the reduction in DIC values, however, including surprisal as part of an explanation for empirical word predictability appears to be motivated. This finding is consistent with the intuition that predictability subsumes syntactic parsing cost, among other factors, although clearly surprisal is not the dominant predictor.
The relation between surprisal and empirical word predictability, though weak, nevertheless raises the possibility that surprisal scores may account for variance in fixation durations independent of the variance accounted for by empirical predictability. We investigate this question next using eye movement data from the Potsdam Sentence Corpus.
Does surprisal predict eye movements?
Surprisal formalizes a notion of parsing cost that appears to be distinct from any similar cost that may be subsumed in empirical predictability protocols. It may thus provide a way to account for eye movement data by bringing in a delimited class of linguistic factors that are not captured by conscious reflection about upcoming words.
To investigate this question empirically, we chose several of the dependent eye movement measures in common use (
Table 3 and
Table 4). A distinct class of “first pass” measures reflects the first left-to-right sweep of the eye over the sentence. A second distinction relates to “early” and “late” measures. A widely accepted belief is that the former but not the latter reflect processes that begin when a word is accessed from memory (
Clifton et al., 2007, 349). Although these definitions are fairly standard in the literature, controversy remains about the precise cognitive process responsible for a particular dependent measure.
In general, human comprehenders tend to read more slowly under conditions of cognitive duress. For instance, readers make regressive eye movements more often and go more slowly during the disambiguating region of syntactically-ambiguous sentences (
Frazier & Rayner, 1982). They also slow down when a phrase must be ‘integrated’ as the argument of a verb that does not ordinarily take that kind of complement, e.g. “eat justice” provokes a slowdown compared to “eat pizza.”
The surprisal complexity metric, if successful in accounting for eye movement data, would fit into the gap between these sorts of heuristic claims and measurable empirical data, alongside computational accounts such as
Green and Mitchell (
2006),
Vasishth and Lewis (
2006),
Lewis et al. (
2006) and Vasishth et al. (in press). We used the dependent measures in
Table 3 and
Table 4 to fit separate linear mixed-effects models that take into account the candidate predictors introduced in the last section: the
n-gram factors, word length, empirical predictability. For the analysis of regression probabilities (coded as a binary response for each word: 1 signified that a regression occurred at a word, and 0 that it did not occur), we used a generalized linear mixed-effects model with a binomial link function (
Bates & Sarkar, 2007), (
Gelman & Hill, 2007). Sentences and participants were treated as partially crossed random factors; that is, we estimated the variances associated with differences between participants and differences between sentences, in addition to residual variance of the dependent measures. Then we compared the Deviance Information Criterion value of these simpler models with those of more complex models that had an additional predictor: either surprisal based on the dependency grammar, or surprisal based on phrase-structure grammar.
The calculation of the dependent measures was carried out using the
em package developed by
Logačev and Vasishth (
2006). Regarding first-fixation durations, only those values were analyzed that were non-identical to single-fixation durations. In each reading-time analysis reported below, reading times below 50 ms were removed and the dependent measures were log transformed. All predictors were centered in order to render the intercept of the statistical models easier to interpret.
Results
The main results of this paper are summarized in
Table 5,
Table 6,
Table 7 and
Table 8. In the multiple regression
Table 6,
Table 7 and
Table 8, a predictor is statistically significant if the absolute t-value is greater than two (p-values are not shown for the reading time dependent measures because in linear mixed-effects models the degrees of freedom are difficult to estimate, Gelman & Hill, 2007).
In order to facilitate comprehension, the multiple regression
Table 6,
Table 7 and
Table 8 are summarized in a more compact form in
Figure 4 and
Figure 5. The graphical summary has the advantage that it is possible, at a glance, to see the consistency in the signs of the coefficients across different measures; the tables will not yield this information without a struggle. The figures are interpreted as follows. The error bars signify 95% confidence intervals for the coefficient estimates; consequently, if an error bar does not cross the zero line, it is statistically significant. This visual test is identical to computing a t-value.
In general, both early and late fixation-duration-based dependent measures exhibited clear effects of unigram frequency, bigram frequency, and logit predictability after statistically controlling for the co-stock of predictors (
Figure 4 and
Figure 5). One exception was first-fixation duration (which excludes durations that were also single-fixation durations); here, the effect of predictability and the reciprocal of length was not significant.
These simpler models were augmented with one of two surprisal factors, one based on dependency grammar, the other based on phrase-structure grammar. As summarized in the
Table 5, for virtually every dependent measure the predictive error (DIC value) was lower in the more complex model that included surprisal. One exception was regression probability, in which the phrase-structure based grammar predictions did not reduce DIC.
For fixation durations (
Table 6 and
Table 7 and
Figure 4 and
Figure 5), in general both versions of surprisal had a significant effect in the predicted direction (that is, longer durations for higher surprisal values). One exception was the effect of phrase-structure based surprisal on rereading time; here, reading time was longer for lower surprisal values. However, since the rereading time data is sparse (about 1/10th of the other measures; the sparseness of the data is also reflected in the relatively wide confidence intervals for the coefficient estimates of rereading time), it may be difficult to interpret this result, especially given the consistently positive coefficients for surprisal in all other dependent measures.
For regression probabilities (
Table 8), dependency-grammar based surprisal had a significant effect over and above the other predictors: an increase in surprisal predicts a greater likelihood of a regression. Phrase-structure based surprisal is not a significant predictor of regression probability, but the sign of the coefficient is also negative, as in the dependency-based model.
Discussion
The work presented in this paper showed that surprisal values calculated with a dependency grammar as well as with a phrase-structure grammar are significant predictors of reading times and regressions. The role of these surprisals as predictors was still significant even when empirical word predictability, n-gram frequency and word length were also taken into account. On the other hand, surprisal did not appear to have a significant effect on empirical predictability as computed in eye-movement research.
The high-level factor, surprisal, appears in both the so-called early and late measures, with comparable magnitudes of the coefficients for surprisal. This finding is thus hard to reconcile with a simple identification of early measures with syntactic parsing costs and late measures with durations of post-syntactic events. It may be that late measures include the time-costs of syntactic processes initiated much earlier.
The early effects of parsing costs are of high relevance for the further development of eye-movement control models such as E-Z Reader (
Pollatsek et al., 2006) and SWIFT (
Engbert et al., 2005). In these models, fixation durations at a word are a function of word-identification difficulty, which in turn is assumed to be dependent on word-level variables such as frequency, length and predictability. Although these variables can account for a large proportion of the variance in fixation durations and other measures, we have shown that surprisal plays an important role as well. Of these three predictors, empirical predictability is an “expensive” input variable because it needs to be determined in an independent norming study and applies only to the sentences used in this study. This fact greatly limits the simulation of eye movements collected on new sentences. It had been our hope that surprisal measures (which can also be computed from available treebanks) could be used as a generally available substitute of empirical predictability. Our results did not match these expectations for the two types of surprisal scores examined here. Nevertheless, given the computational availability of surprisal values, it is clearly a candidate for being included as a fourth input variable in future versions of computational models. As
Clifton et al. (
2007) note, no model of eye-movement control currently takes factors such as syntactic parsing cost and semantic processing difficulty into account. While some of this variance is probably captured indirectly by empirical predictability, the contribution of this paper is to demonstrate how syntactic parsing costs can be estimated using probabilistic knowledge of grammar.