What Limits Our Capacity to Process Nested Long-Range Dependencies in Sentence Comprehension?
1. What Are the Computational Mechanisms that Explain Our Limited Ability to Process Specific Sentences?
- “The dog that the cat chased ran away.”
- “The dog that the cat that the mouse bit chased ran away.”
- “The mouse bit the cat that chased the dog that ran away.”
2. Empirical Evidence for Processing Difficulties in Sentence Comprehension
3. Explanations for Syntactic Capacity Limitation by Cognitive Models
3.1. Memory-Based Theories
3.1.1. Dependency Locality Theory
3.1.2. ACT-R Based
3.2. Expectation-Based Theories
3.3. Symbolic Neural Architectures
4. Understanding Capacity Limitation in Light of Neural Language Models
4.1. NLMs Predict that Capacity Limitation Results from a Functional Specification of Dedicated Syntax and Number Units
4.1.1. A Sparse Neural Circuit for Long-Range Number Agreement
- Syntax units: units in the network whose activity is predictive of transient syntactic properties of the sentence. A particular set of syntax units was found to be predictive of the transient depth of syntactic tree, which is an index of syntactic complexity . The activity of one of the syntax units was found to follow the structure of the main subject-verb dependency in various sentences (green curve in Figure 4. See also Figure 3 in Ref. ). Specifically, the activity of this syntax unit is positive throughout the main subject-verb dependency and changes sign only at the main verb. As discussed further below, this activity profile allows the carrying of the grammatical number of the main subject across long distances.
- Long-Range number units: units that can encode grammatical number (singular or plural) for long-range dependencies. Out of total of 1300 units in the neural network, only two were identified as long-range number units—one unit for singular and the other for plural number. Long-range number units were shown to encode the grammatical number of the main subject (‘keys’ in “The keys to the cabinet are..”) and robustly store it up to the main verb (‘are’) across possible attractors (‘cabinet’). Figure 4A illustrates the activity profile of the singular (red) and plural (blue) long-range units during the processing of a subject-extracted relative clause whose main subject is singular. Note that while the activity of the plural unit is silent (i.e., around zero), the singular unit is active throughout the subject-verb dependency, and beyond the attractor. Figure 4B illustrates the ‘mirror’ case, in which the main subject of the sentence is plural. Importantly, ablation of any of the long-range number units was shown to bring the performance of the network on number-agreement tasks to chance level on the corresponding task. In sum, the long-range units carry out the long-range number agreement task in the network.
- Short-Range number units: units that encode the grammatical number of the last encountered noun. In contrast to long-range units, short-range number units can carry grammatical number only across dependencies that do not contain intervening nouns with opposite numbers. Short-range number units were identified using decoding methods .
4.1.2. Processing of Nested Long-Range Dependencies and Capacity Limitation in NLMs
4.1.3. Capacity Limitation also Emerges in the Case of Deeper Nesting
4.1.4. Varying Processing Difficulties of Sentences with Two Long-Range Dependencies
4.2. Neural Language Models Make Precise Predictions on How Humans Process Sentences
- Behavioral predictions:
- Embedded dependencies are more error prone: the main, ‘outer’, dependency is processed by the long-range mechanism, which was shown to be robust, protecting the carried grammatical number from intervening information. Embedded dependencies, however, are more susceptible to interference and processing failures. Indeed, we found that both humans and neural language models make more number-agreement errors on the embedded verb of center-embedded clauses compared to the main one . In this study, humans and neural networks were tested on both simple object-extracted relative clauses (1) and object-extracted relative clauses in which the inner dependency is also a long-range one (“The dog that the cats near the car chase runs away”). Both humans and neural networks make more agreement errors on the inner verb (‘chase’) compared to the outer one (‘runs’).
- Agreement-error patterns show primacy and recency effects: In doubly center-embedded sentences, while the main agreement can be robustly processed by the long-range mechanism, and the innermost by short-range units, the middle long-range dependency is predicted to be most error prone. In a recent study , we tested agreement errors of a variety of neural language models trained on deep nested long-range dependencies generated from an artificial grammar. Neural language models were indeed found to make more number-agreement errors on middle dependencies compared to the outer-most and innermost ones, showing a ‘recency and primacy effects’ in the error patterns. The model therefore predicts similar agreement-error patterns in human behavior. We note that this error pattern is consistent with ’structural forgetting’, a phenomenon reported in humans, e.g., , in which English speakers tend to judge ungrammatical sentences with doubly center embedding and a missing verb as grammatically correct (e.g., “The patient who the nurse who the clinic had hired met Jack”). Importantly, structural forgetting occurs with sentences in which the middle verb is omitted.
- Neural predictions:
- Long- and short-range units reside in different cortical regions: the two types of units that emerged in the models during training suggest that a similar division may be observed in cortical processing. In particular, Dehaene et al.  have suggested a hierarchy of cortical processing in language, from low-level transition regularities to high-level structured patterns across the cortex. In accordance with this view, long-range units are predicted to be found at higher-level cortical regions, and short-range units in lower-level ones, or both (in NLMs trained on natural-language data, long-range number units tend to emerge in the highest layer of the network [26,41], whereas short-range units can be found across several layers). Since long-range units are predicted to be sparse they might be found in highly localized regions of the brain (note that the small number of syntax and number units that emerged in the network is not a realistic estimation of the number of corresponding neurons in the brain, without taking in consideration several corrections. First, the total number of units in the NLM is several orders of magnitude smaller than that in the brain. Second, the NLM is commonly considered as a ‘rate model’. Consequently, the activity of a single unit in the model in response to a feature would map to a large number of spiking neurons in the brain, all responsive to the same feature. Taken together, a single unit in the NLM could therefore correspond to possibly more than neurons in the brain. The ‘sparsity’ of the mechanism should therefore not be construed as an extreme localist, ‘grandmother cell’, e.g., , view).
- Specific syntactic cortical regions show persistent activity throughout long-range dependencies: the activity pattern of the syntax unit during sentence processing suggests similar dynamics in cortical regions related to syntactic processing. Specifically, activity of certain syntactic regions is predicted to persist throughout a long-range dependency, in order to gate feature-information storage (grammatical number, gender, etc.) in other regions.
- Specific syntactic cortical regions project onto long- but not short-range units: while the activity of long-range units was found to follow the structure of the long-range dependencies, as conveyed by the syntax unit, the activity of the short-range units is not structure sensitive. Neural activity related to syntactic processing of long-range dependencies, presumably in syntactic brain regions, is predicted to drive neural activity related to long-range encoding of grammatical number, but not that related to short-range encoding.
5. Remaining Challenges
- Language acquisition: artificial neural networks fail to capture the speed and ease of language acquisition in human children. Children are known to acquire language from a relatively small number of stimuli compared to the complexity of the task of inferring underlying regularities in natural data . In contrast, current state-of-the-art neural language models require large amount of training data, which needed to be presented to the model several times. Finding the structural and learning biases required to reduce such data thirst down to a profile similar to humans, remains one of the major challenges in the field.
- Compositionality: neural language models fail to achieve systematic compositionality . Human language is characterized by systematic compositionality—people can understand and produce a potentially infinite number of novel combinations from a finite set of known elements [50,51]. For example, once an English speaker learns a new verb, for example, “dax”, he can immediately understand and produce sentences like “dax again” or "dax twice”. This algebraic capacity, famously argued to be a property of symbolic-based systems but not of neural networks [52,53], remains a challenge also for modern neural language models. Consequently, it is thus expected that current neural language model will present strong limitations in their ability to account for brain responses to new words.
- Fit to brain data: neural language models fail to achieve good fit to neuroimaging and electrophysiological data (e.g., fMRI, MEG, intracranial). Both language models and neural language models have been used in recent years to model brain data acquired from subjects engaged in naturalistic tasks, such as active listening to stories, e.g., [54,55,56,57]. Typically, activations of language models during sentence processing are extracted and used as predictors in linear regression models, which are used to fit brain data. However, despite preliminary success, current models still explain a relatively small variance in brain data. This suggests that additional—and possibly critical—processes to those of artifical neural networks, are engaged in the human brain when it processes sentences.
- Biological plausibility: although inspired by findings in neuroscience, various aspects of the dynamics and learning in common neural language models cannot be directly mapped onto biologically-plausible mechanisms in the brain. For example, while learning and plasticity in the brain are known to occur locally [58,59], the back-propagation algorithm used to train neural networks  relies on error propagation across distant parts of the network. On the other hand, neural-network models that are more faithful to brain dynamics are still hard to tame, e.g., [61,62] and can simulate only limited aspects of linguistic phenomena. Constructing a neural language model that is both biological-plausible and achieves high performance on linguistic tasks remains another challenge in the field.
6. Summary and Conclusions
Conflicts of Interest
- Chomsky, N.; Miller, G.A. Introduction to the Formal Analysis of Natural Languages; Handbook of Mathematical Psychology; Duncan Luce, R., Bush, R.R., Galanter, E., Eds.; John Wiley & Sons: New York, NY, USA, 1963; pp. 269–321. [Google Scholar]
- Karlsson, F. Constraints on multiple center-embedding of clauses. J. Linguist. 2007, 43, 365–392. [Google Scholar] [CrossRef][Green Version]
- Grodner, D.; Gibson, E. Consequences of the serial nature of linguistic input for sentenial complexity. Cogn. Sci. 2005, 29, 261–290. [Google Scholar] [CrossRef] [PubMed][Green Version]
- Bock, K.; Miller, C. Broken agreement. Cogn. Psychol. 1991, 23, 45–93. [Google Scholar] [CrossRef]
- Franck, J.; Vigliocco, G.; Nicol, J. Subject-verb agreement errors in French and English: The role of syntactic hierarchy. Lang. Cogn. Process. 2002, 17, 371–404. [Google Scholar] [CrossRef]
- Franck, J.; Lassi, G.; Frauenfelder, U.H.; Rizzi, L. Agreement and movement: A syntactic analysis of attraction. Cognition 2006, 101, 173–216. [Google Scholar] [CrossRef][Green Version]
- Franck, J.; Frauenfelder, U.H.; Rizzi, L. A syntactic analysis of interference in subject–verb agreement. In Mit Working Papers in Linguistics; University of Geneva: Geneva, Switzerland, 2007; pp. 173–190. [Google Scholar]
- Friedmann, N.; Belletti, A.; Rizzi, L. Relativized relatives: Types of intervention in the acquisition of A-bar dependencies. Lingua 2009, 119, 67–88. [Google Scholar] [CrossRef]
- Kennedy, A.; Pynte, J. Parafoveal-on-foveal effects in normal reading. Vis. Res. 2005, 45, 153–168. [Google Scholar] [CrossRef][Green Version]
- Kliegl, R.; Nuthmann, A.; Engbert, R. Tracking the mind during reading: The influence of past, present, and future words on fixation durations. J. Exp. Psychol. Gen. 2006, 135, 12. [Google Scholar] [CrossRef][Green Version]
- Demberg, V.; Keller, F. Cognitive Models of Syntax and Sentence Processing. In Human Language: From Genes and Brains to Behavior; Hagoort, P., Ed.; The MIT Press: Cambridge, MA, USA, 2019; pp. 293–312. [Google Scholar]
- Gibson, E. Linguistic complexity: Locality of syntactic dependencies. Cognition 1998, 68, 1–76. [Google Scholar] [CrossRef]
- Pearlmutter, N.J.; Gibson, E. Recency in verb phrase attachment. J. Exp. Psychol. Learn. Mem. Cogn. 2001, 27, 574. [Google Scholar] [CrossRef]
- Gibson, E. The dependency locality theory: A distance-based theory of linguistic complexity. Image Lang. Brain 2000, 2000, 95–126. [Google Scholar]
- Anderson, J.R. The Architecture of Cognition; Psychology Press: London, UK, 2013. [Google Scholar]
- Lewis, R.L.; Vasishth, S. An activation-based model of sentence processing as skilled memory retrieval. Cogn. Sci. 2005, 29, 375–419. [Google Scholar] [CrossRef] [PubMed][Green Version]
- Chomsky, N. Barriers; MIT Press: Cambridge, MA, USA, 1986; Volume 13. [Google Scholar]
- Hale, J. A probabilistic Earley parser as a psycholinguistic model. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies; Association for Computational Linguistics: Stroudsburg, PA, USA, 2001; pp. 1–8. [Google Scholar]
- Jurafsky, D. A probabilistic model of lexical and syntactic access and disambiguation. Cogn. Sci. 1996, 20, 137–194. [Google Scholar] [CrossRef]
- Levy, R. Expectation-based syntactic comprehension. Cognition 2008, 106, 1126–1177. [Google Scholar] [CrossRef] [PubMed][Green Version]
- Smolensky, P. Tensor product variable binding and the representation of symbolic structures in connectionist networks. Artif. Intell. 1990, 46, 159–216. [Google Scholar] [CrossRef]
- Gayler, R.W. Vector symbolic architectures answer Jackendoff’s challenges for cognitive neuroscience. arXiv 2004, arXiv:cs/0412059. [Google Scholar]
- Kanerva, P. Binary spatter-coding of ordered K-tuples. In International Conference on Artificial Neural Networks; Springer: Berlin/Heidelberg, Germany, 1996; pp. 869–873. [Google Scholar]
- Plate, T.A. Holographic reduced representations. IEEE Trans. Neural Netw. 1995, 6, 623–641. [Google Scholar] [CrossRef][Green Version]
- Christiansen, M.; Chater, N. Toward a connectionist model of recursion in human linguistic performance. Cogn. Sci. 1999, 23, 157–205. [Google Scholar] [CrossRef]
- Lakretz, Y.; Kruszewski, G.; Desbordes, T.; Hupkes, D.; Dehaene, S.; Baroni, M. The emergence of number and syntax units in LSTM language models. arXiv 2019, arXiv:1903.07435. [Google Scholar]
- Nelson, M.; El Karoui, I.; Giber, K.; Yang, X.; Cohen, L.; Koopman, H.; Cash, S.; Naccache, L.; Hale, J.; Pallier, C.; et al. Neurophysiological dynamics of phrase-structure building during sentence processing. Proc. Natl. Acad. Sci. USA 2017, 114, E3669–E3678. [Google Scholar] [CrossRef][Green Version]
- King, J.R.; Dehaene, S. Characterizing the dynamics of mental representations: The temporal generalization method. Trends Cogn. Sci. 2014, 18, 203–210. [Google Scholar] [CrossRef] [PubMed][Green Version]
- Elman, J. Finding structure in time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
- Gulordava, K.; Bojanowski, P.; Grave, E.; Linzen, T.; Baroni, M. Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1 June 2018; pp. 1195–1205. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 1 June 2019; pp. 4171–4186. [Google Scholar]
- Linzen, T.; Dupoux, E.; Goldberg, Y. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Trans. Assoc. Comput. Linguist. 2016, 4, 521–535. [Google Scholar] [CrossRef]
- Marvin, R.; Linzen, T. Targeted syntactic evaluation of language models. arXiv 2018, arXiv:1808.09031. [Google Scholar]
- Futrell, R.; Wilcox, E.; Morita, T.; Levy, R. RNNs as psycholinguistic subjects: Syntactic state and grammatical dependency. arXiv 2018, arXiv:1809.01329. [Google Scholar]
- Linzen, T.; Leonard, B. Distinct patterns of syntactic agreement errors in recurrent networks and humans. In Proceedings of the CogSci 2018—40th Annual Meeting of the Cognitive Science Society, Madison, WI, USA, 25–28 July 2018; pp. 692–697. [Google Scholar]
- Goldberg, Y. Assessing BERT’s Syntactic Abilities. arXiv 2019, arXiv:1901.05287. [Google Scholar]
- Belinkov, Y.; Glass, J. Analysis methods in neural language processing: A survey. Trans. Assoc. Comput. Linguist. 2019, 7, 49–72. [Google Scholar] [CrossRef]
- Hupkes, D.; Veldhoen, S.; Zuidema, W. Visualisation and ‘diagnostic classifiers’ reveal how recurrent and recursive neural networks process hierarchical structure. J. Artif. Intell. Res. 2018, 61, 907–926. [Google Scholar] [CrossRef]
- Giulianelli, M.; Harding, J.; Mohnert, F.; Hupkes, D.; Zuidema, W. Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information. In Proceedings of the EMNLP BlackboxNLP Workshop 2018, Brussels, Belgium, 1 November 2018; pp. 240–248. [Google Scholar]
- Traxler, M.J.; Morris, R.K.; Seely, R.E. Processing subject and object relative clauses: Evidence from eye movements. J. Mem. Lang. 2002, 47, 69–90. [Google Scholar] [CrossRef]
- Lakretz, Y.; Hupkes, D.; Vergallito, A.; Marelli, M.; Baroni, M.; Dehaene, S. Processing of nested dependencies by humans and neural language models. 2020. in preparation. [Google Scholar]
- Lakretz, Y.; Desbordes, T.; King, J.R.; Crabbé, B.; Oquab, M.; Dehaene, S. Can RNNs learn Recursive Nested Subject-Verb Agreements? ACL 2020. under review. [Google Scholar]
- Joulin, A.; Mikolov, T. Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
- Woods, W.A. Transition network grammars for natural language analysis. Commun. ACM 1970, 13, 591–606. [Google Scholar] [CrossRef]
- Gibson, E.; Thomas, J. Memory limitations and structural forgetting: The perception of complex ungrammatical sentences as grammatical. Lang. Cogn. Process. 1999, 14, 225–248. [Google Scholar] [CrossRef][Green Version]
- Dehaene, S.; Meyniel, F.; Wacongne, C.; Wang, L.; Pallier, C. The neural representation of sequences: From transition probabilities to algebraic patterns and linguistic trees. Neuron 2015, 88, 2–19. [Google Scholar] [CrossRef] [PubMed][Green Version]
- Quiroga, R.Q.; Reddy, L.; Kreiman, G.; Koch, C.; Fried, I. Invariant visual representation by single neurons in the human brain. Nature 2005, 435, 1102–1107. [Google Scholar] [CrossRef]
- Chomsky, N. On cognitive structures and their development: A reply to Piaget. In Philosophy of Mind: Classical Problems/Contemporary Issues; The MIT Press: Cambridge, MA, USA, 2006; pp. 751–755. [Google Scholar]
- Lake, B.; Baroni, M. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In Proceedings of the ICML 2018—International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 2879–2888. [Google Scholar]
- Chomsky, N. Syntactic Structures; Mouton: Berlin, Germany, 1957. [Google Scholar]
- Montague, R. Universal Grammar. Theoria 1970, 36, 373–398. [Google Scholar] [CrossRef]
- Fodor, J.; Pylyshyn, Z. Connectionism and cognitive architecture: A critical analysis. Cognition 1988, 28, 3–71. [Google Scholar] [CrossRef]
- Fodor, J.; Lepore, E. The Compositionality Papers; Oxford University Press: Oxford, UK, 2002. [Google Scholar]
- Wehbe, L.; Murphy, B.; Talukdar, P.; Fyshe, A.; Ramdas, A.; Mitchell, T. Simultaneously uncovering the patterns of brain regions involved in different story reading subprocesses. PLoS ONE 2014, 9, e112575. [Google Scholar] [CrossRef][Green Version]
- Huth, A.G.; De Heer, W.A.; Griffiths, T.L.; Theunissen, F.E.; Gallant, J.L. Natural speech reveals the semantic maps that tile human cerebral cortex. Nature 2016, 532, 453–458. [Google Scholar] [CrossRef][Green Version]
- Jain, S.; Huth, A. Incorporating context into language encoding models for fmri. In Proceedings of the Thirty-Second Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 6628–6637. [Google Scholar]
- Toneva, M.; Wehbe, L. Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain). In Proceedings of the Thirty-Third Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 14928–14938. [Google Scholar]
- Hebb, D.O. The Organization of Behavior: A Neuropsychological Theory; J. Wiley: New York, NY, USA, 1949. [Google Scholar]
- Bi, G.Q.; Poo, M.M. Synaptic modification by correlated activity: Hebb’s postulate revisited. Ann. Rev. Neurosci. 2001, 24, 139–166. [Google Scholar] [CrossRef][Green Version]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
- Treves, A. Frontal latching networks: A possible neural basis for infinite recursion. Cogn. Neuropsychol. 2005, 22, 276–291. [Google Scholar] [CrossRef] [PubMed]
- Russo, E.; Pirmoradian, S.; Treves, A. Associative latching dynamics vs. syntax. In Advances in Cognitive Neurodynamics (II); Springer: Berlin/Heidelberg, Germany, 2011; pp. 111–115. [Google Scholar]
|Cognitive Theories for Sentence Processing|
|Theory||Grammar||Parsing Algorithm||Limiting Resource||Explanation for Capacity Limitation and processing breakdowns|
|DLT||Dependency grammar||Dependency parsing||Energy units||Too many long-range structural integrations take place at a given word, exceeding unit resources.|
|ACT-R based||pCFG||Left-corner||Temporal activity||High similarity among memory items cause unresolvable interference.|
|Expectation-based||pCFG||None||Probability mass||Frequent syntactic structures ‘consume’ most of the probability mass, leading rarer structures to generate high surprisal.|
|Symbolic neural architectures||None||None||Dimensionality||Highly complex syntactic structures require higher state-space dimensionality than that available.|
|Neural Language Models||None||None||Specialized syntax units||The neural circuit for long-range dependencies is sparse and can therefore process only a limited number of nested, or cross, dependencies.|
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Lakretz, Y.; Dehaene, S.; King, J.-R. What Limits Our Capacity to Process Nested Long-Range Dependencies in Sentence Comprehension? Entropy 2020, 22, 446. https://doi.org/10.3390/e22040446
Lakretz Y, Dehaene S, King J-R. What Limits Our Capacity to Process Nested Long-Range Dependencies in Sentence Comprehension? Entropy. 2020; 22(4):446. https://doi.org/10.3390/e22040446Chicago/Turabian Style
Lakretz, Yair, Stanislas Dehaene, and Jean-Rémi King. 2020. "What Limits Our Capacity to Process Nested Long-Range Dependencies in Sentence Comprehension?" Entropy 22, no. 4: 446. https://doi.org/10.3390/e22040446