Assessment of Word-Level Neural Language Models for Sentence Completion †
Abstract
:1. Introduction
- We demonstrate that, when properly trained, simple RNN LMs are highly competitive for the sentence completion. Our word RNNs achieved results beyond the previous best reported on the MSR and SAT datasets.
- We verify that the transfer learning approach that pre-trains a transformer-based LM from large data and fine-tunes the model for the target task is also viable for the sentence completion. Our experiments compared various pre-trained networks along with different settings for fine-tuning, showing that the performance varied significantly with different networks, and we were able to obtain state-of-the-art results for both datasets under certain configurations.
- The new cloze-style dataset written in Korean was collected from the government’s official examinations. Experimental results show that the models that were effective for the English datasets underperformed on the Korean dataset, leaving space for further investigation.
- The PyTorch implementation code (https://github.com/heevery/sentence-completion) for experimentation is made available to encourage subsequent studies on neural approaches in machine comprehension.
2. Related Work
3. Methods
3.1. Word-Level RNN LM
3.2. LM-Based Scoring
3.3. Fine-Tuning Pre-Trained LM for Sentence Completion
4. MSR Sentence Completion
4.1. Results with the Official Training Data
4.2. Results with External Data
5. SAT Sentence Completion
6. TOPIK Cloze Questions
7. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Hermann, K.M.; Kociský, T.; Grefenstette, E.; Espeholt, L.; Kay, W.; Suleyman, M.; Blunsom, P. Teaching Machines to Read and Comprehend. In Proceedings of the NIPS, Vancouver, BC, Canada, 7–12 December 2015; pp. 1693–1701. [Google Scholar]
- Chen, D.; Bolton, J.; Manning, C.D. A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task; ACL (1); The Association for Computer Linguistics: Stroudsburg, PA, USA, 2016. [Google Scholar]
- Hill, F.; Bordes, A.; Chopra, S.; Weston, J. The Goldilocks Principle: Reading Children’s Books with Explicit Memory Representations. arXiv 2016, arXiv:1511.02301. [Google Scholar]
- Zweig, G.; Platt, J.C.; Meek, C.; Burges, C.J.C.; Yessenalina, A.; Liu, Q. Computational Approaches to Sentence Completion; ACL (1); The Association for Computer Linguistics: Stroudsburg, PA, USA, 2012; pp. 601–610. [Google Scholar]
- Zweig, G.; Burges, C.J.C. A Challenge Set for Advancing Language Modeling; WLM@NAACL-HLT; Association for Computational Linguistics: Stroudsburg, PA, USA, 2012; pp. 29–36. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Tang, E. Assessing the Effectiveness of Corpus-Based Methods in Solving SAT Sentence Completion Questions. JCP 2016, 11, 266–279. [Google Scholar] [CrossRef]
- Woods, A. Exploiting Linguistic Features for Sentence Completion; ACL (2); The Association for Computer Linguistics: Stroudsburg, PA, USA, 2016. [Google Scholar]
- Melamud, O.; Goldberger, J.; Dagan, I. Context2vec: Learning Generic Context Embedding with Bidirectional LSTM. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, 11–12 August 2016; pp. 51–61. [Google Scholar]
- Tran, K.M.; Bisazza, A.; Monz, C. Recurrent Memory Networks for Language Modeling; HLT-NAACL; The Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 321–331. [Google Scholar]
- Park, H.; Cho, S.; Park, J. Word RNN as a Baseline for Sentence Completion. In Proceedings of the 2018 IEEE 5th International Congress on Information Science and Technology (CiSt), Marrakech, Morocco, 21–27 October 2018; pp. 183–187. [Google Scholar]
- Józefowicz, R.; Vinyals, O.; Schuster, M.; Shazeer, N.; Wu, Y. Exploring the Limits of Language Modeling. arXiv 2016, arXiv:1602.02410. [Google Scholar]
- Melis, G.; Dyer, C.; Blunsom, P. On the State of the Art of Evaluation in Neural Language Models. arXiv 2018, arXiv:1707.05589. [Google Scholar]
- Howard, J.; Ruder, S. Universal Language Model Fine-tuning for Text Classification; ACL (1); Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 328–339. [Google Scholar]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training; Technical Report; OpenAI: San Francisco, CA, USA, 2018. [Google Scholar]
- Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding; NAACL-HLT (1); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.G.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv 2019, arXiv:1906.08237. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in neural information processing systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Hardalov, M.; Koychev, I.; Nakov, P. Beyond English-only Reading Comprehension: Experiments in Zero-Shot Multilingual Transfer for Bulgarian. arXiv 2019, arXiv:1908.01519. [Google Scholar]
- Cui, Y.; Liu, T.; Chen, Z.; Wang, S.; Hu, G. Consensus Attention-based Neural Networks for Chinese Reading Comprehension. arXiv 2016, arXiv:1607.02250. [Google Scholar]
- Liu, S.; Zhang, X.; Zhang, S.; Wang, H.; Zhang, W. Neural Machine Reading Comprehension: Methods and Trends. Appl. Sci. 2019, 9, 3698. [Google Scholar] [CrossRef] [Green Version]
- Deutsch, D.; Roth, D. Summary Cloze: A New Task for Content Selection in Topic-Focused Summarization; EMNLP/IJCNLP (1); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 3718–3727. [Google Scholar]
- Schwartz, R.; Sap, M.; Konstas, I.; Zilles, L.; Choi, Y.; Smith, N.A. Story Cloze Task: UW NLP System; LSDSem@EACL; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 52–55. [Google Scholar]
- Xie, Q.; Lai, G.; Dai, Z.; Hovy, E.H. Large-scale Cloze Test Dataset Designed by Teachers. arXiv 2017, arXiv:1711.03225. [Google Scholar]
- Huh, M.; Agrawal, P.; Efros, A.A. What makes ImageNet good for transfer learning? arXiv 2016, arXiv:1608.08614. [Google Scholar]
- Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations; NAACL-HLT; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 2227–2237. [Google Scholar]
- Conneau, A.; Lample, G. Cross-lingual Language Model Pretraining. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 7057–7067. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Pires, T.; Schlinger, E.; Garrette, D. How Multilingual is Multilingual BERT? ACL (1); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4996–5001. [Google Scholar]
- Wu, S.; Dredze, M. Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT; EMNLP/IJCNLP (1); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 833–844. [Google Scholar]
- Artetxe, M.; Ruder, S.; Yogatama, D. On the Cross-lingual Transferability of Monolingual Representations. arXiv 2019, arXiv:1910.11856. [Google Scholar]
- Mikolov, T.; Karafiát, M.; Burget, L.; Cernocký, J.; Khudanpur, S. Recurrent neural network based language model. In Proceedings of the Eleventh Annual Conference of the International Speech Communication Association, Chiba, Japan, 26–30 September 2010; pp. 1045–1048. [Google Scholar]
- Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent Neural Network Regularization. arXiv 2014, arXiv:1409.2329. [Google Scholar]
- Berglund, M.; Raiko, T.; Honkala, M.; Kärkkäinen, L.; Vetek, A.; Karhunen, J. Bidirectional Recurrent Neural Networks as Generative Models. In Proceedings of the NIPS, Vancouver, BC, Canada, 7–12 December 2015; pp. 856–864. [Google Scholar]
- Trinh, T.H.; Le, Q.V. A Simple Method for Commonsense Reasoning. arXiv 2018, arXiv:1806.02847. [Google Scholar]
- Mirowski, P.; Vlachos, A. Dependency Recurrent Neural Language Models for Sentence Completion; ACL (2); The Association for Computer Linguistics: Stroudsburg, PA, USA, 2015; pp. 511–517. [Google Scholar]
- Levesque, H.J.; Davis, E.; Morgenstern, L. The Winograd Schema Challenge. In Proceedings of the Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, Rome, Italy, 10–14 June 2012. [Google Scholar]
- Mikolov, T. Statistical language models based on neural networks. In Proceedings of the Google, Mountain View, CA, USA, 2 April 2012. [Google Scholar]
- Chelba, C.; Mikolov, T.; Schuster, M.; Ge, Q.; Brants, T.; Koehn, P.; Robinson, T. One billion word benchmark for measuring progress in statistical language modeling. arXiv 2014, arXiv:1312.3005. [Google Scholar]
- Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv 2019, arXiv:1910.03771. [Google Scholar]
- Davies, M.; Fuchs, R. Expanding horizons in the study of World Englishes with the 1.9 billion word Global Web-based English Corpus (GloWbE). English World-Wide 2015, 36, 1–28. [Google Scholar] [CrossRef]
- Han, X.; Eisenstein, J. Unsupervised Domain Adaptation of Contextualized Embeddings for Sequence Labeling; EMNLP/IJCNLP (1); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4237–4247. [Google Scholar]
- Kondratyuk, D.; Straka, M. 75 Languages, 1 Model: Parsing Universal Dependencies Universally; EMNLP/IJCNLP (1); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 2779–2795. [Google Scholar]
- Houlsby, N.; Giurgiu, A.; Jastrzkebski, S.; Morrone, B.; de Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-Efficient Transfer Learning for NLP. Proc. Mach. Learn. Res. 2019, 97, 2790–2799. [Google Scholar]
- Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. arXiv 2019, arXiv:1911.02116. [Google Scholar]
LM Formulation | Accuracy [%] | ||
---|---|---|---|
Blank | Full | Partial | |
Unidirectional LM | 50.5 (0.4) | 69.4 (0.8) | 56.4 (0.9) |
Bidirectional LM | 69.8 (0.5) | 72.3 (1.1) | 63.7 (1.4) |
Masked LM | N/A | 58.2 (1.6) | N/A |
Model (--) | Accuracy |
---|---|
RNNLMs [39] | 55.4 |
Skip-gram [6] | 48.0 |
Skip-gram + RNNLMs [6] | 59.2/58.7 |
NPMI + Co-occ. Freq. + LSA + CBOW + CSKIP [7] | 48 |
PMI using Unigrams + Bigrams + Trigrams [8] | 61.44 |
Context2vec (300-600-600) [9] | 66.2/64.0 |
LSTM (256-256-256) [10] | 56.0 |
Unidirectional-RM (512-512-512) [10] | 69.2 |
Bidirectional-RM (512-512-512) [10] | 67.0 |
Unidirectional word RNN (200-600-400) | 69.1 (0.9)/69.6 (0.8) |
Bidirectional word RNN (200-600-400) | 72.5 (1.4)/72.0 (2.0) |
Unidirectional word RNN ensemble | 72.0/71.5 |
Bidirectional word RNN ensemble | 74.1/74.6 |
Model | Dev. | Test |
---|---|---|
Bidirectional word RNN ensemble | 74.1 | 74.6 |
LM1B | 70.2 | 67.7 |
BERT-base-uncased | 55.38 | 58.46 |
BERT-base-cased | 60.58 | 60.19 |
BERT-base-multilingual-uncased | 22.50 | 20.58 |
BERT-base-multilingual-cased | 41.54 | 41.92 |
BERT-large-uncased | 54.23 | 55.00 |
BERT-large-cased | 52.69 | 49.23 |
BERT-large-uncased-wwm | 77.69 | 77.12 |
BERT-large-cased-wwm | 75.19 | 77.12 |
GPT2 | 46.73 | 44.04 |
GPT2-medium | 54.62 | 52.69 |
Input Construction | Accuracy |
---|---|
Proposed method | 84.2 (2.3) |
with [SEP] | 84.2 (1.3) |
with segment identifiers | 84.6 (2.9) |
with [SEP] and segment identifiers | 74.6 (1.8) |
Model | w/o FT | w/FT |
---|---|---|
BERT-base-uncased | 58.46 | 68.65 |
BERT-base-cased | 60.19 | 73.27 |
BERT-base-multilingual-uncased | 20.58 | 21.92 |
BERT-base-multilingual-cased | 41.92 | 47.88 |
BERT-large-uncased | 55.00 | 79.23 |
BERT-large-cased | 49.23 | 79.42 |
BERT-large-uncased-wwm | 77.12 | 86.15 |
BERT-large-cased-wwm | 77.12 | 85.77 |
GPT2 | 44.04 | 38.27 |
GPT2-medium | 52.69 | 59.23 |
Model (--) | Training Corpus | Accuracy |
---|---|---|
NPMI + Co-occ. Freq. + LSA + CBOW + CSKIP [7] | GloWbe | 59 |
PMI using Unigrams + Bigrams + Trigrams [8] | English Gigaword | 58.95 |
Unidirectional word RNN (200-600-400) | 19C novels | 29.6 (1.5) |
Bidirectional word RNN (200-600-400) | 33.3 (2.0) | |
Unidirectional word RNN (500-2000-500) | 1B word benchmark | 66.5 |
Bidirectional word RNN (500-2000-500) | 69.1 | |
LM1B (1024-8196-1024) | 1B word benchmark | 71.0 |
Model | w/o FT | w/FT |
---|---|---|
BERT-base-uncased | 30.92 | 42.11 |
BERT-base-cased | 30.92 | 52.63 |
BERT-base-multilingual-uncased | 17.11 | 22.37 |
BERT-base-multilingual-cased | 23.03 | 36.84 |
BERT-large-uncased | 25.66 | 73.03 |
BERT-large-cased | 27.63 | 73.03 |
BERT-large-uncased-wwm | 63.82 | 80.26 |
BERT-large-cased-wwm | 59.87 | 80.92 |
GPT2 | 38.16 | 32.24 |
GPT2-medium | 53.29 | 53.29 |
Section | Type | The Number of Questions | |||||
---|---|---|---|---|---|---|---|
Level 1 | Level 2 | Novice | Level I | Level II | Total | ||
Vocabulary | Short-single | 30 | 28 | 175 | 233 | ||
Short-multi | 78 | 91 | 303 | 472 | |||
Long-single | 6 | 14 | 48 | 68 | |||
Long-multi | 32 | 12 | 102 | 146 | |||
Subtotal | 146 | 145 | 628 | 919 | |||
Reading | Short-single | 26 | 19 | 100 | 36 | 12 | 193 |
Short-multi | 4 | 4 | |||||
Long-single | 19 | 13 | 65 | 49 | 66 | 212 | |
Subtotal | 49 | 32 | 165 | 85 | 78 | 409 | |
Writing | Short-single | 25 | 22 | 47 | |||
Short-multi | 109 | 94 | 129 | 332 | |||
Long-single | 32 | 21 | 53 | ||||
Long-multi | 41 | 22 | 63 | ||||
Subtotal | 175 | 170 | 150 | 495 | |||
Total | 370 | 347 | 943 | 85 | 78 | 1823 |
Section | Type | Passage Length | Choice Length | ||
---|---|---|---|---|---|
Words | Chars | Words | Chars | ||
Vocabulary | Short-single | 5.9 | 14.7 | 1.0 | 2.0 |
Short-multi | 11.2 | 25.0 | 1.3 | 3.5 | |
Long-single | 42.4 | 107.7 | 1.7 | 4.2 | |
Long-multi | 30.5 | 69.7 | 1.4 | 3.9 | |
Subtotal | 15.2 | 35.6 | 1.3 | 3.3 | |
Reading | Short-single | 9.7 | 25.3 | 1.0 | 3.3 |
Short-multi | 10.2 | 26.0 | 1.2 | 5.0 | |
Long-single | 52.4 | 139.1 | 2.3 | 6.6 | |
Subtotal | 31.8 | 84.3 | 1.7 | 5.0 | |
Writing | Short-single | 14.0 | 35.9 | 3.7 | 10.1 |
Short-multi | 10.9 | 24.5 | 3.4 | 8.7 | |
Long-single | 45.3 | 117.0 | 4.0 | 11.1 | |
Long-multi | 41.4 | 93.1 | 3.4 | 8.5 | |
Subtotal | 18.8 | 44.2 | 3.5 | 9.1 | |
Total | 19.9 | 48.9 | 5.0 | 5.2 |
Type | Question | Translation |
---|---|---|
Short-single | 돈을 찾으러 ________에 갑니다. 1. 은행 2. 운동장 3. 경찰서 4. 백화점 | I’m going to the ________ to find the money. 1. bank 2. playground 3. police office 4. department store |
Short-multi | 아침에 다 같이 식사하세요? 우리는 ________ 시간이 다 다르니까 같이 못 먹어요. 1. 쉬는 2. 끝나는 3. 일어나는 4. 내는 | Do you eat together in the morning?No, we can’t eat together because the times we ________ are different. 1. rest 2. finish 3. wake up 4. pay |
Long-single | 친절은 다른 사람을 위한 따뜻한 마음과 행동입니다. 친절한 사람은 다른 사람에게 ________ 행동을 하지 않습니다. 그리고 남이 어려울 때 적극적으로 도와 줍니다. 친절한 말과 행동은 이 세상을 더 아름답게 만듭니다. 1. 좋은 2. 기쁜 3. 나쁜 4. 착한 | Kindness is a warm heart and action for others. Kind people do not act ________ others. They actively help when others are in trouble. Kind words and actions make this world more beautiful. 1. well with 2. happy with 3. bad to 4. nicely to |
Long-multi | 경찰관: 어디에서 잃어 버리셨어요? 아주머니: 택시 안에 놓고 내렸어요. 경찰관: 가방이 ________ 아주머니: 까만 색 큰 가방이에요. 경찰관: 뭐가 들어 있습니까? 아주머니: 지갑이요. 꼭 찾아야 하는데요. 경찰관: 알아보겠습니다. 아주머니: 아저씨, 꼭 좀 부탁드립니다. 경찰관: 너무 걱정하지 마십시오. 댁에 가서 기다리세요. 1. 어떻게 생겼어요? 2. 어떤 가게에서 샀어요? 3. 어떻게 만들었어요? 4. 어디에서 샀어요? | Officer: Where did you lose it? Ma’am: I left it in the cab. Officer: Your bag, ________ Ma’am: It’s a big black bag. Officer: What do you have in it? Ma’am: My wallet, I must find it. Officer: I’ll see what I can do. Ma’am: Please, officer. Officer: Don’t worry too much, go home and wait. 1. what does it look like? 2. what store did you buy it from? 3. how did you make it? 4. where did you buy it? |
Model | Dev. | Test |
---|---|---|
Unidirectional word RNN | 43.4 (0.5) | 42.3 (0.6) |
Bidirectional word RNN | 43.3 (0.5) | 40.8 (1.1) |
BERT-base-multilingual-uncased | 25.8 | 24.9 |
BERT-base-multilingual-cased | 29.9 | 32.6 |
FT BERT-base-multilingual-uncased | 30.0 | 27.3 |
FT BERT-base-multilingual-cased | 38.0 | 33.7 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Park, H.; Park, J. Assessment of Word-Level Neural Language Models for Sentence Completion. Appl. Sci. 2020, 10, 1340. https://doi.org/10.3390/app10041340
Park H, Park J. Assessment of Word-Level Neural Language Models for Sentence Completion. Applied Sciences. 2020; 10(4):1340. https://doi.org/10.3390/app10041340
Chicago/Turabian StylePark, Heewoong, and Jonghun Park. 2020. "Assessment of Word-Level Neural Language Models for Sentence Completion" Applied Sciences 10, no. 4: 1340. https://doi.org/10.3390/app10041340
APA StylePark, H., & Park, J. (2020). Assessment of Word-Level Neural Language Models for Sentence Completion. Applied Sciences, 10(4), 1340. https://doi.org/10.3390/app10041340