Combining Transformer Embeddings with Linguistic Features for Complex Word Identification
Abstract
:1. Introduction
2. Related Work
3. Dataset
4. System Description
4.1. Features
- PROPN: Number of pronouns within the sentence.
- AUX: Number of auxiliaries within the sentence.
- VERB: Number of verbs within the sentence.
- ADP: Number of adverbs within the sentence.
- NOUN: Number of nouns within the sentence.
- NN: Number of nouns, singular or massive.
- SYM: Number of symbols within the sentence.
- NUM: Number of numbers within the sentence.
- Absolute frequency: the absolute frequency.
- Relative frequency: the relative frequency of the target word.
- Number of words in the sentence: number of words in the sentence. Words in sentence (NumSentenceWords) [25,29].Based on the work proposed by [25] for exploring linguistic features for lexical complexity prediction, we implemented:
- Part of Speech (POS): the Part of Speech category.
- Relative frequency of the previous token: the relative frequency of the word before the token.
- Relative frequency of the word after the token: the relative frequency of the word after the token.
- Length of the previous word: the number of characters in the word before the token.
- Length of the following word: the number of characters in the word after the token.
- Lexical diversity—MTDL: the lexical diversity of the target word in the sentence.Additionally, the following WordNet features were also considered for each target word, as in the work carried out by [26]:
- Number of synonyms [22].
- Number of hyponyms [22].
- Number of hyperonyms [22].
4.2. Traditional Machine Learning Classifiers
5. Results
6. Discussion
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Rico-Sulayes, A. General lexicon-based complex word identification extended with stem n-grams and morphological engines. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR-WS, Malaga, Spain, 23 September 2020. [Google Scholar]
- Uluslu, A.Y. Automatic Lexical Simplification for Turkish. arXiv 2022, arXiv:2201.05878. [Google Scholar]
- Shardlow, M.; Cooper, M.; Zampieri, M. CompLex: A New Corpus for Lexical Complexity Predicition from Likert Scale Data. In Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI), Marseille, France, 11 May 2020. [Google Scholar]
- Singh, S.; Mahmood, A. The NLP cookbook: Modern recipes for transformer based deep learning architectures. IEEE Access 2021, 9, 68675–68702. [Google Scholar] [CrossRef]
- Nandy, A.; Adak, S.; Halder, T.; Pokala, S.M. cs60075_team2 at SemEval-2021 Task 1: Lexical Complexity Prediction using Transformer-based Language Models pre-trained on various text corpora. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Online, 5–6 August 2021; pp. 678–682. [Google Scholar]
- Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, Virtual Event, 3–10 March 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 610–623. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems NIPS 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Canete, J.; Chaperon, G.; Fuentes, R.; Ho, J.H.; Kang, H.; Pérez, J. Spanish pre-trained BERT model and evaluation data. In Proceedings of the PML4DC, ICLR 2020, Addis Ababa, Ethiopia, 26 April–1 May 2020. [Google Scholar]
- Yaseen, T.B.; Ismail, Q.; Al-Omari, S.; Al-Sobh, E.; Abdullah, M. JUST-BLUE at SemEval-2021 Task 1: Predicting Lexical Complexity using BERT and RoBERTa Pre-trained Language Models. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Bangkok, Thailand, 5–6 August 2021; pp. 661–666. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 8440–8451. [Google Scholar] [CrossRef]
- Shardlow, M.; Evans, R.; Paetzold, G.H.; Zampieri, M. SemEval-2021 Task 1: Lexical Complexity Prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Online, 25 May 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 1–16. [Google Scholar] [CrossRef]
- Mc Laughlin, G.H. SMOG grading-a new readability formula. J. Read. 1969, 12, 639–646. [Google Scholar]
- Dale, E.; Chall, J.S. A formula for predicting readability: Instructions. Educ. Res. Bull. 1948, 27, 37–54. [Google Scholar]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
- Ortiz-Zambrano, J.A.; Montejo-Ráez, A. Complex words identification using word-level features for SemEval-2020 Task 1. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Online, 25 May 2021; pp. 126–129. [Google Scholar]
- El Mamoun, N.; El Mahdaouy, A.; El Mekki, A.; Essefar, K.; Berrada, I. CS-UM6P at SemEval-2021 Task 1: A Deep Learning Model-based Pre-trained Transformer Encoder for Lexical Complexity. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Online, 25 May 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 585–589. [Google Scholar] [CrossRef]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
- Zhuang, L.; Wayne, L.; Ya, S.; Jun, Z. A Robustly Optimized BERT Pre-training Approach with Post-training. In Proceedings of the 20th Chinese National Conference on Computational Linguistics, Huhhot, China, 13–15 August 2021; Chinese Information Processing Society of China: Huhhot, China, 2021; pp. 1218–1227. [Google Scholar]
- Liebeskind, C.; Elkayam, O.; Liebeskind, S. JCT at SemEval-2021 Task 1: Context-aware Representation for Lexical Complexity Prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Bangkok, Thailand, 5–6 August 2021; pp. 138–143. [Google Scholar]
- Zaharia, G.E.; Cercel, D.C.; Dascalu, M. UPB at SemEval-2021 Task 1: Combining Deep Learning and Hand-Crafted Features for Lexical Complexity Prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Bangkok, Thailand, 5–6 August 2021; Association for Computational Linguistics: Minneapolis, MN, USA, 2021; pp. 609–616. [Google Scholar] [CrossRef]
- Mosquera, A. Alejandro Mosquera at SemEval-2021 Task 1: Exploring Sentence and Word Features for Lexical Complexity Prediction. In Proceedings of the Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Bangkok, Thailand, 5–6 August 2021; Association for Computational Linguistics: Minneapolis, MN, USA, 2021; pp. 554–559. [Google Scholar] [CrossRef]
- Vettigli, G.; Sorgente, A. CompNA at SemEval-2021 Task 1: Prediction of lexical complexity analyzing heterogeneous features. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Bangkok, Thailand, 5–6 August 2021; pp. 560–564. [Google Scholar]
- Paetzold, G.; Specia, L. Sv000gg at semeval-2016 task 11: Heavy gauge complex word identification with system voting. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, CA, USA, 16–17 June 2016; pp. 969–974. [Google Scholar]
- Ronzano, F.; Anke, L.E.; Saggion, H. Taln at semeval-2016 task 11: Modelling complex words by contextual, lexical and semantic features. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, CA, USA, 16–17 June 2016; pp. 1011–1016. [Google Scholar]
- Gooding, S.; Kochmar, E. CAMB at CWI shared task 2018: Complex word identification with ensemble-based voting. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, New Orleans, LA, USA, 5–6 June 2018; pp. 184–194. [Google Scholar]
- Desai, A.T.; North, K.; Zampieri, M.; Homan, C. LCP-RIT at SemEval-2021 Task 1: Exploring Linguistic Features for Lexical Complexity Prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Bangkok, Thailand, 5–6 August 2021; Association for Computational Linguistics: Minneapolis, MN, USA, 2021; pp. 548–553. [Google Scholar] [CrossRef]
- Rayner, K.; Duffy, S.A. Lexical complexity and fixation times in reading: Effects of word frequency, verb complexity, and lexical ambiguity. Mem. Cogn. 1986, 14, 191–201. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Shardlow, M. A Comparison of Techniques to Automatically Identify Complex Words. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop, Sofia, Bulgaria, 5–7 August 2013; pp. 103–109. [Google Scholar]
- Paetzold, G. UTFPR at SemEval-2021 Task 1: Complexity Prediction by Combining BERT Vectors and Classic Features. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Bangkok, Thailand, 5–6 August 2021; pp. 617–622. [Google Scholar]
- Shardlow, M.; Evans, R.; Zampieri, M. Predicting lexical complexity in English texts: The Complex 2.0 dataset. Lang. Resour. Eval. 2022, 56, 1153–1194. [Google Scholar] [CrossRef]
- Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of the COMPSTAT’2010, 19th International Conference on Computational Statistics, Paris, France, 22–27 August 2010; pp. 177–186. [Google Scholar]
- Crammer, K.; Dekel, O.; Keshet, J.; Shalev-Shwartz, S.; Singer, Y. Online Passive-Aggressive Algorithms. J. Mach. Learn. Res. 2006, 7, 551–585. [Google Scholar]
- Song, B.; Pan, C.; Wang, S.; Luo, Z. DeepBlueAI at SemEval-2021 Task 7: Detecting and Rating Humor and Offense with Stacking Diverse Language Model-Based Methods. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Bangkok, Thailand, 5–6 August 2021; pp. 1130–1134. [Google Scholar]
- Taya, Y.; Kanashiro Pereira, L.; Cheng, F.; Kobayashi, I. OCHADAI-KYOTO at SemEval-2021 Task 1: Enhancing Model Generalization and Robustness for Lexical Complexity Prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Bangkok, Thailand, 5–6 August 2021; Association for Computational Linguistics: Minneapolis, MN, USA, 2021; pp. 17–23. [Google Scholar] [CrossRef]
Id | Corpus | Sentence | Token | Complexity |
---|---|---|---|---|
3ZL....32A | Bible | Behold, there came up out of the river seven cattle, sleek and fat, and they fed in the marsh grass. | river | 0.100 |
34R...E5C | Bible | I am a fellow bondservant with you and with your brothers, the prophets, and with those who keep the words of this book. | brothers | 0.400 |
3GM...UY3 | Biomed | Supplementary data are available at NAR online. | online | 0.107 |
3KI...67D | Biomed | In lens epithelium derived from alphaAKO lenses, cell growth rates were reported to be 50% lower compared to wild type, suggesting a role for alphaA in regulating the cell cycle. | growth | 0.107 |
3VA...PSC | Europarl | (ES) Mr President, as a Spanish Member resident in the Canary Islands, I want to thank you for remembering the victims of the accident on 20 August. | victims | 0.191 |
3H6...PWP | Europarl | Over 40% of the energy we use is consumed in buildings and 75% of the buildings standing today will still be here in 2050, so we need to tackle energy efficiency in existing buildings as well as in new stock. | efficiency | 0.3333 |
Subset | Genre | Context | Unique Tokens | Average Complexity |
---|---|---|---|---|
All | Total | 10,800 | 5617 | 0.321 |
Europarl | 3600 | 2227 | 0.303 | |
Biomed | 3600 | 1904 | 0.353 | |
Bible | 3600 | 1934 | 0.307 | |
Single | Total | 9000 | 4129 | 0.302 |
Europarl | 3000 | 1725 | 0.286 | |
Biomed | 3000 | 1388 | 0.325 | |
Bible | 3000 | 1462 | 0.293 | |
MWE | Total | 1800 | 1488 | 0.419 |
Europarl | 600 | 502 | 0.388 | |
Biomed | 600 | 516 | 0.491 | |
Bible | 600 | 472 | 0.377 |
Feature Identifier | Description |
---|---|
LF | Linguistic Features. |
BERT | Sentence encodings from BERT model. |
BERT | Token/word encodings from BERT model. |
XLMR | Sentence encodings from XLM-RoBERTa model. |
XLMR | Token/word encodings from RoBERTa model. |
Results for the Deep Learning Approaches with CompLex | |||||||
---|---|---|---|---|---|---|---|
Configuration | Model | Alg | MAE | MSE | RMSE | Pearson | R2 |
BERT-W⊕BERT-S⊕LF | fine-tuned | SVR | 0.068898 | 0.009908 | 0.094294 | 0.891137 | 0.80 |
BERT-W⊕BERT-S⊕ | |||||||
XLMR-W⊕XLMR-S⊕LF | fine-tuned | SVR | 0.068899 | 0.009908 | 0.094296 | 0.8911367 | 0.80 |
BERT-W⊕BERT-S⊕ | |||||||
XLMR-W⊕XLMR-S⊕LF | pre-trained | SVR | 0.068899 | 0.009908 | 0.094296 | 0.891136 | 0.79 |
BERT-W⊕BERT-S⊕LF | fine-tuned | GBR | 0.068972 | 0.011898 | 0.099297 | 0.927208 | 0.87 |
BERT-W⊕BERT-S⊕ | |||||||
XLMR-W⊕XLMR-S⊕LF | fine-tuned | GBR | 0.068974 | 0.011898 | 0.099297 | 0.927218 | 0.86 |
BERT-W⊕BERT-S⊕ | |||||||
XLMR-W⊕XLMR-S⊕LF | pre-trained | GBR | 0.068974 | 0.011898 | 0.099297 | 0.927218 | 0.87 |
BERT-W⊕LF | fine-tuned | SVR | 0.069900 | 0.009913 | 0.094392 | 0.890911 | 0.79 |
BERT-W⊕LF | fine-tuned | GBR | 0.070124 | 0.012018 | 0.099372 | 0.926726 | 0.86 |
BERT-W⊕BERT-S | fine-tuned | SVR | 0.071623 | 0.009204 | 0.095901 | 0.874019 | 0.77 |
BERT-W⊕LF | pre-trained | SVR | 0.074342 | 0.009909 | 0.095558 | 0.864029 | 0.75 |
BERT-W⊕BERT-S | pre-trained | SVR | 0.074394 | 0.009323 | 0.096034 | 0.873943 | 0.75 |
BERT-W⊕LF | pre-trained | GBR | 0.075430 | 0.012348 | 0.100003 | 0.900043 | 0.80 |
BERT-W | pre-trained | RLM | 0.075552 | 0.009734 | 0.098232 | 0.789901 | 0.63 |
BERT-W | fine-tuned | SVR | 0.075897 | 0.009520 | 0.097119 | 0.864816 | 0.76 |
BERT-W | pre-trained | SVR | 0.075938 | 0.009559 | 0.097123 | 0.864 | 0.76 |
Rank | Team Name | Pearson | MAE | MSE | |
---|---|---|---|---|---|
1 | JUST Blue | 0.7886 | 0.0609 | 0.0062 | 0.6172 |
2 | DeepBlueAI | 0.7882 | 0.0610 | 0.0061 | 0.6210 |
3 | OCHADAI-KYOTO | 0.7772 | 0.0617 | 0.0065 | 0.6015 |
4 | ia pucp | 0.7704 | 0.0618 | 0.0066 | 0.5929 |
5 | Alejandro M. | 0.7790 | 0.0619 | 0.0064 | 0.6062 |
Our system | - | 0.0875 | 0.0131 | 0.1930 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ortiz-Zambrano, J.A.; Espin-Riofrio, C.; Montejo-Ráez, A. Combining Transformer Embeddings with Linguistic Features for Complex Word Identification. Electronics 2023, 12, 120. https://doi.org/10.3390/electronics12010120
Ortiz-Zambrano JA, Espin-Riofrio C, Montejo-Ráez A. Combining Transformer Embeddings with Linguistic Features for Complex Word Identification. Electronics. 2023; 12(1):120. https://doi.org/10.3390/electronics12010120
Chicago/Turabian StyleOrtiz-Zambrano, Jenny A., César Espin-Riofrio, and Arturo Montejo-Ráez. 2023. "Combining Transformer Embeddings with Linguistic Features for Complex Word Identification" Electronics 12, no. 1: 120. https://doi.org/10.3390/electronics12010120
APA StyleOrtiz-Zambrano, J. A., Espin-Riofrio, C., & Montejo-Ráez, A. (2023). Combining Transformer Embeddings with Linguistic Features for Complex Word Identification. Electronics, 12(1), 120. https://doi.org/10.3390/electronics12010120