Next Article in Journal
From the Digital Data Revolution toward a Digital Society: Pervasiveness of Artificial Intelligence
Previous Article in Journal
Automatic Feature Selection for Improved Interpretability on Whole Slide Imaging
Open AccessArticle

Leaving No Stone Unturned: Flexible Retrieval of Idiomatic Expressions from a Large Text Corpus

1
School of Computer Science and Informatics, Cardiff University, Cardiff CF24 3AA, UK
2
School of English, Communication and Philosophy, Cardiff University, Cardiff CF10 3EU, UK
*
Author to whom correspondence should be addressed.
Academic Editor: Chris Biemann
Mach. Learn. Knowl. Extr. 2021, 3(1), 263-283; https://doi.org/10.3390/make3010013
Received: 2 February 2021 / Revised: 24 February 2021 / Accepted: 25 February 2021 / Published: 3 March 2021
(This article belongs to the Section Data)
Idioms are multi-word expressions whose meaning cannot always be deduced from the literal meaning of constituent words. A key feature of idioms that is central to this paper is their peculiar mixture of fixedness and variability, which poses challenges for their retrieval from large corpora using traditional search approaches. These challenges hinder insights into idiom usage, affecting users who are conducting linguistic research as well as those involved in language education. To facilitate access to idiom examples taken from real-world contexts, we introduce an information retrieval system designed specifically for idioms. Given a search query that represents an idiom, typically in its canonical form, the system expands it automatically to account for the most common types of idiom variation including inflection, open slots, adjectival or adverbial modification and passivisation. As a by-product of query expansion, other types of idiom variation captured include derivation, compounding, negation, distribution across multiple clauses as well as other unforeseen types of variation. The system was implemented on top of Elasticsearch, an open-source, distributed, scalable, real-time search engine. Flexible retrieval of idioms is supported by a combination of linguistic pre-processing of the search queries, their translation into a set of query clauses written in a query language called Query DSL, and analysis, an indexing process that involves tokenisation and normalisation. Our system outperformed the phrase search in terms of recall and outperformed the keyword search in terms of precision. Out of the three, our approach was found to provide the best balance between precision and recall. By providing a fast and easy way of finding idioms in large corpora, our approach can facilitate further developments in fields such as linguistics, language education and natural language processing. View Full-Text
Keywords: information retrieval; natural language processing; corpus linguistics; multi-word expressions; idioms information retrieval; natural language processing; corpus linguistics; multi-word expressions; idioms
Show Figures

Figure 1

MDPI and ACS Style

Hughes, C.; Filimonov, M.; Wray, A.; Spasić, I. Leaving No Stone Unturned: Flexible Retrieval of Idiomatic Expressions from a Large Text Corpus. Mach. Learn. Knowl. Extr. 2021, 3, 263-283. https://doi.org/10.3390/make3010013

AMA Style

Hughes C, Filimonov M, Wray A, Spasić I. Leaving No Stone Unturned: Flexible Retrieval of Idiomatic Expressions from a Large Text Corpus. Machine Learning and Knowledge Extraction. 2021; 3(1):263-283. https://doi.org/10.3390/make3010013

Chicago/Turabian Style

Hughes, Callum; Filimonov, Maxim; Wray, Alison; Spasić, Irena. 2021. "Leaving No Stone Unturned: Flexible Retrieval of Idiomatic Expressions from a Large Text Corpus" Mach. Learn. Knowl. Extr. 3, no. 1: 263-283. https://doi.org/10.3390/make3010013

Find Other Styles

Article Access Map by Country/Region

1
Back to TopTop