Next Article in Journal
Digital Storytelling to Enhance Adults’ Speaking Skills in Learning Foreign Languages: A Case Study
Next Article in Special Issue
Data-Driven Lexical Normalization for Medical Social Media
Previous Article in Journal
Accessible Digital Musical Instruments—A Review of Musical Interfaces in Inclusive Music Practice
Open AccessArticle

Unsupervised Keyphrase Extraction for Web Pages

1
Bernoulli Institute, Department of Artificial Intelligence, University of Groningen, PO Box 407, 9700 AK Groningen, The Netherlands
2
Dataprovider.com, Helperpark 292, 9723 ZA Groningen, The Netherlands
*
Authors to whom correspondence should be addressed.
Multimodal Technologies Interact. 2019, 3(3), 58; https://doi.org/10.3390/mti3030058
Received: 28 June 2019 / Revised: 15 July 2019 / Accepted: 26 July 2019 / Published: 31 July 2019
(This article belongs to the Special Issue Text Mining in Complex Domains)
Keyphrase extraction is an important part of natural language processing (NLP) research, although little research is done in the domain of web pages. The World Wide Web contains billions of pages that are potentially interesting for various NLP tasks, yet it remains largely untouched in scientific research. Current research is often only applied to clean corpora such as abstracts and articles from academic journals or sets of scraped texts from a single domain. However, textual data from web pages differ from normal text documents, as it is structured using HTML elements and often consists of many small fragments. These elements are furthermore used in a highly inconsistent manner and are likely to contain noise. We evaluated the keyphrases extracted by several state-of-the-art extraction methods and found that they did not transfer well to web pages. We therefore propose WebEmbedRank, an adaptation of a recently proposed extraction method that can make use of structural information in web pages in a robust manner. We compared this novel method to other baselines and state-of-the-art methods using a manually annotated dataset and found that WebEmbedRank achieved significant improvements over existing extraction methods on web pages. View Full-Text
Keywords: unsupervised keyphrase extraction; sequence embeddings; web pages; WebEmbedRank unsupervised keyphrase extraction; sequence embeddings; web pages; WebEmbedRank
Show Figures

Figure 1

MDPI and ACS Style

Haarman, T.; Zijlema, B.; Wiering, M. Unsupervised Keyphrase Extraction for Web Pages. Multimodal Technologies Interact. 2019, 3, 58.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop