Submit to Special Issue Submit Abstract to Special Issue Review for Information Propose a Special Issue

Journal Menu

Journal Browser

► Journal Browser

Information Extraction and Language Discourse Processing

Print Special Issue Flyer
Special Issue Editors
Special Issue Information
Keywords
Benefits of Publishing in a Special Issue
Published Papers

A special issue of Information (ISSN 2078-2489). This special issue belongs to the section "Artificial Intelligence".

Deadline for manuscript submissions: 31 January 2026 | Viewed by 36416

Share This Special Issue

Special Issue Editors

Dr. Jennifer D'Souza

E-Mail Website
Guest Editor

TIB–Leibniz Information Centre for Science and Technology, 30167 Hannover, Germany
Interests: information extraction; text mining; natural language processing; knowledge graphs

Prof. Dr. Chengzhi Zhang

E-Mail Website
Guest Editor

Professor, School of Economics and Management, Nanjing University of Science and Technology (NJUST), No. 200, Xiaolingwei, 210094 Nanjing, China
Interests: scientific text mining; knowledge entity extraction and evaluation; social media mining

Special Issue Information

Dear Colleagues,

Information extraction (IE) plays an increasingly important and pervasive role in today’s era of digitalized communication media based on the Semantic Web. E.g., search engine results, as snippets, are slowly replaced by “rich snippets”; there is an interest in converting scholarly publications to structured records available in such downstream IT applications as Leaderboards, etc. IE is thus the task of automatically extracting structured information from unstructured and/or semi-structured electronically represented documents. In most cases, this activity concerns processing of human language texts by means of natural language processing (NLP). The automatic extraction of information from unstructured sources has opened up new avenues for querying, organizing, and analyzing data by drawing upon the clean semantics of structured databases and the abundance of unstructured data.

Apart from extrinsic models of IE, research in linguistics and computational linguistics have long pointed out that text is not just simple sequence of clauses and sentences but rather follows a highly elaborated structure formalized within discourse. The framework used for discourse analysis has long since been rhetorical structure theory (RST). Within a well-written text, no unit of the text is completely isolated; interpretation requires understanding the unit’s relation with the context. Research in discourse analysis aims to unmask such relations in the text, which is helpful for many downstream applications such as summarization, information retrieval, and question answering.

This Special Issue seeks novel research reports on the spectrum that blends information extraction and language discourse processing research in diverse communities. The editors welcome submissions along various dimensions derived from the nature of the extraction task, the advanced neural techniques used for extraction, the variety of input resources exploited, and the type of output produced. Quantitative, qualitative, and mixed methods studies are welcome, as are case studies and experience reports if they describe an impactful application at a scale that delivers useful lessons to the journal readership.

Topics of interest include (but are not limited to):

Knowledge base population with discourse-centric information extraction (IE)
Coreference resolution and its impact on discourse-centric IE
Relationship extraction leveraging linguistic discourse
Template filling
Impact of pragmatics or rhetorics on information extraction
Discourse-centric IE at scale
Intelligent and novel assessment models of discourse-centric IE
Survey of discourse-centric IE in natural language processing (NLP)
Challenges implementing discourse-centric IE in real-world scenarios
Modeling domains using discourse-centric IE
Human–AI hybrid systems for learning discourse and IE
Application of discourse-centric IE

Dr. Jennifer D'Souza
Prof. Dr. Chengzhi Zhang
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Information is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

coherence
topic focus
information structure
conversation structure
discourse processing
scholarly discourse processing
anaphora resolution

Benefits of Publishing in a Special Issue

Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (11 papers)

Download All Papers

Order results

Result details

Show export options Show export options

Select all

Export citation of selected articles as:

Research

22 pages, 493 KiB

Open AccessArticle

Improving Performance of Automatic Keyword Extraction (AKE) Methods Using PoS Tagging and Enhanced Semantic-Awareness

by Enes Altuncu, Jason R. C. Nurse, Yang Xu, Jie Guo and Shujun Li

Information 2025, 16(7), 601; https://doi.org/10.3390/info16070601 - 13 Jul 2025

Viewed by 175

Abstract

Automatic keyword extraction (AKE) has gained more importance with the increasing amount of digital textual data that modern computing systems process. It has various applications in information retrieval (IR) and natural language processing (NLP), including text summarisation, topic analysis and document indexing. This paper proposes a simple but effective post-processing-based universal approach to improving the performance of any AKE methods, via an enhanced level of semantic-awareness supported by PoS tagging. To demonstrate the performance of the proposed approach, we considered word types retrieved from a PoS tagging step and two representative sources of semantic information—specialised terms defined in one or more context-dependent thesauri, and named entities in Wikipedia. The above three steps can be simply added to the end of any AKE methods as part of a post-processor, which simply re-evaluates all candidate keywords following some context-specific and semantic-aware criteria. For five state-of-the-art (SOTA) AKE methods, our experimental results with 17 selected datasets showed that the proposed approach improved their performances both consistently (up to 100% in terms of improved cases) and significantly (between 10.2% and 53.8%, with an average of 25.8%, in terms of F1-score and across all five methods), especially when all the three enhancement steps are used. Our results have profound implications considering the fact that our proposed approach can be easily applied to any AKE method with the standard output (candidate keywords and scores) and the ease to further extend it. Full article

(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)

► Show Figures

Figure 1

21 pages, 3129 KiB

Open AccessArticle

Optimizing Contextonym Analysis for Terminological Definition Writing

by Antonio San Martín

Information 2025, 16(4), 257; https://doi.org/10.3390/info16040257 - 22 Mar 2025

Viewed by 474

Abstract

To write terminological definitions that meet user needs, terminologists require methods that help them effectively select the most relevant information to be included in a definition. In this sense, a corpus technique that can be useful for the definition of terms is contextonym analysis. It involves the quantitative analysis of the other terms with which the term to be defined usually co-occurs (i.e., its contextonyms), regardless of any syntactic or semantic relationship. This paper presents a study conducted to determine the optimal configuration for extracting contextonyms for the creation of terminological definitions. More specifically, this study aims to create a word sketch column in Sketch Engine that lists contextonyms, offering a user-friendly method for their extraction. This study has identified that the optimal context window for extracting contextonyms in the form of word sketches in English to inform definition writing is 50 tokens, and that these contextonyms should be ranked by frequency. Full article

(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)

► Show Figures

Graphical abstract

30 pages, 1605 KiB

Open AccessArticle

From Misinformation to Insight: Machine Learning Strategies for Fake News Detection

by Despoina Mouratidis, Andreas Kanavos and Katia Kermanidis

Information 2025, 16(3), 189; https://doi.org/10.3390/info16030189 - 28 Feb 2025

Cited by 1 | Viewed by 5404

Abstract

In the digital age, the rapid proliferation of misinformation and disinformation poses a critical challenge to societal trust and the integrity of public discourse. This study presents a comprehensive machine learning framework for fake news detection, integrating advanced natural language processing techniques and deep learning architectures. We rigorously evaluate a diverse set of detection models across multiple content types, including social media posts, news articles, and user-generated comments. Our approach systematically compares traditional machine learning classifiers (Naïve Bayes, SVMs, Random Forest) with state-of-the-art deep learning models, such as CNNs, LSTMs, and BERT, while incorporating optimized vectorization techniques, including TF-IDF, Word2Vec, and contextual embeddings. Through extensive experimentation across multiple datasets, our results demonstrate that BERT-based models consistently achieve superior performance, significantly improving detection accuracy in complex misinformation scenarios. Furthermore, we extend the evaluation beyond conventional accuracy metrics by incorporating the Matthews Correlation Coefficient (MCC) and Receiver Operating Characteristic–Area Under the Curve (ROC–AUC), ensuring a robust and interpretable assessment of model efficacy. Beyond technical advancements, we explore the ethical implications of automated misinformation detection, addressing concerns related to censorship, algorithmic bias, and the trade-off between content moderation and freedom of expression. This research not only advances the methodological landscape of fake news detection but also contributes to the broader discourse on safeguarding democratic values, media integrity, and responsible AI deployment in digital environments. Full article

(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)

► Show Figures

Graphical abstract

17 pages, 766 KiB

Open AccessArticle

Semi-Supervised Relation Extraction Corpus Construction and Models Creation for Under-Resourced Languages: A Use Case for Slovene

by Timotej Knez, Miha Štravs and Slavko Žitnik

Information 2025, 16(2), 143; https://doi.org/10.3390/info16020143 - 15 Feb 2025

Cited by 1 | Viewed by 600

Abstract

The goal of relation extraction is to recognize head and tail entities in a document and determine a relation between them. While a lot of progress was made in solving automated relation extraction in widely used languages such as English, the use of these methods for under-resourced languages and domains is limited due to the lack of training data. In this work, we present a pipeline using distant supervision for constructing a relation extraction corpus in an arbitrary language. The corpus construction combines Wikipedia documents in the target language with relations in the WikiData knowledge graph. We demonstrate the process by constructing a new corpus for relation extraction in the Slovene language. Our corpus captures 20 unique relation types. The final corpus contains 811,032 relations annotated in 244,437 sentences. We use the corpus to train models using three architectures and evaluate them on the task of Slovene relation extraction. We achieve comparable performance to approaches on English data. Full article

(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)

► Show Figures

Figure 1

30 pages, 667 KiB

Open AccessArticle

Large Language Models for Electronic Health Record De-Identification in English and German

by Samuel Sousa, Michael Jantscher, Mark Kröll and Roman Kern

Information 2025, 16(2), 112; https://doi.org/10.3390/info16020112 - 6 Feb 2025

Cited by 1 | Viewed by 2103

Abstract

Electronic health record (EHR) de-identification is crucial for publishing or sharing medical data without violating the patient’s privacy. Protected health information (PHI) is abundant in EHRs, and privacy regulations worldwide mandate de-identification before downstream tasks are performed. The ever-growing data generation in healthcare and the advent of generative artificial intelligence have increased the demand for de-identified EHRs and highlighted privacy issues with large language models (LLMs), especially data transmission to cloud-based LLMs. In this study, we benchmark ten LLMs for de-identifying EHRs in English and German. We then compare de-identification performance for in-context learning and full model fine-tuning and analyze the limitations of LLMs for this task. Our experimental evaluation shows that LLMs effectively de-identify EHRs in both languages. Moreover, in-context learning with a one-shot setting boosts de-identification performance without the costly full fine-tuning of the LLMs. Full article

(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)

► Show Figures

Figure 1

25 pages, 514 KiB

Open AccessArticle

Bridging Linguistic Gaps: Developing a Greek Text Simplification Dataset

by Leonidas Agathos, Andreas Avgoustis, Xristiana Kryelesi, Aikaterini Makridou, Ilias Tzanis, Despoina Mouratidis, Katia Lida Kermanidis and Andreas Kanavos

Information 2024, 15(8), 500; https://doi.org/10.3390/info15080500 - 20 Aug 2024

Cited by 1 | Viewed by 1480

Abstract

Text simplification is crucial in bridging the comprehension gap in today’s information-rich environment. Despite advancements in English text simplification, languages with intricate grammatical structures, such as Greek, often remain under-explored. The complexity of Greek grammar, characterized by its flexible syntactic ordering, presents unique challenges that hinder comprehension for native speakers, learners, tourists, and international students. This paper introduces a comprehensive dataset for Greek text simplification, containing over 7500 sentences across diverse topics such as history, science, and culture, tailored to address these challenges. We outline the methodology for compiling this dataset, including a collection of texts from Greek Wikipedia, their annotation with simplified versions, and the establishment of robust evaluation metrics. Additionally, the paper details the implementation of quality control measures and the application of machine learning techniques to analyze text complexity. Our experimental results demonstrate the dataset’s initial effectiveness and potential in reducing linguistic barriers and enhancing communication, with initial machine learning models showing promising directions for future improvements in classifying text complexity. The development of this dataset marks a significant step toward improving accessibility and comprehension for a broad audience of Greek speakers and learners, fostering a more inclusive society. Full article

(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)

► Show Figures

Figure 1

19 pages, 725 KiB

Open AccessArticle

Evaluating Large Language Models for Structured Science Summarization in the Open Research Knowledge Graph

by Vladyslav Nechakhin, Jennifer D’Souza and Steffen Eger

Information 2024, 15(6), 328; https://doi.org/10.3390/info15060328 - 5 Jun 2024

Cited by 4 | Viewed by 4580

Abstract

Structured science summaries or research contributions using properties or dimensions beyond traditional keywords enhance science findability. Current methods, such as those used by the Open Research Knowledge Graph (ORKG), involve manually curating properties to describe research papers’ contributions in a structured manner, but this is labor-intensive and inconsistent among human domain-expert curators. We propose using Large Language Models (LLMs) to automatically suggest these properties. However, it is essential to assess the readiness of LLMs like GPT-3.5, Llama 2, and Mistral for this task before their application. Our study performs a comprehensive comparative analysis between the ORKG’s manually curated properties and those generated by the aforementioned state-of-the-art LLMs. We evaluate LLM performance from four unique perspectives: semantic alignment with and deviation from ORKG properties, fine-grained property mapping accuracy, SciNCL embedding-based cosine similarity, and expert surveys comparing manual annotations with LLM outputs. These evaluations occur within a multidisciplinary science setting. Overall, LLMs show potential as recommendation systems for structuring science, but further fine-tuning is recommended to improve their alignment with scientific tasks and mimicry of human expertise. Full article

(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)

► Show Figures

Figure 1

22 pages, 461 KiB

Open AccessArticle

The Power of Context: A Novel Hybrid Context-Aware Fake News Detection Approach

by Jawaher Alghamdi, Yuqing Lin and Suhuai Luo

Information 2024, 15(3), 122; https://doi.org/10.3390/info15030122 - 21 Feb 2024

Cited by 5 | Viewed by 3227

Abstract

The detection of fake news has emerged as a crucial area of research due to its potential impact on society. In this study, we propose a robust methodology for identifying fake news by leveraging diverse aspects of language representation and incorporating auxiliary information. Our approach is based on the utilisation of Bidirectional Encoder Representations from Transformers (BERT) to capture contextualised semantic knowledge. Additionally, we employ a multichannel Convolutional Neural Network (mCNN) integrated with stacked Bidirectional Gated Recurrent Units (sBiGRU) to jointly learn multi-aspect language representations. This enables our model to effectively identify valuable clues from news content while simultaneously incorporating content- and context-based cues, such as user posting behaviour, to enhance the detection of fake news. Through extensive experimentation on four widely used real-world datasets, our proposed framework demonstrates superior performance (↑3.59% (PolitiFact), ↑6.8% (GossipCop), ↑2.96% (FA-KES), and ↑12.51% (LIAR), considering both content-based features and additional auxiliary information) compared to existing state-of-the-art approaches, establishing its effectiveness in the challenging task of fake news detection. Full article

(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)

► Show Figures

Figure 1

26 pages, 1592 KiB

Open AccessArticle

FinChain-BERT: A High-Accuracy Automatic Fraud Detection Model Based on NLP Methods for Financial Scenarios

by Xinze Yang, Chunkai Zhang, Yizhi Sun, Kairui Pang, Luru Jing, Shiyun Wa and Chunli Lv

Information 2023, 14(9), 499; https://doi.org/10.3390/info14090499 - 12 Sep 2023

Cited by 14 | Viewed by 8716

Abstract

This research primarily explores the application of Natural Language Processing (NLP) technology in precision financial fraud detection, with a particular focus on the implementation and optimization of the FinChain-BERT model. Firstly, the FinChain-BERT model has been successfully employed for financial fraud detection tasks, improving the capability of handling complex financial text information through deep learning techniques. Secondly, novel attempts have been made in the selection of loss functions, with a comparison conducted between negative log-likelihood function and Keywords Loss Function. The results indicated that the Keywords Loss Function outperforms the negative log-likelihood function when applied to the FinChain-BERT model. Experimental results validated the efficacy of the FinChain-BERT model and its optimization measures. Whether in the selection of loss functions or the application of lightweight technology, the FinChain-BERT model demonstrated superior performance. The utilization of Keywords Loss Function resulted in a model achieving 0.97 in terms of accuracy, recall, and precision. Simultaneously, the model size was successfully reduced to 43 MB through the application of integer distillation technology, which holds significant importance for environments with limited computational resources. In conclusion, this research makes a crucial contribution to the application of NLP in financial fraud detection and provides a useful reference for future studies. Full article

(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)

► Show Figures

Figure 1

18 pages, 1431 KiB

Open AccessArticle

Exploring a Multi-Layered Cross-Genre Corpus of Document-Level Semantic Relations

by Gregor Williamson, Angela Cao, Yingying Chen, Yuxin Ji, Liyan Xu and Jinho D. Choi

Information 2023, 14(8), 431; https://doi.org/10.3390/info14080431 - 1 Aug 2023

Cited by 1 | Viewed by 1616

Abstract

This paper introduces a multi-layered cross-genre corpus, annotated for coreference resolution, causal relations, and temporal relations, comprising a variety of genres, from news articles and children’s stories to Reddit posts. Our results reveal distinctive genre-specific characteristics at each layer of annotation, highlighting unique challenges for both annotators and machine learning models. Children’s stories feature linear temporal structures and clear causal relations. In contrast, news articles employ non-linear temporal sequences with minimal use of explicit causal or conditional language and few first-person pronouns. Lastly, Reddit posts are author-centered explanations of ongoing situations, with occasional meta-textual reference. Our annotation schemes are adapted from existing work to better suit a broader range of text types. We argue that our multi-layered cross-genre corpus not only reveals genre-specific semantic characteristics but also indicates a rich contextual interplay between the various layers of semantic information. Our MLCG corpus is shared under the open-source Apache 2.0 license. Full article

(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)

► Show Figures

Figure 1

21 pages, 339 KiB

Open AccessArticle

Extracting Narrative Patterns in Different Textual Genres: A Multilevel Feature Discourse Analysis

by María Miró Maestre, Marta Vicente, Elena Lloret and Armando Suárez Cueto

Information 2023, 14(1), 28; https://doi.org/10.3390/info14010028 - 31 Dec 2022

Cited by 1 | Viewed by 4033

Abstract

We present a data-driven approach to discover and extract patterns in textual genres with the aim of identifying whether there is an interesting variation of linguistic features among different narrative genres depending on their respective communicative purposes. We want to achieve this goal by performing a multilevel discourse analysis according to (1) the type of feature studied (shallow, syntactic, semantic, and discourse-related); (2) the texts at a document level; and (3) the textual genres of news, reviews, and children’s tales. To accomplish this, several corpora from the three textual genres were gathered from different sources to ensure a heterogeneous representation, paying attention to the presence and frequency of a series of features extracted with computational tools. This deep analysis aims at obtaining more detailed knowledge of the different linguistic phenomena that directly shape each of the genres included in the study, therefore showing the particularities that make them be considered as individual genres but also comprise them inside the narrative typology. The findings suggest that this type of multilevel linguistic analysis could be of great help for areas of research within natural language processing such as computational narratology, as they allow a better understanding of the fundamental features that define each genre and its communicative purpose. Likewise, this approach could also boost the creation of more consistent automatic story generation tools in areas of language generation. Full article

(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)

► Show Figures

Journal Menu

Journal Browser

Information Extraction and Language Discourse Processing

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (11 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI