Data Mining Applied in Natural Language Processing

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Artificial Intelligence".

Deadline for manuscript submissions: closed (15 September 2024) | Viewed by 11896

Special Issue Editor

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China
Interests: artificial intelligence and applications; vision and language
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

The objective of this Special Issue is to invite diverse submissions, fostering a collaborative initiative to comprehensively comprehend the emerging opportunities and challenges in the realm of data mining applied in natural language processing. We aim to identify key tasks, evaluate the current state of the art, showcase inventive methodologies and ideas, introduce substantial real-world systems or applications, propose new datasets, and engage in discussions about future directions. Through this coordinated effort, we aspire to advance our understanding of the intricate interplay between data mining and natural language processing, paving the way for advancements in the field.

In this Special Issue, original research articles and reviews are welcome. Research areas may include (but not limited to) the following areas (in alphabetical order):

  • Computational Social Science and Social Media;
  • Dialogue and Interactive Systems;
  • Discourse and Pragmatics;
  • Information Extraction;
  • Interpretablity and Analysis of Models for NLP;
  • Linguistic Theories, Cognitive Modeling, and Psycholinguistics;
  • Machine Learning for NLP;
  • Machine Translation and Multilinguality;
  • Named Entity Recognition and Text Classification;
  • Phonology, Morphology, and Word Segmentation;
  • Semantics: Lexical, Sentence level, Textual Inference, and Other areas;
  • Sentiment Analysis and Opinion Mining;
  • Summarization;
  • Syntax: Tagging, Chunking, and Parsing;
  • Text Mining and Information Retrieval.

We look forward to receiving your contributions.

Technical Program Committee Member:

Dr. Yu Zhao  Southwestern University of Finance and Economic

Dr. Ruifan Li
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • data mining
  • natural language processing
  • text mining
  • sentiment analysis
  • machine translation
  • deep learning
  • named entity recognition
  • text classification
  • cross-language NLP

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (8 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

17 pages, 302 KiB  
Article
Comparative Analysis of Graph Neural Networks and Transformers for Robust Fake News Detection: A Verification and Reimplementation Study
by Soveatin Kuntur, Maciej Krzywda, Anna Wróblewska, Marcin Paprzycki and Maria Ganzha
Electronics 2024, 13(23), 4784; https://doi.org/10.3390/electronics13234784 - 4 Dec 2024
Cited by 1 | Viewed by 2201
Abstract
This study compares Transformer-based models and Graph Neural Networks (GNNs) for fake news detection across three datasets: FakeNewsNet, ISOT, and WELFake. Transformer models (BERT, RoBERTa, GPT-2) demonstrated superior performance, achieving mean accuracies above 85% on FakeNewsNet and exceeding 98% on ISOT and WELFake. [...] Read more.
This study compares Transformer-based models and Graph Neural Networks (GNNs) for fake news detection across three datasets: FakeNewsNet, ISOT, and WELFake. Transformer models (BERT, RoBERTa, GPT-2) demonstrated superior performance, achieving mean accuracies above 85% on FakeNewsNet and exceeding 98% on ISOT and WELFake. Specifically, RoBERTa achieved 86.16% accuracy on FakeNewsNet and 99.99% on ISOT, while GPT-2 reached 99.72% on WELFake. In contrast, GNNs (GCN, GraphSAGE, GIN, GAT) exhibited lower performance. GCN achieved 71% accuracy on FakeNewsNet but dropped to 53.30% on ISOT and 50.28% on WELFake, with F1 scores reflecting similar trends. Other GNNs, like GraphSAGE, showed even lower results, particularly on ISOT and WELFake, where performance hovered around 50%. Our findings indicate that while Transformers provide exceptional accuracy and reliability, GNNs offer potential efficiency benefits for resource-constrained scenarios despite their lower predictive performance. This study informs model selection for fake news detection tasks and encourages the exploration of hybrid approaches to balance accuracy and computational efficiency. Full article
(This article belongs to the Special Issue Data Mining Applied in Natural Language Processing)
Show Figures

Figure 1

17 pages, 2057 KiB  
Article
Fake Review Detection Model Based on Comment Content and Review Behavior
by Pengfei Sun, Weihong Bi, Yifan Zhang, Qiuyu Wang, Feifei Kou, Tongwei Lu and Jinpeng Chen
Electronics 2024, 13(21), 4322; https://doi.org/10.3390/electronics13214322 - 4 Nov 2024
Viewed by 1840
Abstract
With the development of the Internet, services such as catering, beauty, accommodation, and entertainment can be reserved or consumed online. Therefore, consumers increasingly rely on online information to choose merchants, products, and services, with reviews becoming a crucial factor in their decision making. [...] Read more.
With the development of the Internet, services such as catering, beauty, accommodation, and entertainment can be reserved or consumed online. Therefore, consumers increasingly rely on online information to choose merchants, products, and services, with reviews becoming a crucial factor in their decision making. However, the authenticity of reviews is highly debated in the field of Internet-based process-of-life service consumption. In recent years, due to the rapid growth of these industries, the detection of fake reviews has gained increasing attention. Fake reviews seriously mislead customers and damage the authenticity of online reviews. Various fake review classifiers have been developed, taking into account the content of the reviews and the behavior involved in the reviews, such as rating, time, etc. However, there has been no research considering the credibility of reviewers and merchants as part of identifying fake reviews. In order to improve the accuracy of existing fake review classification and detection methods, this study utilizes a comment text processing module to model the content of reviews, utilizes a reviewer behavior processing module and a reviewed merchant behavior processing module to model consumer review behavior sequences that imply reviewer credibility and merchant review behavior sequences that imply merchant credibility, respectively, and finally merges the two features for fake review classification. The experimental results show that, compared to other models, the model proposed in this paper improves the classification performance by simultaneously modeling the content of reviews and the credibility of reviewers and merchants. Full article
(This article belongs to the Special Issue Data Mining Applied in Natural Language Processing)
Show Figures

Figure 1

12 pages, 491 KiB  
Article
Quantum-Inspired Fusion for Open-Domain Question Answering
by Ruixue Duan, Xin Liu, Zhigang Ding and Yangsen Zhang
Electronics 2024, 13(20), 4135; https://doi.org/10.3390/electronics13204135 - 21 Oct 2024
Viewed by 831
Abstract
Open-domain question-answering systems need models capable of referencing multiple passages simultaneously to generate accurate answers. The Rational Fusion-in-Decoder (RFiD) model focuses on differentiating between causal relationships and spurious features by utilizing the encoders of the Fusion-in-Decoder model. However, RFiD reliance on partial token [...] Read more.
Open-domain question-answering systems need models capable of referencing multiple passages simultaneously to generate accurate answers. The Rational Fusion-in-Decoder (RFiD) model focuses on differentiating between causal relationships and spurious features by utilizing the encoders of the Fusion-in-Decoder model. However, RFiD reliance on partial token information limits its ability to determine whether the corresponding passage is a rationale for the question, potentially leading to inappropriate answers. To address this issue, we propose a Quantum-Inspired Fusion-in-Decoder (QFiD) model. Our approach introduces a Quantum Fusion Module (QFM) that maps single-dimensional into multi-dimensional hidden states, enabling the model to capture more comprehensive token information. Then, the classical mixture method from quantum information theory is used to fuse all information. Based on the fused information, the model can accurately predict the relationship between the question and passage. Experimental results on two prominent ODQA datasets, Natural Questions and TriviaQA, demonstrate that QFiD outperforms the strong baselines in automatic evaluations. Full article
(This article belongs to the Special Issue Data Mining Applied in Natural Language Processing)
Show Figures

Figure 1

13 pages, 933 KiB  
Article
Dynamic Assessment-Based Curriculum Learning Method for Chinese Grammatical Error Correction
by Ruixue Duan, Zhiyuan Ma, Yangsen Zhang, Zhigang Ding and Xiulei Liu
Electronics 2024, 13(20), 4079; https://doi.org/10.3390/electronics13204079 - 17 Oct 2024
Viewed by 1065
Abstract
Current mainstream for Chinese grammatical error correction methods rely on deep neural network models, which require a large amount of high-quality data for training. However, existing Chinese grammatical error correction corpora have a low annotation quality and high noise levels, leading to a [...] Read more.
Current mainstream for Chinese grammatical error correction methods rely on deep neural network models, which require a large amount of high-quality data for training. However, existing Chinese grammatical error correction corpora have a low annotation quality and high noise levels, leading to a low generalization ability of the models and difficulty in handling complex sentences. To address this issue, this paper proposes a dynamic assessment-based curriculum learning method for Chinese grammatical error correction. The proposed approach focuses on two key components: defining the difficulty of training samples and devising an effective training strategy. In the difficulty assessment phase, we enhance the accuracy of the curriculum sequence by dynamically updating the evaluation model. During the training strategy phase, a multi-stage dynamic progressive approach is employed to select training samples of varying difficulty levels, which helps prevent the model from prematurely converging to local optima and enhances the overall training effectiveness. Experimental results on the MuCGEC and NLPCC 2018 Chinese grammatical error correction datasets show that the proposed curriculum learning method significantly improves the model’s error correction performance, with F0.5 scores increasing by 0.9 and 1.05, respectively, validating the method’s effectiveness. Full article
(This article belongs to the Special Issue Data Mining Applied in Natural Language Processing)
Show Figures

Figure 1

17 pages, 1076 KiB  
Article
Prompt-Based End-to-End Cross-Domain Dialogue State Tracking
by Hengtong Lu, Lucen Zhong, Huixing Jiang, Wei Chen, Caixia Yuan and Xiaojie Wang
Electronics 2024, 13(18), 3587; https://doi.org/10.3390/electronics13183587 - 10 Sep 2024
Viewed by 753
Abstract
Cross-domain dialogue state tracking (DST) focuses on using labeled data from source domains to train a DST model for target domains. It is of great significance for transferring a dialogue system into new domains. Most of the existing cross-domain DST models track each [...] Read more.
Cross-domain dialogue state tracking (DST) focuses on using labeled data from source domains to train a DST model for target domains. It is of great significance for transferring a dialogue system into new domains. Most of the existing cross-domain DST models track each slot independently, which leads to poor performances caused by not considering the correlation among different slots, as well as low efficiency of training and inference. This paper, therefore, proposes a prompt-based end-to-end cross-domain DST method for efficiently tracking all slots simultaneously. A dynamic prompt template shuffle method is proposed to alleviate the bias of the slot order, and a dynamic prompt template sampling method is proposed to alleviate the bias of the slot number, respectively. The experimental results on the MultiWOZ 2.0 and MultiWOZ 2.1 datasets show that our approach consistently outperforms the state-of-the-art baselines in all target domains and improves both training and inference efficiency by at least 5 times. Full article
(This article belongs to the Special Issue Data Mining Applied in Natural Language Processing)
Show Figures

Figure 1

18 pages, 1584 KiB  
Article
Automatic Generation and Evaluation of French-Style Chinese Modern Poetry
by Li Zuo, Dengke Zhang, Yuhai Zhao and Guoren Wang
Electronics 2024, 13(13), 2659; https://doi.org/10.3390/electronics13132659 - 6 Jul 2024
Viewed by 1203
Abstract
Literature has a strong cultural imprint and regional color, including poetry. Natural language itself is part of the poetry style. It is interesting to attempt to use one language to present poetry in another language style. Therefore, in this study, we propose a [...] Read more.
Literature has a strong cultural imprint and regional color, including poetry. Natural language itself is part of the poetry style. It is interesting to attempt to use one language to present poetry in another language style. Therefore, in this study, we propose a method to fine-tune a pre-trained model in a targeted manner to automatically generate French-style modern Chinese poetry and conduct a multi-faceted evaluation of the generated results. In a five-point scale based on human evaluation, judges assigned scores between 3.29 and 3.93 in seven dimensions, which reached 80.8–93.6% of the scores of the Chinese versions of real French poetry in these dimensions. In terms of the high-frequency poetic imagery, the consistency of the top 30–50 high-frequency poetic images between the poetry generated by the fine-tuned model and the French poetry reached 50–60%. In terms of the syntactic features, compared with the poems generated by the baseline model, the distribution frequencies of three special types of words that appear relatively frequently in French poetry increased by 12.95%, 15.81%, and 284.44% per 1000 Chinese characters in the poetry generated by the fine-tuned model. The human evaluation, poetic image distribution, and syntactic feature statistics show that the targeted fine-tuned model is helpful for the spread of language style. This fine-tuned model can successfully generate modern Chinese poetry in a French style. Full article
(This article belongs to the Special Issue Data Mining Applied in Natural Language Processing)
Show Figures

Figure 1

20 pages, 728 KiB  
Article
Semantic Augmentation in Chinese Adversarial Corpus for Discourse Relation Recognition Based on Internal Semantic Elements
by Zheng Hua, Ruixia Yang, Yanbin Feng and Xiaojun Yin
Electronics 2024, 13(10), 1944; https://doi.org/10.3390/electronics13101944 - 15 May 2024
Viewed by 1305
Abstract
This paper proposes incorporating linguistic semantic information into discourse relation recognition and constructing a Semantic Augmented Chinese Discourse Corpus (SACA) comprising 9546 adversative complex sentences. In adversative complex sentences, we suggest a quadruple (P, Q, R, Qβ) [...] Read more.
This paper proposes incorporating linguistic semantic information into discourse relation recognition and constructing a Semantic Augmented Chinese Discourse Corpus (SACA) comprising 9546 adversative complex sentences. In adversative complex sentences, we suggest a quadruple (P, Q, R, Qβ) representing internal semantic elements, where the semantic opposition between Q and Qβ forms the basis of the adversative relationship. P denotes the premise, and R represents the adversative reason. The overall annotation approach of this corpus follows the Penn Discourse Treebank (PDTB), except for the classification of senses. We combined insights from the Chinese Discourse Treebank (CDTB) and obtained eight sense categories for Chinese adversative complex sentences. Based on this corpus, we explore the relationship between sense classification and internal semantic elements within our newly proposed Chinese Adversative Discourse Relation Recognition (CADRR) task. Leveraging deep learning techniques, we constructed various classification models and the model that utilizes internal semantic element features, demonstrating their effectiveness and the applicability of our SACA corpus. Compared with pre-trained models, our model incorporates internal semantic element information to achieve state-of-the-art performance. Full article
(This article belongs to the Special Issue Data Mining Applied in Natural Language Processing)
Show Figures

Figure 1

20 pages, 2744 KiB  
Article
CogCol: Code Graph-Based Contrastive Learning Model for Code Summarization
by Yucen Shi, Ying Yin, Mingqian Yu and Liangyu Chu
Electronics 2024, 13(10), 1816; https://doi.org/10.3390/electronics13101816 - 8 May 2024
Viewed by 1346
Abstract
Summarizing source code by natural language aims to help developers better understand existing code, making software development more efficient. Since source code is highly structured, recent research uses code structure information like Abstract Semantic Tree (AST) to enhance the structure understanding rather than [...] Read more.
Summarizing source code by natural language aims to help developers better understand existing code, making software development more efficient. Since source code is highly structured, recent research uses code structure information like Abstract Semantic Tree (AST) to enhance the structure understanding rather than a normal translation task. However, AST can only represent the syntactic relationship of code snippets, which can not reflect high-level relationships like control and data dependency in the program dependency graph. Moreover, researchers treat the AST as the unique structure information of one code snippet corresponding to one summarization. It will be easily affected by simple perturbations as it lacks the understanding of code with similar structure. To handle the above problems, we build CogCol, a Code graph-based Contrastive learning model. CogCol is a Transformer-based model that converts code graphs into unique sequences to enhance the model’s structure learning. In detail, CogCol uses supervised contrastive learning by building several kinds of code graphs as positive samples to enhance the structural representation of code snippets and generalizability. Moreover, experiments on the widely used open-source dataset show that CogCol can significantly improve the state-of-the-art code summarization models under Meteor, BLEU, and ROUGE. Full article
(This article belongs to the Special Issue Data Mining Applied in Natural Language Processing)
Show Figures

Figure 1

Back to TopTop