applsci-logo

Journal Browser

Journal Browser

Data and Text Mining: New Approaches, Achievements and Applications

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (20 March 2025) | Viewed by 23898

Special Issue Editor


E-Mail Website
Guest Editor
Department of Methodology and Statistics, Faculty of Social and Behavioral Sciences, Utrecht University, 3584 CH Utrecht, Netherlands
Interests: applied data science; social/human data science; computational social science; data mining; text mining; natural language processing; statistical learning; machine learning; deep learning; big data analysis

Special Issue Information

Dear Colleagues,

In today's data-driven world, the field of data and text mining has emerged as a central domain, addressing innovative techniques for analysing and systematically extracting valuable insights, as well as managing large and complex datasets that exceed the capabilities of traditional data processing techniques. The advent of big data has ushered in a transformative era, and its profound impact can be seen across multiple sectors, including social sciences, healthcare, international development, education, and beyond. Furthermore, as we move further into the realm of text mining, we are witnessing remarkable advances in natural language processing (NLP). These advances enable us to unravel the intricate tapestry of human language, opening the door to a wealth of unexplored knowledge and opportunities for discovery. In this Special Issue, we therefore explore this dynamic convergence of data and text mining.

We look forward to your contributions to this Special Issue.

Dr. Ayoub Bagheri
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • data mining
  • text mining
  • big data
  • natural language processing (NLP)
  • computational social sciences
  • knowledge discovery
  • statistical learning
  • machine learning
  • data analysis
  • information retrieval

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (9 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

24 pages, 1200 KiB  
Article
Exploring Core Knowledge in Interdisciplinary Research: Insights from Topic Modeling Analysis
by Shuangyan Wu, Mixin Lin, Mengxiao Ji and Ting Wang
Appl. Sci. 2024, 14(21), 10054; https://doi.org/10.3390/app142110054 - 4 Nov 2024
Cited by 1 | Viewed by 3786
Abstract
Although interdisciplinary research has garnered extensive attention in academia, its core knowledge structure has yet to be systematically explored. To address this gap, this study aims to uncover the underlying core knowledge topics within interdisciplinary research, enabling researchers to gain a deeper understanding [...] Read more.
Although interdisciplinary research has garnered extensive attention in academia, its core knowledge structure has yet to be systematically explored. To address this gap, this study aims to uncover the underlying core knowledge topics within interdisciplinary research, enabling researchers to gain a deeper understanding of the knowledge framework, improve research efficiency, and offer insights for future inquiries. Based on the Web of Science (WoS) database, this study collected 153 highly cited papers and employed the LDA topic model to identify latent topics and extract the knowledge structure within interdisciplinary research. The findings indicate that the core knowledge topics of interdisciplinary research can be categorized into four major areas: the knowledge framework and social impact of interdisciplinary research, multidisciplinary approaches in cancer treatment and patient care, Covid-19 multidisciplinary care and rehabilitation, and multidisciplinary AI and optimization in industrial applications. Moreover, the study reveals that AI-related interdisciplinary research topics are rapidly emerging. Through an in-depth analysis of these topics, the study discusses potential future directions for interdisciplinary research, including the cultivation and development of interdisciplinary talent, evaluation systems and policy support for interdisciplinary research, international cooperation and interdisciplinary globalization, and AI and interdisciplinary research optimization. This study not only uncovers the core knowledge structure of interdisciplinary research but also demonstrates the effectiveness of the LDA topic model as a data mining tool for revealing key topics and trends, providing practical tools for future research. However, this study has two main limitations: the time lag of highly cited papers and the dynamic evolution of interdisciplinary research. Future research should address these limitations to further enhance the understanding of interdisciplinary research. Full article
(This article belongs to the Special Issue Data and Text Mining: New Approaches, Achievements and Applications)
Show Figures

Figure 1

14 pages, 5197 KiB  
Article
Artificial Intelligence Approach to Business Process Re-Engineering the Information Flow of Warehouse Shipping Orders: An Italian Case Study
by Matteo Merli, Filippo Emanuele Ciarapica, Kishore Chalakkal Varghese and Maurizio Bevilacqua
Appl. Sci. 2024, 14(21), 9894; https://doi.org/10.3390/app14219894 - 29 Oct 2024
Viewed by 1262
Abstract
Artificial intelligence is revolutionizing and significantly shaping the work environment. It not only presents an innovative and unique challenge, but it also creates numerous opportunities in various sectors. This study aims to examine a case of artificial intelligence use in the logistics sector, [...] Read more.
Artificial intelligence is revolutionizing and significantly shaping the work environment. It not only presents an innovative and unique challenge, but it also creates numerous opportunities in various sectors. This study aims to examine a case of artificial intelligence use in the logistics sector, specifically the implementation of text recognition algorithms that use advanced artificial intelligence technology to perform tasks and activities typically carried out by human operators. The implementation under analysis was carried out at an Italian company in Tolentino, in the province of Macerata. The implementation of this tool focuses on the ability to autonomously read shipping notes, with the aim of simplifying the flow of information in the goods receipt phase and reducing the workload of warehouse staff. As illustrated in the following chapters, when artificial intelligence follows several trainings, technical improvements and structural changes in shipping notes, the flow is greatly simplified, significantly facilitating this logistics phase. Full article
(This article belongs to the Special Issue Data and Text Mining: New Approaches, Achievements and Applications)
Show Figures

Figure 1

14 pages, 4674 KiB  
Article
Machine Learning Accelerated Design of High-Temperature Ternary and Quaternary Nitride Superconductors
by Md Tohidul Islam, Qinrui Liu and Scott Broderick
Appl. Sci. 2024, 14(20), 9196; https://doi.org/10.3390/app14209196 - 10 Oct 2024
Cited by 1 | Viewed by 1212
Abstract
The recent advancements in the field of superconductivity have been significantly driven by the development of nitride superconductors, particularly niobium nitride (NbN). Multicomponent nitrides offer a promising platform for achieving high-temperature superconductivity. Beyond their high superconducting transition temperature (Tc), niobium-based compounds are notable [...] Read more.
The recent advancements in the field of superconductivity have been significantly driven by the development of nitride superconductors, particularly niobium nitride (NbN). Multicomponent nitrides offer a promising platform for achieving high-temperature superconductivity. Beyond their high superconducting transition temperature (Tc), niobium-based compounds are notable for their superior superconducting and mechanical properties, making them suitable for a wide range of device applications. In this work, machine learning is used to identify ternary and quaternary nitrides, which can surpass the properties of binary NbN. Specifically, Nb0.35Ta0.23Ti0.42N shows an 84.95% improvement in Tc compared to base NbN, while the ternary composition Nb0.55Ti0.45N exhibits a 17.29% improvement. This research provides a valuable reference for the further exploration of high-temperature superconductors in diversified ternary and quaternary compositions. Full article
(This article belongs to the Special Issue Data and Text Mining: New Approaches, Achievements and Applications)
Show Figures

Figure 1

17 pages, 1345 KiB  
Article
Automating Systematic Literature Reviews with Retrieval-Augmented Generation: A Comprehensive Overview
by Binglan Han, Teo Susnjak and Anuradha Mathrani
Appl. Sci. 2024, 14(19), 9103; https://doi.org/10.3390/app14199103 - 9 Oct 2024
Cited by 3 | Viewed by 9096
Abstract
This study examines Retrieval-Augmented Generation (RAG) in large language models (LLMs) and their significant application for undertaking systematic literature reviews (SLRs). RAG-based LLMs can potentially automate tasks like data extraction, summarization, and trend identification. However, while LLMs are exceptionally proficient in generating human-like [...] Read more.
This study examines Retrieval-Augmented Generation (RAG) in large language models (LLMs) and their significant application for undertaking systematic literature reviews (SLRs). RAG-based LLMs can potentially automate tasks like data extraction, summarization, and trend identification. However, while LLMs are exceptionally proficient in generating human-like text and interpreting complex linguistic nuances, their dependence on static, pre-trained knowledge can result in inaccuracies and hallucinations. RAG mitigates these limitations by integrating LLMs’ generative capabilities with the precision of real-time information retrieval. We review in detail the three key processes of the RAG framework—retrieval, augmentation, and generation. We then discuss applications of RAG-based LLMs to SLR automation and highlight future research topics, including integration of domain-specific LLMs, multimodal data processing and generation, and utilization of multiple retrieval sources. We propose a framework of RAG-based LLMs for automating SRLs, which covers four stages of SLR process: literature search, literature screening, data extraction, and information synthesis. Future research aims to optimize the interaction between LLM selection, training strategies, RAG techniques, and prompt engineering to implement the proposed framework, with particular emphasis on the retrieval of information from individual scientific papers and the integration of these data to produce outputs addressing various aspects such as current status, existing gaps, and emerging trends. Full article
(This article belongs to the Special Issue Data and Text Mining: New Approaches, Achievements and Applications)
Show Figures

Figure 1

27 pages, 2051 KiB  
Article
A Transparent Pipeline for Identifying Sexism in Social Media: Combining Explainability with Model Prediction
by Hadi Mohammadi, Anastasia Giachanou and Ayoub Bagheri
Appl. Sci. 2024, 14(19), 8620; https://doi.org/10.3390/app14198620 - 24 Sep 2024
Cited by 2 | Viewed by 1714
Abstract
In this study, we present a new approach that combines multiple Bidirectional Encoder Representations from Transformers (BERT) architectures with a Convolutional Neural Network (CNN) framework designed for sexism detection in text at a granular level. Our method relies on the analysis and identification [...] Read more.
In this study, we present a new approach that combines multiple Bidirectional Encoder Representations from Transformers (BERT) architectures with a Convolutional Neural Network (CNN) framework designed for sexism detection in text at a granular level. Our method relies on the analysis and identification of the most important terms contributing to sexist content using Shapley Additive Explanations (SHAP) values. This approach involves defining a range of Sexism Scores based on both model predictions and explainability, moving beyond binary classification to provide a deeper understanding of the sexism-detection process. Additionally, it enables us to identify specific parts of a sentence and their respective contributions to this range, which can be valuable for decision makers and future research. In conclusion, this study introduces an innovative method for enhancing the clarity of large language models (LLMs), which is particularly relevant in sensitive domains such as sexism detection. The incorporation of explainability into the model represents a significant advancement in this field. The objective of our study is to bridge the gap between advanced technology and human comprehension by providing a framework for creating AI models that are both efficient and transparent. This approach could serve as a pipeline for future studies to incorporate explainability into language models. Full article
(This article belongs to the Special Issue Data and Text Mining: New Approaches, Achievements and Applications)
Show Figures

Figure 1

24 pages, 3998 KiB  
Article
Automatic Era Identification in Classical Arabic Poetry
by Nariman Makhoul Sleiman, Ali Ahmad Hussein, Tsvi Kuflik and Einat Minkov
Appl. Sci. 2024, 14(18), 8240; https://doi.org/10.3390/app14188240 - 12 Sep 2024
Viewed by 1213
Abstract
The authenticity of classical Arabic poetry has long been challenged by claims that some part of the pre-Islamic poetic heritage should not be attributed to this era. According to these assertions, some of this legacy was produced after the advent of Islam and [...] Read more.
The authenticity of classical Arabic poetry has long been challenged by claims that some part of the pre-Islamic poetic heritage should not be attributed to this era. According to these assertions, some of this legacy was produced after the advent of Islam and ascribed, for different reasons, to pre-Islamic poets. As pre-Islamic poets were illiterate, medieval Arabic literature devotees relied on Bedouin oral transmission when writing down and collecting the poems about two centuries later. This process left the identity of the real poets who composed these poems and the period in which they worked unresolved. In this work, we seek to answer the questions of how and to what extent we can identify the period in which classical Arabic poetry was composed, where we exploit modern-day automatic text processing techniques for this aim. We consider a dataset of Arabic poetry collected from the diwans (‘collections of poems’) of thirteen Arabic poets that corresponds to two main eras: the pre-ʿAbbāsid era (covering the period between the 6th and the 8th centuries CE) and the ʿAbbāsid era (starting in the year 750 CE). Some poems in each diwan are considered ‘original’; i.e., poems that are attributed to a certain poet with high confidence. The diwans also include, however, an additional section of poems that are attributed to a poet with reservations, meaning that these poems might have been composed by another poet and/or in another period. We trained a set of machine learning algorithms (classifiers) in order to explore the potential of machine learning techniques to automatically identify the period in which a poem had been written. In the training phase, we represent each poem using various types of features (characteristics) designed to capture lexical, topical, and stylistic aspects of this poetry. By training and assessing automatic models of period prediction using the ‘original’ poetry, we obtained highly encouraging results, measuring between 0.73–0.90 in terms of F1 for the various periods. Moreover, we observe that the stylistic features, which pertain to elements that characterize Arabic poetry, as well as the other feature types, are all indicative of the period in which the poem had been written. We applied the resulting prediction models to poems for which the authorship period is under dispute (‘attributed’) and got interesting results, suggesting that some of the poems may belong to different eras—an issue to be further examined by Arabic poetry researchers. The resulting prediction models may be applied to poems for which the authorship period is under dispute. We demonstrate this research direction, presenting some interesting anecdotal results. Full article
(This article belongs to the Special Issue Data and Text Mining: New Approaches, Achievements and Applications)
Show Figures

Figure 1

26 pages, 1413 KiB  
Article
Active Learning for Biomedical Article Classification with Bag of Words and FastText Embeddings
by Paweł Cichosz
Appl. Sci. 2024, 14(17), 7945; https://doi.org/10.3390/app14177945 - 6 Sep 2024
Viewed by 1304
Abstract
In several applications of text classification, training document labels are provided by human evaluators, and therefore, gathering sufficient data for model creation is time consuming and costly. The labeling time and effort may be reduced by active learning, in which classification models are [...] Read more.
In several applications of text classification, training document labels are provided by human evaluators, and therefore, gathering sufficient data for model creation is time consuming and costly. The labeling time and effort may be reduced by active learning, in which classification models are created based on relatively small training sets, which are obtained by collecting class labels provided in response to labeling requests or queries. This is an iterative process with a sequence of models being fitted, and each of them is used to select query articles to be added to the training set for the next one. Such a learning scenario may pose different challenges for machine learning algorithms and text representation methods used for text classification than ordinary passive learning, since they have to deal with very small, often imbalanced data, and the computational expense of both model creation and prediction has to remain low. This work examines how classification algorithms and text representation methods that have been found particularly useful by prior work handle these challenges. The random forest and support vector machines algorithms are coupled with the bag of words and FastText word embedding representations and applied to datasets consisting of scientific article abstracts from systematic literature review studies in the biomedical domain. Several strategies are used to select articles for active learning queries, including uncertainty sampling, diversity sampling, and strategies favoring the minority class. Confidence-based and stability-based early stopping criteria are used to generate active learning termination signals. The results confirm that active learning is a useful approach to creating text classification models with limited access to labeled data, making it possible to save at least half of the human effort needed to assign relevant or irrelevant class labels to training articles. Two of the four examined combinations of classification algorithms and text representation methods were the most successful: the SVM algorithm with the FastText representation and the random forest algorithm with the bag of words representation. Uncertainty sampling turned out to be the most useful query selection strategy, and confidence-based stopping was found more universal and easier to configure than stability-based stopping. Full article
(This article belongs to the Special Issue Data and Text Mining: New Approaches, Achievements and Applications)
Show Figures

Figure 1

Review

Jump to: Research

45 pages, 5583 KiB  
Review
From Tweets to Threats: A Survey of Cybersecurity Threat Detection Challenges, AI-Based Solutions and Potential Opportunities in X
by Omar Alsodi, Xujuan Zhou, Raj Gururajan, Anup Shrestha and Eyad Btoush
Appl. Sci. 2025, 15(7), 3898; https://doi.org/10.3390/app15073898 - 2 Apr 2025
Viewed by 833
Abstract
The pervasive use of social media platforms, such as X (formerly Twitter), has become a part of our daily lives, simultaneously increasing the threat of cyber attacks. To address this risk, numerous studies have explored methods to detect and predict cyber attacks by [...] Read more.
The pervasive use of social media platforms, such as X (formerly Twitter), has become a part of our daily lives, simultaneously increasing the threat of cyber attacks. To address this risk, numerous studies have explored methods to detect and predict cyber attacks by analyzing X data. This study specifically examines the application of AI techniques for predicting potential cyber threats on X. DeepNN consistently outperforms competing methods in terms of overall and average figure of merit. While character-level feature extraction methods are abundant, we contend that a semantic focus is more beneficial for this stage of the process. The findings indicate that current studies often lack comprehensive evaluations of critical aspects such as prediction scope, types of cybersecurity threats, feature extraction techniques, algorithm complexity, information summarization levels, scalability over time, and performance measurements. This review primarily focuses on identifying AI methods used to detect cyber threats on X and investigates existing gaps and trends in this area. Notably, over the past few years, limited review articles have been published on detecting cyber threats on X, especially those concentrating on recent journal articles rather than conference papers. Full article
(This article belongs to the Special Issue Data and Text Mining: New Approaches, Achievements and Applications)
Show Figures

Figure 1

21 pages, 1069 KiB  
Review
Reproducibility and Data Storage for Active Learning-Aided Systematic Reviews
by Peter Lombaers, Jonathan de Bruin and Rens van de Schoot
Appl. Sci. 2024, 14(9), 3842; https://doi.org/10.3390/app14093842 - 30 Apr 2024
Cited by 3 | Viewed by 2070
Abstract
In the screening phase of a systematic review, screening prioritization via active learning effectively reduces the workload. However, the PRISMA guidelines are not sufficient for reporting the screening phase in a reproducible manner. Text screening with active learning is an iterative process, but [...] Read more.
In the screening phase of a systematic review, screening prioritization via active learning effectively reduces the workload. However, the PRISMA guidelines are not sufficient for reporting the screening phase in a reproducible manner. Text screening with active learning is an iterative process, but the labeling decisions and the training of the active learning model can happen independently of each other in time. Therefore, it is not trivial to store the data from both events so that one can still know which iteration of the model was used for each labeling decision. Moreover, many iterations of the active learning model will be trained throughout the screening process, producing an enormous amount of data (think of many gigabytes or even terabytes of data), and machine learning models are continually becoming larger. This article clarifies the steps in an active learning-aided screening process and what data is produced at every step. We consider what reproducibility means in this context and we show that there is tension between the desire to be reproducible and the amount of data that is stored. Finally, we present the RDAL Checklist (Reproducibility and Data storage for Active Learning-Aided Systematic Reviews Checklist), which helps users and creators of active learning software make their screening process reproducible. Full article
(This article belongs to the Special Issue Data and Text Mining: New Approaches, Achievements and Applications)
Show Figures

Figure 1

Back to TopTop