Information

2026

Jump to: 2025, 2024, 2023, 2022, 2021

23 pages, 989 KB

Open AccessArticle

AI-Driven Corruption Risk Indicator Detection: A Comparative Evaluation of Transformer-Based NLP Models in Unstructured Procurement Data

by Nikolaos Peppes, Theodoros Alexakis, Emmanouil Daskalakis and Evgenia Adamopoulou

Information 2026, 17(4), 329; https://doi.org/10.3390/info17040329 - 28 Mar 2026

Viewed by 555

Abstract

The detection of corruption-related indicators within unstructured, textual procurement data remains a complex task due to linguistic ambiguity, contextual variation and domain-specific terminology. This study presents a comparative evaluation of three transformer-based Natural Language Processing (NLP) architectures (BERT-base-uncased, RoBERTa-base and DeBERTa-v3-base) for automated [...] Read more.

The detection of corruption-related indicators within unstructured, textual procurement data remains a complex task due to linguistic ambiguity, contextual variation and domain-specific terminology. This study presents a comparative evaluation of three transformer-based Natural Language Processing (NLP) architectures (BERT-base-uncased, RoBERTa-base and DeBERTa-v3-base) for automated corruption risk indicator detection in procurement texts coming from heterogeneous sources. A unified dataset is constructed by linking unstructured technical documentation with structured procurement outcomes, enabling an outcome-driven risk labeling strategy. Performance evaluation is conducted through different metrics, including precision, recall, F1-score and ROC-AUC, complemented by explainability analysis using Integrated Gradients. The results demonstrate a clear performance progression and highlight the comparative strengths of the evaluated architectures. Overall, this study highlights the potential of contextual transformer models to support scalable, transparent and operational anti-corruption monitoring systems. Full article

► Show Figures

Graphical abstract

24 pages, 1972 KB

Open AccessArticle

Exploring the Topics and Sentiments of AI-Related Public Opinions: An Advanced Machine Learning Text Analysis

by Wullianallur Raghupathi, Jie Ren and Tanush Kulkarni

Information 2026, 17(2), 134; https://doi.org/10.3390/info17020134 - 1 Feb 2026

Viewed by 3070

Abstract

This study investigates the evolution of public sentiment and discourse surrounding artificial intelligence through a comprehensive multi-method analysis of 28,819 Reddit comments spanning March 2015 to May 2024. Addressing three research questions—(1) what dominant topics characterize AI discourse, (2) how has sentiment changed [...] Read more.

This study investigates the evolution of public sentiment and discourse surrounding artificial intelligence through a comprehensive multi-method analysis of 28,819 Reddit comments spanning March 2015 to May 2024. Addressing three research questions—(1) what dominant topics characterize AI discourse, (2) how has sentiment changed over time, particularly following ChatGPT 5.2’s release, and (3) what linguistic patterns distinguish positive from negative discourse—we employ 28 distinct analytical techniques to provide validated insights into public AI perception. Methodologically, the study integrates VADER sentiment analysis, Linguistic Inquiry and Word Count (LIWC) analysis with regression validation, dual topic modeling using Latent Dirichlet Allocation and Non-negative Matrix Factorization for cross-validation, four-dimensional tone analysis, named entity recognition, emotion detection, and advanced NLP techniques including sarcasm detection, stance classification, and toxicity analysis. A key methodological contribution is the validation of LIWC categories through linear regression (R² = 0.049, p < 0.001) and logistic regression (61% accuracy), moving beyond the descriptive statistics typical of prior linguistic analyses. Results reveal a pronounced decline in positive sentiment from +0.320 in 2015 to +0.053 in 2024. Contrary to expectations, sentiment decreased following ChatGPT’s November 2022 release, with negative comments increasing from 31.9% to 35.1%—suggesting that direct exposure to powerful AI capabilities intensifies rather than alleviates public concerns. LIWC regression analysis identified negative emotion words (β = −0.083) and positive emotion words (β = +0.063) as the strongest sentiment predictors, confirming that affective rather than technical engagement drives public AI attitudes. Topic modeling revealed nine coherent themes, with facial recognition, algorithmic bias, AI ethics, and social media misinformation emerging as dominant concerns across both LDA and NMF analyses. Network analysis identified regulation as a central hub (degree centrality = 0.929) connecting all major AI concerns, indicating strong public appetite for governance frameworks. These findings contribute to theoretical understandings of technology risk perception, provide practical guidance for AI developers and policymakers, and demonstrate validated computational methods for tracking public opinion toward emerging technologies. Full article

► Show Figures

Figure 1

25 pages, 1075 KB

Open AccessArticle

Prompt-Based Few-Shot Text Classification with Multi-Granularity Label Augmentation and Adaptive Verbalizer

by Deling Huang, Zanxiong Li, Jian Yu and Yulong Zhou

Information 2026, 17(1), 58; https://doi.org/10.3390/info17010058 - 8 Jan 2026

Viewed by 730

Abstract

Few-Shot Text Classification (FSTC) aims to classify text accurately into predefined categories using minimal training samples. Recently, prompt-tuning-based methods have achieved promising results by constructing verbalizers that map input data to the label space, thereby maximizing the utilization of pre-trained model features. However, [...] Read more.

Few-Shot Text Classification (FSTC) aims to classify text accurately into predefined categories using minimal training samples. Recently, prompt-tuning-based methods have achieved promising results by constructing verbalizers that map input data to the label space, thereby maximizing the utilization of pre-trained model features. However, existing verbalizer construction methods often rely on external knowledge bases, which require complex noise filtering and manual refinement, making the process time-consuming and labor-intensive, while approaches based on pre-trained language models (PLMs) frequently overlook inherent prediction biases. Furthermore, conventional data augmentation methods focus on modifying input instances while overlooking the integral role of label semantics in prompt tuning. This disconnection often leads to a trade-off where increased sample diversity comes at the cost of semantic consistency, resulting in marginal improvements. To address these limitations, this paper first proposes a novel Bayesian Mutual Information-based method that optimizes label mapping to retain general PLM features while reducing reliance on irrelevant or unfair attributes to mitigate latent biases. Based on this method, we propose two synergistic generators that synthesize semantically consistent samples by integrating label word information from the verbalizer to effectively enrich data distribution and alleviate sparsity. To guarantee the reliability of the augmented set, we propose a Low-Entropy Selector that serves as a semantic filter, retaining only high-confidence samples to safeguard the model against ambiguous supervision signals. Furthermore, we propose a Difficulty-Aware Adversarial Training framework that fosters generalized feature learning, enabling the model to withstand subtle input perturbations. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods on most few-shot and full-data splits, with F1 score improvements of up to +2.8% on the standard AG’s News benchmark and +1.0% on the challenging DBPedia benchmark. Full article

► Show Figures

Graphical abstract

2025

Jump to: 2026, 2024, 2023, 2022, 2021

19 pages, 726 KB

Open AccessArticle

Structural–Semantic Term Weighting for Interpretable Topic Modeling with Higher Coherence and Lower Token Overlap

by Dmitriy Rodionov, Evgenii Konnikov, Gleb Golikov and Polina Yakob

Information 2026, 17(1), 22; https://doi.org/10.3390/info17010022 - 31 Dec 2025

Cited by 1 | Viewed by 451

Abstract

Topic modeling of large news streams is widely used to reconstruct economic and political narratives, which requires coherent topics with low lexical overlap while remaining interpretable to domain experts. We propose TF-SYN-NER-Rel, a structural–semantic term weighting scheme that extends classical TF-IDF by integrating [...] Read more.

Topic modeling of large news streams is widely used to reconstruct economic and political narratives, which requires coherent topics with low lexical overlap while remaining interpretable to domain experts. We propose TF-SYN-NER-Rel, a structural–semantic term weighting scheme that extends classical TF-IDF by integrating positional, syntactic, factual, and named-entity coefficients derived from morphosyntactic and dependency parses of Russian news texts. The method is embedded into a standard Latent Dirichlet Allocation (LDA) pipeline and evaluated on a large Russian-language news corpus from the online archive of Moskovsky Komsomolets (over 600,000 documents), with political, financial, and sports subsets obtained via dictionary-based expert labeling. For each subset, TF-SYN-NER-Rel is compared with standard TF-IDF under identical LDA settings, and topic quality is assessed using the C_v coherence metric. To assess robustness, we repeat model training across multiple random initializations and report aggregate coherence statistics. Quantitative results show that TF-SYN-NER-Rel improves coherence and yields smoother, more stable coherence curves across the number of topics. Qualitative analysis indicates reduced lexical overlap between topics and clearer separation of event-centered and institutional themes, especially in political and financial news. Overall, the proposed pipeline relies on CPU-based NLP tools and sparse linear algebra, providing a computationally lightweight and interpretable complement to embedding- and LLM-based topic modeling in large-scale news monitoring. Full article

► Show Figures

Figure 1

15 pages, 770 KB

Open AccessArticle

Analysis of Large Language Models for Company Annual Reports Based on Retrieval-Augmented Generation

by Abhijit Mokashi, Bennet Puthuparambil, Chaissy Daniel and Thomas Hanne

Information 2025, 16(9), 786; https://doi.org/10.3390/info16090786 - 10 Sep 2025

Cited by 1 | Viewed by 4496

Abstract

Large language models (LLMs) like ChatGPT-4 and Gemini 1.0 demonstrate significant text generation capabilities but often struggle with outdated knowledge, domain specificity, and hallucinations. Retrieval-Augmented Generation (RAG) offers a promising solution by integrating external knowledge sources to produce more accurate and informed responses. [...] Read more.

Large language models (LLMs) like ChatGPT-4 and Gemini 1.0 demonstrate significant text generation capabilities but often struggle with outdated knowledge, domain specificity, and hallucinations. Retrieval-Augmented Generation (RAG) offers a promising solution by integrating external knowledge sources to produce more accurate and informed responses. This research investigates RAG’s effectiveness in enhancing LLM performance for financial report analysis. We examine how RAG and the specific prompt design improve the provision of qualitative and quantitative financial information in terms of accuracy, relevance, and verifiability. Employing a design science research approach, we compare ChatGPT-4 responses before and after RAG integration, using annual reports from ten selected technology companies. Our findings demonstrate that RAG improves the relevance and verifiability of LLM outputs (by 0.66 and 0.71, respectively, on a scale from 1 to 5), while also reducing irrelevant or incorrect answers. Prompt specificity is shown to critically impact response quality. This study indicates RAG’s potential to mitigate LLM biases and inaccuracies, offering a practical solution for generating reliable and contextually rich financial insights. Full article

► Show Figures

Figure 1

10 pages, 466 KB

Open AccessArticle

The Negative Concord Mystery: Insights from a Language Model

by William O’Grady, Haopeng Zhang and Miseon Lee

Information 2025, 16(8), 710; https://doi.org/10.3390/info16080710 - 20 Aug 2025

Viewed by 2150

Abstract

An important recent development in the field of linguistics is the use of small language models to investigate language acquisition. Following this line of research, we investigate the mysterious appearance of ‘negative concord’ (e.g., I didn’t do nothing) in the speech of [...] Read more.

An important recent development in the field of linguistics is the use of small language models to investigate language acquisition. Following this line of research, we investigate the mysterious appearance of ‘negative concord’ (e.g., I didn’t do nothing) in the speech of children whose environment offers no exposure to patterns of this sort. Drawing on a 10-million-word version of the BabyLM corpus, we show that the preference for negative concord over patterns involving a single negative (e.g., I did nothing) can be traced to a cognitive force known as biuniqueness, whose effects will be examined with the help of data from both natural speech and a language model. Full article

16 pages, 2741 KB

Open AccessArticle

EVOCA: Explainable Verification of Claims by Graph Alignment

by Carmela De Felice, Carmelo Fabio Longo, Misael Mongiovì, Daniele Francesco Santamaria and Giusy Giulia Tuccari

Information 2025, 16(7), 597; https://doi.org/10.3390/info16070597 - 11 Jul 2025

Viewed by 1717

Abstract

The paper introduces EVOCA—Explainable Verification Of Claims by Graph Alignment—a hybrid approach that combines NLP (Natural Language Processing) techniques with the structural advantages of knowledge graphs to manage and reduce the amount of evidence required to evaluate statements. The approach leverages the [...] Read more.

The paper introduces EVOCA—Explainable Verification Of Claims by Graph Alignment—a hybrid approach that combines NLP (Natural Language Processing) techniques with the structural advantages of knowledge graphs to manage and reduce the amount of evidence required to evaluate statements. The approach leverages the explicit and interpretable structure of semantic graphs, which naturally represent the semantic structure of a sentence—or a set of sentences—and explicitly encodes the relationships among different concepts, thereby facilitating the extraction and manipulation of relevant information. The primary objective of the proposed tool is to condense the evidence into a short sentence that preserves only the salient and relevant information of the target claim. This process eliminates superfluous and redundant information, which could impact the performance of the subsequent verification task and provide useful information to explain the outcome. To achieve this, the proposed tool called EVOCA—Explainable Verification Of Claims by Graph Alignment—generates a sub-graph in AMR (Abstract Meaning Representation), representing the tokens of the claim–evidence pair that exhibit high semantic similarity. The structured representation offered by the AMR graph not only aids in identifying the most relevant information but also improves the interpretability of the results. The resulting sub-graph is converted back into natural language with the SPRING AMR tool, producing a concise but meaning-rich “sub-evidence” sentence. The output can be processed by lightweight language models to determine whether the evidence supports, contradicts, or is neutral about the claim. The approach is tested on the 4297 sentence pairs of the Climate-BERT-fact-checking dataset, and the promising results are discussed. Full article

► Show Figures

Figure 1

22 pages, 509 KB

Open AccessArticle

Aspect-Enhanced Prompting Method for Unsupervised Domain Adaptation in Aspect-Based Sentiment Analysis

by Binghan Lu, Kiyoaki Shirai and Natthawut Kertkeidkachorn

Information 2025, 16(5), 411; https://doi.org/10.3390/info16050411 - 16 May 2025

Cited by 1 | Viewed by 2355

Abstract

This study proposes an Aspect-Enhanced Prompting (AEP) method for unsupervised Multi-Source Domain Adaptation in Aspect Sentiment Classification, where data from the target domain are completely unavailable for model training. The proposed AEP is based on two generative language models: one generates a prompt [...] Read more.

This study proposes an Aspect-Enhanced Prompting (AEP) method for unsupervised Multi-Source Domain Adaptation in Aspect Sentiment Classification, where data from the target domain are completely unavailable for model training. The proposed AEP is based on two generative language models: one generates a prompt from a given review, while the other follows the prompt and classifies the sentiment of an aspect. The first model extracts Aspect-Related Features (ARFs), which are words closely related to the aspect, from the review and incorporates them into the prompt in a domain-agnostic manner, thereby directing the second model to identify the sentiment accurately. Our framework incorporates an innovative rescoring mechanism and a cluster-based prompt expansion strategy. Both are intended to enhance the robustness of the generation of the prompt and the adaptability of the model to diverse domains. The results of experiments conducted on five datasets (Restaurant, Laptop, Device, Service, and Location) demonstrate that our method outperforms the baselines, including a state-of-the-art unsupervised domain adaptation method. The effectiveness of both the rescoring mechanism and the cluster-based prompt expansion is also validated through an ablation study. Full article

► Show Figures

Figure 1

20 pages, 1902 KB

Open AccessArticle

Distantly Supervised Relation Extraction Method Based on Multi-Level Hierarchical Attention

by Zhaoxin Xuan, Hejing Zhao, Xin Li and Ziqi Chen

Information 2025, 16(5), 364; https://doi.org/10.3390/info16050364 - 29 Apr 2025

Cited by 1 | Viewed by 1517

Abstract

Distantly Supervised Relation Extraction (DSRE) aims to automatically identify semantic relationships within large text corpora by aligning with external knowledge bases. Despite the success of current methods in automating data annotation, they introduce two main challenges: label noise and data long-tail distribution. Label [...] Read more.

Distantly Supervised Relation Extraction (DSRE) aims to automatically identify semantic relationships within large text corpora by aligning with external knowledge bases. Despite the success of current methods in automating data annotation, they introduce two main challenges: label noise and data long-tail distribution. Label noise results in inaccurate annotations, which can undermine the quality of relation extraction. The long-tail problem, on the other hand, leads to an imbalanced model that struggles to extract less frequent, long-tail relations. In this paper, we introduce a novel relation extraction framework based on multi-level hierarchical attention. This approach utilizes Graph Attention Networks (GATs) to model the hierarchical structure of the relations, capturing the semantic dependencies between relation types and generating relation embeddings that reflect the overall hierarchical framework. To improve the classification process, we incorporate a multi-level classification structure guided by hierarchical attention, which enhances the accuracy of both head and tail relation extraction. A local probability constraint is introduced to ensure coherence across the classification levels, fostering knowledge transfer from frequent to less frequent relations. Experimental evaluations on the New York Times (NYT) dataset demonstrate that our method outperforms existing baselines, particularly in the context of long-tail relation extraction, offering a comprehensive solution to the challenges of DSRE. Full article

► Show Figures

Figure 1

20 pages, 4029 KB

Open AccessArticle

AI Narrative Modeling: How Machines’ Intelligence Reproduces Archetypal Storytelling

by Igor Kabashkin, Olga Zervina and Boriss Misnevs

Information 2025, 16(4), 319; https://doi.org/10.3390/info16040319 - 17 Apr 2025

Cited by 6 | Viewed by 10365

Abstract

This study examines how large language models reproduce Jungian archetypal patterns in storytelling. Results indicate that AI excels at replicating structured, goal-oriented archetypes (Hero, Wise Old Man), but it struggles with psychologically complex and ambiguous narratives (Shadow, Trickster). Expert evaluations confirmed these patterns, [...] Read more.

This study examines how large language models reproduce Jungian archetypal patterns in storytelling. Results indicate that AI excels at replicating structured, goal-oriented archetypes (Hero, Wise Old Man), but it struggles with psychologically complex and ambiguous narratives (Shadow, Trickster). Expert evaluations confirmed these patterns, rating AI higher on narrative coherence and thematic alignment than on emotional depth and creative originality. Full article

► Show Figures

Graphical abstract

24 pages, 2290 KB

Open AccessArticle

nBERT: Harnessing NLP for Emotion Recognition in Psychotherapy to Transform Mental Health Care

by Abdur Rasool, Saba Aslam, Naeem Hussain, Sharjeel Imtiaz and Waqar Riaz

Information 2025, 16(4), 301; https://doi.org/10.3390/info16040301 - 9 Apr 2025

Cited by 24 | Viewed by 6329

Abstract

The rising prevalence of mental health disorders, particularly depression, highlights the need for improved approaches in therapeutic interventions. Traditional psychotherapy relies on subjective assessments, which can vary across therapists and sessions, making it challenging to track emotional progression and therapy effectiveness objectively. Leveraging [...] Read more.

The rising prevalence of mental health disorders, particularly depression, highlights the need for improved approaches in therapeutic interventions. Traditional psychotherapy relies on subjective assessments, which can vary across therapists and sessions, making it challenging to track emotional progression and therapy effectiveness objectively. Leveraging the advancements in Natural Language Processing (NLP) and domain-specific Large Language Models (LLMs), this study introduces nBERT, a fine-tuned Bidirectional Encoder Representations from the Transformers (BERT) model integrated with the NRC Emotion Lexicon, to elevate emotion recognition in psychotherapy transcripts. The goal of this study is to provide a computational framework that aids in identifying emotional patterns, tracking patient-therapist emotional alignment, and assessing therapy outcomes. Addressing the challenge of emotion classification in text-based therapy sessions, where non-verbal cues are absent, nBERT demonstrates its ability to extract nuanced emotional insights from unstructured textual data, providing a data-driven approach to enhance mental health assessments. Trained on a dataset of 2021 psychotherapy transcripts, the model achieves an average precision of 91.53%, significantly outperforming baseline models. This capability not only improves diagnostic accuracy but also supports the customization of therapeutic strategies. By automating the interpretation of complex emotional dynamics in psychotherapy, nBERT exemplifies the transformative potential of NLP and LLMs in revolutionizing mental health care. Beyond psychotherapy, the framework enables broader LLM applications in the life sciences, including personalized medicine and precision healthcare. Full article

► Show Figures

Figure 1

31 pages, 23911 KB

Open AccessArticle

GSAF: An ML-Based Sentiment Analytics Framework for Understanding Contemporary Public Sentiment and Trends on Key Societal Issues

by Abdul Moid Khan Mohammed, G. G. Md. Nawaz Ali and Samantha S. Khairunnesa

Information 2025, 16(4), 271; https://doi.org/10.3390/info16040271 - 27 Mar 2025

Cited by 3 | Viewed by 2564

Abstract

This paper presents a Generalized Sentiment Analytics Framework (GSAF) for understanding public sentiments on different key societal issues in real time. The framework uses natural language processing techniques for computing sentiments and displays them in different emotions leveraging publicly available social media data [...] Read more.

This paper presents a Generalized Sentiment Analytics Framework (GSAF) for understanding public sentiments on different key societal issues in real time. The framework uses natural language processing techniques for computing sentiments and displays them in different emotions leveraging publicly available social media data (i.e., X threads (formally Twitter)). As a case study of our developed framework, we have leveraged over 3 million tweets to map, analyze, and visualize public sentiment state-wise across the United States on different societal issues. With X as a key social media platform, this study harnesses its vast user base to provide real-time insights into emotional responses surrounding key societal and political events. Built using R and the Shiny web framework, the platform offers users interactive visualizations of emotion-specific sentiments, such as anger, joy, and trust, displayed on a U.S. state-level choropleth map. The platform allows keyword-based searches and employs advanced text-processing techniques to filter and clean tweet data for robust analysis. Furthermore, it implements efficient caching mechanisms to enhance performance, comparing various strategies like LRU and Size-Based Eviction. This research highlights the potential of sentiment analysis for policymaking, marketing, and public discourse, providing a valuable tool for understanding and predicting public sentiment trends. Full article

► Show Figures

Graphical abstract

16 pages, 247 KB

Open AccessArticle

Show Me All Writing Errors: A Two-Phased Grammatical Error Corrector for Romanian

by Mihai-Cristian Tudose, Stefan Ruseti and Mihai Dascalu

Information 2025, 16(3), 242; https://doi.org/10.3390/info16030242 - 18 Mar 2025

Cited by 1 | Viewed by 2318

Abstract

Nowadays, grammatical error correction (GEC) has a significant role in writing since even native speakers often face challenges with proficient writing. This research is focused on developing a methodology to correct grammatical errors in the Romanian language, a less-resourced language for which there [...] Read more.

Nowadays, grammatical error correction (GEC) has a significant role in writing since even native speakers often face challenges with proficient writing. This research is focused on developing a methodology to correct grammatical errors in the Romanian language, a less-resourced language for which there are currently no up-to-date GEC solutions. Our main contributions include an open-source synthetic dataset of 345,403 Romanian sentences, a manually curated dataset of 3054 social media comments, a two-phased GEC approach, and a comparison with several Romanian models, including RoMistral and RoLama3, but also LanguageTool, GPT-4o mini, and GPT-4o. We consider a synthetic dataset to finetune our models, while we rely on two real-life datasets with genuine human mistakes (i.e., CNA and RoComments) to evaluate performance. Building an artificial dataset was necessary because of the scarcity of real-life mistake datasets, whereas introducing RoComments, a new genuine dataset, is argued by the necessity to cover errors amongst native speakers encountered in social media comments. We also introduce a two-phased approach, where we first identify the location of erroneous tokens in the sentence; next, the erroneous tokens are replaced by an encoder–decoder model. Our approach achieved an

F_{0.5}

of 0.57 on CNA and 0.64 on RoComments, surpassing by a considerable margin LanguageTool as well as an end-to-end version based on Flan-T5 and mT0 in most setups. While our two-phased method did not outperform GPT-4o, arguably by its smaller size and language exposure, it obtained on-par results with GPT-4o mini and achieved higher performance than all Romanian LLMs. Full article

► Show Figures

Figure 1

14 pages, 1804 KB

Open AccessArticle

A Spoofing Speech Detection Method Combining Multi-Scale Features and Cross-Layer Information

by Hongyan Yuan, Linjuan Zhang, Baoning Niu and Xianrong Zheng

Information 2025, 16(3), 194; https://doi.org/10.3390/info16030194 - 2 Mar 2025

Cited by 1 | Viewed by 2614

Abstract

Pre-trained self-supervised speech models can extract general acoustic features, providing feature inputs for various speech downstream tasks. Spoofing speech detection, which is a pressing issue in the age of generative AI, requires both global information and local features of speech. The multi-layer transformer [...] Read more.

Pre-trained self-supervised speech models can extract general acoustic features, providing feature inputs for various speech downstream tasks. Spoofing speech detection, which is a pressing issue in the age of generative AI, requires both global information and local features of speech. The multi-layer transformer structure in pre-trained speech models can effectively capture temporal information and global context in speech, but there is still room for improvement in handling local features. To address this issue, a speech spoofing detection method that integrates multi-scale features and cross-layer information is proposed. The method introduces a multi-scale feature adapter (MSFA), which enhances the model’s ability to perceive local features through residual convolutional blocks and squeeze-and-excitation (SE) mechanisms. Additionally, cross-adaptable weights (CAWs) are used to guide the model in focusing on task-relevant shallow information, thereby enabling the effective fusion of features from different layers of the pre-trained model. Experimental results show that the proposed method achieved an equal error rate (EER) of 0.36% and 4.29% on the ASVspoof2019 logical access (LA) and ASVspoof2021 LA datasets, respectively, demonstrating excellent detection performance and generalization ability. Full article

► Show Figures

Graphical abstract

21 pages, 2702 KB

Open AccessArticle

Analyzing Fairness of Computer Vision and Natural Language Processing Models

by Ahmed Rashed, Abdelkrim Kallich and Mohamed Eltayeb

Information 2025, 16(3), 182; https://doi.org/10.3390/info16030182 - 27 Feb 2025

Cited by 3 | Viewed by 5484

Abstract

Machine learning (ML) algorithms play a critical role in decision-making across various domains, such as healthcare, finance, education, and law enforcement. However, concerns about fairness and bias in these systems have raised significant ethical and social challenges. To address these challenges, this research [...] Read more.

Machine learning (ML) algorithms play a critical role in decision-making across various domains, such as healthcare, finance, education, and law enforcement. However, concerns about fairness and bias in these systems have raised significant ethical and social challenges. To address these challenges, this research utilizes two prominent fairness libraries, Fairlearn by Microsoft and AIF360 by IBM. These libraries offer comprehensive frameworks for fairness analysis, providing tools to evaluate fairness metrics, visualize results, and implement bias mitigation algorithms. The study focuses on assessing and mitigating biases for unstructured datasets using Computer Vision (CV) and Natural Language Processing (NLP) models. The primary objective is to present a comparative analysis of the performance of mitigation algorithms from the two fairness libraries. This analysis involves applying the algorithms individually, one at a time, in one of the stages of the ML lifecycle, pre-processing, in-processing, or post-processing, as well as sequentially across more than one stage. The results reveal that some sequential applications improve the performance of mitigation algorithms by effectively reducing bias while maintaining the model’s performance. Publicly available datasets from Kaggle were chosen for this research, providing a practical context for evaluating fairness in real-world machine learning workflows. Full article

► Show Figures

Graphical abstract

18 pages, 585 KB

Open AccessArticle

Improving Diacritical Arabic Speech Recognition: Transformer-Based Models with Transfer Learning and Hybrid Data Augmentation

by Haifa Alaqel and Khalil El Hindi

Information 2025, 16(3), 161; https://doi.org/10.3390/info16030161 - 20 Feb 2025

Cited by 4 | Viewed by 5104

Abstract

Diacritical Arabic (DA) refers to Arabic text with diacritical marks that guide pronunciation and clarify meanings, making their recognition crucial for accurate linguistic interpretation. These diacritical marks (short vowels) significantly influence meaning and pronunciation, and their accurate recognition is vital for the effectiveness [...] Read more.

Diacritical Arabic (DA) refers to Arabic text with diacritical marks that guide pronunciation and clarify meanings, making their recognition crucial for accurate linguistic interpretation. These diacritical marks (short vowels) significantly influence meaning and pronunciation, and their accurate recognition is vital for the effectiveness of automatic speech recognition (ASR) systems, particularly in applications requiring high semantic precision, such as voice-enabled translation services. Despite its importance, leveraging advanced machine learning techniques to enhance ASR for diacritical Arabic has remained underexplored. A key challenge in developing DA ASR is the limited availability of training data. This study introduces a transformer-based approach leveraging transfer learning and data augmentation to address these challenges. Using a cross-lingual speech representation (XLSR) model pretrained on 53 languages, we fine-tune it on DA and integrate connectionist temporal classification (CTC) with transformers for improved performance. Data augmentation techniques, including volume adjustment, pitch shift, speed alteration, and hybrid strategies, further mitigate data limitations, significantly reducing word error rates (WER). Our methods achieve a WER of 12.17%, outperforming traditional ASR systems and setting a new benchmark for DA ASR. These findings demonstrate the potential of advanced machine learning to address longstanding challenges in DA ASR and enhance its accuracy. Full article

► Show Figures

Figure 1

26 pages, 1974 KB

Open AccessArticle

Augmenting LLMs to Securely Retrieve Information for Construction and Facility Management

by David Krütli and Thomas Hanne

Information 2025, 16(2), 76; https://doi.org/10.3390/info16020076 - 22 Jan 2025

Cited by 6 | Viewed by 3441

Abstract

In the past few years, generative AI has seen remarkable progress. The emergence of the transformer architecture has facilitated the creation of highly advanced language models that generate text, summarize content, and translate languages with impressive accuracy. Our study introduces a retrieval-augmented generation [...] Read more.

In the past few years, generative AI has seen remarkable progress. The emergence of the transformer architecture has facilitated the creation of highly advanced language models that generate text, summarize content, and translate languages with impressive accuracy. Our study introduces a retrieval-augmented generation system tailored to the dynamic needs of facility management. The proposed system aims to provide instant, accurate access to essential information by integrating advanced techniques from natural language processing and information retrieval paradigms. The implementation leverages the Mixtral 8x7B model for multilingual text processing and the Milvus vector database for efficient document storage and retrieval. The dataset used includes documents such as images, operation manuals, inspection results, blueprints, and technical drawings, in various file formats. This diverse dataset reflects the variety of information encountered in construction and facility management. The evaluation involved generating question–answer pairs pertinent to facility management tasks and assessing the system’s performance using metrics such as ROUGE, BLEU, and semantic similarity. The findings suggest that retrieval-augmented generation systems can significantly enhance operational efficiency by reducing the time and effort required to access information while maintaining high security and data privacy standards. Full article

► Show Figures

Graphical abstract

2024

Jump to: 2026, 2025, 2023, 2022, 2021

19 pages, 393 KB

Open AccessArticle

Causality Extraction from Medical Text Using Large Language Models (LLMs)

by Seethalakshmi Gopalakrishnan, Luciana Garbayo and Wlodek Zadrozny

Information 2025, 16(1), 13; https://doi.org/10.3390/info16010013 - 30 Dec 2024

Cited by 12 | Viewed by 5016

Abstract

This study explores the potential of natural language models, including large language models, to extract causal relations from medical texts, specifically from clinical practice guidelines (CPGs). The outcomes of causality extraction from clinical practice guidelines for gestational diabetes are presented, marking a first [...] Read more.

This study explores the potential of natural language models, including large language models, to extract causal relations from medical texts, specifically from clinical practice guidelines (CPGs). The outcomes of causality extraction from clinical practice guidelines for gestational diabetes are presented, marking a first in the field. The results are reported on a set of experiments using variants of BERT (BioBERT, DistilBERT, and BERT) and using newer large language models (LLMs), namely, GPT-4 and LLAMA2. Our experiments show that BioBERT performed better than other models, including the large language models, with an average F1-score of 0.72. The GPT-4 and LLAMA2 results show similar performance but less consistency. The code and an annotated corpus of causal statements within the clinical practice guidelines for gestational diabetes are released. Extracting causal structures might help identify LLMs’ hallucinations and possibly prevent some medical errors if LLMs are used in patient settings. Some practical extensions of extracting causal statements from medical text would include providing additional diagnostic support based on less frequent cause–effect relationships, identifying possible inconsistencies in medical guidelines, and evaluating the evidence for recommendations. Full article

► Show Figures

Figure 1

20 pages, 1505 KB

Open AccessArticle

Optimizing Tourism Accommodation Offers by Integrating Language Models and Knowledge Graph Technologies

by Andrea Cadeddu, Alessandro Chessa, Vincenzo De Leo, Gianni Fenu, Enrico Motta, Francesco Osborne, Diego Reforgiato Recupero, Angelo Salatino and Luca Secchi

Information 2024, 15(7), 398; https://doi.org/10.3390/info15070398 - 10 Jul 2024

Cited by 14 | Viewed by 4573

Abstract

Online platforms have become the primary means for travellers to search, compare, and book accommodations for their trips. Consequently, online platforms and revenue managers must acquire a comprehensive comprehension of these dynamics to formulate a competitive and appealing offerings. Recent advancements in natural [...] Read more.

Online platforms have become the primary means for travellers to search, compare, and book accommodations for their trips. Consequently, online platforms and revenue managers must acquire a comprehensive comprehension of these dynamics to formulate a competitive and appealing offerings. Recent advancements in natural language processing, specifically through the development of large language models, have demonstrated significant progress in capturing the intricate nuances of human language. On the other hand, knowledge graphs have emerged as potent instruments for representing and organizing structured information. Nevertheless, effectively integrating these two powerful technologies remains an ongoing challenge. This paper presents an innovative deep learning methodology that combines large language models with domain-specific knowledge graphs for classification of tourism offers. The main objective of our system is to assist revenue managers in the following two fundamental dimensions: (i) comprehending the market positioning of their accommodation offerings, taking into consideration factors such as accommodation price and availability, together with user reviews and demand, and (ii) optimizing presentations and characteristics of the offerings themselves, with the intention of improving their overall appeal. For this purpose, we developed a domain knowledge graph covering a variety of information about accommodations and implemented targeted feature engineering techniques to enhance the information representation within a large language model. To evaluate the effectiveness of our approach, we conducted a comparative analysis against alternative methods on four datasets about accommodation offers in London. The proposed solution obtained excellent results, significantly outperforming alternative methods. Full article

► Show Figures

Figure 1

16 pages, 233 KB

Open AccessArticle

The Personification of ChatGPT (GPT-4)—Understanding Its Personality and Adaptability

by Leandro Stöckli, Luca Joho, Felix Lehner and Thomas Hanne

Information 2024, 15(6), 300; https://doi.org/10.3390/info15060300 - 24 May 2024

Cited by 8 | Viewed by 17380

Abstract

Thanks to the publication of ChatGPT, Artificial Intelligence is now basically accessible and usable to all internet users. The technology behind it can be used in many chatbots, whereby the chatbots should be trained for the respective area of application. Depending on the [...] Read more.

Thanks to the publication of ChatGPT, Artificial Intelligence is now basically accessible and usable to all internet users. The technology behind it can be used in many chatbots, whereby the chatbots should be trained for the respective area of application. Depending on the application, the chatbot should react differently and thus, for example, also take on and embody personality traits to be able to help and answer people better and more personally. This raises the question of whether ChatGPT-4 is able to embody personality traits. Our study investigated whether ChatGPT-4’s personality can be analyzed using personality tests for humans. To test possible approaches to measuring the personality traits of ChatGPT-4, experiments were conducted with two of the most well-known personality tests: the Big Five and Myers–Briggs. The experiments also examine whether and how personality can be changed by user input and what influence this has on the results of the personality tests. Full article

17 pages, 3845 KB

Open AccessArticle

Robust Chinese Short Text Entity Disambiguation Method Based on Feature Fusion and Contrastive Learning

by Qishun Mei and Xuhui Li

Information 2024, 15(3), 139; https://doi.org/10.3390/info15030139 - 29 Feb 2024

Cited by 3 | Viewed by 2518

Abstract

To address the limitations of existing methods of short-text entity disambiguation, specifically in terms of their insufficient feature extraction and reliance on massive training samples, we propose an entity disambiguation model called COLBERT, which fuses LDA-based topic features and BERT-based semantic features, as [...] Read more.

To address the limitations of existing methods of short-text entity disambiguation, specifically in terms of their insufficient feature extraction and reliance on massive training samples, we propose an entity disambiguation model called COLBERT, which fuses LDA-based topic features and BERT-based semantic features, as well as using contrastive learning, to enhance the disambiguation process. Experiments on a publicly available Chinese short-text entity disambiguation dataset show that the proposed model achieves an F1-score of 84.0%, which outperforms the benchmark method by 0.6%. Moreover, our model achieves an F1-score of 74.5% with a limited number of training samples, which is 2.8% higher than the benchmark method. These results demonstrate that our model achieves better effectiveness and robustness and can reduce the burden of data annotation as well as training costs. Full article

► Show Figures

Figure 1

12 pages, 2278 KB

Open AccessArticle

SRBerta—A Transformer Language Model for Serbian Cyrillic Legal Texts

by Miloš Bogdanović, Jelena Kocić and Leonid Stoimenov

Information 2024, 15(2), 74; https://doi.org/10.3390/info15020074 - 25 Jan 2024

Cited by 8 | Viewed by 4600

Abstract

Language is a unique ability of human beings. Although relatively simple for humans, the ability to understand human language is a highly complex task for machines. For a machine to learn a particular language, it must understand not only the words and rules [...] Read more.

Language is a unique ability of human beings. Although relatively simple for humans, the ability to understand human language is a highly complex task for machines. For a machine to learn a particular language, it must understand not only the words and rules used in a particular language, but also the context of sentences and the meaning that words take on in a particular context. In the experimental development we present in this paper, the goal was the development of the language model SRBerta—a language model designed to understand the formal language of Serbian legal documents. SRBerta is the first of its kind since it has been trained using Cyrillic legal texts contained within a dataset created specifically for this purpose. The main goal of SRBerta network development was to understand the formal language of Serbian legislation. The training process was carried out using minimal resources (single NVIDIA Quadro RTX 5000 GPU) and performed in two phases—base model training and fine-tuning. We will present the structure of the model, the structure of the training datasets, the training process, and the evaluation results. Further, we will explain the accuracy metric used in our case and demonstrate that SRBerta achieves a high level of accuracy for the task of masked language modeling in Serbian Cyrillic legal texts. Finally, SRBerta model and training datasets are publicly available for scientific and commercial purposes. Full article

► Show Figures

Figure 1

2023

Jump to: 2026, 2025, 2024, 2022, 2021

21 pages, 782 KB

Open AccessArticle

Offensive Text Span Detection in Romanian Comments Using Large Language Models

by Andrei Paraschiv, Teodora Andreea Ion and Mihai Dascalu

Information 2024, 15(1), 8; https://doi.org/10.3390/info15010008 - 21 Dec 2023

Cited by 4 | Viewed by 4666

Abstract

The advent of online platforms and services has revolutionized communication, enabling users to share opinions and ideas seamlessly. However, this convenience has also brought about a surge in offensive and harmful language across various communication mediums. In response, social platforms have turned to [...] Read more.

The advent of online platforms and services has revolutionized communication, enabling users to share opinions and ideas seamlessly. However, this convenience has also brought about a surge in offensive and harmful language across various communication mediums. In response, social platforms have turned to automated methods to identify offensive content. A critical research question emerges when investigating the role of specific text spans within comments in conveying offensive characteristics. This paper conducted a comprehensive investigation into detecting offensive text spans in Romanian language comments using Transformer encoders and Large Language Models (LLMs). We introduced an extensive dataset of 4800 Romanian comments annotated with offensive text spans. Moreover, we explored the impact of varying model sizes, architectures, and training data volumes on the performance of offensive text span detection, providing valuable insights for determining the optimal configuration. The results argue for the effectiveness of BERT pre-trained models for this span-detection task, showcasing their superior performance. We further investigated the impact of different sample-retrieval strategies for few-shot learning using LLMs based on vector text representations. The analysis highlights important insights and trade-offs in leveraging LLMs for offensive-language-detection tasks. Full article

► Show Figures

Figure 1

18 pages, 466 KB

Open AccessArticle

Weakly Supervised Learning Approach for Implicit Aspect Extraction

by Aye Aye Mar, Kiyoaki Shirai and Natthawut Kertkeidkachorn

Information 2023, 14(11), 612; https://doi.org/10.3390/info14110612 - 13 Nov 2023

Cited by 2 | Viewed by 3031

Abstract

Aspect-based sentiment analysis (ABSA) is a process to extract an aspect of a product from a customer review and identify its polarity. Most previous studies of ABSA focused on explicit aspects, but implicit aspects have not yet been the subject of much attention. [...] Read more.

Aspect-based sentiment analysis (ABSA) is a process to extract an aspect of a product from a customer review and identify its polarity. Most previous studies of ABSA focused on explicit aspects, but implicit aspects have not yet been the subject of much attention. This paper proposes a novel weakly supervised method for implicit aspect extraction, which is a task to classify a sentence into a pre-defined implicit aspect category. A dataset labeled with implicit aspects is automatically constructed from unlabeled sentences as follows. First, explicit sentences are obtained by extracting explicit aspects from unlabeled sentences, while sentences that do not contain explicit aspects are preserved as candidates of implicit sentences. Second, clustering is performed to merge the explicit and implicit sentences that share the same aspect. Third, the aspect of the explicit sentence is assigned to the implicit sentences in the same cluster as the implicit aspect label. Then, the BERT model is fine-tuned for implicit aspect extraction using the constructed dataset. The results of the experiments show that our method achieves 82% and 84% accuracy for mobile phone and PC reviews, respectively, which are 20 and 21 percentage points higher than the baseline. Full article

► Show Figures

Figure 1

15 pages, 1788 KB

Open AccessArticle

Multiple Information-Aware Recurrent Reasoning Network for Joint Dialogue Act Recognition and Sentiment Classification

by Shi Li and Xiaoting Chen

Information 2023, 14(11), 593; https://doi.org/10.3390/info14110593 - 1 Nov 2023

Cited by 1 | Viewed by 2176

Abstract

The task of joint dialogue act recognition (DAR) and sentiment classification (DSC) aims to predict both the act and sentiment labels of each utterance in a dialogue. Existing methods mainly focus on local or global semantic features of the dialogue from a single [...] Read more.

The task of joint dialogue act recognition (DAR) and sentiment classification (DSC) aims to predict both the act and sentiment labels of each utterance in a dialogue. Existing methods mainly focus on local or global semantic features of the dialogue from a single perspective, disregarding the impact of the other part. Therefore, we propose a multiple information-aware recurrent reasoning network (MIRER). Firstly, the sequence information is smoothly sent to multiple local information layers for fine-grained feature extraction through a BiLSTM-connected hybrid CNN group method. Secondly, to obtain global semantic features that are speaker-, context-, and temporal-sensitive, we design a speaker-aware temporal reasoning heterogeneous graph to characterize interactions between utterances spoken by different speakers, incorporating different types of nodes and meta-relations with node-edge-type-dependent parameters. We also design a dual-task temporal reasoning heterogeneous graph to realize the semantic-level and prediction-level self-interaction and interaction, and we constantly revise and improve the label in the process of dual-task recurrent reasoning. MIRER fully integrates context-level features, fine-grained features, and global semantic features, including speaker, context, and temporal sensitivity, to better simulate conversation scenarios. We validated the method on two public dialogue datasets, Mastodon and DailyDialog, and the experimental results show that MIRER outperforms various existing baseline models. Full article

► Show Figures

Figure 1

15 pages, 1726 KB

Open AccessReview

Thematic Analysis of Big Data in Financial Institutions Using NLP Techniques with a Cloud Computing Perspective: A Systematic Literature Review

by Ratnesh Kumar Sharma, Gnana Bharathy, Faezeh Karimi, Anil V. Mishra and Mukesh Prasad

Information 2023, 14(10), 577; https://doi.org/10.3390/info14100577 - 20 Oct 2023

Cited by 11 | Viewed by 5231

Abstract

This literature review explores the existing work and practices in applying thematic analysis natural language processing techniques to financial data in cloud environments. This work aims to improve two of the five Vs of the big data system. We used the PRISMA approach [...] Read more.

This literature review explores the existing work and practices in applying thematic analysis natural language processing techniques to financial data in cloud environments. This work aims to improve two of the five Vs of the big data system. We used the PRISMA approach (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) for the review. We analyzed the research papers published over the last 10 years about the topic in question using a keyword-based search and bibliometric analysis. The systematic literature review was conducted in multiple phases, and filters were applied to exclude papers based on the title and abstract initially, then based on the methodology/conclusion, and, finally, after reading the full text. The remaining papers were then considered and are discussed here. We found that automated data discovery methods can be augmented by applying an NLP-based thematic analysis on the financial data in cloud environments. This can help identify the correct classification/categorization and measure data quality for a sentiment analysis. Full article

► Show Figures

Figure 1

18 pages, 516 KB

Open AccessArticle

Automated Assessment of Comprehension Strategies from Self-Explanations Using LLMs

by Bogdan Nicula, Mihai Dascalu, Tracy Arner, Renu Balyan and Danielle S. McNamara

Information 2023, 14(10), 567; https://doi.org/10.3390/info14100567 - 14 Oct 2023

Cited by 11 | Viewed by 4700

Abstract

Text comprehension is an essential skill in today’s information-rich world, and self-explanation practice helps students improve their understanding of complex texts. This study was centered on leveraging open-source Large Language Models (LLMs), specifically FLAN-T5, to automatically assess the comprehension strategies employed by readers [...] Read more.

Text comprehension is an essential skill in today’s information-rich world, and self-explanation practice helps students improve their understanding of complex texts. This study was centered on leveraging open-source Large Language Models (LLMs), specifically FLAN-T5, to automatically assess the comprehension strategies employed by readers while understanding Science, Technology, Engineering, and Mathematics (STEM) texts. The experiments relied on a corpus of three datasets (N = 11,833) with self-explanations annotated on 4 dimensions: 3 comprehension strategies (i.e., bridging, elaboration, and paraphrasing) and overall quality. Besides FLAN-T5, we also considered GPT3.5-turbo to establish a stronger baseline. Our experiments indicated that the performance improved with fine-tuning, having a larger LLM model, and providing examples via the prompt. Our best model considered a pretrained FLAN-T5 XXL model and obtained a weighted F1-score of 0.721, surpassing the 0.699 F1-score previously obtained using smaller models (i.e., RoBERTa). Full article

► Show Figures

Figure 1

28 pages, 6126 KB

Open AccessArticle

Social Media Analytics on Russia–Ukraine Cyber War with Natural Language Processing: Perspectives and Challenges

by Fahim Sufi

Information 2023, 14(9), 485; https://doi.org/10.3390/info14090485 - 31 Aug 2023

Cited by 34 | Viewed by 21105

Abstract

Utilizing social media data is imperative in comprehending critical insights on the Russia–Ukraine cyber conflict due to their unparalleled capacity to provide real-time information dissemination, thereby enabling the timely tracking and analysis of cyber incidents. The vast array of user-generated content on these [...] Read more.

Utilizing social media data is imperative in comprehending critical insights on the Russia–Ukraine cyber conflict due to their unparalleled capacity to provide real-time information dissemination, thereby enabling the timely tracking and analysis of cyber incidents. The vast array of user-generated content on these platforms, ranging from eyewitness accounts to multimedia evidence, serves as invaluable resources for corroborating and contextualizing cyber attacks, facilitating the attribution of malicious actors. Furthermore, social media data afford unique access to public sentiment, the propagation of propaganda, and emerging narratives, offering profound insights into the effectiveness of information operations and shaping counter-messaging strategies. However, there have been hardly any studies reported on the Russia–Ukraine cyber war harnessing social media analytics. This paper presents a comprehensive analysis of the crucial role of social-media-based cyber intelligence in understanding Russia’s cyber threats during the ongoing Russo–Ukrainian conflict. This paper introduces an innovative multidimensional cyber intelligence framework and utilizes Twitter data to generate cyber intelligence reports. By leveraging advanced monitoring tools and NLP algorithms, like language detection, translation, sentiment analysis, term frequency–inverse document frequency (TF-IDF), latent Dirichlet allocation (LDA), Porter stemming, n-grams, and others, this study automatically generated cyber intelligence for Russia and Ukraine. Using 37,386 tweets originating from 30,706 users in 54 languages from 13 October 2022 to 6 April 2023, this paper reported the first detailed multilingual analysis on the Russia–Ukraine cyber crisis in four cyber dimensions (geopolitical and socioeconomic; targeted victim; psychological and societal; and national priority and concerns). It also highlights challenges faced in harnessing reliable social-media-based cyber intelligence. Full article

► Show Figures

Figure 1

16 pages, 2427 KB

Open AccessArticle

Auditory Models for Formant Frequency Discrimination of Vowel Sounds

by Can Xu and Chang Liu

Information 2023, 14(8), 429; https://doi.org/10.3390/info14080429 - 31 Jul 2023

Viewed by 3701

Abstract

As formant frequencies of vowel sounds are critical acoustic cues for vowel perception, human listeners need to be sensitive to formant frequency change. Numerous studies have found that formant frequency discrimination is affected by many factors like formant frequency, speech level, and fundamental [...] Read more.

As formant frequencies of vowel sounds are critical acoustic cues for vowel perception, human listeners need to be sensitive to formant frequency change. Numerous studies have found that formant frequency discrimination is affected by many factors like formant frequency, speech level, and fundamental frequency. Theoretically, to perceive a formant frequency change, human listeners with normal hearing may need a relatively constant change in the excitation and loudness pattern, and this internal change in auditory processing is independent of vowel category. Thus, the present study examined whether such metrics could explain the effects of formant frequency and speech level on formant frequency discrimination thresholds. Moreover, a simulation model based on the auditory excitation-pattern and loudness-pattern models was developed to simulate the auditory processing of vowel signals and predict thresholds of vowel formant discrimination. The results showed that predicted thresholds based on auditory metrics incorporating auditory excitation or loudness patterns near the target formant showed high correlations and low root-mean-square errors with human behavioral thresholds in terms of the effects of formant frequency and speech level). In addition, the simulation model, which particularly simulates the spectral processing of acoustic signals in the human auditory system, may be used to evaluate the auditory perception of speech signals for listeners with hearing impairments and/or different language backgrounds. Full article

► Show Figures

Figure 1

10 pages, 1031 KB

Open AccessArticle

Natural Syntax, Artificial Intelligence and Language Acquisition

by William O’Grady and Miseon Lee

Information 2023, 14(7), 418; https://doi.org/10.3390/info14070418 - 20 Jul 2023

Cited by 7 | Viewed by 5924

Abstract

In recent work, various scholars have suggested that large language models can be construed as input-driven theories of language acquisition. In this paper, we propose a way to test this idea. As we will document, there is good reason to think that processing [...] Read more.

In recent work, various scholars have suggested that large language models can be construed as input-driven theories of language acquisition. In this paper, we propose a way to test this idea. As we will document, there is good reason to think that processing pressures override input at an early point in linguistic development, creating a temporary but sophisticated system of negation with no counterpart in caregiver speech. We go on to outline a (for now) thought experiment involving this phenomenon that could contribute to a deeper understanding both of human language and of the language models that seek to simulate it. Full article

11 pages, 1379 KB

Open AccessArticle

MSGAT-Based Sentiment Analysis for E-Commerce

by Tingyao Jiang, Wei Sun and Min Wang

Information 2023, 14(7), 416; https://doi.org/10.3390/info14070416 - 19 Jul 2023

Cited by 7 | Viewed by 2735

Abstract

Sentence-level sentiment analysis, as a research direction in natural language processing, has been widely used in various fields. In order to address the problem that syntactic features were neglected in previous studies on sentence-level sentiment analysis, a multiscale graph attention network (MSGAT) sentiment [...] Read more.

Sentence-level sentiment analysis, as a research direction in natural language processing, has been widely used in various fields. In order to address the problem that syntactic features were neglected in previous studies on sentence-level sentiment analysis, a multiscale graph attention network (MSGAT) sentiment analysis model based on dependent syntax is proposed. The model adopts RoBERTa_WWM as the text encoding layer, generates graphs on the basis of syntactic dependency trees, and obtains sentence sentiment features at different scales for text classification through multilevel graph attention network. Compared with the existing mainstream text sentiment analysis models, the proposed model achieves better performance on both a hotel review dataset and a takeaway review dataset, with 94.8% and 93.7% accuracy and 96.2% and 90.4% F1 score, respectively. The results demonstrate the superiority and effectiveness of the model in Chinese sentence sentiment analysis. Full article

► Show Figures

Figure 1

20 pages, 6480 KB

Open AccessArticle

Arabic Mispronunciation Recognition System Using LSTM Network

by Abdelfatah Ahmed, Mohamed Bader, Ismail Shahin, Ali Bou Nassif, Naoufel Werghi and Mohammad Basel

Information 2023, 14(7), 413; https://doi.org/10.3390/info14070413 - 16 Jul 2023

Cited by 13 | Viewed by 3532

Abstract

The Arabic language has always been an immense source of attraction to various people from different ethnicities by virtue of the significant linguistic legacy that it possesses. Consequently, a multitude of people from all over the world are yearning to learn it. However, [...] Read more.

The Arabic language has always been an immense source of attraction to various people from different ethnicities by virtue of the significant linguistic legacy that it possesses. Consequently, a multitude of people from all over the world are yearning to learn it. However, people from different mother tongues and cultural backgrounds might experience some hardships regarding articulation due to the absence of some particular letters only available in the Arabic language, which could hinder the learning process. As a result, a speaker-independent and text-dependent efficient system that aims to detect articulation disorders was implemented. In the proposed system, we emphasize the prominence of “speech signal processing” in diagnosing Arabic mispronunciation using the Mel-frequency cepstral coefficients (MFCCs) as the optimum extracted features. In addition, long short-term memory (LSTM) was also utilized for the classification process. Furthermore, the analytical framework was incorporated with a gender recognition model to perform two-level classification. Our results show that the LSTM network significantly enhances mispronunciation detection along with gender recognition. The LSTM models attained an average accuracy of 81.52% in the proposed system, reflecting a high performance compared to previous mispronunciation detection systems. Full article

► Show Figures

Figure 1

15 pages, 740 KB

Open AccessArticle

Text to Causal Knowledge Graph: A Framework to Synthesize Knowledge from Unstructured Business Texts into Causal Graphs

by Seethalakshmi Gopalakrishnan, Victor Zitian Chen, Wenwen Dou, Gus Hahn-Powell, Sreekar Nedunuri and Wlodek Zadrozny

Information 2023, 14(7), 367; https://doi.org/10.3390/info14070367 - 28 Jun 2023

Cited by 10 | Viewed by 8246

Abstract

This article presents a state-of-the-art system to extract and synthesize causal statements from company reports into a directed causal graph. The extracted information is organized by its relevance to different stakeholder group benefits (customers, employees, investors, and the community/environment). The presented method of [...] Read more.

This article presents a state-of-the-art system to extract and synthesize causal statements from company reports into a directed causal graph. The extracted information is organized by its relevance to different stakeholder group benefits (customers, employees, investors, and the community/environment). The presented method of synthesizing extracted data into a knowledge graph comprises a framework that can be used for similar tasks in other domains, e.g., medical information. The current work addresses the problem of finding, organizing, and synthesizing a view of the cause-and-effect relationships based on textual data in order to inform and even prescribe the best actions that may affect target business outcomes related to the benefits for different stakeholders (customers, employees, investors, and the community/environment). Full article

► Show Figures

Figure 1

21 pages, 4772 KB

Open AccessArticle

Authorship Identification of Binary and Disassembled Codes Using NLP Methods

by Aleksandr Romanov, Anna Kurtukova, Anastasia Fedotova and Alexander Shelupanov

Information 2023, 14(7), 361; https://doi.org/10.3390/info14070361 - 25 Jun 2023

Cited by 3 | Viewed by 3672

Abstract

This article is part of a series aimed at determining the authorship of source codes. Analyzing binary code is a crucial aspect of cybersecurity, software development, and computer forensics, particularly in identifying malware authors. Any program is machine code, which can be disassembled [...] Read more.

This article is part of a series aimed at determining the authorship of source codes. Analyzing binary code is a crucial aspect of cybersecurity, software development, and computer forensics, particularly in identifying malware authors. Any program is machine code, which can be disassembled using specialized tools and analyzed for authorship identification, similar to natural language text using Natural Language Processing methods. We propose an ensemble of fastText, support vector machine (SVM), and the authors’ hybrid neural network developed in previous works in this research. The improved methodology was evaluated using a dataset of source codes written in C and C++ languages collected from GitHub and Google Code Jam. The collected source codes were compiled into executable programs and then disassembled using reverse engineering tools. The average accuracy of author identification for disassembled codes using the improved methodology exceeds 0.90. Additionally, the methodology was tested on the source codes, achieving an average accuracy of 0.96 in simple cases and over 0.85 in complex cases. These results validate the effectiveness of the developed methodology and its applicability to solving cybersecurity challenges. Full article

► Show Figures

Figure 1

14 pages, 376 KB

Open AccessArticle

An Intelligent Conversational Agent for the Legal Domain

by Flora Amato, Mattia Fonisto, Marco Giacalone and Carlo Sansone

Information 2023, 14(6), 307; https://doi.org/10.3390/info14060307 - 27 May 2023

Cited by 15 | Viewed by 5306

Abstract

An intelligent conversational agent for the legal domain is an AI-powered system that can communicate with users in natural language and provide legal advice or assistance. In this paper, we present CREA2, an agent designed to process legal concepts and be able to [...] Read more.

An intelligent conversational agent for the legal domain is an AI-powered system that can communicate with users in natural language and provide legal advice or assistance. In this paper, we present CREA2, an agent designed to process legal concepts and be able to guide users on legal matters. The conversational agent can help users navigate legal procedures, understand legal jargon, and provide recommendations for legal action. The agent can also give suggestions helpful in drafting legal documents, such as contracts, leases, and notices. Additionally, conversational agents can help reduce the workload of legal professionals by handling routine legal tasks. CREA2, in particular, will guide the user in resolving disputes between people residing within the European Union, proposing solutions in controversies between two or more people who are contending over assets in a divorce, an inheritance, or the division of a company. The conversational agent can later be accessed through various channels, including messaging platforms, websites, and mobile applications. This paper presents a retrieval system that evaluates the similarity between a user’s query and a given question. The system uses natural language processing (NLP) algorithms to interpret user input and associate responses by addressing the problem as a semantic search similar question retrieval. Although a common approach to question and answer (Q&A) retrieval is to create labelled Q&A pairs for training, we exploit an unsupervised information retrieval system in order to evaluate the similarity degree between a given query and a set of questions contained in the knowledge base. We used the recently proposed SBERT model for the evaluation of relevance. In the paper, we illustrate the effective design principles, the implemented details and the results of the conversational system and describe the experimental campaign carried out on it. Full article

► Show Figures

Figure 1

13 pages, 507 KB

Open AccessArticle

Multilingual Text Summarization for German Texts Using Transformer Models

by Tomas Humberto Montiel Alcantara, David Krütli, Revathi Ravada and Thomas Hanne

Information 2023, 14(6), 303; https://doi.org/10.3390/info14060303 - 25 May 2023

Cited by 18 | Viewed by 7267

Abstract

The tremendous increase in documents available on the Web has turned finding the relevant pieces of information into a challenging, tedious, and time-consuming activity. Text summarization is an important natural language processing (NLP) task used to reduce the reading requirements of text. Automatic [...] Read more.

The tremendous increase in documents available on the Web has turned finding the relevant pieces of information into a challenging, tedious, and time-consuming activity. Text summarization is an important natural language processing (NLP) task used to reduce the reading requirements of text. Automatic text summarization is an NLP task that consists of creating a shorter version of a text document which is coherent and maintains the most relevant information of the original text. In recent years, automatic text summarization has received significant attention, as it can be applied to a wide range of applications such as the extraction of highlights from scientific papers or the generation of summaries of news articles. In this research project, we are focused mainly on abstractive text summarization that extracts the most important contents from a text in a rephrased form. The main purpose of this project is to summarize texts in German. Unfortunately, most pretrained models are only available for English. We therefore focused on the German BERT multilingual model and the BART monolingual model for English, with a consideration of translation possibilities. As the source of the experiment setup, took the German Wikipedia article dataset and compared how well the multilingual model performed for German text summarization when compared to using machine-translated text summaries from monolingual English language models. We used the ROUGE-1 metric to analyze the quality of the text summarization. Full article

► Show Figures

Figure 1

19 pages, 493 KB

Open AccessArticle

Distilling Knowledge with a Teacher’s Multitask Model for Biomedical Named Entity Recognition

by Tahir Mehmood, Alfonso E. Gerevini, Alberto Lavelli, Matteo Olivato and Ivan Serina

Information 2023, 14(5), 255; https://doi.org/10.3390/info14050255 - 24 Apr 2023

Cited by 3 | Viewed by 2670

Abstract

Single-task models (STMs) struggle to learn sophisticated representations from a finite set of annotated data. Multitask learning approaches overcome these constraints by simultaneously training various associated tasks, thereby learning generic representations among various tasks by sharing some layers of the neural network architecture. [...] Read more.

Single-task models (STMs) struggle to learn sophisticated representations from a finite set of annotated data. Multitask learning approaches overcome these constraints by simultaneously training various associated tasks, thereby learning generic representations among various tasks by sharing some layers of the neural network architecture. Because of this, multitask models (MTMs) have better generalization properties than those of single-task learning. Multitask model generalizations can be used to improve the results of other models. STMs can learn more sophisticated representations in the training phase by utilizing the extracted knowledge of an MTM through the knowledge distillation technique where one model supervises another model during training by using its learned generalizations. This paper proposes a knowledge distillation technique in which different MTMs are used as the teacher model to supervise different student models. Knowledge distillation is applied with different representations of the teacher model. We also investigated the effect of the conditional random field (CRF) and softmax function for the token-level knowledge distillation approach, and found that the softmax function leveraged the performance of the student model compared to CRF. The result analysis was also extended with statistical analysis by using the Friedman test. Full article

► Show Figures

Figure 1

17 pages, 286 KB

Open AccessEditor’s ChoiceReview

Transformers in the Real World: A Survey on NLP Applications

by Narendra Patwardhan, Stefano Marrone and Carlo Sansone

Information 2023, 14(4), 242; https://doi.org/10.3390/info14040242 - 17 Apr 2023

Cited by 183 | Viewed by 33441

Abstract

The field of Natural Language Processing (NLP) has undergone a significant transformation with the introduction of Transformers. From the first introduction of this technology in 2017, the use of transformers has become widespread and has had a profound impact on the field of [...] Read more.

The field of Natural Language Processing (NLP) has undergone a significant transformation with the introduction of Transformers. From the first introduction of this technology in 2017, the use of transformers has become widespread and has had a profound impact on the field of NLP. In this survey, we review the open-access and real-world applications of transformers in NLP, specifically focusing on those where text is the primary modality. Our goal is to provide a comprehensive overview of the current state-of-the-art in the use of transformers in NLP, highlight their strengths and limitations, and identify future directions for research. In this way, we aim to provide valuable insights for both researchers and practitioners in the field of NLP. In addition, we provide a detailed analysis of the various challenges faced in the implementation of transformers in real-world applications, including computational efficiency, interpretability, and ethical considerations. Moreover, we highlight the impact of transformers on the NLP community, including their influence on research and the development of new NLP models. Full article

► Show Figures

Figure 1

19 pages, 2625 KB

Open AccessArticle

Novel Task-Based Unification and Adaptation (TUA) Transfer Learning Approach for Bilingual Emotional Speech Data

by Ismail Shahin, Ali Bou Nassif, Rameena Thomas and Shibani Hamsa

Information 2023, 14(4), 236; https://doi.org/10.3390/info14040236 - 12 Apr 2023

Cited by 1 | Viewed by 3089

Abstract

Modern developments in machine learning methodology have produced effective approaches to speech emotion recognition. The field of data mining is widely employed in numerous situations where it is possible to predict future outcomes by using the input sequence from previous training data. Since [...] Read more.

Modern developments in machine learning methodology have produced effective approaches to speech emotion recognition. The field of data mining is widely employed in numerous situations where it is possible to predict future outcomes by using the input sequence from previous training data. Since the input feature space and data distribution are the same for both training and testing data in conventional machine learning approaches, they are drawn from the same pool. However, because so many applications require a difference in the distribution of training and testing data, the gathering of training data is becoming more and more expensive. High performance learners that have been trained using similar, already-existing data are needed in these situations. To increase a model’s capacity for learning, transfer learning involves transferring knowledge from one domain to another related domain. To address this scenario, we have extracted ten multi-dimensional features from speech signals using OpenSmile and a transfer learning method to classify the features of various datasets. In this paper, we emphasize the importance of a novel transfer learning system called Task-based Unification and Adaptation (TUA), which bridges the disparity between extensive upstream training and downstream customization. We take advantage of the two components of the TUA, task-challenging unification and task-specific adaptation. Our algorithm is studied using the following speech datasets: the Arabic Emirati-accented speech dataset (ESD), the English Speech Under Simulated and Actual Stress (SUSAS) dataset and the Ryerson Audio-Visual Database of Emotional Speech and Song dataset (RAVDESS). Using the multidimensional features and transfer learning method on the given datasets, we were able to achieve an average speech emotion recognition rate of 91.2% on the ESD, 84.7% on the RAVDESS and 88.5% on the SUSAS datasets, respectively. Full article

► Show Figures

Figure 1

15 pages, 1643 KB

Open AccessArticle

MBTI Personality Prediction Using Machine Learning and SMOTE for Balancing Data Based on Statement Sentences

by Gregorius Ryan, Pricillia Katarina and Derwin Suhartono

Information 2023, 14(4), 217; https://doi.org/10.3390/info14040217 - 3 Apr 2023

Cited by 37 | Viewed by 27127

Abstract

The rise of social media as a platform for self-expression and self-understanding has led to increased interest in using the Myers–Briggs Type Indicator (MBTI) to explore human personalities. Despite this, there needs to be more research on how other word-embedding techniques, machine learning [...] Read more.

The rise of social media as a platform for self-expression and self-understanding has led to increased interest in using the Myers–Briggs Type Indicator (MBTI) to explore human personalities. Despite this, there needs to be more research on how other word-embedding techniques, machine learning algorithms, and imbalanced data-handling techniques can improve the results of MBTI personality-type predictions. Our research aimed to investigate the efficacy of these techniques by utilizing the Word2Vec model to obtain a vector representation of words in the corpus data. We implemented several machine learning approaches, including logistic regression, linear support vector classification, stochastic gradient descent, random forest, the extreme gradient boosting classifier, and the cat boosting classifier. In addition, we used the synthetic minority oversampling technique (SMOTE) to address the issue of imbalanced data. The results showed that our approach could achieve a relatively high F1 score (between 0.7383 and 0.8282), depending on the chosen model for predicting and classifying MBTI personality. Furthermore, we found that using SMOTE could improve the selected models’ performance (F1 score between 0.7553 and 0.8337), proving that the machine learning approach integrated with Word2Vec and SMOTE could predict and classify MBTI personality well, thus enhancing the understanding of MBTI. Full article

► Show Figures

Figure 1

24 pages, 1086 KB

Open AccessReview

Applications of Text Mining in the Transportation Infrastructure Sector: A Review

by Sudipta Chowdhury and Ammar Alzarrad

Information 2023, 14(4), 201; https://doi.org/10.3390/info14040201 - 23 Mar 2023

Cited by 17 | Viewed by 7260

Abstract

Transportation infrastructure is vital to the well-functioning of economic activities in a region. Due to the digitalization of data storage, ease of access to large databases, and advancement of social media, large volumes of text data that relate to different aspects of transportation [...] Read more.

Transportation infrastructure is vital to the well-functioning of economic activities in a region. Due to the digitalization of data storage, ease of access to large databases, and advancement of social media, large volumes of text data that relate to different aspects of transportation infrastructure are generated. Text mining techniques can explore any large amount of textual data within a limited time and with limited resource allocation for generating easy-to-understand knowledge. This study aims to provide a comprehensive review of the various applications of text mining techniques in transportation infrastructure research. The scope of this research ranges across all forms of transportation infrastructure-related problems or issues that were investigated by different text mining techniques. These transportation infrastructure-related problems or issues may involve issues such as crashes or accidents investigation, driving behavior analysis, and construction activities. A Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA)-based structured methodology was used to identify relevant studies that implemented different text mining techniques across different transportation infrastructure-related problems or issues. A total of 59 studies from both the U.S. and other parts of the world (e.g., China, and Bangladesh) were ultimately selected for review after a rigorous quality check. The results show that apart from simple text mining techniques for data pre-processing, the majority of the studies used topic modeling techniques for a detailed evaluation of the text data. Other techniques such as classification algorithms were also later used to predict and/or project future scenarios/states based on the identified topics. The findings from this study will hopefully provide researchers and practitioners with a better understanding of the potential of text mining techniques under different circumstances to solve different types of transportation infrastructure-related problems. They will also provide a blueprint to better understand the ever-evolving area of transportation engineering and infrastructure-focused studies. Full article

► Show Figures

Figure 1

25 pages, 1760 KB

Open AccessEditor’s ChoiceReview

A Systematic Review of Transformer-Based Pre-Trained Language Models through Self-Supervised Learning

by Evans Kotei and Ramkumar Thirunavukarasu

Information 2023, 14(3), 187; https://doi.org/10.3390/info14030187 - 16 Mar 2023

Cited by 68 | Viewed by 17585

Abstract

Transfer learning is a technique utilized in deep learning applications to transmit learned inference to a different target domain. The approach is mainly to solve the problem of a few training datasets resulting in model overfitting, which affects model performance. The study was [...] Read more.

Transfer learning is a technique utilized in deep learning applications to transmit learned inference to a different target domain. The approach is mainly to solve the problem of a few training datasets resulting in model overfitting, which affects model performance. The study was carried out on publications retrieved from various digital libraries such as SCOPUS, ScienceDirect, IEEE Xplore, ACM Digital Library, and Google Scholar, which formed the Primary studies. Secondary studies were retrieved from Primary articles using the backward and forward snowballing approach. Based on set inclusion and exclusion parameters, relevant publications were selected for review. The study focused on transfer learning pretrained NLP models based on the deep transformer network. BERT and GPT were the two elite pretrained models trained to classify global and local representations based on larger unlabeled text datasets through self-supervised learning. Pretrained transformer models offer numerous advantages to natural language processing models, such as knowledge transfer to downstream tasks that deal with drawbacks associated with training a model from scratch. This review gives a comprehensive view of transformer architecture, self-supervised learning and pretraining concepts in language models, and their adaptation to downstream tasks. Finally, we present future directions to further improvement in pretrained transformer-based language models. Full article

► Show Figures

Figure 1

12 pages, 608 KB

Open AccessArticle

Adapting Off-the-Shelf Speech Recognition Systems for Novel Words

by Wiam Fadel, Toumi Bouchentouf, Pierre-André Buvet and Omar Bourja

Information 2023, 14(3), 179; https://doi.org/10.3390/info14030179 - 13 Mar 2023

Cited by 3 | Viewed by 5073

Abstract

Current speech recognition systems with fixed vocabularies have difficulties recognizing Out-of-Vocabulary words (OOVs) such as proper nouns and new words. This leads to misunderstandings or even failures in dialog systems. Ensuring effective speech recognition is crucial for the proper functioning of robot assistants. [...] Read more.

Current speech recognition systems with fixed vocabularies have difficulties recognizing Out-of-Vocabulary words (OOVs) such as proper nouns and new words. This leads to misunderstandings or even failures in dialog systems. Ensuring effective speech recognition is crucial for the proper functioning of robot assistants. Non-native accents, new vocabulary, and aging voices can cause malfunctions in a speech recognition system. If this task is not executed correctly, the assistant robot will inevitably produce false or random responses. In this paper, we used a statistical approach based on distance algorithms to improve OOV correction. We developed a post-processing algorithm to be combined with a speech recognition model. In this sense, we compared two distance algorithms: Damerau–Levenshtein and Levenshtein distance. We validated the performance of the two distance algorithms in conjunction with five off-the-shelf speech recognition models. Damerau–Levenshtein, as compared to the Levenshtein distance algorithm, succeeded in minimizing the Word Error Rate (WER) when using the MoroccanFrench test set with five speech recognition systems, namely VOSK API, Google API, Wav2vec2.0, SpeechBrain, and Quartznet pre-trained models. Our post-processing method works regardless of the architecture of the speech recognizer, and its results on our MoroccanFrench test set outperformed the five chosen off-the-shelf speech recognizer systems. Full article

► Show Figures

Figure 1

21 pages, 756 KB

Open AccessReview

Reconsidering Read and Spontaneous Speech: Causal Perspectives on the Generation of Training Data for Automatic Speech Recognition

by Philipp Gabler, Bernhard C. Geiger, Barbara Schuppler and Roman Kern

Information 2023, 14(2), 137; https://doi.org/10.3390/info14020137 - 19 Feb 2023

Cited by 18 | Viewed by 7466

Abstract

Superficially, read and spontaneous speech—the two main kinds of training data for automatic speech recognition—appear as complementary, but are equal: pairs of texts and acoustic signals. Yet, spontaneous speech is typically harder for recognition. This is usually explained by different kinds of variation [...] Read more.

Superficially, read and spontaneous speech—the two main kinds of training data for automatic speech recognition—appear as complementary, but are equal: pairs of texts and acoustic signals. Yet, spontaneous speech is typically harder for recognition. This is usually explained by different kinds of variation and noise, but there is a more fundamental deviation at play: for read speech, the audio signal is produced by recitation of the given text, whereas in spontaneous speech, the text is transcribed from a given signal. In this review, we embrace this difference by presenting a first introduction of causal reasoning into automatic speech recognition, and describing causality as a tool to study speaking styles and training data. After breaking down the data generation processes of read and spontaneous speech and analysing the domain from a causal perspective, we highlight how data generation by annotation must affect the interpretation of inference and performance. Our work discusses how various results from the causality literature regarding the impact of the direction of data generation mechanisms on learning and prediction apply to speech data. Finally, we argue how a causal perspective can support the understanding of models in speech processing regarding their behaviour, capabilities, and limitations. Full article

► Show Figures

Figure 1

18 pages, 490 KB

Open AccessArticle

Multilingual Speech Recognition for Turkic Languages

by Saida Mussakhojayeva, Kaisar Dauletbek, Rustem Yeshpanov and Huseyin Atakan Varol

Information 2023, 14(2), 74; https://doi.org/10.3390/info14020074 - 28 Jan 2023

Cited by 24 | Viewed by 9482

Abstract

The primary aim of this study was to contribute to the development of multilingual automatic speech recognition for lower-resourced Turkic languages. Ten languages—Azerbaijani, Bashkir, Chuvash, Kazakh, Kyrgyz, Sakha, Tatar, Turkish, Uyghur, and Uzbek—were considered. A total of 22 models were developed (13 monolingual [...] Read more.

The primary aim of this study was to contribute to the development of multilingual automatic speech recognition for lower-resourced Turkic languages. Ten languages—Azerbaijani, Bashkir, Chuvash, Kazakh, Kyrgyz, Sakha, Tatar, Turkish, Uyghur, and Uzbek—were considered. A total of 22 models were developed (13 monolingual and 9 multilingual). The multilingual models that were trained using joint speech data performed more robustly than the baseline monolingual models, with the best model achieving an average character and word error rate reduction of 56.7%/54.3%, respectively. The results of the experiment showed that character and word error rate reduction was more likely when multilingual models were trained with data from related Turkic languages than when they were developed using data from unrelated, non-Turkic languages, such as English and Russian. The study also presented an open-source Turkish speech corpus. The corpus contains 218.2 h of transcribed speech with 186,171 utterances and is the largest publicly available Turkish dataset of its kind. The datasets and codes used to train the models are available for download from our GitHub page. Full article

2022

Jump to: 2026, 2025, 2024, 2023, 2021

17 pages, 2657 KB

Open AccessReview

Automatic Sarcasm Detection: Systematic Literature Review

by Alexandru-Costin Băroiu and Ștefan Trăușan-Matu

Information 2022, 13(8), 399; https://doi.org/10.3390/info13080399 - 22 Aug 2022

Cited by 43 | Viewed by 11259

Abstract

Sarcasm is an integral part of human language and culture. Naturally, it has garnered great interest from researchers from varied fields of study, including Artificial Intelligence, especially Natural Language Processing. Automatic sarcasm detection has become an increasingly popular topic in the past decade. [...] Read more.

Sarcasm is an integral part of human language and culture. Naturally, it has garnered great interest from researchers from varied fields of study, including Artificial Intelligence, especially Natural Language Processing. Automatic sarcasm detection has become an increasingly popular topic in the past decade. The research conducted in this paper presents, through a systematic literature review, the evolution of the automatic sarcasm detection task from its inception in 2010 to the present day. No such work has been conducted thus far and it is essential to establish the progress that researchers have made when tackling this task and, moving forward, what the trends are. This study finds that multi-modal approaches and transformer-based architectures have become increasingly popular in recent years. Additionally, this paper presents a critique of the work carried out so far and proposes future directions of research in the field. Full article

► Show Figures

Figure 1

21 pages, 702 KB

Open AccessSystematic Review

Automatic Text Summarization of Biomedical Text Data: A Systematic Review

by Andrea Chaves, Cyrille Kesiku and Begonya Garcia-Zapirain

Information 2022, 13(8), 393; https://doi.org/10.3390/info13080393 - 19 Aug 2022

Cited by 39 | Viewed by 11122

Abstract

In recent years, the evolution of technology has led to an increase in text data obtained from many sources. In the biomedical domain, text information has also evidenced this accelerated growth, and automatic text summarization systems play an essential role in optimizing physicians’ [...] Read more.

In recent years, the evolution of technology has led to an increase in text data obtained from many sources. In the biomedical domain, text information has also evidenced this accelerated growth, and automatic text summarization systems play an essential role in optimizing physicians’ time resources and identifying relevant information. In this paper, we present a systematic review in recent research of text summarization for biomedical textual data, focusing mainly on the methods employed, type of input data text, areas of application, and evaluation metrics used to assess systems. The survey was limited to the period between 1st January 2014 and 15th March 2022. The data collected was obtained from WoS, IEEE, and ACM digital libraries, while the search strategies were developed with the help of experts in NLP techniques and previous systematic reviews. The four phases of a systematic review by PRISMA methodology were conducted, and five summarization factors were determined to assess the studies included: Input, Purpose, Output, Method, and Evaluation metric. Results showed that

3.5 %

of 801 studies met the inclusion criteria. Moreover, Single-document, Biomedical Literature, Generic, and Extractive summarization proved to be the most common approaches employed, while techniques based on Machine Learning were performed in 16 studies and Rouge (Recall-Oriented Understudy for Gisting Evaluation) was reported as the evaluation metric in 26 studies. This review found that in recent years, more transformer-based methodologies for summarization purposes have been implemented compared to a previous survey. Additionally, there are still some challenges in text summarization in different domains, especially in the biomedical field in terms of demand for further research. Full article

► Show Figures

Figure 1

15 pages, 851 KB

Open AccessArticle

Traditional Chinese Medicine Word Representation Model Augmented with Semantic and Grammatical Information

by Yuekun Ma, Zhongyan Sun, Dezheng Zhang and Yechen Feng

Information 2022, 13(6), 296; https://doi.org/10.3390/info13060296 - 10 Jun 2022

Cited by 3 | Viewed by 4753

Abstract

Text vectorization is the basic work of natural language processing tasks. High-quality vector representation with rich feature information can guarantee the quality of entity recognition and other downstream tasks in the field of traditional Chinese medicine (TCM). The existing word representation models mainly [...] Read more.

Text vectorization is the basic work of natural language processing tasks. High-quality vector representation with rich feature information can guarantee the quality of entity recognition and other downstream tasks in the field of traditional Chinese medicine (TCM). The existing word representation models mainly include the shallow models with relatively independent word vectors and the deep pre-training models with strong contextual correlation. Shallow models have simple structures but insufficient extraction of semantic and syntactic information, and deep pre-training models have strong feature extraction ability, but the models have complex structures and large parameter scales. In order to construct a lightweight word representation model with rich contextual semantic information, this paper enhances the shallow word representation model with weak contextual relevance at three levels: the part-of-speech (POS) of the predicted target words, the word order of the text, and the synonymy, antonymy and analogy semantics. In this study, we conducted several experiments in both intrinsic similarity analysis and extrinsic quantitative comparison. The results show that the proposed model achieves state-of-the-art performance compared to the baseline models. In the entity recognition task, the F1 value improved by 4.66% compared to the traditional continuous bag-of-words model (CBOW). The model is a lightweight word representation model, which reduces the training time by 51% compared to the pre-training language model BERT and reduces 89% in terms of memory usage. Full article

► Show Figures

Figure 1

9 pages, 258 KB

Open AccessArticle

Contextualizer: Connecting the Dots of Context with Second-Order Attention

by Diego Maupomé and Marie-Jean Meurs

Information 2022, 13(6), 290; https://doi.org/10.3390/info13060290 - 8 Jun 2022

Cited by 1 | Viewed by 2904

Abstract

Composing the representation of a sentence from the tokens that it comprises is difficult, because such a representation needs to account for how the words present relate to each other. The Transformer architecture does this by iteratively changing token representations with respect to [...] Read more.

Composing the representation of a sentence from the tokens that it comprises is difficult, because such a representation needs to account for how the words present relate to each other. The Transformer architecture does this by iteratively changing token representations with respect to one another. This has the drawback of requiring computation that grows quadratically with respect to the number of tokens. Furthermore, the scalar attention mechanism used by Transformers requires multiple sets of parameters to operate over different features. The present paper proposes a lighter algorithm for sentence representation with complexity linear in sequence length. This algorithm begins with a presumably erroneous value of a context vector and adjusts this value with respect to the tokens at hand. In order to achieve this, representations of words are built combining their symbolic embedding with a positional encoding into single vectors. The algorithm then iteratively weighs and aggregates these vectors using a second-order attention mechanism, which allows different feature pairs to interact with each other separately. Our models report strong results in several well-known text classification tasks. Full article

► Show Figures

Figure 1

2021

Jump to: 2026, 2025, 2024, 2023, 2022

24 pages, 1480 KB

Open AccessArticle

Robust Complaint Processing in Portuguese

by Henrique Lopes-Cardoso, Tomás Freitas Osório, Luís Vilar Barbosa, Gil Rocha, Luís Paulo Reis, João Pedro Machado and Ana Maria Oliveira

Information 2021, 12(12), 525; https://doi.org/10.3390/info12120525 - 17 Dec 2021

Cited by 1 | Viewed by 6304

Abstract

The Natural Language Processing (NLP) community has witnessed huge improvements in the last years. However, most achievements are evaluated on benchmarked curated corpora, with little attention devoted to user-generated content and less-resourced languages. Despite the fact that recent approaches target the development of [...] Read more.

The Natural Language Processing (NLP) community has witnessed huge improvements in the last years. However, most achievements are evaluated on benchmarked curated corpora, with little attention devoted to user-generated content and less-resourced languages. Despite the fact that recent approaches target the development of multi-lingual tools and models, they still underperform in languages such as Portuguese, for which linguistic resources do not abound. This paper exposes a set of challenges encountered when dealing with a real-world complex NLP problem, based on user-generated complaint data in Portuguese. This case study meets the needs of a country-wide governmental institution responsible for food safety and economic surveillance, and its responsibilities in handling a high number of citizen complaints. Beyond looking at the problem from an exclusively academic point of view, we adopt application-level concerns when analyzing the progress obtained through different techniques, including the need to obtain explainable decision support. We discuss modeling choices and provide useful insights for researchers working on similar problems or data. Full article

► Show Figures

Figure 1

13 pages, 335 KB

Open AccessArticle

A Comparative Study of Arabic Part of Speech Taggers Using Literary Text Samples from Saudi Novels

by Reyadh Alluhaibi, Tareq Alfraidi, Mohammad A. R. Abdeen and Ahmed Yatimi

Information 2021, 12(12), 523; https://doi.org/10.3390/info12120523 - 15 Dec 2021

Cited by 10 | Viewed by 6027

Abstract

Part of Speech (POS) tagging is one of the most common techniques used in natural language processing (NLP) applications and corpus linguistics. Various POS tagging tools have been developed for Arabic. These taggers differ in several aspects, such as in their modeling techniques, [...] Read more.

Part of Speech (POS) tagging is one of the most common techniques used in natural language processing (NLP) applications and corpus linguistics. Various POS tagging tools have been developed for Arabic. These taggers differ in several aspects, such as in their modeling techniques, tag sets and training and testing data. In this paper we conduct a comparative study of five Arabic POS taggers, namely: Stanford Arabic, CAMeL Tools, Farasa, MADAMIRA and Arabic Linguistic Pipeline (ALP) which examine their performance using text samples from Saudi novels. The testing data has been extracted from different novels that represent different types of narrations. The main result we have obtained indicates that the ALP tagger performs better than others in this particular case, and that Adjective is the most frequent mistagged POS type as compared to Noun and Verb. Full article

► Show Figures

Figure 1

12 pages, 223 KB

Open AccessArticle

Developing Core Technologies for Resource-Scarce Nguni Languages

by Jakobus S. du Toit and Martin J. Puttkammer

Information 2021, 12(12), 520; https://doi.org/10.3390/info12120520 - 14 Dec 2021

Cited by 5 | Viewed by 4292

Abstract

The creation of linguistic resources is crucial to the continued growth of research and development efforts in the field of natural language processing, especially for resource-scarce languages. In this paper, we describe the curation and annotation of corpora and the development of multiple [...] Read more.

The creation of linguistic resources is crucial to the continued growth of research and development efforts in the field of natural language processing, especially for resource-scarce languages. In this paper, we describe the curation and annotation of corpora and the development of multiple linguistic technologies for four official South African languages, namely isiNdebele, Siswati, isiXhosa, and isiZulu. Development efforts included sourcing parallel data for these languages and annotating each on token, orthographic, morphological, and morphosyntactic levels. These sets were in turn used to create and evaluate three core technologies, viz. a lemmatizer, part-of-speech tagger, morphological analyzer for each of the languages. We report on the quality of these technologies which improve on previously developed rule-based technologies as part of a similar initiative in 2013. These resources are made publicly accessible through a local resource agency with the intention of fostering further development of both resources and technologies that may benefit the NLP industry in South Africa. Full article

18 pages, 2353 KB

Open AccessArticle

A Knowledge-Based Sense Disambiguation Method to Semantically Enhanced NL Question for Restricted Domain

by Ammar Arbaaeen and Asadullah Shah

Information 2021, 12(11), 452; https://doi.org/10.3390/info12110452 - 31 Oct 2021

Cited by 3 | Viewed by 3628

Abstract

Within the space of question answering (QA) systems, the most critical module to improve overall performance is question analysis processing. Extracting the lexical semantic of a Natural Language (NL) question presents challenges at syntactic and semantic levels for most QA systems. This is [...] Read more.

Within the space of question answering (QA) systems, the most critical module to improve overall performance is question analysis processing. Extracting the lexical semantic of a Natural Language (NL) question presents challenges at syntactic and semantic levels for most QA systems. This is due to the difference between the words posed by a user and the terms presently stored in the knowledge bases. Many studies have achieved encouraging results in lexical semantic resolution on the topic of word sense disambiguation (WSD), and several other works consider these challenges in the context of QA applications. Additionally, few scholars have examined the role of WSD in returning potential answers corresponding to particular questions. However, natural language processing (NLP) is still facing several challenges to determine the precise meaning of various ambiguities. Therefore, the motivation of this work is to propose a novel knowledge-based sense disambiguation (KSD) method for resolving the problem of lexical ambiguity associated with questions posed in QA systems. The major contribution is the proposed innovative method, which incorporates multiple knowledge sources. This includes the question’s metadata (date/GPS), context knowledge, and domain ontology into a shallow NLP. The proposed KSD method is developed into a unique tool for a mobile QA application that aims to determine the intended meaning of questions expressed by pilgrims. The experimental results reveal that our method obtained comparable and better accuracy performance than the baselines in the context of the pilgrimage domain. Full article

► Show Figures

Figure 1

20 pages, 523 KB

Open AccessArticle

Optimizing Small BERTs Trained for German NER

by Jochen Zöllner, Konrad Sperfeld, Christoph Wick and Roger Labahn

Information 2021, 12(11), 443; https://doi.org/10.3390/info12110443 - 25 Oct 2021

Cited by 3 | Viewed by 4089

Abstract

Currently, the most widespread neural network architecture for training language models is the so-called BERT, which led to improvements in various Natural Language Processing (NLP) tasks. In general, the larger the number of parameters in a BERT model, the better the results obtained [...] Read more.

Currently, the most widespread neural network architecture for training language models is the so-called BERT, which led to improvements in various Natural Language Processing (NLP) tasks. In general, the larger the number of parameters in a BERT model, the better the results obtained in these NLP tasks. Unfortunately, the memory consumption and the training duration drastically increases with the size of these models. In this article, we investigate various training techniques of smaller BERT models: We combine different methods from other BERT variants, such as ALBERT, RoBERTa, and relative positional encoding. In addition, we propose two new fine-tuning modifications leading to better performance: Class-Start-End tagging and a modified form of Linear Chain Conditional Random Fields. Furthermore, we introduce Whole-Word Attention, which reduces BERTs memory usage and leads to a small increase in performance compared to classical Multi-Head-Attention. We evaluate these techniques on five public German Named Entity Recognition (NER) tasks, of which two are introduced by this article. Full article

► Show Figures

Figure 1

13 pages, 1336 KB

Open AccessArticle

Multi-Task Learning for Sentiment Analysis with Hard-Sharing and Task Recognition Mechanisms

by Jian Zhang, Ke Yan and Yuchang Mo

Information 2021, 12(5), 207; https://doi.org/10.3390/info12050207 - 12 May 2021

Cited by 28 | Viewed by 5861

Abstract

In the era of big data, multi-task learning has become one of the crucial technologies for sentiment analysis and classification. Most of the existing multi-task learning models for sentiment analysis are developed based on the soft-sharing mechanism that has less interference between different [...] Read more.

In the era of big data, multi-task learning has become one of the crucial technologies for sentiment analysis and classification. Most of the existing multi-task learning models for sentiment analysis are developed based on the soft-sharing mechanism that has less interference between different tasks than the hard-sharing mechanism. However, there are also fewer essential features that the model can extract with the soft-sharing method, resulting in unsatisfactory classification performance. In this paper, we propose a multi-task learning framework based on a hard-sharing mechanism for sentiment analysis in various fields. The hard-sharing mechanism is achieved by a shared layer to build the interrelationship among multiple tasks. Then, we design a task recognition mechanism to reduce the interference of the hard-shared feature space and also to enhance the correlation between multiple tasks. Experiments on two real-world sentiment classification datasets show that our approach achieves the best results and improves the classification accuracy over the existing methods significantly. The task recognition training process enables a unique representation of the features of different tasks in the shared feature space, providing a new solution reducing interference in the shared feature space for sentiment analysis. Full article

► Show Figures

Figure 1

21 pages, 964 KB

Open AccessReview

Ontology-Based Approach to Semantically Enhanced Question Answering for Closed Domain: A Review

by Ammar Arbaaeen and Asadullah Shah

Information 2021, 12(5), 200; https://doi.org/10.3390/info12050200 - 1 May 2021

Cited by 23 | Viewed by 8225

Abstract

For many users of natural language processing (NLP), it can be challenging to obtain concise, accurate and precise answers to a question. Systems such as question answering (QA) enable users to ask questions and receive feedback in the form of quick answers to [...] Read more.

For many users of natural language processing (NLP), it can be challenging to obtain concise, accurate and precise answers to a question. Systems such as question answering (QA) enable users to ask questions and receive feedback in the form of quick answers to questions posed in natural language, rather than in the form of lists of documents delivered by search engines. This task is challenging and involves complex semantic annotation and knowledge representation. This study reviews the literature detailing ontology-based methods that semantically enhance QA for a closed domain, by presenting a literature review of the relevant studies published between 2000 and 2020. The review reports that 83 of the 124 papers considered acknowledge the QA approach, and recommend its development and evaluation using different methods. These methods are evaluated according to accuracy, precision, and recall. An ontological approach to semantically enhancing QA is found to be adopted in a limited way, as many of the studies reviewed concentrated instead on NLP and information retrieval (IR) processing. While the majority of the studies reviewed focus on open domains, this study investigates the closed domain. Full article

► Show Figures

Figure 1

Journal Menu

Journal Browser

Natural Language Processing and Applications: Challenges and Perspectives

Share This Topical Collection

Editor

Topical Collection Information

Keywords

Published Papers (56 papers)

2026

Jump to: 2025, 2024, 2023, 2022, 2021

2025

Jump to: 2026, 2024, 2023, 2022, 2021

2024

Jump to: 2026, 2025, 2023, 2022, 2021

2023

Jump to: 2026, 2025, 2024, 2022, 2021

2022

Jump to: 2026, 2025, 2024, 2023, 2021

2021

Jump to: 2026, 2025, 2024, 2023, 2022

Further Information

Guidelines

MDPI Initiatives

Follow MDPI