Computers

Research

Jump to: Review, Other

44 pages, 4216 KB

Open AccessArticle

Legal AI in Low-Resource Languages: Building and Evaluating QA Systems for the Kazakh Legislation

by Diana Rakhimova, Assem Turarbek, Vladislav Karyukin, Assiya Sarsenbayeva and Rashid Alieyev

Computers 2025, 14(9), 354; https://doi.org/10.3390/computers14090354 - 27 Aug 2025

Viewed by 972

Abstract

The research focuses on the development and evaluation of a legal question–answer system for the Kazakh language, a low-resource and morphologically complex language. Four datasets were compiled from open legal sources—Adilet, Zqai, Gov, and a manually created synthetic set—containing question–аnswer pairs extracted from [...] Read more.

The research focuses on the development and evaluation of a legal question–answer system for the Kazakh language, a low-resource and morphologically complex language. Four datasets were compiled from open legal sources—Adilet, Zqai, Gov, and a manually created synthetic set—containing question–аnswer pairs extracted from official legislative documents and government portals. Seven large language models (GPT-4o mini, GEMMA, KazLLM, LLaMA, Phi, Qwen, and Mistral) were fine-tuned using structured prompt templates, quantization methods, and domain-specific training to enhance contextual understanding and efficiency. The evaluation employed both automatic metrics (ROUGE and METEOR) and expert-based manual assessment. GPT-4o mini achieved the highest overall performance, with ROUGE-1: 0.309, ROUGE-2: 0.175, ROUGE-L: 0.263, and METEOR: 0.320, and received an expert score of 3.96, indicating strong legal reasoning capabilities and adaptability to Kazakh legal contexts. The results highlight GPT-4o mini’s superiority over other tested models in both quantitative and qualitative evaluations. This work demonstrates the feasibility and importance of developing localized legal AI solutions for low-resource languages, contributing to improved legal accessibility, transparency, and digital governance in Kazakhstan. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Figure 1

19 pages, 832 KB

Open AccessArticle

Leveraging Contrastive Semantics and Language Adaptation for Robust Financial Text Classification Across Languages

by Liman Zhang, Qianye Lin, Fanyu Meng, Siyu Liang, Jingxuan Lu, Shen Liu, Kehan Chen and Yan Zhan

Computers 2025, 14(8), 338; https://doi.org/10.3390/computers14080338 - 19 Aug 2025

Viewed by 570

Abstract

With the growing demand for multilingual financial information, cross-lingual financial sentiment recognition faces significant challenges, including semantic misalignment, ambiguous sentiment expression, and insufficient transferability. To address these issues, a unified multilingual recognition framework is proposed, integrating semantic contrastive learning with a language-adaptive modulation [...] Read more.

With the growing demand for multilingual financial information, cross-lingual financial sentiment recognition faces significant challenges, including semantic misalignment, ambiguous sentiment expression, and insufficient transferability. To address these issues, a unified multilingual recognition framework is proposed, integrating semantic contrastive learning with a language-adaptive modulation mechanism. This approach is built upon the XLM-R multilingual model and employs a semantic contrastive module to enhance cross-lingual semantic consistency. In addition, a language modulation module based on low-rank parameter injection is introduced to improve the model’s sensitivity to fine-grained emotional features in low-resource languages such as Chinese and French. Experiments were conducted on a constructed trilingual financial sentiment dataset encompassing English, Chinese, and French. The results demonstrate that the proposed model significantly outperforms existing methods in cross-lingual sentiment recognition tasks. Specifically, in the English-to-French transfer setting, the model achieved 73.6% in accuracy, 69.8% in F1-Macro, 72.4% in F1-Weighted, and a cross-lingual generalization score of 0.654. Further improvements were observed under multilingual joint training, reaching 77.3%, 73.6%, 76.1%, and 0.696, respectively. In overall comparisons, the proposed model attained the highest performance across cross-lingual scenarios, with 75.8% in accuracy, 72.3% in F1-Macro, and 74.7% in F1-Weighted, surpassing strong baselines such as XLM-R+SimCSE and LaBSE. These results highlight the model’s superior capability in semantic alignment and generalization across languages. The proposed framework demonstrates strong applicability and promising potential in multilingual financial sentiment analysis, public opinion monitoring, and multilingual risk modeling. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Figure 1

15 pages, 4430 KB

Open AccessArticle

A Comprehensive Approach to Instruction Tuning for Qwen2.5: Data Selection, Domain Interaction, and Training Protocols

by Xungang Gu, Mengqi Wang, Yangjie Tian, Ning Li, Jiaze Sun, Jingfang Xu, He Zhang, Ruohua Xu and Ming Liu

Computers 2025, 14(7), 264; https://doi.org/10.3390/computers14070264 - 5 Jul 2025

Viewed by 973

Abstract

Instruction tuning plays a pivotal role in aligning large language models with diverse tasks, yet its effectiveness hinges on the interplay of data quality, domain composition, and training strategies. This study moves beyond qualitative assessment to systematically quantify these factors through extensive experiments [...] Read more.

Instruction tuning plays a pivotal role in aligning large language models with diverse tasks, yet its effectiveness hinges on the interplay of data quality, domain composition, and training strategies. This study moves beyond qualitative assessment to systematically quantify these factors through extensive experiments on data selection, data mixture, and training protocols. By quantifying performance trade-offs, we demonstrate that the implicit method SuperFiltering achieves an optimal balance, whereas explicit filters can induce capability conflicts. A fine-grained analysis of cross-domain interactions quantifies a near-linear competition between code and math, while showing that tool use data exhibits minimal interference. To mitigate these measured conflicts, we compare multi-task, sequential, and multi-stage training strategies, revealing that multi-stage training significantly reduces Conflict Rates while preserving domain expertise. Our findings culminate in a unified framework for optimizing instruction tuning, offering actionable, data-driven guidelines for balancing multi-domain performance and enhancing model generalization, thus advancing the field by providing a methodology to move from intuition to systematic optimization. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Figure 1

19 pages, 1594 KB

Open AccessArticle

Leave as Fast as You Can: Using Generative AI to Automate and Accelerate Hospital Discharge Reports

by Alex Trejo Omeñaca, Esteve Llargués Rocabruna, Jonny Sloan, Michelle Catta-Preta, Jan Ferrer i Picó, Julio Cesar Alfaro Alvarez, Toni Alonso Solis, Eloy Lloveras Gil, Xavier Serrano Vinaixa, Daniela Velasquez Villegas, Ramon Romeu Garcia, Carles Rubies Feijoo, Josep Maria Monguet i Fierro and Beatriu Bayes Genis

Computers 2025, 14(6), 210; https://doi.org/10.3390/computers14060210 - 28 May 2025

Viewed by 1759

Abstract

Clinical documentation, particularly the hospital discharge report (HDR), is essential for ensuring continuity of care, yet its preparation is time-consuming and places a considerable clinical and administrative burden on healthcare professionals. Recent advancements in Generative Artificial Intelligence (GenAI) and the use of prompt [...] Read more.

Clinical documentation, particularly the hospital discharge report (HDR), is essential for ensuring continuity of care, yet its preparation is time-consuming and places a considerable clinical and administrative burden on healthcare professionals. Recent advancements in Generative Artificial Intelligence (GenAI) and the use of prompt engineering in large language models (LLMs) offer opportunities to automate parts of this process, improving efficiency and documentation quality while reducing administrative workload. This study aims to design a digital system based on LLMs capable of automatically generating HDRs using information from clinical course notes and emergency care reports. The system was developed through iterative cycles, integrating various instruction flows and evaluating five different LLMs combined with prompt engineering strategies and agent-based architectures. Throughout the development, more than 60 discharge reports were generated and assessed, leading to continuous system refinement. In the production phase, 40 pneumology discharge reports were produced, receiving positive feedback from physicians, with an average score of 2.9 out of 4, indicating the system’s usefulness, with only minor edits needed in most cases. The ongoing expansion of the system to additional services and its integration within a hospital electronic system highlights the potential of LLMs, when combined with effective prompt engineering and agent-based architectures, to generate high-quality medical content and provide meaningful support to healthcare professionals. Hospital discharge reports (HDRs) are pivotal for continuity of care but consume substantial clinician time. Generative AI systems based on large language models (LLMs) could streamline this process, provided they deliver accurate, multilingual, and workflow-compatible outputs. We pursued a three-stage, design-science approach. Proof-of-concept: five state-of-the-art LLMs were benchmarked with multi-agent prompting to produce sample HDRs and define the optimal agent structure. Prototype: 60 HDRs spanning six specialties were generated and compared with clinician originals using ROUGE with average scores compatible with specialized news summarizing models in Spanish and Catalan (lower scores). A qualitative audit of 27 HDR pairs showed recurrent divergences in medication dose (56%) and social context (52%). Pilot deployment: The AI-HDR service was embedded in the hospital’s electronic health record. In the pilot, 47 HDRs were autogenerated in real-world settings and reviewed by attending physicians. Missing information and factual errors were flagged in 53% and 47% of drafts, respectively, while written assessments diminished the importance of these errors. An LLM-driven, agent-orchestrated pipeline can safely draft real-world HDRs, cutting administrative overhead while achieving clinician-acceptable quality, not without errors that require human supervision. Future work should refine specialty-specific prompts to curb omissions, add temporal consistency checks to prevent outdated data propagation, and validate time savings and clinical impact in multi-center trials. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Figure 1

33 pages, 2131 KB

Open AccessArticle

Domain- and Language-Adaptable Natural Language Interface for Property Graphs

by Ioannis Tsampos and Emmanouil Marakakis

Computers 2025, 14(5), 183; https://doi.org/10.3390/computers14050183 - 9 May 2025

Viewed by 1116

Abstract

Despite the growing adoption of Property Graph Databases, like Neo4j, interacting with them remains difficult for non-technical users due to the reliance on formal query languages. Natural Language Interfaces (NLIs) address this by translating natural language (NL) into Cypher. However, existing solutions are [...] Read more.

Despite the growing adoption of Property Graph Databases, like Neo4j, interacting with them remains difficult for non-technical users due to the reliance on formal query languages. Natural Language Interfaces (NLIs) address this by translating natural language (NL) into Cypher. However, existing solutions are typically limited to high-resource languages; are difficult to adapt to evolving domains with limited annotated data; and often depend on Machine Learning (ML) approaches, including Large Language Models (LLMs), that demand substantial computational resources and advanced expertise for training and maintenance. We address these limitations by introducing a novel dependency-based, training-free, schema-agnostic Natural Language Interface (NLI) that converts NL queries into Cypher for querying Property Graphs. Our system employs a modular pipeline-integrating entity and relationship extraction, Named Entity Recognition (NER), semantic mapping, triple creation via syntactic dependencies, and validation against an automatically extracted Schema Graph. The distinctive feature of this approach is the reduction in candidate entity pairs using syntactic analysis and schema validation, eliminating the need for candidate query generation and ranking. The schema-agnostic design enables adaptation across domains and languages. Our system supports single- and multi-hop queries, conjunctions, comparisons, aggregations, and complex questions through an explainable process. Evaluations on real-world queries demonstrate reliable translation results. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Graphical abstract

19 pages, 2033 KB

Open AccessArticle

DeepStego: Privacy-Preserving Natural Language Steganography Using Large Language Models and Advanced Neural Architectures

by Oleksandr Kuznetsov, Kyrylo Chernov, Aigul Shaikhanova, Kainizhamal Iklassova and Dinara Kozhakhmetova

Computers 2025, 14(5), 165; https://doi.org/10.3390/computers14050165 - 29 Apr 2025

Cited by 2 | Viewed by 1183

Abstract

Modern linguistic steganography faces the fundamental challenge of balancing embedding capacity with detection resistance, particularly against advanced AI-based steganalysis. This paper presents DeepStego, a novel steganographic system leveraging GPT-4-omni’s language modeling capabilities for secure information hiding in text. Our approach combines dynamic synonym [...] Read more.

Modern linguistic steganography faces the fundamental challenge of balancing embedding capacity with detection resistance, particularly against advanced AI-based steganalysis. This paper presents DeepStego, a novel steganographic system leveraging GPT-4-omni’s language modeling capabilities for secure information hiding in text. Our approach combines dynamic synonym generation with semantic-aware embedding to achieve superior detection resistance while maintaining text naturalness. Through comprehensive experimentation, DeepStego demonstrates significantly lower detection rates compared to existing methods across multiple state-of-the-art steganalysis techniques. DeepStego supports higher embedding capacities while maintaining strong detection resistance and semantic coherence. The system shows superior scalability compared to existing methods. Our evaluation demonstrates perfect message recovery accuracy and significant improvements in text quality preservation compared to competing approaches. These results establish DeepStego as a significant advancement in practical steganographic applications, particularly suitable for scenarios requiring secure covert communication with high embedding capacity. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Figure 1

16 pages, 799 KB

Open AccessArticle

Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models

by Francisco Javier Lima Florido and Gloria Corpas Pastor

Computers 2025, 14(3), 102; https://doi.org/10.3390/computers14030102 - 14 Mar 2025

Cited by 1 | Viewed by 1681

Abstract

In recent years, the advances in deep neural networks (DNNs) and large language models (LLMs) have led to major breakthroughs and new levels of performance in Natural Language Processing (NLP), including tasks related to speech processing. Based on these new trends, new models [...] Read more.

In recent years, the advances in deep neural networks (DNNs) and large language models (LLMs) have led to major breakthroughs and new levels of performance in Natural Language Processing (NLP), including tasks related to speech processing. Based on these new trends, new models such as Whisper and Wav2Vec 2.0 achieve robust performance in speech processing tasks, even in speech-to-text translation and end-to-end speech translation, far exceeding all previous results. Although these models have shown excellent results in real-time speech processing, they still have some accuracy issues for some tasks and high latency problems when working with large amounts of audio data. In addition, many of them need audio to be segmented and labelled for speech synthesis and annotation tasks. Speaker diarisation, background noise detection, prosodic boundary detection and accent classification are some of the pre-processing tasks required in these cases. In this study, we will fine-tune a small Wav2Vec 2.0 base model for multi-task classification and audio segmentation. A corpus of spoken American English will be used for the experiments. We intend to explore this new approach and, more specifically, the performance of the model with regard to prosodic boundaries detection for audio segmentation, and advanced accent identification. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Figure 1

12 pages, 965 KB

Open AccessArticle

Multifaceted Assessment of Responsible Use and Bias in Language Models for Education

by Ishrat Ahmed, Wenxing Liu, Rod D. Roscoe, Elizabeth Reilley and Danielle S. McNamara

Computers 2025, 14(3), 100; https://doi.org/10.3390/computers14030100 - 12 Mar 2025

Cited by 2 | Viewed by 2968

Abstract

Large language models (LLMs) are increasingly being utilized to develop tools and services in various domains, including education. However, due to the nature of the training data, these models are susceptible to inherent social or cognitive biases, which can influence their outputs. Furthermore, [...] Read more.

Large language models (LLMs) are increasingly being utilized to develop tools and services in various domains, including education. However, due to the nature of the training data, these models are susceptible to inherent social or cognitive biases, which can influence their outputs. Furthermore, their handling of critical topics, such as privacy and sensitive questions, is essential for responsible deployment. This study proposes a framework for the automatic detection of biases and violations of responsible use using a synthetic question-based dataset mimicking student–chatbot interactions. We employ the LLM-as-a-judge method to evaluate multiple LLMs for biased responses. Our findings show that some models exhibit more bias than others, highlighting the need for careful consideration when selecting models for deployment in educational and other high-stakes applications. These results emphasize the importance of addressing bias in LLMs and implementing robust mechanisms to uphold responsible AI use in real-world services. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Figure 1

21 pages, 738 KB

Open AccessArticle

Unpacking Sarcasm: A Contextual and Transformer-Based Approach for Improved Detection

by Parul Dubey, Pushkar Dubey and Pitshou N. Bokoro

Computers 2025, 14(3), 95; https://doi.org/10.3390/computers14030095 - 6 Mar 2025

Cited by 1 | Viewed by 2761

Abstract

Sarcasm detection is a crucial task in natural language processing (NLP), particularly in sentiment analysis and opinion mining, where sarcasm can distort sentiment interpretation. Accurately identifying sarcasm remains challenging due to its context-dependent nature and linguistic complexity across informal text sources like social [...] Read more.

Sarcasm detection is a crucial task in natural language processing (NLP), particularly in sentiment analysis and opinion mining, where sarcasm can distort sentiment interpretation. Accurately identifying sarcasm remains challenging due to its context-dependent nature and linguistic complexity across informal text sources like social media and conversational dialogues. This study utilizes three benchmark datasets, namely, News Headlines, Mustard, and Reddit (SARC), which contain diverse sarcastic expressions from headlines, scripted dialogues, and online conversations. The proposed methodology leverages transformer-based models (RoBERTa and DistilBERT), integrating context summarization, metadata extraction, and conversational structure preservation to enhance sarcasm detection. The novelty of this research lies in combining contextual summarization with metadata-enhanced embeddings to improve model interpretability and efficiency. Performance evaluation is based on accuracy, F1 score, and the Jaccard coefficient, ensuring a comprehensive assessment. Experimental results demonstrate that RoBERTa achieves 98.5% accuracy with metadata, while DistilBERT offers a 1.74x speedup, highlighting the trade-off between accuracy and computational efficiency for real-world sarcasm detection applications. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Figure 1

35 pages, 633 KB

Open AccessArticle

Set-Word Embeddings and Semantic Indices: A New Contextual Model for Empirical Language Analysis

by Pedro Fernández de Córdoba, Carlos A. Reyes Pérez, Claudia Sánchez Arnau and Enrique A. Sánchez Pérez

Computers 2025, 14(1), 30; https://doi.org/10.3390/computers14010030 - 20 Jan 2025

Cited by 1 | Viewed by 1340

Abstract

We present a new word embedding technique in a (non-linear) metric space based on the shared membership of terms in a corpus of textual documents, where the metric is naturally defined by the Boolean algebra of all subsets of the corpus and a [...] Read more.

We present a new word embedding technique in a (non-linear) metric space based on the shared membership of terms in a corpus of textual documents, where the metric is naturally defined by the Boolean algebra of all subsets of the corpus and a measure

μ

defined on it. Once the metric space is constructed, a new term (a noun, an adjective, a classification term) can be introduced into the model and analyzed by means of semantic projections, which in turn are defined as indexes using the measure

μ

and the word embedding tools. We formally define all necessary elements and prove the main results about the model, including a compatibility theorem for estimating the representability of semantically meaningful external terms in the model (which are written as real Lipschitz functions in the metric space), proving the relation between the semantic index and the metric of the space (Theorem 1). Our main result proves the universality of our word-set embedding, proving mathematically that every word embedding based on linear space can be written as a word-set embedding (Theorem 2). Since we adopt an empirical point of view for the semantic issues, we also provide the keys for the interpretation of the results using probabilistic arguments (to facilitate the subsequent integration of the model into Bayesian frameworks for the construction of inductive tools), as well as in fuzzy set-theoretic terms. We also show some illustrative examples, including a complete computational case using big-data-based computations. Thus, the main advantages of the proposed model are that the results on distances between terms are interpretable in semantic terms once the semantic index used is fixed and, although the calculations could be costly, it is possible to calculate the value of the distance between two terms without the need to calculate the whole distance matrix. “Wovon man nicht sprechen kann, darüber muss man schweigen”. Tractatus Logico-Philosophicus. L. Wittgenstein. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Graphical abstract

29 pages, 6016 KB

Open AccessArticle

Impact of Chatbots on User Experience and Data Quality on Citizen Science Platforms

by Akasha-Leonie Kessel, Soror Sahri, Sven Groppe, Jinghua Groppe, Hanieh Khorashadizadeh, Marc Pignal, Eva Perez Pimparé and Régine Vignes-Lebbe

Computers 2025, 14(1), 21; https://doi.org/10.3390/computers14010021 - 10 Jan 2025

Viewed by 3732

Abstract

Citizen science (CS) projects, which engage the general public in scientific research, often face challenges in ensuring high-quality data collection and maintaining user engagement. Recent advancements in Large Language Models (LLMs) present a promising solution by providing automated, real-time assistance to users, reducing [...] Read more.

Citizen science (CS) projects, which engage the general public in scientific research, often face challenges in ensuring high-quality data collection and maintaining user engagement. Recent advancements in Large Language Models (LLMs) present a promising solution by providing automated, real-time assistance to users, reducing the need for extensive human intervention, and offering instant support. The CS project Les Herbonautes, dedicated to mass digitization of the French National Herbarium, serves as a case study for this paper, which details the development and evaluation of a network of open source LLM agents to assist users during data collection. The research involved the review of related work, stakeholder meetings with the Muséum National d’Histoire Naturelle, and user and context analyses to formalize system requirements. With these, a prototype with a user interface in the form of a chatbot was designed and implemented using LangGraph, and afterward evaluated through expert evaluation to assess its effect on usability and user experience (UX). The findings indicate that such a chatbot can enhance UX and improve data quality by guiding users and providing immediate feedback. However, limitations due to the non-deterministic nature of LLMs exist, suggesting that workflows must be carefully designed to mitigate potential errors and ensure reliable performance. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Figure 1

19 pages, 21558 KB

Open AccessArticle

Visualizing Ambiguity: Analyzing Linguistic Ambiguity Resolution in Text-to-Image Models

by Wala Elsharif, Mahmood Alzubaidi, James She and Marco Agus

Computers 2025, 14(1), 19; https://doi.org/10.3390/computers14010019 - 8 Jan 2025

Cited by 1 | Viewed by 2184

Abstract

Text-to-image models have demonstrated remarkable progress in generating visual content from textual descriptions. However, the presence of linguistic ambiguity in the text prompts poses a potential challenge to these models, possibly leading to undesired or inaccurate outputs. This work conducts a preliminary study [...] Read more.

Text-to-image models have demonstrated remarkable progress in generating visual content from textual descriptions. However, the presence of linguistic ambiguity in the text prompts poses a potential challenge to these models, possibly leading to undesired or inaccurate outputs. This work conducts a preliminary study and provides insights into how text-to-image diffusion models resolve linguistic ambiguity through a series of experiments. We investigate a set of prompts that exhibit different types of linguistic ambiguities with different models and the images they generate, focusing on how the models’ interpretations of linguistic ambiguity compare to those of humans. In addition, we present a curated dataset of ambiguous prompts and their corresponding images known as the Visual Linguistic Ambiguity Benchmark (V-LAB) dataset. Furthermore, we report a number of limitations and failure modes caused by linguistic ambiguity in text-to-image models and propose prompt engineering guidelines to minimize the impact of ambiguity. The findings of this exploratory study contribute to the ongoing improvement of text-to-image models and provide valuable insights for future advancements in the field. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Figure 1

17 pages, 1448 KB

Open AccessArticle

LLaMA 3 vs. State-of-the-Art Large Language Models: Performance in Detecting Nuanced Fake News

by Stefan Emil Repede and Remus Brad

Computers 2024, 13(11), 292; https://doi.org/10.3390/computers13110292 - 11 Nov 2024

Cited by 2 | Viewed by 3643

Abstract

This study investigates the effectiveness of a proposed version of Meta’s LLaMA 3 model in detecting fake claims across bilingual (English and Romanian) datasets, focusing on a multi-class approach beyond traditional binary classifications in order to better mimic real-world scenarios. The research employs [...] Read more.

This study investigates the effectiveness of a proposed version of Meta’s LLaMA 3 model in detecting fake claims across bilingual (English and Romanian) datasets, focusing on a multi-class approach beyond traditional binary classifications in order to better mimic real-world scenarios. The research employs a proposed version of the LLaMA 3 model, optimized for identifying nuanced categories such as “Mostly True” and “Mostly False”, and compares its performance against leading large language models (LLMs) including Open AI’s ChatGPT versions, Google’s Gemini, and similar LLaMA models. The analysis reveals that the proposed LLaMA 3 model consistently outperforms its base version and older LLaMA models, particularly in the Romanian dataset, achieving the highest accuracy of 39% and demonstrating superior capabilities in identifying nuanced claims, over all the compared large language models. However, the model’s performance across both languages highlights some challenges, with generally low accuracy and difficulties in handling ambiguous categories by all the LLMs. The study also underscores the impact of language and cultural context on model reliability, noting that even state-of-the-art models like ChatGPT 4.o and Gemini exhibit inconsistencies when applied to Romanian text and more than a binary true/false approach. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Figure 1

13 pages, 853 KB

Open AccessArticle

Assessing Large Language Models Used for Extracting Table Information from Annual Financial Reports

by David Balsiger, Hans-Rudolf Dimmler, Samuel Egger-Horstmann and Thomas Hanne

Computers 2024, 13(10), 257; https://doi.org/10.3390/computers13100257 - 9 Oct 2024

Cited by 3 | Viewed by 4393

Abstract

The extraction of data from tables in PDF documents has been a longstanding challenge in the field of data processing and analysis. While traditional methods have been explored in depth, the rise of Large Language Models (LLMs) offers new possibilities. This article addresses [...] Read more.

The extraction of data from tables in PDF documents has been a longstanding challenge in the field of data processing and analysis. While traditional methods have been explored in depth, the rise of Large Language Models (LLMs) offers new possibilities. This article addresses the knowledge gaps regarding LLMs, specifically ChatGPT-4 and BARD, for extracting and interpreting data from financial tables in PDF format. This research is motivated by the real-world need to efficiently gather and analyze corporate financial information. The hypothesis is that LLMs—in this case, ChatGPT-4 and BARD—can accurately extract key financial data, such as balance sheets and income statements. The methodology involves selecting representative pages from 46 annual reports of large Swiss corporations listed in the SMI Expanded Index from 2022 and copy–pasting text from these into LLMs. Eight analytical questions were posed to the LLMs, and their responses were assessed for accuracy and for identifying potential error sources in data extraction. The findings revealed significant variance in the performance of ChatGPT-4 and another LLM, BARD, with ChatGPT-4 generally exhibiting superior accuracy. This research contributes to understanding the capabilities and limitations of LLMs in processing and interpreting complex financial data from corporate documents. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Figure 1

20 pages, 2961 KB

Open AccessArticle

Leveraging Large Language Models with Chain-of-Thought and Prompt Engineering for Traffic Crash Severity Analysis and Inference

by Hao Zhen, Yucheng Shi, Yongcan Huang, Jidong J. Yang and Ninghao Liu

Computers 2024, 13(9), 232; https://doi.org/10.3390/computers13090232 - 14 Sep 2024

Cited by 5 | Viewed by 4005

Abstract

Harnessing the power of Large Language Models (LLMs), this study explores the use of three state-of-the-art LLMs, specifically GPT-3.5-turbo, LLaMA3-8B, and LLaMA3-70B, for crash severity analysis and inference, framing it as a classification task. We generate textual narratives from original traffic crash tabular [...] Read more.

Harnessing the power of Large Language Models (LLMs), this study explores the use of three state-of-the-art LLMs, specifically GPT-3.5-turbo, LLaMA3-8B, and LLaMA3-70B, for crash severity analysis and inference, framing it as a classification task. We generate textual narratives from original traffic crash tabular data using a pre-built template infused with domain knowledge. Additionally, we incorporated Chain-of-Thought (CoT) reasoning to guide the LLMs in analyzing the crash causes and then inferring the severity. This study also examine the impact of prompt engineering specifically designed for crash severity inference. The LLMs were tasked with crash severity inference to: (1) evaluate the models’ capabilities in crash severity analysis, (2) assess the effectiveness of CoT and domain-informed prompt engineering, and (3) examine the reasoning abilities with the CoT framework. Our results showed that LLaMA3-70B consistently outperformed the other models, particularly in zero-shot settings. The CoT and Prompt Engineering techniques significantly enhanced performance, improving logical reasoning and addressing alignment issues. Notably, the CoT offers valuable insights into LLMs’ reasoning process, unleashing their capacity to consider diverse factors such as environmental conditions, driver behavior, and vehicle characteristics in severity analysis and inference. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Figure 1

24 pages, 22050 KB

Open AccessArticle

SOD: A Corpus for Saudi Offensive Language Detection Classification

by Afefa Asiri and Mostafa Saleh

Computers 2024, 13(8), 211; https://doi.org/10.3390/computers13080211 - 20 Aug 2024

Cited by 1 | Viewed by 2112

Abstract

Social media platforms like X (formerly known as Twitter) are integral to modern communication, enabling the sharing of news, emotions, and ideas. However, they also facilitate the spread of harmful content, and manual moderation of these platforms is impractical. Automated moderation tools, predominantly [...] Read more.

Social media platforms like X (formerly known as Twitter) are integral to modern communication, enabling the sharing of news, emotions, and ideas. However, they also facilitate the spread of harmful content, and manual moderation of these platforms is impractical. Automated moderation tools, predominantly developed for English, are insufficient for addressing online offensive language in Arabic, a language rich in dialects and informally used on social media. This gap underscores the need for dedicated, dialect-specific resources. This study introduces the Saudi Offensive Dialectal dataset (SOD), consisting of over 24,000 tweets annotated across three levels: offensive or non-offensive, with offensive tweets further categorized as general insults, hate speech, or sarcasm. A deeper analysis of hate speech identifies subtypes related to sports, religion, politics, race, and violence. A comprehensive descriptive analysis of the SOD is also provided to offer deeper insights into its composition. Using machine learning, traditional deep learning, and transformer-based deep learning models, particularly AraBERT, our research achieves a significant F1-Score of 87% in identifying offensive language. This score improves to 91% with data augmentation techniques addressing dataset imbalances. These results, which surpass many existing studies, demonstrate that a specialized dialectal dataset enhances detection efficacy compared to mixed-language datasets. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Figure 1

31 pages, 2905 KB

Open AccessArticle

On Using GeoGebra and ChatGPT for Geometric Discovery

by Francisco Botana, Tomas Recio and María Pilar Vélez

Computers 2024, 13(8), 187; https://doi.org/10.3390/computers13080187 - 30 Jul 2024

Cited by 6 | Viewed by 4263

Abstract

This paper explores the performance of ChatGPT and GeoGebra Discovery when dealing with automatic geometric reasoning and discovery. The emergence of Large Language Models has attracted considerable attention in mathematics, among other fields where intelligence should be present. We revisit a couple of [...] Read more.

This paper explores the performance of ChatGPT and GeoGebra Discovery when dealing with automatic geometric reasoning and discovery. The emergence of Large Language Models has attracted considerable attention in mathematics, among other fields where intelligence should be present. We revisit a couple of elementary Euclidean geometry theorems discussed in the birth of Artificial Intelligence and a non-trivial inequality concerning triangles. GeoGebra succeeds in proving all these selected examples, while ChatGPT fails in one case. Our thesis is that both GeoGebra and ChatGPT could be used as complementary systems, where the natural language abilities of ChatGPT and the certified computer algebra methods in GeoGebra Discovery can cooperate in order to obtain sound and—more relevant—interesting results. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Figure 1

23 pages, 578 KB

Open AccessArticle

An NLP-Based Exploration of Variance in Student Writing and Syntax: Implications for Automated Writing Evaluation

by Maria Goldshtein, Amin G. Alhashim and Rod D. Roscoe

Computers 2024, 13(7), 160; https://doi.org/10.3390/computers13070160 - 25 Jun 2024

Cited by 2 | Viewed by 1948

Abstract

In writing assessment, expert human evaluators ideally judge individual essays with attention to variance among writers’ syntactic patterns. There are many ways to compose text successfully or less successfully. For automated writing evaluation (AWE) systems to provide accurate assessment and relevant feedback, they [...] Read more.

In writing assessment, expert human evaluators ideally judge individual essays with attention to variance among writers’ syntactic patterns. There are many ways to compose text successfully or less successfully. For automated writing evaluation (AWE) systems to provide accurate assessment and relevant feedback, they must be able to consider similar kinds of variance. The current study employed natural language processing (NLP) to explore variance in syntactic complexity and sophistication across clusters characterized in a large corpus (n = 36,207) of middle school and high school argumentative essays. Using NLP tools, k-means clustering, and discriminant function analysis (DFA), we observed that student writers employed four distinct syntactic patterns: (1) familiar and descriptive language, (2) consistently simple noun phrases, (3) variably complex noun phrases, and (4) moderate complexity with less familiar language. Importantly, each pattern spanned the full range of writing quality; there were no syntactic patterns consistently evaluated as “good” or “bad”. These findings support the need for nuanced approaches in automated writing assessment while informing ways that AWE can participate in that process. Future AWE research can and should explore similar variability across other detectable elements of writing (e.g., vocabulary, cohesion, discursive cues, and sentiment) via diverse modeling methods. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Figure 1

Review

Jump to: Research, Other

41 pages, 966 KB

Open AccessReview

ChatGPT’s Expanding Horizons and Transformative Impact Across Domains: A Critical Review of Capabilities, Challenges, and Future Directions

by Taiwo Raphael Feyijimi, John Ogbeleakhu Aliu, Ayodeji Emmanuel Oke and Douglas Omoregie Aghimien

Computers 2025, 14(9), 366; https://doi.org/10.3390/computers14090366 - 2 Sep 2025

Viewed by 756

Abstract

The rapid proliferation of Chat Generative Pre-trained Transformer (ChatGPT) marks a pivotal moment in artificial intelligence, eliciting responses from academic shock to industrial awe. As these technologies advance from passive tools toward proactive, agentic systems, their transformative potential and inherent risks are magnified [...] Read more.

The rapid proliferation of Chat Generative Pre-trained Transformer (ChatGPT) marks a pivotal moment in artificial intelligence, eliciting responses from academic shock to industrial awe. As these technologies advance from passive tools toward proactive, agentic systems, their transformative potential and inherent risks are magnified globally. This paper presents a comprehensive, critical review of ChatGPT’s impact across five key domains: natural language understanding (NLU), content generation, knowledge discovery, education, and engineering. While ChatGPT demonstrates profound capabilities, significant challenges remain in factual accuracy, bias, and the inherent opacity of its reasoning—a core issue termed the “Black Box Conundrum”. To analyze these evolving dynamics and the implications of this shift toward autonomous agency, this review introduces a series of conceptual frameworks, each specifically designed to illuminate the complex interactions and trade-offs within these domains: the “Specialization vs. Generalization” tension in NLU; the “Quality–Scalability–Ethics Trilemma” in content creation; the “Pedagogical Adaptation Imperative” in education; and the emergence of “Human–LLM Cognitive Symbiosis” in engineering. The analysis reveals an urgent need for proactive adaptation across sectors. Educational paradigms must shift to cultivate higher-order cognitive skills, while professional practices (including practices within education sector) must evolve to treat AI as a cognitive partner, leveraging techniques like Retrieval-Augmented Generation (RAG) and sophisticated prompt engineering. Ultimately, this paper argues for an overarching “Ethical–Technical Co-evolution Imperative”, charting a forward-looking research agenda that intertwines technological innovation with vigorous ethical and methodological standards to ensure responsible AI development and integration. Ultimately, the analysis reveals that the challenges of factual accuracy, bias, and opacity are interconnected and acutely magnified by the emergence of agentic systems, demanding a unified, proactive approach to adaptation across all sectors. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Figure 1

Other

Jump to: Research, Review

34 pages, 5078 KB

Open AccessSystematic Review

Context-Aware Embedding Techniques for Addressing Meaning Conflation Deficiency in Morphologically Rich Languages Word Embedding: A Systematic Review and Meta Analysis

by Mosima Anna Masethe, Hlaudi Daniel Masethe and Sunday O. Ojo

Computers 2024, 13(10), 271; https://doi.org/10.3390/computers13100271 - 17 Oct 2024

Cited by 4 | Viewed by 3554

Abstract

This systematic literature review aims to evaluate and synthesize the effectiveness of various embedding techniques—word embeddings, contextual word embeddings, and context-aware embeddings—in addressing Meaning Conflation Deficiency (MCD). Using the PRISMA framework, this study assesses the current state of research and provides insights into [...] Read more.

This systematic literature review aims to evaluate and synthesize the effectiveness of various embedding techniques—word embeddings, contextual word embeddings, and context-aware embeddings—in addressing Meaning Conflation Deficiency (MCD). Using the PRISMA framework, this study assesses the current state of research and provides insights into the impact of these techniques on resolving meaning conflation issues. After a thorough literature search, 403 articles on the subject were found. A thorough screening and selection process resulted in the inclusion of 25 studies in the meta-analysis. The evaluation adhered to the PRISMA principles, guaranteeing a methodical and lucid process. To estimate effect sizes and evaluate heterogeneity and publication bias among the chosen papers, meta-analytic approaches were utilized such as the tau-squared (τ²) which represents a statistical parameter used in random-effects, H-squared (H²) is a statistic used to measure heterogeneity, and I-squared (I²) quantify the degree of heterogeneity. The meta-analysis demonstrated a high degree of variation in effect sizes among the studies, with a τ² value of 8.8724. The significant degree of heterogeneity was further emphasized by the H² score of 8.10 and the I² value of 87.65%. A trim and fill analysis with a beta value of 5.95, a standard error of 4.767, a Z-value (or Z-score) of 1.25 which is a statistical term used to express the number of standard deviations a data point deviates from the established mean, and a p-value (probability value) of 0.2 was performed to account for publication bias which is one statistical tool that can be used to assess the importance of hypothesis test results. The results point to a sizable impact size, but the estimates are highly unclear, as evidenced by the huge standard error and non-significant p-value. The review concludes that although contextually aware embeddings have promise in treating Meaning Conflation Deficiency, there is a great deal of variability and uncertainty in the available data. The varied findings among studies are highlighted by the large τ², I², and H² values, and the trim and fill analysis show that changes in publication bias do not alter the impact size’s non-significance. To generate more trustworthy insights, future research should concentrate on enhancing methodological consistency, investigating other embedding strategies, and extending analysis across various languages and contexts. Even though the results demonstrate a significant impact size in addressing MCD through sophisticated word embedding techniques, like context-aware embeddings, there is still a great deal of variability and uncertainty because of various factors, including the different languages studied, the sizes of the corpuses, and the embedding techniques used. These differences show how future research methods must be standardized to guarantee that study results can be compared to one another. The results emphasize how crucial it is to extend the linguistic scope to more morphologically rich and low-resource languages, where MCD is especially difficult. The creation of language-specific models for low-resource languages is one way to increase performance and consistency across Natural Language Processing (NLP) applications in a practical sense. By taking these actions, we can advance our understanding of MCD more thoroughly, which will ultimately improve the performance of NLP systems in a variety of language circumstances. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Natural Language Processing (NLP) and Large Language Modelling

Share This Special Issue

Special Issue Editor

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Related Special Issue

Published Papers (20 papers)

Research

Review

Other

Further Information

Guidelines

MDPI Initiatives

Follow MDPI