MDPI - Publisher of Open Access Journals

19 pages, 1311 KB

Open AccessArticle

An Interpretable Soft-Sensor Framework for Dissertation Peer Review Using BERT

by Meng Wang, Jincheng Su, Zhide Chen, Wencheng Yang and Xu Yang

Sensors 2025, 25(20), 6411; https://doi.org/10.3390/s25206411 - 17 Oct 2025

Graduate education has entered the era of big data, and systematic analysis of dissertation evaluations has become crucial for quality monitoring. However, the complexity and subjectivity inherent in peer-review texts pose significant challenges for automated analysis. While natural language processing (NLP) offers potential [...] Read more.

Graduate education has entered the era of big data, and systematic analysis of dissertation evaluations has become crucial for quality monitoring. However, the complexity and subjectivity inherent in peer-review texts pose significant challenges for automated analysis. While natural language processing (NLP) offers potential solutions, most existing methods fail to adequately capture nuanced disciplinary criteria or provide interpretable inferences for educators. Inspired by soft-sensor, this study employs a BERT-based model enhanced with additional attention mechanisms to quantify latent evaluation dimensions from dissertation reviews. The framework integrates Shapley Additive exPlanations (SHAP) to ensure the interpretability of model predictions, combining deep semantic modeling with SHAP to quantify characteristic importance in academic evaluation. The experimental results demonstrate that the implemented model outperforms baseline methods in accuracy, precision, recall, and F1-score. Furthermore, its interpretability mechanism reveals key evaluation dimensions experts prioritize during the paper assessment. This analytical framework establishes an interpretable soft-sensor paradigm that bridges NLP with substantive review principles, providing actionable insights for enhancing dissertation improvement strategies. Full article

(This article belongs to the Special Issue Integrating AI and IoT with Sensors in Computer-Based Educational Systems)

► Show Figures

Figure 1

27 pages, 3065 KB

Open AccessArticle

Chinese Financial News Analysis for Sentiment and Stock Prediction: A Comparative Framework with Language Models

by Hsiu-Min Chuang, Hsiang-Chih He and Ming-Che Hu

Big Data Cogn. Comput. 2025, 9(10), 263; https://doi.org/10.3390/bdcc9100263 - 16 Oct 2025

Abstract

Financial news has a significant impact on investor sentiment and short-term stock price trends. While many studies have applied natural language processing (NLP) techniques to financial forecasting, most have focused on single tasks or English corpora, with limited research in non-English language contexts [...] Read more.

Financial news has a significant impact on investor sentiment and short-term stock price trends. While many studies have applied natural language processing (NLP) techniques to financial forecasting, most have focused on single tasks or English corpora, with limited research in non-English language contexts such as Taiwan. This study develops a joint framework to perform sentiment classification and short-term stock price prediction using Chinese financial news from Taiwan’s top 50 listed companies. Five types of word embeddings—one-hot, TF-IDF, CBOW, skip-gram, and BERT—are systematically compared across 17 traditional, deep, and Transformer models, as well as a large language model (LLaMA3) fully fine-tuned on the Chinese financial texts. To ensure annotation quality, sentiment labels were manually assigned by annotators with finance backgrounds and validated through a double-checking process. Experimental results show that a CNN using skip-gram embeddings achieves the strongest performance among deep learning models, while LLaMA3 yields the highest overall F1-score for sentiment classification. For regression, LSTM consistently provides the most reliable predictive power across different volatility groups, with Bayesian Linear Regression remaining competitive for low-volatility firms. LLaMA3 is the only Transformer-based model to achieve a positive

R^{2}

under high-volatility conditions. Furthermore, forecasting accuracy is higher for the five-day horizon than for the fifteen-day horizon, underscoring the increasing difficulty of medium-term forecasting. These findings confirm that financial news provides valuable predictive signals for emerging markets and that short-term sentiment-informed forecasts enhance real-time investment decisions. Full article

(This article belongs to the Special Issue Natural Language Processing Applications in Big Data)

► Show Figures

Figure 1

15 pages, 2861 KB

Open AccessArticle

Sustainable Real-Time NLP with Serverless Parallel Processing on AWS

by Chaitanya Kumar Mankala and Ricardo J. Silva

Information 2025, 16(10), 903; https://doi.org/10.3390/info16100903 - 15 Oct 2025

Viewed by 129

Abstract

This paper proposes a scalable serverless architecture for real-time natural language processing (NLP) on large datasets using Amazon Web Services (AWS). The framework integrates AWS Lambda, Step Functions, and S3 to enable fully parallel sentiment analysis with Transformer-based models such as DistilBERT, RoBERTa, [...] Read more.

This paper proposes a scalable serverless architecture for real-time natural language processing (NLP) on large datasets using Amazon Web Services (AWS). The framework integrates AWS Lambda, Step Functions, and S3 to enable fully parallel sentiment analysis with Transformer-based models such as DistilBERT, RoBERTa, and ClinicalBERT. By containerizing inference workloads and orchestrating parallel execution, the system eliminates the need for dedicated servers while dynamically scaling to workload demand. Experimental evaluation on the IMDb Reviews dataset demonstrates substantial efficiency gains: parallel execution achieved a 6.07× reduction in wall-clock duration, an 81.2% reduction in total computing time and energy consumption, and a 79.1% reduction in variable costs compared to sequential processing. These improvements directly translate into a smaller carbon footprint, highlighting the sustainability benefits of serverless architectures for AI workloads. The findings show that the proposed framework is model-independent and provides consistent advantages across diverse Transformer variants. This work illustrates how cloud-native, event-driven infrastructures can democratize access to large-scale NLP by reducing cost, processing time, and environmental impact while offering a reproducible pathway for real-world research and industrial applications. Full article

(This article belongs to the Special Issue Generative AI Transformations in Industrial and Societal Applications)

► Show Figures

Graphical abstract

17 pages, 1005 KB

Open AccessArticle

Leveraging Clinical Record Geolocation for Improved Alzheimer’s Disease Diagnosis Using DMV Framework

by Peng Zhang and Divya Chaudhary

Biomedicines 2025, 13(10), 2496; https://doi.org/10.3390/biomedicines13102496 - 14 Oct 2025

Viewed by 326

Abstract

Background: Early detection of Alzheimer’s disease (AD) is critical for timely intervention, but clinical assessments and neuroimaging are often costly and resource intensive. Natural language processing (NLP) of clinical records offers a scalable alternative, and integrating geolocation may capture complementary environmental risk signals. [...] Read more.

Background: Early detection of Alzheimer’s disease (AD) is critical for timely intervention, but clinical assessments and neuroimaging are often costly and resource intensive. Natural language processing (NLP) of clinical records offers a scalable alternative, and integrating geolocation may capture complementary environmental risk signals. Methods: We propose the DMV (Data processing, Model training, Validation) framework that frames early AD detection as a regression task predicting a continuous risk score (“data_value”) from clinical text and structured features. We evaluated embeddings from Llama3-70B, GPT-4o (via text-embedding-ada-002), and GPT-5 (text-embedding-3-large) combined with a Random Forest regressor on a CDC-derived dataset (≈284 k records). Models were trained and assessed using 10-fold cross-validation. Performance metrics included Mean Squared Error (MSE), Mean Absolute Error (MAE), and R²; paired t-tests and Wilcoxon signed-rank tests assessed statistical significance. Results: Including geolocation (latitude and longitude) consistently improved performance across models. For the Random Forest baseline, MSE decreased by 48.6% when geolocation was added. Embedding-based models showed larger gains; GPT-5 with geolocation achieved the best results (MSE = 14.0339, MAE = 2.3715, R² = 0.9783), and the reduction in error from adding geolocation was statistically significant (p < 0.001, paired tests). Conclusions: Combining high-quality text embeddings with patient geolocation yields substantial and statistically significant improvements in AD risk estimation. Incorporating spatial context alongside clinical text may help clinicians account for environmental and regional risk factors and improve early detection in scalable, data-driven workflows. Full article

(This article belongs to the Special Issue Neurodevelopmental Disorders: From Pathophysiology to Novel Therapeutic Approaches—2nd Edition)

► Show Figures

Figure 1

28 pages, 1206 KB

Open AccessArticle

Integrated Subject–Action–Object and Bayesian Models of Intelligent Word Semantic Similarity Measures

by Siping Zeng, Xiaodong Liu, Wenguang Lin, Vasantha Gokula and Renbin Xiao

Systems 2025, 13(10), 902; https://doi.org/10.3390/systems13100902 - 13 Oct 2025

Viewed by 250

Abstract

Synonym similarity judgments based on semantic distance calculation play a vital role in supporting applications in the field of Natural Language Processing (NLP). However, existing semantic computing methods excessively rely on low-efficiency human supervision or high-quality datasets, which limits their further application. For [...] Read more.

Synonym similarity judgments based on semantic distance calculation play a vital role in supporting applications in the field of Natural Language Processing (NLP). However, existing semantic computing methods excessively rely on low-efficiency human supervision or high-quality datasets, which limits their further application. For these reasons, this paper proposes an automatic and intelligent method for calculating semantic similarity that integrates Subject–Action–Object (SAO) and WordNet to combine knowledge-based semantic similarity and corpus-based semantic similarity. First, the SAO structure is extracted from the Wikipedia dataset, and the statistics of SAO similarity are obtained by calculating co-occurrences of words in SAO. Second, the semantic similarity parameters of words are obtained based on WordNet, and the semantic similarity parameters are adjusted by Laplace Smoothing (LS). Finally, the semantic similarity can be obtained by the Bayesian Model (BM), which combines the semantic similarity parameter and the SAO similarity statistics. The experimental results from well-known word similarity datasets show that the proposed method outperforms traditional methods and even Large Language Models (LLM) in terms of accuracy. The Pearson, Spearman, and Kendall indices were introduced to prove the superiority of the proposed algorithm between model scores and human judgements. Full article

► Show Figures

Figure 1

23 pages, 2102 KB

Open AccessArticle

Hawkish or Dovish? That Is the Question: Agentic Retrieval of FED Monetary Policy Report

by Ana Lorena Jiménez-Preciado, Mario Alejandro Durán-Saldivar, Salvador Cruz-Aké and Francisco Venegas-Martínez

Mathematics 2025, 13(20), 3255; https://doi.org/10.3390/math13203255 - 11 Oct 2025

Viewed by 274

Abstract

This paper develops a Natural Language Processing (NLP) pipeline to quantify the hawkish–dovish stance in the Federal Reserve’s semiannual Monetary Policy Reports (MPRs). The goal is to transform long-form central-bank text into reproducible stance scores and interpretable policy signals for research and monitoring. [...] Read more.

This paper develops a Natural Language Processing (NLP) pipeline to quantify the hawkish–dovish stance in the Federal Reserve’s semiannual Monetary Policy Reports (MPRs). The goal is to transform long-form central-bank text into reproducible stance scores and interpretable policy signals for research and monitoring. The corpus comprises 26 MPRs (26 February 2013 to 20 June 2025). PDFs are parsed and segmented and chunks are embedded, indexed with FAISS, retrieved via LangChain, and scored by GPT-4o on a continuous scale from −2 (dovish) to +2 (hawkish). Reliability is assessed with a four-dimension validation suite: (i) semantic consistency using cosine-similarity separation, (ii) numerical consistency against theory-implied correlation ranges (e.g., Taylor-rule logic), (iii) bootstrap stability of reported metrics, and (iv) content-quality diagnostics. Results show a predominant Neutral distribution (50.0%), with Dovish (26.9%) and Hawkish (23.1%). The average stance is near zero (≈0.019) with volatility σ ≈ 0.866, and the latest window exhibits a hawkish drift of ~+0.8 points. The Numerical Consistency Score is 0.800, and the integrated validation score is 0.796, indicating publication-grade robustness. We conclude that an embedding-based, agentic RAG approach with GPT-4o yields a scalable, auditable measure of FED communication; limitations include biannual frequency and prompt/model sensitivity, but the framework is suitable for policy tracking and empirical applications. Full article

(This article belongs to the Special Issue Advances in Intelligent Computing, Machine Learning and Pattern Recognition)

► Show Figures

Figure 1

23 pages, 832 KB

Open AccessArticle

Sentiment Analysis in Mexican Spanish: A Comparison Between Fine-Tuning and In-Context Learning with Large Language Models

by Tomás Bernal-Beltrán, Mario Andrés Paredes-Valverde, María del Pilar Salas-Zárate, José Antonio García-Díaz and Rafael Valencia-García

Future Internet 2025, 17(10), 445; https://doi.org/10.3390/fi17100445 - 29 Sep 2025

Viewed by 339

Abstract

The proliferation of social media has made Sentiment Analysis an essential tool for understanding user opinions, particularly in underrepresented language variants such as Mexican Spanish. Recent advances in Large Language Models have made effective sentiment analysis through in-context learning techniques, reducing the need [...] Read more.

The proliferation of social media has made Sentiment Analysis an essential tool for understanding user opinions, particularly in underrepresented language variants such as Mexican Spanish. Recent advances in Large Language Models have made effective sentiment analysis through in-context learning techniques, reducing the need for supervised training. This study compares the performance of zero and few-shot with traditional fine-tuning approaches of tourism-related texts in Mexican Spanish. Two annotated datasets from the REST-MEX 2022 and 2023 shared tasks were used for this purpose. Results show that fine-tuning, particularly with the MarIA model, achieves the best overall performance. However, modern LLMs that use in-context learning strategies, such as Mixtral 8x7B for zero-shot and Mistral 7B for few-shot, demonstrate strong potential in low-resource settings by closely approximating the accuracy of fine-tuned models, suggesting that in-context learning is a viable alternative to fine-tuning for sentiment analysis in Mexican Spanish when labeled data is limited. These approaches can enable intelligent, data-driven digital services with applications in tourism platforms and urban information systems that enhance user experience and trust in large-scale socio-technical ecosystems. Full article

(This article belongs to the Special Issue 2024 and 2025 Feature Papers from Future Internet’s Editorial Board Members)

► Show Figures

Figure 1

23 pages, 5554 KB

Open AccessArticle

Innovative Forecasting: “A Transformer Architecture for Enhanced Bridge Condition Prediction”

by Manuel Fernando Flores Cuenca, Yavuz Yardim and Cengis Hasan

Infrastructures 2025, 10(10), 260; https://doi.org/10.3390/infrastructures10100260 - 29 Sep 2025

Viewed by 392

Abstract

The preservation of bridge infrastructure has become increasingly critical as aging assets face accelerated deterioration due to climate change, environmental loading, and operational stressors. This issue is particularly pronounced in regions with limited maintenance budgets, where delayed interventions compound structural vulnerabilities. Although traditional [...] Read more.

The preservation of bridge infrastructure has become increasingly critical as aging assets face accelerated deterioration due to climate change, environmental loading, and operational stressors. This issue is particularly pronounced in regions with limited maintenance budgets, where delayed interventions compound structural vulnerabilities. Although traditional bridge inspections generate detailed condition ratings, these are often viewed as isolated snapshots rather than part of a continuous structural health timeline, limiting their predictive value. To overcome this, recent studies have employed various Artificial Intelligence (AI) models. However, these models are often restricted by fixed input sizes and specific report formats, making them less adaptable to the variability of real-world data. Thus, this study introduces a Transformer architecture inspired by Natural Language Processing (NLP), treating condition ratings, and other features as tokens within temporally ordered inspection “sentences” spanning 1993–2024. Due to the self-attention mechanism, the model effectively captures long-range dependencies in patterns, enhancing forecasting accuracy. Empirical results demonstrate 96.88% accuracy for short-term prediction and 86.97% across seven years, surpassing the performance of comparable time-series models such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs). Ultimately, this approach enables a data-driven paradigm for structural health monitoring, enabling bridges to “speak” through inspection data and empowering engineers to “listen” with enhanced precision. Full article

(This article belongs to the Special Issue Advances in Structural Health Monitoring and Industry 5.0 Innovations for Bridge Management and Conservation)

► Show Figures

Figure 1

15 pages, 930 KB

Open AccessArticle

Design and Evaluation of Knowledge-Distilled LLM for Improving the Efficiency of School Administrative Document Processing

by Younhee Hong

Electronics 2025, 14(19), 3860; https://doi.org/10.3390/electronics14193860 - 29 Sep 2025

Viewed by 295

Abstract

This study proposed OP-LLM-SA, a knowledge distillation-based lightweight model, for building an on-premise AI system for public documents, and evaluated its performance based on 80 public documents. The token accuracy was 92.36%, and the complete sentence rate was 97.19%, showing meaningful results compared [...] Read more.

This study proposed OP-LLM-SA, a knowledge distillation-based lightweight model, for building an on-premise AI system for public documents, and evaluated its performance based on 80 public documents. The token accuracy was 92.36%, and the complete sentence rate was 97.19%, showing meaningful results compared to the original documents. During inference, the GPU environment required only about 4.5 GB, indicating that the model can be used on general office computers, and Llama-3.2’s Korean language support model showed the best performance among the LLMs. This study is significant in that it proposes a system that can efficiently process public documents in an on-premise environment. In particular, it is expected to be helpful for teachers who are burdened with processing public documents. In the future, we plan to conduct research to expand the scope of application of text mining technology to various administrative document processing environments that handle public documents and personal information, as well as school administration. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

21 pages, 1197 KB

Open AccessArticle

A Hybrid System for Automated Assessment of Korean L2 Writing: Integrating Linguistic Features with LLM

by Wonjin Hur and Bongjun Ji

Systems 2025, 13(10), 851; https://doi.org/10.3390/systems13100851 - 28 Sep 2025

Viewed by 466

Abstract

The global expansion of Korean language education has created an urgent need for scalable, objective, and consistent methods for assessing the writing skills of non-native (L2) learners. Traditional manual grading is resource-intensive and prone to subjectivity, while existing Automated Essay Scoring (AES) systems [...] Read more.

The global expansion of Korean language education has created an urgent need for scalable, objective, and consistent methods for assessing the writing skills of non-native (L2) learners. Traditional manual grading is resource-intensive and prone to subjectivity, while existing Automated Essay Scoring (AES) systems often struggle with the linguistic nuances of Korean and the specific error patterns of L2 writers. This paper introduces a novel hybrid AES system designed specifically for Korean L2 writing. The system integrates two complementary feature sets: (1) a comprehensive suite of conventional linguistic features capturing lexical diversity, syntactic complexity, and readability to assess writing form and (2) a novel semantic relevance feature that evaluates writing content. This semantic feature is derived by calculating the cosine similarity between a student’s essay and an ideal, high-proficiency reference answer generated by a Large Language Model (LLM). Various machine learning models are trained on the Korean Language Learner Corpus from the National Institute of the Korean Language to predict a holistic score on the 6-level Test of Proficiency in Korean (TOPIK) scale. The proposed hybrid system demonstrates superior performance compared to baseline models that rely on either linguistic or semantic features alone. The integration of the LLM-based semantic feature provides a significant improvement in scoring accuracy, more closely aligning the automated assessment with human expert judgments. By systematically combining measures of linguistic form and semantic content, this hybrid approach provides a more holistic and accurate assessment of Korean L2 writing proficiency. The system represents a practical and effective tool for supporting large-scale language education and assessment, aligning with the need for advanced AI-driven educational technology systems. Full article

► Show Figures

Figure 1

31 pages, 2653 KB

Open AccessFeature PaperArticle

A Machine Learning and Econometric Framework for Credibility-Aware AI Adoption Measurement and Macroeconomic Impact Assessment in the Energy Sector

by Adriana AnaMaria Davidescu, Marina-Diana Agafiței, Mihai Gheorghe and Vasile Alecsandru Strat

Mathematics 2025, 13(19), 3075; https://doi.org/10.3390/math13193075 - 24 Sep 2025

Viewed by 486

Abstract

Artificial intelligence (AI) adoption in strategic sectors such as energy is often framed in optimistic narratives, yet its actual economic contribution remains under-quantified. This study proposes a novel, multi-stage methodology at the intersection of machine learning, statistics, and big data analytics to bridge [...] Read more.

Artificial intelligence (AI) adoption in strategic sectors such as energy is often framed in optimistic narratives, yet its actual economic contribution remains under-quantified. This study proposes a novel, multi-stage methodology at the intersection of machine learning, statistics, and big data analytics to bridge this gap. First, we construct a media-derived AI Adoption Score using natural language processing (NLP) techniques, including dictionary-based keyword extraction, sentiment analysis, and zero-shot classification, applied to a large corpus of firm-related news and scientific publications. To enhance reliability, we introduce a Misinformation Bias Score (MBS)—developed via zero-shot classification and named entity recognition—to penalise speculative or biased reporting, yielding a credibility-adjusted adoption metric. Using these scores, we classify firms and apply a Fixed Effects Difference-in-Differences (FE DiD) econometric model to estimate the causal effect of AI adoption on turnover. Finally, we scale firm-level results to the macroeconomic level via a Leontief Input–Output model, quantifying direct, indirect, and induced contributions to GDP and employment. Results show that AI adoption in Romania’s energy sector accounts for up to 42.8% of adopter turnover, contributing 3.54% to national GDP in 2023 and yielding a net employment gain of over 65,000 jobs, despite direct labour displacement. By integrating machine learning-based text analytics, statistical causal inference, and big data-driven macroeconomic modelling, this study delivers a replicable framework for measuring credible AI adoption and its economy-wide impacts, offering valuable insights for policymakers and researchers in digital transformation, energy economics, and sustainable development. Full article

(This article belongs to the Special Issue Machine Learning, Statistics and Big Data, 2nd Edition)

► Show Figures

Figure 1

29 pages, 13141 KB

Open AccessArticle

Automatic Complexity Analysis of UML Class Diagrams Using Visual Question Answering (VQA) Techniques

by Nimra Shehzadi, Javed Ferzund, Rubia Fatima and Adnan Riaz

Software 2025, 4(4), 22; https://doi.org/10.3390/software4040022 - 23 Sep 2025

Viewed by 514

Abstract

Context: Modern software systems have become increasingly complex, making it difficult to interpret raw requirements and effectively utilize traditional tools for software design and analysis. Unified Modeling Language (UML) class diagrams are widely used to visualize and understand system architecture, but analyzing them [...] Read more.

Context: Modern software systems have become increasingly complex, making it difficult to interpret raw requirements and effectively utilize traditional tools for software design and analysis. Unified Modeling Language (UML) class diagrams are widely used to visualize and understand system architecture, but analyzing them manually, especially for large-scale systems, poses significant challenges. Objectives: This study aims to automate the analysis of UML class diagrams by assessing their complexity using a machine learning approach. The goal is to support software developers in identifying potential design issues early in the development process and to improve overall software quality. Methodology: To achieve this, this research introduces a Visual Question Answering (VQA)-based framework that integrates both computer vision and natural language processing. Vision Transformers (ViTs) are employed to extract global visual features from UML class diagrams, while the BERT language model processes natural language queries. By combining these two models, the system can accurately respond to questions related to software complexity, such as class coupling and inheritance depth. Results: The proposed method demonstrated strong performance in experimental trials. The ViT model achieved an accuracy of 0.8800, with both the F1 score and recall reaching 0.8985. These metrics highlight the effectiveness of the approach in automatically evaluating UML class diagrams. Conclusions: The findings confirm that advanced machine learning techniques can be successfully applied to automate software design analysis. This approach can help developers detect design flaws early and enhance software maintainability. Future work will explore advanced fusion strategies, novel data augmentation techniques, and lightweight model adaptations suitable for environments with limited computational resources. Full article

(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)

► Show Figures

Figure 1

33 pages, 598 KB

Open AccessReview

Idea Density and Grammatical Complexity as Neurocognitive Markers

by Diego Iacono and Gloria C. Feltis

Brain Sci. 2025, 15(9), 1022; https://doi.org/10.3390/brainsci15091022 - 22 Sep 2025

Viewed by 573

Abstract

Language, a uniquely human cognitive faculty, is fundamentally characterized by its capacity for complex thoughts and structured expressions. This review examines two critical measures of linguistic performance: idea density (ID) and grammatical complexity (GC). ID quantifies the richness of information conveyed per unit [...] Read more.

Language, a uniquely human cognitive faculty, is fundamentally characterized by its capacity for complex thoughts and structured expressions. This review examines two critical measures of linguistic performance: idea density (ID) and grammatical complexity (GC). ID quantifies the richness of information conveyed per unit of language, reflecting semantic efficiency and conceptual processing. GC, conversely, measures the structural sophistication of syntax, indicative of hierarchical organization and rule-based operations. We explore the neurobiological underpinnings of these measures, identifying key brain regions and white matter pathways involved in their generation and comprehension. This includes linking ID to a distributed network of semantic hubs, like the anterior temporal lobe and temporoparietal junction, and GC to a fronto-striatal procedural network encompassing Broca’s area and the basal ganglia. Moreover, a central theme is the integration of Chomsky’s theories of Universal Grammar (UG), which posits an innate human linguistic endowment, with their neurobiological correlates. This integration analysis bridges foundational models that first mapped syntax (Friederici’s work) to distinct neural pathways with contemporary network-based theories that view grammar as an emergent property of dynamic, inter-regional neural oscillations. Furthermore, we examine the genetic factors influencing ID and GC, including genes implicated in neurodevelopmental and neurodegenerative disorders. A comparative anatomical perspective across human and non-human primates illuminates the evolutionary trajectory of the language-ready brain. Also, we emphasize that, clinically, ID and GC serve as sensitive neurocognitive markers whose power lies in their often-dissociable profiles. For instance, the primary decline of ID in Alzheimer’s disease contrasts with the severe grammatical impairment in nonfluent aphasia, aiding in differential diagnosis. Importantly, as non-invasive and scalable metrics, ID and GC also provide a critical complement to gold-standard but costly biomarkers like CSF and PET. Finally, the review considers the emerging role of AI and Natural Language Processing (NLP) in automating these linguistic analyses, concluding with a necessary discussion of the critical challenges in validation, ethics, and implementation that must be addressed for these technologies to be responsibly integrated into clinical practice. Full article

(This article belongs to the Section Neurolinguistics)

► Show Figures

Figure 1

29 pages, 2935 KB

Open AccessArticle

Optimising Contextual Embeddings for Meaning Conflation Deficiency Resolution in Low-Resourced Languages

by Mosima A. Masethe, Sunday O. Ojo and Hlaudi D. Masethe

Computers 2025, 14(9), 402; https://doi.org/10.3390/computers14090402 - 22 Sep 2025

Viewed by 412

Abstract

Meaning conflation deficiency (MCD) presents a continual obstacle in natural language processing (NLP), especially for low-resourced and morphologically complex languages, where polysemy and contextual ambiguity diminish model precision in word sense disambiguation (WSD) tasks. This paper examines the optimisation of contextual embedding models, [...] Read more.

Meaning conflation deficiency (MCD) presents a continual obstacle in natural language processing (NLP), especially for low-resourced and morphologically complex languages, where polysemy and contextual ambiguity diminish model precision in word sense disambiguation (WSD) tasks. This paper examines the optimisation of contextual embedding models, namely XLNet, ELMo, BART, and their improved variations, to tackle MCD in linguistic settings. Utilising Sesotho sa Leboa as a case study, researchers devised an enhanced XLNet architecture with specific hyperparameter optimisation, dynamic padding, early termination, and class-balanced training. Comparative assessments reveal that the optimised XLNet attains an accuracy of 91% and exhibits balanced precision–recall metrics of 92% and 91%, respectively, surpassing both its baseline counterpart and competing models. Optimised ELMo attained the greatest overall metrics (accuracy: 92%, F1-score: 96%), whilst optimised BART demonstrated significant accuracy improvements (96%) despite a reduced recall. The results demonstrate that fine-tuning contextual embeddings using MCD-specific methodologies significantly improves semantic disambiguation for under-represented languages. This study offers a scalable and flexible optimisation approach suitable for additional low-resource language contexts. Full article

► Show Figures

Figure 1

40 pages, 3284 KB

Open AccessArticle

SemaTopic: A Framework for Semantic-Adaptive Probabilistic Topic Modeling

by Amani Drissi, Salma Sassi, Richard Chbeir, Anis Tissaoui and Abderrazek Jemai

Computers 2025, 14(9), 400; https://doi.org/10.3390/computers14090400 - 19 Sep 2025

Viewed by 392

Abstract

Topic modeling is a crucial technique for Natural Language Processing (NLP) which helps to automatically uncover coherent topics from large-scale text corpora. Yet, classic methods tend to suffer from poor semantic depth and topic coherence. In this regard, we present here a new [...] Read more.

Topic modeling is a crucial technique for Natural Language Processing (NLP) which helps to automatically uncover coherent topics from large-scale text corpora. Yet, classic methods tend to suffer from poor semantic depth and topic coherence. In this regard, we present here a new approach “

S e m a T o p i c

” to improve the quality and interpretability of discovered topics. By exploiting semantic understanding and stronger clustering dynamics, our approach results in a more continuous, finer and more stable representation of the topics. Experimental results demonstrate that

S e m a T o p i c

achieves a relative gain of +6.2% in semantic coherence compared to BERTopic on the 20 Newsgroups dataset (

C_{v} = 0.5315

vs. 0.5004), while maintaining stable performance across heterogeneous and multilingual corpora. These findings highlight “

S e m a T o p i c

” as a scalable and reliable solution for practical text mining and knowledge discovery. Full article

(This article belongs to the Special Issue Advances in Semantic Multimedia and Personalized Digital Content)

► Show Figures

Figure 1

Search Results (750)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (750)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI